Requirement definition ~ python environment construction
Create a site scraping mechanism
Process the downloaded file (xls) to create the final product (csv)
** Download file from S3 / Create file upload to S3 ** 1.2 Implemented captcha
Allow it to be launched in a Docker container
Register for AWS batch

S3 operation

Change file structure

Up until the last time, I wrote slowly from scraping to file processing, but at this timing I split the file.

One of the purposes is to create a test mode because connecting to S3 = a large number of words that become Input = it takes time to execute the test.

This time I tried to divide it like this.

`file organization`


├── app
│ ├── drivers selenium put drivers
│   └── source           
│       ├── run.py main executable
│       ├── scraping.py scraping process(2nd)
│       ├── make_outputs.py Processing of downloaded files(3rd)
│       ├── s3_operator.Processing related to py S3(this time:4th)
│       └── configs.py environment variables, passwords, etc.
└── tmp
    ├── files
│ ├── from S3 main execution
│ ├── to S3 main execution
│ └── download Place the file downloaded by scraping
└── logs logs(selenium log etc.)

Call the processing of each file from the main executable file = run.py. Since the words fetched from S3 are huge, make it possible to set "test execution mode" so that the execution time will be short at the time of testing.

`run.py`


if __name__ == '__main__':
  #Be able to create a test execution mode
  parser = argparse.ArgumentParser(description='scraping batch')
  parser.add_argument('--run_mode', dest='run_mode', default = 'normal', help='run_mode: test | normal')
  args = parser.parse_args()
  
  #Fetch variables for each environment
  env = os.getenv('BATCH_ENV', 'LOCAL')
  config = configs.load(env)
  
  #Main processing
  words = s3_operator.getFromS3(config) #Get the file from S3 and get the words that are INPUT
  scraping.main(config,args,words) #Scraping process(When executing the test, it ends with 2 cases)
  make_outputs.main(config,words) #File processing
  s3_operator.sendToS3(config) #Send file to S3

By the way, configs.py looks like this

`configs.py`


class Config:
    
    login_url = "XXXXXXXXXX"
    login_id =  "XXXXXXXXXX"
    login_password = "XXXXXXXXX"

    #S3
    region_name ="XXXXXXXXXX"
    input_path = "XXXXXXXXXXX"
    output_path = "XXXXXXXXXX"

    @staticmethod
    def load(env_name):
        if env_name == 'PRD':
            return PrdConfig()
        elif env_name == 'STG':
            return StgConfig()
        elif env_name == 'DEV':
            return DevConfig()
        else:
            return LocalConfig()

class LocalConfig(Config):
    access_key_id = 'XXXXXXXXX'
    secret_access_key = 'XXXXXXXXXX'
    bucket_name = 'XXXXX'

#...Below DevConfig,StgConfig,Same for PrdConfig

Process to get a file from S3

I will write the process to get the file from S3.

--Create a storage location for download files --Get a list of S3 file names --Download one by one --Get from the downloaded file

Actually, there were multiple files, so I followed the above flow ... In this article, I will only get /YYYYMMDD/words.csv, so it will be a bit redundant. whatever.

`s3_operator`


def getFromS3(config):
    #Create a storage destination for files downloaded from S3
    date = datetime.now().strftime('%Y%m%d')
    dldir_name = os.path.abspath(__file__ + '/../../../tmp/files/fromS3/'.format(date))
    dldir_path = Path(dldir_name)
    dldir_path.mkdir(exist_ok=True)
    download_dir = str(dldir_path.resolve())
    
    #Define client information to connect to S3
    s3_client = boto3.client(
      's3',
      region_name=config.region_name,
      aws_access_key_id=config.access_key_id,
      aws_secret_access_key=config.secret_access_key,
    )
    
    #Get a list of files with the specified path in the specified bucket
    key_prefix = config.output_path + '/{}/'.format(date)
    response = s3_client.list_objects_v2(Bucket=config.bucket_name,Prefix=key_prefix)
    if response["KeyCount"] == 0:
        print('Object does not exist:{}'.format(key_prefix))
        return
    
    #Download files one by one
    for object in response["Contents"]:
        object_name = object["Key"]
        download_file_name = os.path.basename(object["Key"])
        if len(download_file_name) == 0:
            continue
        download_file_name = download_dir + '/' + download_file_name
        s3_client.download_file(Bucket=config.bucket_name, Key=object_name, Filename=download_file_name)
        print('object"{}I downloaded. Download destination:{}'.format(object_name,download_file_name))

    #Get words
    download_file_name = download_dir + '/words.csv'
    return pd.read_csv(download_file_name).values.tolist()

Processing to send a file to S3

This is the process of sending a file. This is easy because you just send what you made

`s3_operator`


def exportToS3(config):
    date = datetime.now().strftime('%Y%m%d')
    s3_client = boto3.client(
        's3',
        region_name=config.region_name,
        aws_access_key_id=config.access_key_id,
        aws_secret_access_key=config.secret_access_key,
    )

    upfiles = glob.glob(os.path.abspath(__file__ + '/../../../tmp/files/toS3/{}/*'.format(date)))
    for f in upfiles :
        filename = os.path.split(f)[1]
        object_name = config.input_path + "/{}/{}".format(date,filename)
        s3_client.upload_file(f,config.bucket_name,object_name)

Complete

Now, with python run.py, you have a program that can achieve your goal.

But…

--You have to hit it on your PC to run it every week --You have to unlock reCAPTCHA yourself --You need to keep your PC running while it's running

So, from the next time, I will manage it.

Automatic scraping of reCAPTCHA site every day (4/7: S3 file processing)

S3 operation

Change file structure

file organization

run.py

configs.py

Process to get a file from S3

s3_operator

Processing to send a file to S3

s3_operator

Complete

`file organization`

`run.py`

`configs.py`

`s3_operator`

`s3_operator`