Automatic scraping of reCAPTCHA site every day (4/7: S3 file processing)

  1. Requirement definition ~ python environment construction
  2. Create a site scraping mechanism
  3. Process the downloaded file (xls) to create the final product (csv)
  4. ** Download file from S3 / Create file upload to S3 ** 1.2 Implemented captcha
  5. Allow it to be launched in a Docker container
  6. Register for AWS batch

S3 operation

Change file structure

Up until the last time, I wrote slowly from scraping to file processing, but at this timing I split the file.

One of the purposes is to create a test mode because connecting to S3 = a large number of words that become Input = it takes time to execute the test.

This time I tried to divide it like this.

file organization


├── app
│ ├── drivers selenium put drivers
│   └── source           
│       ├── run.py main executable
│       ├── scraping.py scraping process(2nd)
│       ├── make_outputs.py Processing of downloaded files(3rd)
│       ├── s3_operator.Processing related to py S3(this time:4th)
│       └── configs.py environment variables, passwords, etc.
└── tmp
    ├── files
│ ├── from S3 main execution
│ ├── to S3 main execution
│ └── download Place the file downloaded by scraping
└── logs logs(selenium log etc.)

Call the processing of each file from the main executable file = run.py. Since the words fetched from S3 are huge, make it possible to set "test execution mode" so that the execution time will be short at the time of testing.

run.py


if __name__ == '__main__':
  #Be able to create a test execution mode
  parser = argparse.ArgumentParser(description='scraping batch')
  parser.add_argument('--run_mode', dest='run_mode', default = 'normal', help='run_mode: test | normal')
  args = parser.parse_args()
  
  #Fetch variables for each environment
  env = os.getenv('BATCH_ENV', 'LOCAL')
  config = configs.load(env)
  
  #Main processing
  words = s3_operator.getFromS3(config) #Get the file from S3 and get the words that are INPUT
  scraping.main(config,args,words) #Scraping process(When executing the test, it ends with 2 cases)
  make_outputs.main(config,words) #File processing
  s3_operator.sendToS3(config) #Send file to S3

By the way, configs.py looks like this

configs.py


class Config:
    
    login_url = "XXXXXXXXXX"
    login_id =  "XXXXXXXXXX"
    login_password = "XXXXXXXXX"

    #S3
    region_name ="XXXXXXXXXX"
    input_path = "XXXXXXXXXXX"
    output_path = "XXXXXXXXXX"

    @staticmethod
    def load(env_name):
        if env_name == 'PRD':
            return PrdConfig()
        elif env_name == 'STG':
            return StgConfig()
        elif env_name == 'DEV':
            return DevConfig()
        else:
            return LocalConfig()

class LocalConfig(Config):
    access_key_id = 'XXXXXXXXX'
    secret_access_key = 'XXXXXXXXXX'
    bucket_name = 'XXXXX'

#...Below DevConfig,StgConfig,Same for PrdConfig

Process to get a file from S3

I will write the process to get the file from S3.

--Create a storage location for download files --Get a list of S3 file names --Download one by one --Get from the downloaded file

s3_operator


def getFromS3(config):
    #Create a storage destination for files downloaded from S3
    date = datetime.now().strftime('%Y%m%d')
    dldir_name = os.path.abspath(__file__ + '/../../../tmp/files/fromS3/'.format(date))
    dldir_path = Path(dldir_name)
    dldir_path.mkdir(exist_ok=True)
    download_dir = str(dldir_path.resolve())
    
    #Define client information to connect to S3
    s3_client = boto3.client(
      's3',
      region_name=config.region_name,
      aws_access_key_id=config.access_key_id,
      aws_secret_access_key=config.secret_access_key,
    )
    
    #Get a list of files with the specified path in the specified bucket
    key_prefix = config.output_path + '/{}/'.format(date)
    response = s3_client.list_objects_v2(Bucket=config.bucket_name,Prefix=key_prefix)
    if response["KeyCount"] == 0:
        print('Object does not exist:{}'.format(key_prefix))
        return
    
    #Download files one by one
    for object in response["Contents"]:
        object_name = object["Key"]
        download_file_name = os.path.basename(object["Key"])
        if len(download_file_name) == 0:
            continue
        download_file_name = download_dir + '/' + download_file_name
        s3_client.download_file(Bucket=config.bucket_name, Key=object_name, Filename=download_file_name)
        print('object"{}I downloaded. Download destination:{}'.format(object_name,download_file_name))

    #Get words
    download_file_name = download_dir + '/words.csv'
    return pd.read_csv(download_file_name).values.tolist()
    

Processing to send a file to S3

This is the process of sending a file. This is easy because you just send what you made

s3_operator


def exportToS3(config):
    date = datetime.now().strftime('%Y%m%d')
    s3_client = boto3.client(
        's3',
        region_name=config.region_name,
        aws_access_key_id=config.access_key_id,
        aws_secret_access_key=config.secret_access_key,
    )

    upfiles = glob.glob(os.path.abspath(__file__ + '/../../../tmp/files/toS3/{}/*'.format(date)))
    for f in upfiles :
        filename = os.path.split(f)[1]
        object_name = config.input_path + "/{}/{}".format(date,filename)
        s3_client.upload_file(f,config.bucket_name,object_name)

Complete

Now, with python run.py, you have a program that can achieve your goal.

But…

--You have to hit it on your PC to run it every week --You have to unlock reCAPTCHA yourself --You need to keep your PC running while it's running

So, from the next time, I will manage it.

Recommended Posts

Automatic scraping of reCAPTCHA site every day (4/7: S3 file processing)
Automatic scraping of reCAPTCHA site every day (3/7: xls file processing)
Automatic scraping of reCAPTCHA site every day (2/7: scraping)
Automatic scraping of reCAPTCHA site every day (6/7: containerization)
Automatic scraping of reCAPTCHA site every day (5/7: 2captcha)
Automatic scraping of reCAPTCHA site every day (1/7: python environment construction)