Up until the last time, I wrote slowly from scraping to file processing, but at this timing I split the file.
One of the purposes is to create a test mode because connecting to S3 = a large number of words that become Input = it takes time to execute the test
.
This time I tried to divide it like this.
file organization
├── app
│ ├── drivers selenium put drivers
│ └── source
│ ├── run.py main executable
│ ├── scraping.py scraping process(2nd)
│ ├── make_outputs.py Processing of downloaded files(3rd)
│ ├── s3_operator.Processing related to py S3(this time:4th)
│ └── configs.py environment variables, passwords, etc.
└── tmp
├── files
│ ├── from S3 main execution
│ ├── to S3 main execution
│ └── download Place the file downloaded by scraping
└── logs logs(selenium log etc.)
Call the processing of each file from the main executable file = run.py. Since the words fetched from S3 are huge, make it possible to set "test execution mode" so that the execution time will be short at the time of testing.
run.py
if __name__ == '__main__':
#Be able to create a test execution mode
parser = argparse.ArgumentParser(description='scraping batch')
parser.add_argument('--run_mode', dest='run_mode', default = 'normal', help='run_mode: test | normal')
args = parser.parse_args()
#Fetch variables for each environment
env = os.getenv('BATCH_ENV', 'LOCAL')
config = configs.load(env)
#Main processing
words = s3_operator.getFromS3(config) #Get the file from S3 and get the words that are INPUT
scraping.main(config,args,words) #Scraping process(When executing the test, it ends with 2 cases)
make_outputs.main(config,words) #File processing
s3_operator.sendToS3(config) #Send file to S3
By the way, configs.py looks like this
configs.py
class Config:
login_url = "XXXXXXXXXX"
login_id = "XXXXXXXXXX"
login_password = "XXXXXXXXX"
#S3
region_name ="XXXXXXXXXX"
input_path = "XXXXXXXXXXX"
output_path = "XXXXXXXXXX"
@staticmethod
def load(env_name):
if env_name == 'PRD':
return PrdConfig()
elif env_name == 'STG':
return StgConfig()
elif env_name == 'DEV':
return DevConfig()
else:
return LocalConfig()
class LocalConfig(Config):
access_key_id = 'XXXXXXXXX'
secret_access_key = 'XXXXXXXXXX'
bucket_name = 'XXXXX'
#...Below DevConfig,StgConfig,Same for PrdConfig
I will write the process to get the file from S3.
--Create a storage location for download files --Get a list of S3 file names --Download one by one --Get from the downloaded file
/YYYYMMDD/words.csv
, so it will be a bit redundant. whatever.s3_operator
def getFromS3(config):
#Create a storage destination for files downloaded from S3
date = datetime.now().strftime('%Y%m%d')
dldir_name = os.path.abspath(__file__ + '/../../../tmp/files/fromS3/'.format(date))
dldir_path = Path(dldir_name)
dldir_path.mkdir(exist_ok=True)
download_dir = str(dldir_path.resolve())
#Define client information to connect to S3
s3_client = boto3.client(
's3',
region_name=config.region_name,
aws_access_key_id=config.access_key_id,
aws_secret_access_key=config.secret_access_key,
)
#Get a list of files with the specified path in the specified bucket
key_prefix = config.output_path + '/{}/'.format(date)
response = s3_client.list_objects_v2(Bucket=config.bucket_name,Prefix=key_prefix)
if response["KeyCount"] == 0:
print('Object does not exist:{}'.format(key_prefix))
return
#Download files one by one
for object in response["Contents"]:
object_name = object["Key"]
download_file_name = os.path.basename(object["Key"])
if len(download_file_name) == 0:
continue
download_file_name = download_dir + '/' + download_file_name
s3_client.download_file(Bucket=config.bucket_name, Key=object_name, Filename=download_file_name)
print('object"{}I downloaded. Download destination:{}'.format(object_name,download_file_name))
#Get words
download_file_name = download_dir + '/words.csv'
return pd.read_csv(download_file_name).values.tolist()
This is the process of sending a file. This is easy because you just send what you made
s3_operator
def exportToS3(config):
date = datetime.now().strftime('%Y%m%d')
s3_client = boto3.client(
's3',
region_name=config.region_name,
aws_access_key_id=config.access_key_id,
aws_secret_access_key=config.secret_access_key,
)
upfiles = glob.glob(os.path.abspath(__file__ + '/../../../tmp/files/toS3/{}/*'.format(date)))
for f in upfiles :
filename = os.path.split(f)[1]
object_name = config.input_path + "/{}/{}".format(date,filename)
s3_client.upload_file(f,config.bucket_name,object_name)
Now, with python run.py
, you have a program that can achieve your goal.
But…
--You have to hit it on your PC to run it every week --You have to unlock reCAPTCHA yourself --You need to keep your PC running while it's running
So, from the next time, I will manage it.
Recommended Posts