I was participating in a project to improve operations through RPA ... The RPA tool was ** too expensive (7.2 million yen / year) **, so I replaced it with python. It is a summary of the work at that time.
--Log in to a certain site, download the search results with specific word
, and get the data.
--The specific word
that becomes the INPUT is placed on Amazon S3 every day.
--The OUTPUT is to format the acquired data and place it in another position on S3.
――In addition, we have contacted the company of the site and obtained permission for scraping.
--ReCAPTCHA is installed in that specific site when you log in (!) --AnglarJS is used on that particular site --You have to click to download --You have to be able to do it every day
When I wrote only the conclusion (although I made a lot of trial and error), it became as follows. It was quite difficult ...
--Get data using python selenium --Break through reCAPTCHA using a service called "2captcha" --Docker containerization. Launch a virtual display using Xvfb and run the app --Launch daily using AWS batch
Story
We developed it according to the following flow.
I think I will divide Qiita on each page.
First, build a normal python environment!
This time we will use python3.6.3
. The terminal is the default bash
.
Considering that I want to use multiple versions of python in the future, I will create a dedicated python environment using pyenv
and pyenv-virtualenv
.
git clone https://github.com/pyenv/pyenv.git ~/.pyenv
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bash_profile
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bash_profile
echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init -)"\nfi' >> ~/.bash_profile
Then install python3.6
pyenv install 3.6.3
I'm not sure if it's needed, but someone said it's good, so I'll include pyenv-virtualenv as well.
git clone https://github.com/pyenv/pyenv-virtualenv.git $(pyenv root)/plugins/pyenv-virtualenv
echo 'eval "$(pyenv virtualenv-init -)"' >> ~/.bash_profile
And create a virtualenv for this time.
pyenv virtualenv 3.6.3 myproject
Prepare the directory to be used for this project. (Although I actually changed this through trial and error)
mkdir myproject
cd myproject
pyenv local myproject
Under this myproject
, prepare folders and files as shown below.
├── app
│ ├── drivers selenium put drivers
│ └── source
│ └── scraping.py processing
└── tmp
├── files
│ └── download Place the file downloaded by scraping
└── logs logs(selenium log etc.)
Click here for more information. https://qiita.com/kamyu1201@github/items/a07c7d175c051b8ab4c0
Recommended Posts