[DOCKER] Automatic scraping of reCAPTCHA site every day (1/7: python environment construction)

I was participating in a project to improve operations through RPA ... The RPA tool was ** too expensive (7.2 million yen / year) **, so I replaced it with python. It is a summary of the work at that time.

Requirements

Overview

--Log in to a certain site, download the search results with specific word, and get the data. --The specific word that becomes the INPUT is placed on Amazon S3 every day. --The OUTPUT is to format the acquired data and place it in another position on S3. ――In addition, we have contacted the company of the site and obtained permission for scraping.

Difficulty

--ReCAPTCHA is installed in that specific site when you log in (!) --AnglarJS is used on that particular site --You have to click to download --You have to be able to do it every day

Technology selection

When I wrote only the conclusion (although I made a lot of trial and error), it became as follows. It was quite difficult ...

--Get data using python selenium --Break through reCAPTCHA using a service called "2captcha" --Docker containerization. Launch a virtual display using Xvfb and run the app --Launch daily using AWS batch

Story

We developed it according to the following flow.

  1. Build python environment
  2. Create a site scraping mechanism
  3. Process the downloaded file (xls) to create the final product (csv)
  4. Create a file download from S3 / file upload to S3 5.2 Implemented captcha
  5. Allow it to start in a Docker container
  6. Register for AWS batch

I think I will divide Qiita on each page.

1. Environment construction

First, build a normal python environment!

This time we will use python3.6.3. The terminal is the default bash. Considering that I want to use multiple versions of python in the future, I will create a dedicated python environment using pyenv and pyenv-virtualenv.

Install pyenv

git clone https://github.com/pyenv/pyenv.git ~/.pyenv
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bash_profile
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bash_profile
echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n  eval "$(pyenv init -)"\nfi' >> ~/.bash_profile

Then install python3.6

pyenv install 3.6.3

Install pyenv-virtualenv

I'm not sure if it's needed, but someone said it's good, so I'll include pyenv-virtualenv as well.

git clone https://github.com/pyenv/pyenv-virtualenv.git $(pyenv root)/plugins/pyenv-virtualenv
echo 'eval "$(pyenv virtualenv-init -)"' >> ~/.bash_profile

And create a virtualenv for this time.

pyenv virtualenv 3.6.3 myproject

Creating a directory

Prepare the directory to be used for this project. (Although I actually changed this through trial and error)

mkdir myproject
cd myproject
pyenv local myproject

Under this myproject, prepare folders and files as shown below.

├── app
│ ├── drivers selenium put drivers
│   └── source           
│       └── scraping.py processing
└── tmp
    ├── files
│ └── download Place the file downloaded by scraping
└── logs logs(selenium log etc.)

Click here for more information. https://qiita.com/kamyu1201@github/items/a07c7d175c051b8ab4c0

Recommended Posts

Automatic scraping of reCAPTCHA site every day (1/7: python environment construction)
Automatic scraping of reCAPTCHA site every day (2/7: scraping)
Automatic scraping of reCAPTCHA site every day (6/7: containerization)
Automatic scraping of reCAPTCHA site every day (5/7: 2captcha)
Automatic scraping of reCAPTCHA site every day (4/7: S3 file processing)
Automatic scraping of reCAPTCHA site every day (3/7: xls file processing)
Automatic posting of web design gary site with python + selenium (1) Environment construction
Environment construction of python2 & 3 (OSX)
Environment construction of python and opencv
Environment construction of python3.8 on mac
Python environment construction
Environment construction (python)
python environment construction
Python --Environment construction
Python environment construction
python environment construction
Python scraping Extract racing environment from horse racing site
Installation of Python3 and Flask [Environment construction summary]
Poetry-virtualenv environment construction with python of centos-sclo-rh ~ Notes
Unification of Python environment
Basics of Python scraping basics
homebrew python environment construction
Python development environment construction
python2.7 development environment construction
Mac environment construction Python
Python environment construction @ Win7
The definitive edition of python scraping! (Target site: BicCamera)
Construction of Python local development environment Part 2 (pyenv-virtualenv, pip usage)
Automatic update of Python module
Python environment construction (Windows10 + Emacs)
CI environment construction ~ Python edition ~
Memo of python + numpy/scipy/pandas/matplotlib/jupyterlab environment construction on M1 macOS (as of 2020/12/24)
[Memo] Construction of cygwin environment
Python environment construction For Mac
Anaconda3 python environment construction procedure
Python3 environment construction (for beginners)
Python environment construction and TensorFlow
Python environment construction under Windows7 environment
[MEMO] [Development environment construction] Python
Get past performance of runners from Python scraping horse racing site
Construction of Python local development environment Part 1 (pyenv, pyenv-virtualenv, pip installation)
Get started with Python! ~ ① Environment construction ~
Anaconda python environment construction on Windows 10
Start of self-made OS 1. Environment construction
Python + Unity Reinforcement learning environment construction
I checked Mac Python environment construction
Python environment construction memo on Mac
Python environment construction (pyenv, anaconda, tensorflow)
[Python3] Development environment construction << Windows edition >>
Environment construction memo of pyenv + conda
Python development environment construction on macOS
Python environment construction (pyenv + poetry + pipx)
Python3 environment construction with pyenv-virtualenv (CentOS 7.3)
Python3 TensorFlow for Mac environment construction
Emacs Python development environment construction memo
pytorch @ python3.8 environment construction with pipenv
[docker] python3.5 + numpy + matplotlib environment construction
Environment construction of "Tello_Video" on Ubuntu
Python3.6 environment construction (using Win environment Anaconda)
OpenCV3 & Python3 environment construction on Ubuntu
Environment construction of Flask / MySql / Apache / mod_wsgi / virtualenv with Redhat7 (Python2.7) November 2020