Until the last time, once you start it, it will work automatically without doing anything else. If you set it with cron, you can run it every day, but if you leave it as it is, you will keep your PC running all the time. I want to make it start on the server somehow.
The wall here is that ** this scraping requires Display **. I also tried Chrome's headless mode, but it didn't work.
So, this time, I decided to implement it using a mechanism called Xvfb
that can create a virtual display, although it is a little old technology.
Xvfb is an application that runs on Linux. So, I decided to implement it using Linux Docker container and finally execute batch using AWS batch.
First, create the Docker image
that you use.
I will use CentOS, which seems to have been used a lot when I was investigating Xvfb.
First, start from the existing centOS image and actually install the required application.
mac(host)
docker pull centos #Pulling centOS image from Docker Hub
docker run -it -d centos #Start
docker ps #Confirm startup&Get container ID
docker exec -it b7948c7802eb /bin/bash #Enter the terminal on the container side
All you need
So I will install each one.
Try various things in the container
yum install -y python36 #python put
python3 -m pip install --upgrade pip #pip
pip install requests #Try putting in the necessary packages
...
yum -y install xorg-x11-server-Xvfb #Install Xvfb
yum -y install firefox #install firefox
Xvfb :1 -screen 0 1600x1200x16 & #Launch Xvfb
export DISPLAY=:1 #:Use the display defined as 1
firefox #Start firefox
O. I can't see the screen well, but it looks like firefox is running ...? So next, let's run my program here.
Now I started creating the Dockerfile.
Actually, it may be more efficient to try it by using the docker cp
command.
FROM centos
ENV TZ JST-9 #(1)
#Set home directory
ENV HOME=/home
WORKDIR $HOME
#My app(Below app)To under home
COPY . $HOME/
RUN yum install -y python36
RUN python3 -m pip install --upgrade pip
RUN pip install -r app/requirements.txt #(2)
RUN yum -y install xorg-x11-server-Xvfb
RUN yum -y install firefox
RUN chmod 744 startup.sh
CMD ["./startup.sh"] #(3)
(1) Apparently the time zone will be UTC. This batch is changing the time zone because time is important
(2) At first, I wrote line by line, but it was cleaner to put them together, so I put together the required packages in requirements.txt. The required packages are the ones that came out with the pip freeze
command.
(3) I found that it is necessary to start the Xvfb command at the timing of docker run
, so I created a shell and summarized it.
startup.sh
#!/usr/bin/env bash
Xvfb :1 -screen 0 1600x1200x16 &
export DISPLAY=:1
python3 app/source/run.py --run_mode test #At the end I will make it normal
--gecko driver download for linux --Change to branch by judging the OS and which driver to use
The final file structure looks like this.
├── Dockerfile
├── README.md
├── app
│ ├── drivers
│ │ ├── geckodriver
│ │ └── geckodriver_linux
│ ├── requirements.txt
│ └── source
│ ├── run.py
│ ├── scraping.py
│ ├── make_outputs.py
│ ├── s3_operator.py
│ └── configs.py
├── startup.sh
└── tmp
├── files
│ ├── download
│ ├── fromS3
│ └── toS3
└── logs
It can be executed with the following command.
docker build -t myapp .
docker run -it myapp
Actually, it didn't go like this ... I feel like I typed the above command about 40 times.
However, when it goes well, it's a moving thing! I can't see the screen at all, but the console is out and there are files in S3.
At the end, just run it on AWS ... I can see the goal.
Recommended Posts