Beginners use Python for web scraping (4) ―― 1

This time, we aim to put the scraping program of previous on the cloud and execute it automatically, but first, put the test PGM on the cloud and make it operate normally. I will bring it to that point.

Roadmap for learning web scraping in Python

(1) Succeed in scraping the desired stuff locally for the time being. (2) Link the result of scraping locally to Google Spreadsheet. (3) cron is automatically executed locally. (4) Challenge free automatic execution on the cloud server. (Google Compute Engine) (4) -1 Put the test PGM on the cloud and run it normally on CloudShell ← Now here </ font> (4) -2 Add scraping PGM to the repository and run it normally on CloudShell. (4) -3 Create a VM instance of Compute Engine and have it automatically execute scraping. (5) Challenge free automatic execution without a server on the cloud. (Maybe Cloud Functions + Cloud Scheduler)

Steps to lift resources to GCP

(1) Create a git repository on GCP using git (GitHub account required) (2) Create a clone locally (3) Add the program you want to upload to GCP to the local repository and commit (4) Push to master on GCP

(1) Create a git repository on GCP using git

If you do not have the Gcloud SDK installed, install it. Make sure the gcloudl command is set for the desired project. (For a new project, set the project with the gcloud init command.)


16:03:04 [~] % gcloud config list
account = [email protected]
disable_usage_reporting = False
project = my-hoge-app

Your active configuration is: [default]

Create a new repository in Cloud Source Repositories.


16:41:59 [~] % 
16:42:00 [~] % gcloud source repos create gce-cron-test
Created [gce-cron-test].
WARNING: You may be billed for this repository. See for details.

An empty repository will be created in the target project like this. スクリーンショット 2020-09-24 21.47.24.png

(2) Create a clone locally

Clone the repository you created in Cloud Source Repositories locally.


16:44:10 [~] % 
16:44:10 [~] % gcloud source repos clone gce-cron-test
Cloning into '/Users/hoge/gce-cron-test'...
warning: You appear to have cloned an empty repository.
Project [my-hoge-app] repository [gce-cron-test] was cloned to [/Users/hoge/gce-cron-test].

The state where the py file is stored in the created local repository. (You can see that it is a git repository.)


16:46:15 [~] % 
16:46:15 [~] % cd gce-cron-test
16:46:44 [~/gce-cron-test] % ls -la
total 8
drwxr-xr-x   4 hoge  staff   128  9 23 16:45 .
drwxr-xr-x+ 45 hoge  staff  1440  9 23 16:45 ..
drwxr-xr-x   9 hoge  staff   288  9 23 16:45 .git
-rw-r--r--   1 hoge  staff   146  9 21 15:29

(3) Add the program you want to upload to GCP to the local repository and commit

Add the file to the index with the git add command Commit to your local repository with the git commit command.


16:47:21 [~/gce-cron-test] % 
16:47:21 [~/gce-cron-test] % git add .
16:48:03 [~/gce-cron-test] % 
16:48:04 [~/gce-cron-test] % git commit -m "Add cron-test to Cloud Source Repositories"
[master (root-commit) 938ea70] Add cron-test to Cloud Source Repositories
 1 file changed, 5 insertions(+)
 create mode 100644

(4) Push to master on GCP

Push to master (Cloud Source Repositories).


16:50:15 [~/gce-cron-test] % 
16:50:15 [~/gce-cron-test] % git push origin master
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Delta compression using up to 4 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 349 bytes | 116.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0)
 * [new branch]      master -> master

You can see that you were able to push to master along with the commit message. スクリーンショット 2020-09-24 21.30.53.png

Confirmation of operation with Cloud Shell

Let's test it on Cloud Shell on GCP.

Select the desired project and launch Cloud Shell. スクリーンショット 2020-09-25 16.53.01.png

The terminal will start. スクリーンショット 2020-09-25 16.53.41.png

Clone the git repository from master as you would local.


cloudshell:09/25/20 02:59:00 ~ $ gcloud source repos clone gce-cron-test
Cloning into '/home/hoge/gce-cron-test'...
remote: Total 3 (delta 0), reused 3 (delta 0)
Unpacking objects: 100% (3/3), done.
Project [my-xxx-app] repository [gce-cron-test] was cloned to [/home/hoge/gce-cron-test].

It was cloned.


cloudshell:09/25/20 03:01:49 ~ $ cd gce-cron-test
cloudshell:09/25/20 03:02:09 ~/gce-cron-test $ ls -la
total 20
drwxr-xr-x  3 hoge hoge 4096 Sep 23 10:59 .
drwxr-xr-x 13 hoge rvm  4096 Sep 23 11:18 ..
-rw-r--r--  1 hoge hoge  146 Sep 23 09:03
drwxr-xr-x  8 hoge hoge 4096 Sep 23 09:03 .git

Check the python path and version. 3.8.5 is pre-installed in this environment with pyenv.


cloudshell:09/25/20 03:02:21 ~/gce-cron-test $ which python
cloudshell:09/25/20 03:02:42 ~/gce-cron-test $ python -V
Python 3.8.5

As shown below, it works normally on CloudShell.


cloudshell:09/25/20 03:02:50 ~/gce-cron-test $ python
2020/09/25 03:03:11 cron works!
cloudshell:09/25/20 03:03:12 ~/gce-cron-test $

However, crontab didn't work. The Cloud Shell environment seems to be an environment that only accepts interactive interactive commands. .. .. Next time, I will add the scraping PGM to the repository and run it normally on CloudShell.

Bonus: About Cloud Shell

CloudShell is an IDE environment that can be used on google's cloud, a kind of virtual VM environment with a 5GB Disk, and a Theia-based code editor can also be used.

You can also edit hidden files with an editor


$ cloudshell edit $HOME/.bashrc

You can also download it.


$ cloudshell download $HOME/.bashrc


