Now, last time We will add scraping PGM to the created Cloud Source Repositroies repository.
(1) Succeed in scraping the target stuff locally for the time being. (2) Link the result of scraping locally to Google Spreadsheet. (3) cron is automatically executed locally. (4) Challenge free automatic execution on the cloud server. (Google Compute Engine) (4) -1 Put the test PGM on the cloud and run it normally on CloudShell. (4) -2 Add scraping PGM to the repository and run it normally on Cloud Shell. ← Now here </ font> (4) -3 Create a VM instance of Compute Engine and have it automatically execute scraping. (5) Challenge free automatic execution without a server on the cloud. (Maybe Cloud Functions + Cloud Scheduler)
[1] Added scraping PGM to local repository [2] Push to master of Cloud Source Repositories [3] Pull from master to clone on CloudShell [4] Bulk installation of required modules using requirements.txt [5] Performing scraping on CloudShell
Add the file to your local repository.
Mac zsh
11:28:14 [~] % cd gce-cron-test
11:28:25 [~/gce-cron-test] % ls -la
total 40
drwxr-xr-x 7 hoge staff 224 9 26 11:27 .
drwxr-xr-x+ 45 hoge staff 1440 9 23 16:45 ..
-rw-r--r--@ 1 hoge staff 6148 9 26 11:26 .DS_Store
drwxr-xr-x 13 hoge staff 416 9 23 16:49 .git
-rw-r--r-- 1 hoge staff 146 9 21 15:29 cron-test.py
-rw-r--r--@ 1 hoge staff 2352 9 16 17:54 my-web-hoge-app-hogehoge.json
-rw-r--r-- 1 hoge staff 2763 9 17 13:22 requests-test2.py
Make sure there are files that need to be committed, then add and commit.
Mac zsh
11:28:28 [~/gce-cron-test] % git status
On branch master
Your branch is up to date with 'origin/master'.
Untracked files:
(use "git add <file>..." to include in what will be committed)
.DS_Store
my-web-hoge-app-hogehoge.json
requests-test2.py
nothing added to commit but untracked files present (use "git add" to track)
11:28:34 [~/gce-cron-test] %
11:28:52 [~/gce-cron-test] %
11:28:53 [~/gce-cron-test] % git add .
11:28:58 [~/gce-cron-test] %
11:29:38 [~/gce-cron-test] %
11:29:38 [~/gce-cron-test] % git commit -m "Add requests-test to Cloud Source Repositories"
[master 44abc4d] Add requests-test to Cloud Source Repositories
3 files changed, 73 insertions(+)
create mode 100644 .DS_Store
create mode 100644 my-web-hoge-app-hogehoge.json
create mode 100644 requests-test2.py
Do a push to master.
Mac zsh
11:30:13 [~/gce-cron-test] %
11:30:23 [~/gce-cron-test] %
11:30:23 [~/gce-cron-test] % git push origin master
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 4 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 3.48 KiB | 891.00 KiB/s, done.
Total 5 (delta 0), reused 0 (delta 0)
To https://source.developers.google.com/p/my-gce-app/r/gce-cron-test
938ea70..44abc4d master -> master
11:31:37 [~/gce-cron-test] %
Pull to the repository you cloned last time with CloudShell.
cloudshell
cloudshell:09/26/20 02:54:33 ~/gce-cron-test $ git pull origin master
Confirm that it has been added to the CloudShell repository. (Later, I added the leaked requirements.txt.)
cloudshell
cloudshell:09/26/20 02:55:06 ~/gce-cron-test $
cloudshell:09/26/20 02:55:06 ~/gce-cron-test $ ls -la
total 40
drwxr-xr-x 3 hoge hoge 4096 Sep 26 02:52 .
drwxr-xr-x 13 hoge rvm 4096 Sep 23 11:18 ..
-rw-r--r-- 1 hoge hoge 80 Sep 23 11:09 cron.log
-rw-r--r-- 1 hoge hoge 146 Sep 23 09:03 cron-test.py
-rw-r--r-- 1 hoge hoge 6148 Sep 26 02:47 .DS_Store
drwxr-xr-x 8 hoge hoge 4096 Sep 26 02:52 .git
-rw-r--r-- 1 hoge hoge 2352 Sep 26 02:47 my-web-scraping-app-hogehoge.json
-rw-r--r-- 1 hoge hoge 2763 Sep 26 02:47 requests-test2.py
-rw-r--r-- 1 hoge hoge 334 Sep 26 02:52 requirements.txt
Install the required modules in bulk using requirements.txt.
cloudshell
cloudshell:09/26/20 02:55:10 ~/gce-cron-test $ pip install -r requirements.txt
Check the list of pips. I put all the necessary modules in requirements.txt with "pip freeze> requirements.txt" locally on Mac, so of course I have them properly.
cloudshell
cloudshell:09/26/20 02:55:41 ~/gce-cron-test $ pip list
Package Version
-------------------- ---------
appdirs 1.4.4
beautifulsoup4 4.9.1
cachetools 4.1.1
certifi 2020.6.20
chardet 3.0.4
distlib 0.3.1
filelock 3.0.12
google-auth 1.21.0
google-auth-oauthlib 0.4.1
gspread 3.6.0
httplib2 0.18.1
idna 2.10
oauth2client 4.1.3
oauthlib 3.1.0
pip 20.1.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
requests 2.24.0
requests-oauthlib 1.3.0
rsa 4.6
setuptools 47.1.0
six 1.15.0
soupsieve 2.0.1
urllib3 1.25.10
virtualenv 20.0.31
WARNING: You are using pip version 20.1.1; however, version 20.2.3 is available.
You should consider upgrading via the '/home/hoge/.pyenv/versions/3.8.5/bin/python3.8 -m pip install --upgrade pip' command.
Try running the scraping PGM "requests-test2.py".
cloudshell
cloudshell:09/26/20 02:55:49 ~/gce-cron-test $ python requests-test2.py
Traceback (most recent call last):
File "requests-test2.py", line 40, in <module>
sheet = get_gspread_book(secret_key, book_name).worksheet(sheet_name)
File "requests-test2.py", line 20, in get_gspread_book
credentials = ServiceAccountCredentials.from_json_keyfile_name(secret_key, scope)
File "/home/hoge/.pyenv/versions/3.8.5/lib/python3.8/site-packages/oauth2client/service_account.py", line 219, in from_json_keyfile_name
with open(filename, 'r') as file_obj:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/hoge/git-repository/env2/my-web-hoge-app-hogehoge.json'
Inadvertently, there is no such file. Not surprisingly, the full Mac local path was still specified. Local is VScode, but this is fixed in the CloudShell code editor.
cloudshell
cloudshell:09/26/20 02:55:55 ~/gce-cron-test $ pwd
/home/hoge/gce-cron-test
cloudshell:09/26/20 02:56:12 ~/gce-cron-test $ cloudshell open requests-test2.py
The "cloudshell open" command will bring up the code editor, so modify the json path.
It is a re-execution.
cloudshell
cloudshell:09/26/20 03:00:32 ~/gce-cron-test $
cloudshell:09/26/20 03:00:33 ~/gce-cron-test $ python requests-test2.py
2020/09/26 03:01:15 Finished scraping.
cloudshell:09/26/20 03:01:18 ~/gce-cron-test $
I was able to scrape safely. Click here for the full source. Beginners use Python for web scraping (2) The time on GCP is UTC by default, so it will be -9 hours Tokyo time.
Next time, I will create a VM on Google Compute Engine, check the operation of scraping, and try to execute it automatically with cron.
Recommended Posts