1.First of all

Hello, this is HanBei.

Recently, I'm addicted to machine learning and studying.

"I want to do machine learning, but what kind of execution environment can I do?" I think some people think that, but this time Google Colaboratory and [Jupyter Notebook] ](Https://jupyter.org/) and other useful tools.

I am deeply grateful to all the engineers who enrich our lives every day.

1-1. Purpose

This time, I will scrape using Google Colaboratory and download the PDF data on the Web to Google Drive.

1-2. Target (reason to read this article)

** We hope that it will be useful for those who want to obtain data for machine learning using PDF data on the Web **.

1-3. Attention

Scraping is a ** crime ** if you do not follow the usage and dosage correctly.

"I want to do scraping, so don't worry about it!" For those who are optimistic or worried, we recommend that you read at least the two articles below.

miyabisun: "Don't ask questions on the Q & A site about scraping methods" nezuq: "List of precautions for web scraping"

1-4. Items to check

This article teaches you how to scrape, but ** we are not responsible for it **.

Think for yourself and use it with the right ethics.

2. Preparation

2-1. Decide what to scrape

This time, the target is the data that the Ministry of Health, Labor and Welfare announced about COVID-19.

Current status of new coronavirus infection and response by the Ministry of Health, Labor and Welfare (Reiwa 2nd October 0th edition) ， Updated daily. On this page, there is a PDF called Number of PCR test positives by prefecture in Japan (posted on October 0, 2020. ， I would like to download this.

2-2. Make Google Colaboratory available

If you haven't created a Google account to use Google Colaboratory, create one.

How to create a new notebook ...

Click Google Colaboratory to get started
Create a new one from Google Drive Reference: shoji9x9 "Summary of how to use Google Colab"

3. Practice

3-1. Importing the library

from bs4 import BeautifulSoup
import urllib
import urllib.request as req
import time
import requests
import os

3-2. Go to the desired data while transitioning pages

Since the desired data is divided into multiple pages, we will deal with it.

For example, if you want the data for May 9, 2020 ... Since it has a hierarchical structure of "press release material / press release material / current situation of new coronavirus infection and response by the Ministry of Health, Labor and Welfare (Reiwa May 9, 2nd edition)", it is the very beginning of the transition. I will dive more and more from the page of.

This time, the page is specified in a roundabout way. There is also a smarter method, so please refer to this method as well.


#url storage,pr stands for press conference
pr_url_list = []

#URL generation of press release materials by year and month
def get_base_url(year,month):
  #Base URL
  base_url = "https://www.mhlw.go.jp/stf/houdou/houdou_list_" + str(year) + "0" + str(month) + ".html"
  res = req.urlopen(base_url)
  soup = BeautifulSoup(res, "html.parser")

  #Specify a list of press release materials
  all_text = str(soup.find_all('ul', class_= 'm-listNews'))

  #The text acquired here is divided line by line and stored in the list.
  all_text_list = all_text.split("\n")

  #Read the list line by line and extract only the partially matching lines
  for text in all_text_list:
      if "Current status of new coronavirus infection and response by the Ministry of Health, Labor and Welfare" in text:
        print("Number of lines: " + str(all_text_list.index(text)) + ", url: " + str(text))

        #Get the url one line before
        pr_url = all_text_list[all_text_list.index(text) - 1]
        #Specify a specific part and put it in the url
        date_pr_url = pr_url[26:31]

        PR_URL_LIST = "https://www.mhlw.go.jp/stf/newpage_" + date_pr_url + ".html"

        pr_url_list.append(PR_URL_LIST)
        print(PR_URL_LIST)

  #Apply 1 second rule for scraping
  time.sleep(1)

3-3. Get the URL until May 2020

I don't know how to do it smartly, so I go to see it from January to May with a for statement.


#Range up to May
for month in range(5):
  #Add every month
  month += 1

  #Search range is 2020 only
  year = 2020

  get_base_url(year, month)

↓ If you execute up to this point, you can check the specified URL up to May.

3-4. Get the PDF URL

#Stores pdf urls by prefecture for each month and day
date_pt_urls = []
#Stores pdf urls by prefecture for each month and day
pdf_urls_march_april = []

#Get url for March and April
def march_april_get_pt_url(date_pt):
  date_pt_url = date_pt[19:37]

  PT_PDF_URL = "https://www.mhlw.go.jp/content/" + str(date_pt_url) + ".pdf"
  # print(PT_PDF_URL)

  date_pt_urls.append(PT_PDF_URL)

#Get the url for May (don't change the name on the way...）
def may_get_pt_url(date_pt):
  date_pt_url = date_pt[19:37]

  PT_PDF_URL = "https://www.mhlw.go.jp/content/" + str(date_pt_url) + ".pdf"
  print(PT_PDF_URL)

  date_pt_urls.append(PT_PDF_URL)

for num in range(len(pr_url_list)):
  print(num)

  #Look at the url for each day in turn
  PR_URL = pr_url_list[num]
  res = req.urlopen(PR_URL)
  soup = BeautifulSoup(res, "html.parser")

  all_text = str(soup.find_all)

  #The text acquired here is divided line by line and stored in the list.
  all_text_list = all_text.split("\n")

  #Read the list line by line and extract only the partially matching lines
  for text in all_text_list:
      if "Number of patient reports by prefecture in domestic cases" in text:
         march_april_get_pt_url(text)

  #Read the list line by line and extract only the partially matching lines
  for text in all_text_list:
      if "Number of positive PCR tests by prefecture in Japan" in text:
         may_get_pt_url(text)      

  #Apply 1 second rule for scraping
  time.sleep(1)

↓ You can check the URL of the pdf by executing it so far COVID_19_Colaboratory (1).png

The obtained URL is sparse because the desired PDF data may not be available depending on the day.

3-5. Upload PDF to Google Drive

#Mount the directory you want to use
from google.colab import drive
drive.mount('/content/drive')

def download_pdf(url, file_path):
  response = requests.get(url, stream=True)

  #Successful request
  if response.status_code == 200:
    with open(file_path, "wb") as file:
      file.write(response.content)

Check if it can be downloaded. Please enter the folder name and file name you created for XX below.


download_pdf(url=PT_PDF_URL, file_path="drive/My Drive/Colab Notebooks/〇〇〇/〇〇〇.pdf")

3-6. Download PDF

Please enter the file name arbitrarily in the following 〇〇〇.

#Specify the data save destination
google_drive_save_dir = "./drive/My Drive/Colab Notebooks/〇〇〇/pdf"

for index, url in enumerate(date_pt_urls):
  file_name = "{}.pdf".format(index)

  print(file_name)
  pdf_path = os.path.join(google_drive_save_dir, file_name)
  print(pdf_path)

  download_pdf(url=url, file_path=pdf_path)

When executed with this, the PDF will be uploaded to the specified folder.

4. Try scraping

This time, I tried scraping for the first time, but it was quite difficult.

The HTML of the site is ... or the format of the downloaded PDF is ...

I don't usually write front-end code, but it was interesting to be able to consider how to write code from a new perspective.

After doing so far, I stopped doing machine learning. The reason is that I later realized that I couldn't get the data I wanted (laughs). This lack of planning ...

So, I decided to see it for those who are looking for this content!

Thank you to everyone who has read this far.

I would be grateful if you could give us your comments and advice ^^

[Python] Save PDF from Google Colaboratory to Google Drive! -Let's collect data for machine learning-