[Python] Save PDF from Google Colaboratory to Google Drive! -Let's collect data for machine learning-

1.First of all

Hello, this is HanBei.

Recently, I'm addicted to machine learning and studying.

"I want to do machine learning, but what kind of execution environment can I do?" I think some people think that, but this time Google Colaboratory and [Jupyter Notebook] ](Https://jupyter.org/) and other useful tools.

I am deeply grateful to all the engineers who enrich our lives every day.

1-1. Purpose

This time, I will scrape using Google Colaboratory and download the PDF data on the Web to Google Drive.

1-2. Target (reason to read this article)

** We hope that it will be useful for those who want to obtain data for machine learning using PDF data on the Web **.

1-3. Attention

Scraping is a ** crime ** if you do not follow the usage and dosage correctly.

"I want to do scraping, so don't worry about it!" For those who are optimistic or worried, we recommend that you read at least the two articles below.

miyabisun: "Don't ask questions on the Q & A site about scraping methods" nezuq: "List of precautions for web scraping"

1-4. Items to check

This article teaches you how to scrape, but ** we are not responsible for it **.

Think for yourself and use it with the right ethics.

2. Preparation

2-1. Decide what to scrape

This time, the target is the data that the Ministry of Health, Labor and Welfare announced about COVID-19.

Current status of new coronavirus infection and response by the Ministry of Health, Labor and Welfare (Reiwa 2nd October 0th edition) , Updated daily. On this page, there is a PDF called Number of PCR test positives by prefecture in Japan (posted on October 0, 2020. , I would like to download this.

2-2. Make Google Colaboratory available

If you haven't created a Google account to use Google Colaboratory, create one.

How to create a new notebook ...

  1. Click Google Colaboratory to get started
  2. Create a new one from Google Drive Reference: shoji9x9 "Summary of how to use Google Colab"

3. Practice

3-1. Importing the library

from bs4 import BeautifulSoup
import urllib
import urllib.request as req
import time
import requests
import os

3-2. Go to the desired data while transitioning pages

Since the desired data is divided into multiple pages, we will deal with it.

For example, if you want the data for May 9, 2020 ... Since it has a hierarchical structure of "press release material / press release material / current situation of new coronavirus infection and response by the Ministry of Health, Labor and Welfare (Reiwa May 9, 2nd edition)", it is the very beginning of the transition. I will dive more and more from the page of.

This time, the page is specified in a roundabout way. There is also a smarter method, so please refer to this method as well.


#url storage,pr stands for press conference
pr_url_list = []

#URL generation of press release materials by year and month
def get_base_url(year,month):
  #Base URL
  base_url = "https://www.mhlw.go.jp/stf/houdou/houdou_list_" + str(year) + "0" + str(month) + ".html"
  res = req.urlopen(base_url)
  soup = BeautifulSoup(res, "html.parser")

  #Specify a list of press release materials
  all_text = str(soup.find_all('ul', class_= 'm-listNews'))

  #The text acquired here is divided line by line and stored in the list.
  all_text_list = all_text.split("\n")

  #Read the list line by line and extract only the partially matching lines
  for text in all_text_list:
      if "Current status of new coronavirus infection and response by the Ministry of Health, Labor and Welfare" in text:
        print("Number of lines: " + str(all_text_list.index(text)) + ", url: " + str(text))

        #Get the url one line before
        pr_url = all_text_list[all_text_list.index(text) - 1]
        #Specify a specific part and put it in the url
        date_pr_url = pr_url[26:31]

        PR_URL_LIST = "https://www.mhlw.go.jp/stf/newpage_" + date_pr_url + ".html"

        pr_url_list.append(PR_URL_LIST)
        print(PR_URL_LIST)

  #Apply 1 second rule for scraping
  time.sleep(1)

3-3. Get the URL until May 2020

I don't know how to do it smartly, so I go to see it from January to May with a for statement.


#Range up to May
for month in range(5):
  #Add every month
  month += 1

  #Search range is 2020 only
  year = 2020

  get_base_url(year, month)

↓ If you execute up to this point, you can check the specified URL up to May. COVID_19_Colaboratory.png

3-4. Get the PDF URL

#Stores pdf urls by prefecture for each month and day
date_pt_urls = []
#Stores pdf urls by prefecture for each month and day
pdf_urls_march_april = []

#Get url for March and April
def march_april_get_pt_url(date_pt):
  date_pt_url = date_pt[19:37]

  PT_PDF_URL = "https://www.mhlw.go.jp/content/" + str(date_pt_url) + ".pdf"
  # print(PT_PDF_URL)

  date_pt_urls.append(PT_PDF_URL)

#Get the url for May (don't change the name on the way...)
def may_get_pt_url(date_pt):
  date_pt_url = date_pt[19:37]

  PT_PDF_URL = "https://www.mhlw.go.jp/content/" + str(date_pt_url) + ".pdf"
  print(PT_PDF_URL)

  date_pt_urls.append(PT_PDF_URL)
for num in range(len(pr_url_list)):
  print(num)

  #Look at the url for each day in turn
  PR_URL = pr_url_list[num]
  res = req.urlopen(PR_URL)
  soup = BeautifulSoup(res, "html.parser")

  all_text = str(soup.find_all)

  #The text acquired here is divided line by line and stored in the list.
  all_text_list = all_text.split("\n")

  #Read the list line by line and extract only the partially matching lines
  for text in all_text_list:
      if "Number of patient reports by prefecture in domestic cases" in text:
         march_april_get_pt_url(text)

  #Read the list line by line and extract only the partially matching lines
  for text in all_text_list:
      if "Number of positive PCR tests by prefecture in Japan" in text:
         may_get_pt_url(text)      

  #Apply 1 second rule for scraping
  time.sleep(1)

↓ You can check the URL of the pdf by executing it so far COVID_19_Colaboratory (1).png

The obtained URL is sparse because the desired PDF data may not be available depending on the day.

3-5. Upload PDF to Google Drive

#Mount the directory you want to use
from google.colab import drive
drive.mount('/content/drive')
def download_pdf(url, file_path):
  response = requests.get(url, stream=True)

  #Successful request
  if response.status_code == 200:
    with open(file_path, "wb") as file:
      file.write(response.content)

Check if it can be downloaded. Please enter the folder name and file name you created for XX below.


download_pdf(url=PT_PDF_URL, file_path="drive/My Drive/Colab Notebooks/〇〇〇/〇〇〇.pdf")

3-6. Download PDF

Please enter the file name arbitrarily in the following 〇〇〇.

#Specify the data save destination
google_drive_save_dir = "./drive/My Drive/Colab Notebooks/〇〇〇/pdf"

for index, url in enumerate(date_pt_urls):
  file_name = "{}.pdf".format(index)

  print(file_name)
  pdf_path = os.path.join(google_drive_save_dir, file_name)
  print(pdf_path)

  download_pdf(url=url, file_path=pdf_path)

When executed with this, the PDF will be uploaded to the specified folder.

4. Try scraping

This time, I tried scraping for the first time, but it was quite difficult.

The HTML of the site is ... or the format of the downloaded PDF is ...

I don't usually write front-end code, but it was interesting to be able to consider how to write code from a new perspective.

After doing so far, I stopped doing machine learning. The reason is that I later realized that I couldn't get the data I wanted (laughs). This lack of planning ...

So, I decided to see it for those who are looking for this content!

Thank you to everyone who has read this far.

I would be grateful if you could give us your comments and advice ^^

Recommended Posts

[Python] Save PDF from Google Colaboratory to Google Drive! -Let's collect data for machine learning-
How to collect machine learning data
An introduction to Python for machine learning
[Python] Collect images with Icrawler for machine learning [1000 images]
[Python3] Let's analyze data using machine learning! (Regression)
Download data directly from Drive URL (Google Colaboratory)
Machine learning python code summary (updated from time to time)
Python learning memo for machine learning by Chainer from Chapter 2
Preparing to start "Python machine learning programming" (for macOS)
Python Machine Learning Programming Chapter 1 Gives Computers the Ability to Learn from Data Summary
Data set for machine learning
Collect machine learning data by scraping from bio-based public databases
Made icrawler easier to use for machine learning data collection
How to use machine learning for work? 03_Python coding procedure
Get data from analytics API with Google API Client for python
Newton's method for machine learning (from one variable to multiple variables)
Python learning memo for machine learning by Chainer Chapter 10 Introduction to Cupy
<For beginners> python library <For machine learning>
Python learning memo for machine learning by Chainer Chapter 9 Introduction to scikit-learn
[Note] AI / machine learning / python related websites [updated from time to time]
Feature engineering for machine learning starting with the 1st Google Colaboratory --Binarization and discretization of count data
Take the free "Introduction to Python for Machine Learning" online until 4/27 application
Copy data from Amazon S3 to Google Cloud Storage with Python (boto)
I started machine learning with Python (I also started posting to Qiita) Data preparation
Amplify images for machine learning with python
Use machine learning APIs A3RT from Python
I installed Python 3.5.1 to study machine learning
Why Python is chosen for machine learning
Python: Preprocessing in machine learning: Data acquisition
How to search Google Drive with Google Colaboratory
[Shakyo] Encounter with Python for machine learning
[Python] First data analysis / machine learning (Kaggle)
[Python] Web application design for machine learning
Python> Output numbers from 1 to 100, 501 to 600> For csv
How to use "deque" for Python data
Python: Preprocessing in machine learning: Data conversion
Upload images to Google Drive with Python
Data acquisition from analytics API with Google API Client for python Part 2 Web application
I tried to process and transform the image and expand the data for machine learning
How to Introduce IPython (Python2) to Mac OS X-Preparation for Introduction to Machine Learning Theory-
Align the number of samples between classes of data for machine learning with Python
Python learning notes for machine learning with Chainer Chapters 11 and 12 Introduction to Pandas Matplotlib
Feature engineering for machine learning starting with the 4th Google Colaboratory --Interaction features
[Python] Easy introduction to machine learning with python (SVM)
An introduction to machine learning for bot developers
Machine learning starting from 0 for theoretical physics students # 1
Upgrade the Azure Machine Learning SDK for Python
[Python] How to read data from CIFAR-10 and CIFAR-100
[Python] Flow from web scraping to data analysis
Download files directly to Google Drive (using Google Colaboratory)
Upload files to Google Drive with Lambda (Python)
Machine learning starting from 0 for theoretical physics students # 2
PDF files and sites useful for learning Python 3
Collect images for machine learning (Bing Search API)
I started machine learning with Python Data preprocessing
progate Python learning memo (updated from time to time)
[For beginners] Introduction to vectorization in machine learning
The first step of machine learning ~ For those who want to implement with python ~
Collect machine learning training image data on your own (Google Custom Search API Pikachu)
[Updated from time to time] Python memos often used for data analysis [N division, etc.]
[Machine learning] Where will you win this year's Hakone Ekiden? ~ From data to prediction ~