I don't know the K character in the cloud or the P character in Python, It's been a month since I started studying Python + GCP. Began to be interested in web scraping in Python How to use requests, various attributes of requests object, While learning html parsing with BeutifuruSoup First of all, I will try scraping Yahoo News.

This article uses Python 3.7.3 installed on Mac OS Catalina.

Roadmap for learning web scraping in Python

(1) Succeed in scraping the desired stuff locally for the time being. ← Now here (2) Link the result of scraping locally to Google Spreadsheet. (3) cron is automatically executed locally. (4) Challenge free automatic execution on the cloud server. (Google Compute Engine) (5) Challenge free automatic execution without a server on the cloud. (Maybe Cloud Functions + Cloud Scheduler)

Functions of sample PGM (1)

・ Get website information using requests ・ Parse html with Beautiful Soup -Search for a specific character string with the re library that can search for character strings (identify headline news) -Display all news titles and links from the acquired result list on the console

What are requests?

An external library for HTTP communication with Python. You can simply collect information on the website. You can also get the url using urllib, the standard python library, If you use requests, the amount of code is small and you can write it simply. However, since it is a third-party library, it needs to be installed.

Install requests

It can be installed with pip. Here is the clean state of the virtual environment created with venv.

`bash`


$ virtualenv -p python3.7 env3
% source env3/bin/activate
(env3) % pip list
Package    Version
---------- -------
pip        20.2.3
setuptools 49.2.1
wheel      0.34.2

Install with pip. Check the pip list to see if it's in (and version). Along with that, various things are also included.

`bash`


(env3) % pip install requests
Collecting requests
  Using cached requests-2.24.0-py2.py3-none-any.whl (61 kB)
Collecting idna<3,>=2.5
  Using cached idna-2.10-py2.py3-none-any.whl (58 kB)
Collecting chardet<4,>=3.0.2
  Using cached chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Using cached urllib3-1.25.10-py2.py3-none-any.whl (127 kB)
Collecting certifi>=2017.4.17
  Using cached certifi-2020.6.20-py2.py3-none-any.whl (156 kB)
Installing collected packages: idna, chardet, urllib3, certifi, requests
Successfully installed certifi-2020.6.20 chardet-3.0.4 idna-2.10 requests-2.24.0 urllib3-1.25.10
(env3) % pip list
Package    Version
---------- ---------
certifi    2020.6.20
chardet    3.0.4
idna       2.10
pip        20.2.3
requests   2.24.0
setuptools 49.2.1
urllib3    1.25.10
wheel      0.34.2

requests method

requests is a common HTTP request method, It supports methods such as get, post, put, delete. This time we will use get.

Attributes of the response object for requests

The response object returned by requests.get contains various attributes. In this sample program, the following attributes were confirmed by print.

attribute	What can be confirmed
url	You can get the accessed URL.
status_code	Status code(HTTP status)Can be obtained.
headers	You can get the response header.
encoding	You can get the encoding that Requests guessed.

In addition, there are text attribute and content attribute.

The headers attribute is a dict type (dictionary), and Yahoo News contains many keys as shown below, so in the sample program, the'Content-Type' key is extracted from the headers attribute and printed.

`bash`


{'Cache-Control': 'private, no-cache, no-store, must-revalidate', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html;charset=UTF-8', 'Date': 'Wed, 09 Sep 2020 02:24:04 GMT', 'Set-Cookie': 'B=6rffcc5flgf64&b=3&s=sv; expires=Sat, 10-Sep-2022 02:24:04 GMT; path=/; domain=.yahoo.co.jp, XB=6rffcc5flgf64&b=3&s=sv; expires=Sat, 10-Sep-2022 02:24:04 GMT; path=/; domain=.yahoo.co.jp; secure; samesite=none', 'Vary': 'Accept-Encoding', 'X-Content-Type-Options': 'nosniff', 'X-Download-Options': 'noopen', 'X-Frame-Options': 'DENY', 'X-Vcap-Request-Id': 'd130bb1e-4e53-4738-4b02-8419633dd825', 'X-Xss-Protection': '1; mode=block', 'Age': '0', 'Server': 'ATS', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Via': 'http/1.1 edge2821.img.kth.yahoo.co.jp (ApacheTrafficServer [c sSf ])'}

Source of requests.get part

Click here for the source excerpt of requests.get and each attribute display part of the acquired response object.

`python`


url = 'https://news.yahoo.co.jp/'
response = requests.get(url)
#print(response.text)
print('url: ',response.url)
print('status-code:',response.status_code) #HTTP status code, usually[200 OK]
print('headers[Content-Type]:',response.headers['Content-Type']) #Since headers is a dictionary, you can specify the key to content-type output
print('encoding: ',response.encoding) #encoding

Here are the results.

`bash`


(env3) % python requests-test.py
url:  https://news.yahoo.co.jp/
status-code: 200
headers[Content-Type]: text/html;charset=UTF-8
encoding:  UTF-8

What is Beautiful Soup?

Beautiful Soup is a library for web scraping in Python. You can retrieve and parse data from HTML and XML files. You can easily extract a specific html tag.

installation of beautifulsoup4

Same as requests. It can be installed with pip.

`bash`


(env3) % pip install beautifulsoup4
Collecting beautifulsoup4
  Using cached beautifulsoup4-4.9.1-py3-none-any.whl (115 kB)
Collecting soupsieve>1.2
  Using cached soupsieve-2.0.1-py3-none-any.whl (32 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.1 soupsieve-2.0.1
(env3) % pip list                  
Package        Version
-------------- ---------
beautifulsoup4 4.9.1
certifi        2020.6.20
chardet        3.0.4
idna           2.10
pip            20.2.3
requests       2.24.0
setuptools     49.2.1
soupsieve      2.0.1
urllib3        1.25.10
wheel          0.34.2

Beautiful Soup arguments

In Beautiful Soup, the object to be analyzed (html or xml) is the first argument. (The response object obtained by requests in the sample) Specify the parser to be used for analysis as the second argument.

Parser	Example of use	Strengths	weakness
Python’s html.parser	BeautifulSoup(response.text, "html.parser")	Standard library	Python2 series/3.2.Not compatible with less than 2
lxml’s HTML parser	BeautifulSoup(response.text, "lxml")	Detonation velocity	install required
lxml’s XML parser	BeautifulSoup(response.text, "xml")	Detonation velocity. Only xml parser	install required
html5lib	BeautifulSoup(response.text, "html5lib")	Can handle HTML5 correctly	install required. Very slow.

`python`


soup = BeautifulSoup(response.text, "html.parser")

BeautifulSoup has various methods, but this time we will use the find_all method. You can also set various arguments to the find_all method, but this time we will use keyword arguments.

find_all: keyword argument

You can specify the tag attribute as a keyword argument and get the information of the matching tag.

The value of the keyword argument can also be a string, regular expression, list, function, True value. And you can specify multiple keyword arguments.

For example, if you pass a value to href as a keyword argument, Beautiful Soup will filter the href attribute of the HTML tag.

Quote: https://ai-inter1.com/beautifulsoup_1/#find_all_detail

In other words, "the value of the href attribute matches the specified regular expression", By find_all from the soup object, in the example below, href属性の中で"news.yahoo.co.jp/pickup"が含まれているもののみ全て抽出することが可能となります。

elems = soup.find_all(href = re.compile("news.yahoo.co.jp/pickup"))

Final sample source

At the end, turn it with a for statement and display the title and link of the extracted news on the console. Click here for the final sample source.

`requests-test.py`


import requests
from bs4 import BeautifulSoup
import re

#Download website information using requests
url = 'https://news.yahoo.co.jp/'
response = requests.get(url)
#print(response.text)
print('url: ',response.url)
print('status-code:',response.status_code) #HTTP status code, usually[200 OK]
print('headers[Content-Type]:',response.headers['Content-Type']) #Since headers is a dictionary, you can specify the key to content-type output
print('encoding: ',response.encoding) #encoding

#BeautifulSoup()Website information and parser acquired in"html.parser"give
soup = BeautifulSoup(response.text, "html.parser")

#In the href attribute"news.yahoo.co.jp/pickup"Extract only those that contain
elems = soup.find_all(href = re.compile("news.yahoo.co.jp/pickup"))

#The title and link of the extracted news are displayed on the console.
for elem in elems:
    print(elem.contents[0])
    print(elem.attrs['href'])

The part of PGM is close to the plagiarism of the site posted on the reference site. It was a great reference.

Afterword

Except for the print and import part of the response object of requests for confirmation, You can do web scraping with just 7 lines. Python and its predecessor's library, terrifying.

Click here for the results. I was able to scrape for the time being! The last news with a photo is superfluous, but I don't know what to do, so I'll leave it as it is. .. ..

`bash`


% python requests-test.py
url:  https://news.yahoo.co.jp/
status-code: 200
headers[Content-Type]: text/html;charset=UTF-8
encoding:  UTF-8
Docomo account cooperation silver majority suspension
https://news.yahoo.co.jp/pickup/6370639
Mr. Suga Corrected remarks about the Self-Defense Forces
https://news.yahoo.co.jp/pickup/6370647
Flooded strawberry farmer suffering for 3 consecutive years
https://news.yahoo.co.jp/pickup/6370631
Two people died when four people got on the sea
https://news.yahoo.co.jp/pickup/6370633
Mulan shooting in Xinjiang Repulsion again
https://news.yahoo.co.jp/pickup/6370640
Parents suffer from prejudice panic disorder
https://news.yahoo.co.jp/pickup/6370643
Taku Hiraoka Defendant imprisonment for 2 years and 6 months
https://news.yahoo.co.jp/pickup/6370646
Iseya suspect seized 500 rolls
https://news.yahoo.co.jp/pickup/6370638
<span class="topics_photo_img" style="background-image:url(https://lpt.c.yimg.jp/amd/20200909-00000031-asahi-000-view.jpg)"></span>
https://news.yahoo.co.jp/pickup/6370647

Reference site: https://requests-docs-ja.readthedocs.io/en/latest/ https://ai-inter1.com/beautifulsoup_1/ http://kondou.com/BS4/

Beginners use Python for web scraping (1)

Roadmap for learning web scraping in Python

Functions of sample PGM (1)

What are requests?

Install requests

bash

bash

requests method

Attributes of the response object for requests

bash

Source of requests.get part

python

bash

What is Beautiful Soup?

installation of beautifulsoup4

bash

Beautiful Soup arguments

python

find_all: keyword argument

Final sample source

requests-test.py

Afterword

bash

`bash`

`bash`

`bash`

`python`

`bash`

`bash`

`python`

`requests-test.py`

`bash`