I don't know the K character in the cloud or the P character in Python, It's been a month since I started studying Python + GCP. Began to be interested in web scraping in Python How to use requests, various attributes of requests object, While learning html parsing with BeutifuruSoup First of all, I will try scraping Yahoo News.
(1) Succeed in scraping the desired stuff locally for the time being. ← Now here (2) Link the result of scraping locally to Google Spreadsheet. (3) cron is automatically executed locally. (4) Challenge free automatic execution on the cloud server. (Google Compute Engine) (5) Challenge free automatic execution without a server on the cloud. (Maybe Cloud Functions + Cloud Scheduler)
・ Get website information using requests ・ Parse html with Beautiful Soup -Search for a specific character string with the re library that can search for character strings (identify headline news) -Display all news titles and links from the acquired result list on the console
An external library for HTTP communication with Python. You can simply collect information on the website. You can also get the url using urllib, the standard python library, If you use requests, the amount of code is small and you can write it simply. However, since it is a third-party library, it needs to be installed.
It can be installed with pip. Here is the clean state of the virtual environment created with venv.
bash
$ virtualenv -p python3.7 env3
% source env3/bin/activate
(env3) % pip list
Package Version
---------- -------
pip 20.2.3
setuptools 49.2.1
wheel 0.34.2
Install with pip. Check the pip list to see if it's in (and version). Along with that, various things are also included.
bash
(env3) % pip install requests
Collecting requests
Using cached requests-2.24.0-py2.py3-none-any.whl (61 kB)
Collecting idna<3,>=2.5
Using cached idna-2.10-py2.py3-none-any.whl (58 kB)
Collecting chardet<4,>=3.0.2
Using cached chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
Using cached urllib3-1.25.10-py2.py3-none-any.whl (127 kB)
Collecting certifi>=2017.4.17
Using cached certifi-2020.6.20-py2.py3-none-any.whl (156 kB)
Installing collected packages: idna, chardet, urllib3, certifi, requests
Successfully installed certifi-2020.6.20 chardet-3.0.4 idna-2.10 requests-2.24.0 urllib3-1.25.10
(env3) % pip list
Package Version
---------- ---------
certifi 2020.6.20
chardet 3.0.4
idna 2.10
pip 20.2.3
requests 2.24.0
setuptools 49.2.1
urllib3 1.25.10
wheel 0.34.2
requests is a common HTTP request method, It supports methods such as get, post, put, delete. This time we will use get.
The response object returned by requests.get contains various attributes. In this sample program, the following attributes were confirmed by print.
attribute | What can be confirmed |
---|---|
url | You can get the accessed URL. |
status_code | Status code(HTTP status)Can be obtained. |
headers | You can get the response header. |
encoding | You can get the encoding that Requests guessed. |
In addition, there are text attribute and content attribute.
The headers attribute is a dict type (dictionary), and Yahoo News contains many keys as shown below, so in the sample program, the'Content-Type' key is extracted from the headers attribute and printed.
bash
{'Cache-Control': 'private, no-cache, no-store, must-revalidate', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html;charset=UTF-8', 'Date': 'Wed, 09 Sep 2020 02:24:04 GMT', 'Set-Cookie': 'B=6rffcc5flgf64&b=3&s=sv; expires=Sat, 10-Sep-2022 02:24:04 GMT; path=/; domain=.yahoo.co.jp, XB=6rffcc5flgf64&b=3&s=sv; expires=Sat, 10-Sep-2022 02:24:04 GMT; path=/; domain=.yahoo.co.jp; secure; samesite=none', 'Vary': 'Accept-Encoding', 'X-Content-Type-Options': 'nosniff', 'X-Download-Options': 'noopen', 'X-Frame-Options': 'DENY', 'X-Vcap-Request-Id': 'd130bb1e-4e53-4738-4b02-8419633dd825', 'X-Xss-Protection': '1; mode=block', 'Age': '0', 'Server': 'ATS', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Via': 'http/1.1 edge2821.img.kth.yahoo.co.jp (ApacheTrafficServer [c sSf ])'}
Click here for the source excerpt of requests.get and each attribute display part of the acquired response object.
python
url = 'https://news.yahoo.co.jp/'
response = requests.get(url)
#print(response.text)
print('url: ',response.url)
print('status-code:',response.status_code) #HTTP status code, usually[200 OK]
print('headers[Content-Type]:',response.headers['Content-Type']) #Since headers is a dictionary, you can specify the key to content-type output
print('encoding: ',response.encoding) #encoding
Here are the results.
bash
(env3) % python requests-test.py
url: https://news.yahoo.co.jp/
status-code: 200
headers[Content-Type]: text/html;charset=UTF-8
encoding: UTF-8
Beautiful Soup is a library for web scraping in Python. You can retrieve and parse data from HTML and XML files. You can easily extract a specific html tag.
Same as requests. It can be installed with pip.
bash
(env3) % pip install beautifulsoup4
Collecting beautifulsoup4
Using cached beautifulsoup4-4.9.1-py3-none-any.whl (115 kB)
Collecting soupsieve>1.2
Using cached soupsieve-2.0.1-py3-none-any.whl (32 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.1 soupsieve-2.0.1
(env3) % pip list
Package Version
-------------- ---------
beautifulsoup4 4.9.1
certifi 2020.6.20
chardet 3.0.4
idna 2.10
pip 20.2.3
requests 2.24.0
setuptools 49.2.1
soupsieve 2.0.1
urllib3 1.25.10
wheel 0.34.2
In Beautiful Soup, the object to be analyzed (html or xml) is the first argument. (The response object obtained by requests in the sample) Specify the parser to be used for analysis as the second argument.
Parser | Example of use | Strengths | weakness |
---|---|---|---|
Python’s html.parser | BeautifulSoup(response.text, "html.parser") | Standard library | Python2 series/3.2.Not compatible with less than 2 |
lxml’s HTML parser | BeautifulSoup(response.text, "lxml") | Detonation velocity | install required |
lxml’s XML parser | BeautifulSoup(response.text, "xml") | Detonation velocity. Only xml parser | install required |
html5lib | BeautifulSoup(response.text, "html5lib") | Can handle HTML5 correctly | install required. Very slow. |
python
soup = BeautifulSoup(response.text, "html.parser")
BeautifulSoup has various methods, but this time we will use the find_all method. You can also set various arguments to the find_all method, but this time we will use keyword arguments.
You can specify the tag attribute as a keyword argument and get the information of the matching tag.
The value of the keyword argument can also be a string, regular expression, list, function, True value. And you can specify multiple keyword arguments.
For example, if you pass a value to href as a keyword argument, Beautiful Soup will filter the href attribute of the HTML tag.
Quote: https://ai-inter1.com/beautifulsoup_1/#find_all_detail
In other words, "the value of the href attribute matches the specified regular expression", By find_all from the soup object, in the example below, href属性の中で"news.yahoo.co.jp/pickup"が含まれているもののみ全て抽出することが可能となります。
elems = soup.find_all(href = re.compile("news.yahoo.co.jp/pickup"))
At the end, turn it with a for statement and display the title and link of the extracted news on the console. Click here for the final sample source.
requests-test.py
import requests
from bs4 import BeautifulSoup
import re
#Download website information using requests
url = 'https://news.yahoo.co.jp/'
response = requests.get(url)
#print(response.text)
print('url: ',response.url)
print('status-code:',response.status_code) #HTTP status code, usually[200 OK]
print('headers[Content-Type]:',response.headers['Content-Type']) #Since headers is a dictionary, you can specify the key to content-type output
print('encoding: ',response.encoding) #encoding
#BeautifulSoup()Website information and parser acquired in"html.parser"give
soup = BeautifulSoup(response.text, "html.parser")
#In the href attribute"news.yahoo.co.jp/pickup"Extract only those that contain
elems = soup.find_all(href = re.compile("news.yahoo.co.jp/pickup"))
#The title and link of the extracted news are displayed on the console.
for elem in elems:
print(elem.contents[0])
print(elem.attrs['href'])
The part of PGM is close to the plagiarism of the site posted on the reference site. It was a great reference.
Except for the print and import part of the response object of requests for confirmation, You can do web scraping with just 7 lines. Python and its predecessor's library, terrifying.
Click here for the results. I was able to scrape for the time being! The last news with a photo is superfluous, but I don't know what to do, so I'll leave it as it is. .. ..
bash
% python requests-test.py
url: https://news.yahoo.co.jp/
status-code: 200
headers[Content-Type]: text/html;charset=UTF-8
encoding: UTF-8
Docomo account cooperation silver majority suspension
https://news.yahoo.co.jp/pickup/6370639
Mr. Suga Corrected remarks about the Self-Defense Forces
https://news.yahoo.co.jp/pickup/6370647
Flooded strawberry farmer suffering for 3 consecutive years
https://news.yahoo.co.jp/pickup/6370631
Two people died when four people got on the sea
https://news.yahoo.co.jp/pickup/6370633
Mulan shooting in Xinjiang Repulsion again
https://news.yahoo.co.jp/pickup/6370640
Parents suffer from prejudice panic disorder
https://news.yahoo.co.jp/pickup/6370643
Taku Hiraoka Defendant imprisonment for 2 years and 6 months
https://news.yahoo.co.jp/pickup/6370646
Iseya suspect seized 500 rolls
https://news.yahoo.co.jp/pickup/6370638
<span class="topics_photo_img" style="background-image:url(https://lpt.c.yimg.jp/amd/20200909-00000031-asahi-000-view.jpg)"></span>
https://news.yahoo.co.jp/pickup/6370647
Reference site: https://requests-docs-ja.readthedocs.io/en/latest/ https://ai-inter1.com/beautifulsoup_1/ http://kondou.com/BS4/
Recommended Posts