You will learn basic knowledge such as terms you need to know when scraping and the structure of web pages.
Scraping is the process of automatically extracting necessary information from a web page.
The "information" referred to here is various, such as text, images, and links. You can save the trouble of manually opening a browser and collecting information, and you can collect a large amount of data.
Scraping is used in a variety of situations, from personal use to work and research.
In the field of machine learning, a large amount of data on the Web is useful, especially in natural language processing and image recognition.
In recent years, open data (data released by the government, local governments, companies, etc. with permission for free use) has been attracting attention.
Scraping is also useful when collecting such data.
However, scraping puts a load on the server that displays the Web page if it is used incorrectly. Be careful when using it.
Unlike the API, the service provider does not provide it to developers. If you do it without the permission of the other party, you may violate the terms of the website.
In some cases, the disclosure of the acquired data is prohibited. Some websites prohibit scraping, so Please read the terms of use of the website carefully before scraping.
Check the general flow of scraping. It ’s also called scraping all together.
The terms crawling and scraping are also used for each step.
Is there an API that can obtain the desired information, or is it okay to publish the acquired data? Check if there is any infringement such as copyright.
To retrieve only arbitrary information from a web page, you first need to get the web page.
This work is also known as crawling.
For commonly used libraries
urllib, requests, etc.
Once you have a web page, get the information you need from it.
There are various libraries such as re, lxml and Beautiful Soup.
To be precise, this work is called scraping.
Data acquired by scraping is saved in a database or local (your PC).
Here are three simple ways to do scraping.
As one of the easiest scraping methods Download the web page with the wget command There is a method of performing text processing using the unix command or a regular expression.
This method can be easily done from the command line It is very easy because it can be processed only with the basic knowledge of the unix command.
On the other hand, because it is simple, there is also a lack of practical functions. It also has the disadvantage of being difficult to process for complex data. This is useful, for example, when retrieving a formatted dataset.
$ wget http://scraping.aidemy.net
Software provided, Google Chrome extensions, Google spreadsheets, etc. There are various tools.
It has the advantage of being easy to use with almost no code to write. There are also disadvantages such as limited functions, different usage, and may be charged.
You can perform scraping in various programming languages.
While it is practical to be able to process complex and dynamic data You need to understand the grammar and processing method.
Python can be scraped only with the standard library
More powerful third-party libraries (libraries published by users around the world apart from the standard library) Exists.
By using such a tool, you can scrape quickly with a relatively short program.
Especially because Python is useful for preprocessing such as data analysis and machine learning. It can be said that it has a high affinity with post-scraping processing and visualization.
What is HTML
Abbreviation for Hyper Text Markup Language
It is one of the most basic markup languages for creating web pages.
What is HTML (Hyper Text Markup Language)? As the name implies, it is a high-performance text and markup language.
A markup language is a language for describing the visual structure of a sentence.
It is mainly a language for making web pages easier to read.
As the word markup means, a mark (tag) is attached and the size and position of characters, image links, etc. are embedded.
Markup languages include XML as well as HTML.
XML has a structure suitable for exchanging data between multiple applications via the Internet.
Encoding is the encoding of analog signals and digital data using a specific method. It's one way to format your data differently according to certain rules.
The computer is processing numerically. Just by displaying letters and symbols so that humans can easily understand In fact, there are numbers (bit strings) that correspond to all letters and symbols.
The method of determining which character is assigned to which bit string is called a character code.
For example, there are character codes called UTF-8 and Shift-JIS.
Japanese "a" is "0xe38182" in the UTF-8 character code table It is "0x81a0" in the Shift-JIS character code table.
Decoding is the return of encoded data to its original format.
This is called decryption.
For example, decoding video means converting the digital data recorded on a DVD to the original video. If you do not decode in the same way as when encoding, text data will be garbled.
For example, there is a file written in UTF-8 as follows.
My name is Taro Yamada.
If you decode this file with Shift-JIS, which is different from the encoding, it will be as follows.
聘 √.
Basically, the web browser automatically recognizes and decodes the character code. Users can usually browse without being aware of these things.
Use Python to acquire (crawling) web pages in order.
To get a web page, in the standard library
urllib.Using request module
Specify the URL in the argument of the urlopen method included in this module. urlopen () returns an object of type HTTPResponse.
from urllib.request import urlopen
#Specify the URL of the Google homepage as an example
f = urlopen("https://www.google.co.jp")
# `urlopen()`Returns an object of type HTTPResponse
print(type(f))
# >>>Output result
<class 'http.client.HTTPResponse'>
To refer to the acquired web page
Use the read method.
However, the data is still encoded (not decoded) just by reading it with the read method. It should be noted that the data type is byte type.
from urllib.request import urlopen
#Specify the URL of the Google homepage as an example
f = urlopen("https://www.google.co.jp")
# `urlopen()`The HTML body obtained in`read()`Refer to in
read = f.read()
print(type(read))
print(read)
>>>Output result
# `read()`Data type is byte type
<class 'bytes'>
# `read()`This is the content of the HTML referenced in
b'<!doctype html><html itemscope=.....The following is omitted
Like the sample above
read()If you do not specify the number of characters to read as an argument to
Get all the elements of a web page
read(word count)And the argument
Gets the specified number of characters from the beginning.
In the previous section the urlopen method returns an object of type HTTPResponse I confirmed that the response obtained by the read method of the object is byte type.
Also, the reason why the data type of this response is byte type This was because the loaded web page was not decoded with the raw encoded data.
To handle the acquired response as str type You need to specify the correct character code for decoding.
To find out the character code used in a web page
HTTPResponse type object obtained by urlopen method
Content-Browse to the Type header.
Content-To refer to the Type header
Use getheader method
Content as an argument-Specify Type.
from urllib.request import urlopen
#Specify the URL of the Google homepage as an example
f = urlopen("https://www.google.co.jp")
#HTTP header`Content-Type`Get the value of
f.getheader("Content-Type")
# >>>Output result
'text/html; charset=Shift_JIS'
The value of the Content-Type header is returned as'text / html; charset = Shift_JIS'.
This charset = is the character code of the web page.
The character code of the sample google.co.jp is Shift_JIS.
If charset=If is not set, most Japanese web pages
UTF-You can think of it as 8.
To get the character code, use the above regular expression You can also get it from the value of the Content-Type header,
Practically, combine info () and get_content_charset () to get as follows.
The info method returns an object of type HTTPMessage.
Get from that object_content_Get the charset with the charset method.
from urllib.request import urlopen
#Specify the URL of the Google homepage as an example
f = urlopen("https://www.google.co.jp")
# info()so`HTTPMessage type`Get object
# get_content_charset()so`charset`Get
#Of the argument`failobj=`Specifies the encoding when charset is not specified
encoding = f.info().get_content_charset(failobj="utf-8")
print(encoding)
# >>>Output result
shift_jis
Since I got the character code of the web page Next, get the HTML body decoded with the correct character code.
Decode uses the decode method
Specify the encoding in the argument.
from urllib.request import urlopen
#Specify the URL of the Google homepage as an example
f = urlopen("https://www.google.co.jp")
#Get charset
encoding = f.info().get_content_charset(failobj="utf-8")
# decode()Get the HTML by specifying the encoding in
text = f.read().decode(encoding)
print(text)
# >>>Output result
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head>.....The following is omitted
So far, we have used urllib to get the elements of a web page. Not a standard library here Web pages can be retrieved more easily than urllib
Introducing the requests module.
urllib is easy to use if you only have post and get requests If you want to do something a little tricky like adding HTTP headers or basic authentication (usually password authentication) A little troublesome pre-processing is required.
The requests module automatically processes character code decoding and file / folder compression. You can operate with a simple description on a higher layer.
#Import the repuests module
import requests
# `requests.get()`Gets the Response object of the URL specified in
r = requests.get("https://www.google.co.jp")
#You can get the character code from the Response object with the encoding attribute
print(r.encoding)
>>>Output result
Shift_JIS
#You can get the response automatically decoded into str type with the text attribute.
print(r.text)
>>>Output result
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head>.....The following is omitted
Recommended Posts