Aidemy https:// aidemy.net 2020/9/21
Hello, it is Yope! I'm a liberal arts college student, but I'm interested in the AI field, so I'm studying at the AI-specialized school "Aidemy". I am very happy that many people have read the previous summary article. Thank you! This is the first post of scraping. Nice to meet you.
What to learn this time ・ What is scraping? ・ Get a web page (crawling)
-Scraping is the work of automatically extracting necessary information from __Web pages __. -By using scraping, it is possible to collect a large amount of data required for machine learning. -However, please note that the data on the Web may not be the data __ (open data) __ that is allowed to be used freely.
・ Check if scraping is okay. Also, if an API that can acquire data is provided, use that. -Obtain a web page from which data can be obtained. This is called __crawling __. ・ Obtain the necessary information from the Web page. (Scraping)
-__ Wget command : Use the wget command to download a web page, and use the unix command or regular expression to scrape. __ Simple and easy, but lacking in functionality. __ - Web scraping tool __: Use Chrome extensions, spreadsheets and other scraping tools. Please note that there are some points such as __ functions are limited and there may be a charge . - Programming __: Program the scraping function by yourself. __ Can handle complex data. __ This time I will scrape this way.
-Encoding is the encoding of __data into another format __. -Decoding is to return __encoded data to its original format __. -In scraping, data can be acquired by encoding once to acquire temporary data and decoding the temporary data.
-To get the Web page, use __urlopen ("URL") __ which can be used by importing the urllib.request module. -Although the acquired Web page can be referenced by the read () method, note that it is not a character string (str type) because it is not __decoded. (Check the decoding method in the following sections)
from urllib.request import urlopen
#Get Google URL
url=urlopen("https://www.google.co.jp")
-Because decoding requires information on the "character code" used for the Web page, first obtain this. The character code can be obtained by using the __info (). Get_content_charset (failobj = "utf-8") __ method. -(Failobj = "utf-8") of the above method means that the character code is automatically changed to "utf-8" when the character code (charset) is not specified on the Web page side. is there. The Japanese page is basically "utf-8", so it is specified like this.
url = urlopen("https://www.google.co.jp")
#Acquisition and display of character code
encode = url.info().get_content_charset(failobj="utf-8")
print(encode) #shift_jis
-Once the character code can be obtained, decode it according to the character code and obtain the
part as a str type HTML code. -Decoding is performed with __url.decode (character code) __. You can check the contents with read ().#Decode the already encoded url with the acquired character code (encode)
url_decoded = url.decode(encode)
print(url_decoded.read()) #Abbreviation (HTML code is output)
-If you import and use the request module, you can get a web page more easily than urllib. However, when performing complicated operations, preprocessing becomes difficult. Use __requests.get ("URL") __ to get the URL, and use encoding to get the character code for the obtained URL, and text to get the decoded HTML code.
import requests
url=requests.get("https://www.google.co.jp")
print(url.encoding) #shift_jis
print(url.text) #Abbreviation
-Scraping is to get a web page and extract necessary data from it. In machine learning, it is used to collect the data required for learning. -When retrieving (crawling) a Web page, use the __urlopen () __ function of the urllib.request module, but it cannot be handled as data unless it is decoded. -To decode, you must first get the character code of the Web page and use the __decode () __ method according to the character code. -If you use the requests module, you can get the URL with __requests.get () __, while you can get the character code with __encoding __, and you can get the HTML code in the decoded state with text. It's very easy.
This time is over. Thank you for reading this far.
Recommended Posts