Aidemy https:// aidemy.net　2020/9/21

Introduction

Hello, it is Yope! I'm a liberal arts college student, but I'm interested in the AI field, so I'm studying at the AI-specialized school "Aidemy". I am very happy that many people have read the previous summary article. Thank you! This is the first post of scraping. Nice to meet you.

This article is a summary of what you learned in "Aidemy" "in your own words". It may contain mistakes and misunderstandings. Please note.

What to learn this time ・ What is scraping? ・ Get a web page (crawling)

What is scraping?

About scraping

-Scraping is the work of automatically extracting necessary information from __Web pages __. -By using scraping, it is possible to collect a large amount of data required for machine learning. -However, please note that the data on the Web may not be the data __ (open data) __ that is allowed to be used freely.

Flow of scraping

・ Check if scraping is okay. Also, if an API that can acquire data is provided, use that. -Obtain a web page from which data can be obtained. This is called __crawling __. ・ Obtain the necessary information from the Web page. (Scraping)

Three main methods of scraping

-__ Wget command : Use the wget command to download a web page, and use the unix command or regular expression to scrape. __ Simple and easy, but lacking in functionality. __ - Web scraping tool __: Use Chrome extensions, spreadsheets and other scraping tools. Please note that there are some points such as __ functions are limited and there may be a charge . - Programming __: Program the scraping function by yourself. __ Can handle complex data. __ This time I will scrape this way.

Encoding and decoding

-Encoding is the encoding of __data into another format __. -Decoding is to return __encoded data to its original format __. -In scraping, data can be acquired by encoding once to acquire temporary data and decoding the temporary data.

Get a web page (crawling)

Get web page

-To get the Web page, use __urlopen ("URL") __ which can be used by importing the urllib.request module. -Although the acquired Web page can be referenced by the read () method, note that it is not a character string (str type) because it is not __decoded. (Check the decoding method in the following sections)

from urllib.request import urlopen
#Get Google URL
url=urlopen("https://www.google.co.jp")

Decode the retrieved web page

-Because decoding requires information on the "character code" used for the Web page, first obtain this. The character code can be obtained by using the __info (). Get_content_charset (failobj = "utf-8") __ method. -(Failobj = "utf-8") of the above method means that the character code is automatically changed to "utf-8" when the character code (charset) is not specified on the Web page side. is there. The Japanese page is basically "utf-8", so it is specified like this.

url = urlopen("https://www.google.co.jp")
#Acquisition and display of character code
encode = url.info().get_content_charset(failobj="utf-8")
print(encode) #shift_jis

-Once the character code can be obtained, decode it according to the character code and obtain the part as a str type HTML code. -Decoding is performed with __url.decode (character code) __. You can check the contents with read ().

#Decode the already encoded url with the acquired character code (encode)
url_decoded = url.decode(encode)
print(url_decoded.read()) #Abbreviation (HTML code is output)

Get web pages easier

-If you import and use the request module, you can get a web page more easily than urllib. However, when performing complicated operations, preprocessing becomes difficult. Use __requests.get ("URL") __ to get the URL, and use encoding to get the character code for the obtained URL, and text to get the decoded HTML code.

import requests
url=requests.get("https://www.google.co.jp")
print(url.encoding) #shift_jis
print(url.text) #Abbreviation

Summary

-Scraping is to get a web page and extract necessary data from it. In machine learning, it is used to collect the data required for learning. -When retrieving (crawling) a Web page, use the __urlopen () __ function of the urllib.request module, but it cannot be handled as data unless it is decoded. -To decode, you must first get the character code of the Web page and use the __decode () __ method according to the character code. -If you use the requests module, you can get the URL with __requests.get () __, while you can get the character code with __encoding __, and you can get the HTML code in the decoded state with text. It's very easy.

This time is over. Thank you for reading this far.

Scraping 1