You will be an engineer in 100 days ――Day 71 ――Programming ――About scraping 2

Click here until yesterday

You will become an engineer in 100 days-Day 70-Programming-About scraping

You will become an engineer in 100 days --Day 66 --Programming --About natural language processing

You will become an engineer in 100 days --Day 63 --Programming --Probability 1

You will become an engineer in 100 days-Day 59-Programming-Algorithms

You will become an engineer in 100 days --- Day 53 --Git --About Git

You will become an engineer in 100 days --Day 42 --Cloud --About cloud services

You will become an engineer in 100 days --Day 36 --Database --About the database

You will be an engineer in 100 days-Day 24-Python-Basics of Python language 1

You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1

You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1

You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1

This time is a continuation of scraping.

About communication

We will scrape in the Python language. Since scraping involves communication You need to know how communication works.

Websites are located on servers around the world. On the WEB, communication with the server is basically performed using the protocol (communication protocol) calledHTTP (HyperText Transfer Protocol).

Request a request from the browser to the server The response from the server to the browser is called the response.

Basic exchanges on the WEB are established by request / response (R / R) It's basically achieved by exchanging text messages

** Site search example ** Perform a search with the search tool from your browser Request The server responds to the request with a result response The browser displays the search results based on the response

There are several specifications for HTTP communication, and there are multiple ways to send requests.

** GET communication **

GET requests by adding parameters to the URL

Example: http://otupy.com?p=abc&u=u123 After?, It is a parameter, and the parameter is a key = value connected with&.

** POST communication **

POST is included in the body and requested

http://otupy.com

Request Body param:p:ab,u:u123

** Use POST and GET properly ** Communication itself is done by selecting an appropriate communication method in the browser The program must specify the communication method.

request

A request from a browser to the server of a website is called a request.

When you open a web page in your browser, the browser sends a request message to the server, such as:

GET example:

Request header
GET http://www.otupy.com/ex/http.htm HTTP/1.1
Host: www.otupy.com
Proxy-Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8
Referer: https://www.google.co.jp/
Accept-Encoding: gzip, deflate
Accept-Language: ja,en-US;q=0.9,en;q=0.8

POST example:

Request header:
POST /hoge/ HTTP/1.1
Host: localhost:8080
Connection: keep-alive
Content-Length: 22
Cache-Control: max-age=0
Origin: http://localhost:8080
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8
Referer: http://localhost:8080/hoge/
Accept-Encoding: gzip, deflate, br
Accept-Language: ja,en-US;q=0.8,en;q=0.6

Request body:
name=hoge&comment=hoge

The request has a header and a body part, and what kind of information is packed and sent depends on the communication method.

Therefore, it is necessary to fill in the appropriate information and make a request when accessing it programmatically.

Programmatic access

Let's try scraping right away.

In Python, you can communicate with a library called requests.


import requests

Since the website to be accessed is required, specify it and communicate with GET. requests.get(URL)

url = 'http://www.otupy.net/'
res = requests.get(url)
print(res)

<Response [200]>

As a result of communication, a response is returned. If the communication is successful, you can get the information of the access destination.

Of course, it is communication, so it may fail.

Communication result (response)

As a result of communication, the response is divided into several status codes. Communication is successful in the 200s, but in the 400s and 500s Since the communication has failed, it is necessary to check if the URL is entered incorrectly or if the server of the other party can be accessed.

Classification number message Description
information 100 Continue Processing is continuing. Please send a further request.
information 101 Sitching Protocols Change to the protocol specified in the Upgrade header and request again.
success 200 OK Succeeded.
success 201 Created The new content has been created in the location specified in the Location header.
success 202 Accepted The request has been accepted. However, the process is not completed.
success 203 Non-Authoritative Information The response headers are different from what the original server returned, but the process is successful.
success 204 No Content There is no content, but the process was successful.
success 205 Reset Content Now that the request has been accepted, please discard the current content (screen). ..
success 206 Partial Content Only part of the content will be returned.
transfer 300 Multiple Choices There are multiple options for how to get the content.
transfer 301 Moved Permanently You have moved to another location specified in the Location header.
transfer 302 Found Found in another location specified in the Location header. Please look there.
transfer 303 See Other Look elsewhere in the Location header.
transfer 304 Not Modified Not updated. If-Modified-It will be returned if you use the Since header.
transfer 305 Use Proxy Use the proxy specified in the Location header.
transfer 306 (Unused) unused.
transfer 307 Temporary Redirect I'm temporarily moving to another location.
Client error 400 Bad Request The request is invalid.
Client error 401 Unauthorized Not authenticated.
Client error 402 Payment Required Payment is required.
Client error 403 Forbidden Access is not allowed.
Client error 404 Not Found Not found.
Client error 405 Method Not Allowed The specified method is not supported.
Client error 406 Not Acceptable Not allowed.
Client error 407 Proxy Authentication Required Proxy authentication is required.
Client error 408 Request Timeout The request has timed out.
Client error 409 Conflict The request has a conflict.
Client error 410 Gone The requested content is gone.
Client error 411 Length Required Content-Please add a Length header and request.
Client error 412 Precondition Failed If-...Did not meet the conditions specified in the header.
Client error 413 Request Entity Too Large The requested entity is too large.
Client error 414 Request-URI Too Long The requested URI is too long.
Client error 415 Unsupported Media Type Unsupported media type.
Client error 416 Requested Range Not Satisfiable The requested range is invalid.
Client error 417 Expectation Failed The extension request specified in the Expect header has failed.
Server error 500 Internal Server Error An unexpected error has occurred on the server.
Server error 501 Not Implemented Not implemented.
Server error 502 Bad Gateway The gateway is invalid.
Server error 503 Service Unavailable Service is not available.
Server error 504 Gateway Timeout The gateway has timed out.
Server error 505 HTTP Version Not Supported This HTTP version is not supported.

Checking the communication result in the program

Now let's check the communication result programmatically.

Response variable .status_code You can check the status code at.


url = 'http://www.otupy.net/'
res = requests.get(url)
print(res.status_code)

200

If it is not 200, it means that the information on the website cannot be obtained because the communication has failed.

If the number is 200, the communication is successful and you can see the information obtained from the website.

Since the communication result is stored in a variable, you can see various contents.

Request URL Response variable .url

Status code Response variable .status_code

Get response body in text format Response variable .text

Get the response body in binary format Response variable .content

cookie Response variable .cookies

Get encoding information Response variable .encoding

From here onward, we will use the acquired text information to divide it into the necessary information.

#Get the response in binary format, convert it to characters and display it(1000 characters)
print(res.content[0:1000].decode('utf-8'))

....

Custom header

When making a request, you can make a request by packing information in the request header and body part.

To request by specifying the header in GET communication, do as follows.

requests.get (url, headers = dictionary type header data)

Specify this to modify and access the user agent as header information.

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36 '}
res = requests.get(url, headers=headers)

When changing the parameters in GET communication and communicating, specify as follows.

requests.get (url, params = dictionary type parameter data)

params = {'key1': 'value1', 'key2': 'value2'}
res = requests.get(url, params=params)

To make a request by packing information in the request body part by POST communication, do as follows.

requests.get (url, data = dictionary type body data)

payload = {'send': 'data'}
res = requests.post(url, data=payload)

Summary

Let's be able to acquire information by suppressing the communication mechanism required for scraping. Tomorrow, we will start to extract the necessary information from the information acquired in this continuation.

29 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter: https://twitter.com/otupython

Recommended Posts

You will be an engineer in 100 days ――Day 71 ――Programming ――About scraping 2
You will be an engineer in 100 days ――Day 74 ――Programming ――About scraping 5
You will be an engineer in 100 days ――Day 73 ――Programming ――About scraping 4
You will be an engineer in 100 days ――Day 75 ――Programming ――About scraping 6
You will be an engineer in 100 days ――Day 70 ――Programming ――About scraping
You will be an engineer in 100 days ――Day 61 ――Programming ――About exploration
You will be an engineer in 100 days --Day 68 --Programming --About TF-IDF
You will be an engineer in 100 days ――Day 81 ――Programming ――About machine learning 6
You will be an engineer in 100 days ――Day 82 ――Programming ――About machine learning 7
You will be an engineer in 100 days ――Day 79 ――Programming ――About machine learning 4
You will be an engineer in 100 days ――Day 76 ――Programming ――About machine learning
You will be an engineer in 100 days ――Day 80 ――Programming ――About machine learning 5
You will be an engineer in 100 days ――Day 78 ――Programming ――About machine learning 3
You will be an engineer in 100 days ――Day 84 ――Programming ――About machine learning 9
You will be an engineer in 100 days ――Day 83 ――Programming ――About machine learning 8
You will be an engineer in 100 days ――Day 77 ――Programming ――About machine learning 2
You will be an engineer in 100 days ――Day 85 ――Programming ――About machine learning 10
You will be an engineer in 100 days --Day 63 --Programming --Probability 1
You will be an engineer in 100 days --Day 65 --Programming --Probability 3
You will be an engineer in 100 days --Day 64 --Programming --Probability 2
You will be an engineer in 100 days --Day 86 --Database --About Hadoop
You will be an engineer in 100 days ――Day 60 ――Programming ――About data structure and sorting algorithm
You will be an engineer in 100 days --Day 27 --Python --Python Exercise 1
You will be an engineer in 100 days --Day 34 --Python --Python Exercise 3
You will be an engineer in 100 days --Day 31 --Python --Python Exercise 2
You become an engineer in 100 days ――Day 67 ――Programming ――About morphological analysis
You become an engineer in 100 days ――Day 66 ――Programming ――About natural language processing
You will be an engineer in 100 days ――Day 24 ―― Python ―― Basics of Python language 1
You will be an engineer in 100 days ――Day 30 ―― Python ―― Basics of Python language 6
You will be an engineer in 100 days ――Day 25 ―― Python ―― Basics of Python language 2
You will be an engineer in 100 days --Day 29 --Python --Basics of the Python language 5
You will be an engineer in 100 days --Day 33 --Python --Basics of the Python language 8
You will be an engineer in 100 days --Day 26 --Python --Basics of the Python language 3
You will be an engineer in 100 days --Day 35 --Python --What you can do with Python
You will be an engineer in 100 days --Day 32 --Python --Basics of the Python language 7
You will be an engineer in 100 days --Day 28 --Python --Basics of the Python language 4
When you get an error in python scraping (requests)
You have to be careful about the commands you use every day in the production environment.
What beginners think about programming in 2016