Basics of Python scraping basics

Using Python's requests module and Beautiful Soup The basics of scraping basics

Get content (mainly HTML) from the web

--Code to get and display all HTML of URL:

`Get HTML`


import requests

url = "https://hogehoge12345.html"

response = requests.get(url)
response.encoding = response.apparent_encoding

print(response.text)

-Use requests.get () to send an HTTP request to the argument URL and use the HTTP response returned from the server as the return value. -Use apparent_encoding to prevent garbled characters as much as possible. ・ Response.text is the content of the acquired content -If you can access it by URL, you can also get CSV, image files, video files, etc. (the code is the same as above).

`1 second interval`


import time

time.sleep(1)

-When acquiring HTTP from multiple URLs in succession, leave an interval of at least 1 second so as not to bother the other site. ・ In the first place, it is necessary to confirm the usage restrictions such as whether the site can be accessed programmatically or the published contents are converted into data.

`Save the retrieved web content to a file`


response = requests.get(url)
response.encoding = response.apparent_encoding

exam_html = response.text

with open('exam.html', mode='w', encoding='utf-8') as fp:
    fp.write(exam_html)

HTML parsing

-** Use a library called Beautiful Soup **. --A program that parses HTML words and acquires tags etc. as a data structure is called an HTML parser.

`python`


import requests
from bs4 import BeautifulSoup

url = "https://hogehoge12345.html"
response = requests.get(url)
response.encoding = response.apparent_encoding

#Parse HTML
bs = BeautifulSoup(response.text, 'html.parser')

#Extract the part enclosed by ul tag
ul_tag = bs.find('ul')

#Extract a tag in ul tag
for a_tag  in ul_tag.find_all('a'):

    #Get the text of the a tag
    text = a_tag.text        # => "Click to jump to the link"

    #Get href attribute of a tag
    link_url = a_tag['href'] # => "https://hogehoge12345.html/next"

    print('{}: {}'.format(text, link_url))

-Get the HTML code from \

bs.find ('ul')

Link

CSS selector

-Use ** CSS selector ** to extract specific tags (such as tags with a certain CSS class) -Express by connecting tags and CSS classes with dots <div class = "exam_exam1">-> div.exam1

`select method`


# div.Extract the part surrounded by exam1
div_exam1 = bs.select('div.exam1')

-Select () has the same function as find and find_all to get HTML elements and return them as a list, but you can specify a CSS selector in the search condition (for details, link. blog / difference-find-and-select-in-beautiful-soup-of-python /)).