Basics of Python scraping basics

Using Python's requests module and Beautiful Soup The basics of scraping basics

Get content (mainly HTML) from the web

--Code to get and display all HTML of URL:

Get HTML


import requests

url = "https://hogehoge12345.html"

response = requests.get(url)
response.encoding = response.apparent_encoding

print(response.text)

-Use requests.get () to send an HTTP request to the argument URL and use the HTTP response returned from the server as the return value. -Use apparent_encoding to prevent garbled characters as much as possible. ・ Response.text is the content of the acquired content -If you can access it by URL, you can also get CSV, image files, video files, etc. (the code is the same as above).

1 second interval


import time

time.sleep(1)

-When acquiring HTTP from multiple URLs in succession, leave an interval of at least 1 second so as not to bother the other site. ・ In the first place, it is necessary to confirm the usage restrictions such as whether the site can be accessed programmatically or the published contents are converted into data.

Save the retrieved web content to a file


response = requests.get(url)
response.encoding = response.apparent_encoding

exam_html = response.text

with open('exam.html', mode='w', encoding='utf-8') as fp:
    fp.write(exam_html)

HTML parsing

-** Use a library called Beautiful Soup **. --A program that parses HTML words and acquires tags etc. as a data structure is called an HTML parser.

python


import requests
from bs4 import BeautifulSoup

url = "https://hogehoge12345.html"
response = requests.get(url)
response.encoding = response.apparent_encoding

#Parse HTML
bs = BeautifulSoup(response.text, 'html.parser')

#Extract the part enclosed by ul tag
ul_tag = bs.find('ul')

#Extract a tag in ul tag
for a_tag  in ul_tag.find_all('a'):

    #Get the text of the a tag
    text = a_tag.text        # => "Click to jump to the link"

    #Get href attribute of a tag
    link_url = a_tag['href'] # => "https://hogehoge12345.html/next"

    print('{}: {}'.format(text, link_url))

-Get the HTML code from \