Using Python's requests module and Beautiful Soup The basics of scraping basics
--Code to get and display all HTML of URL:
Get HTML
import requests
url = "https://hogehoge12345.html"
response = requests.get(url)
response.encoding = response.apparent_encoding
print(response.text)
-Use requests.get ()
to send an HTTP request to the argument URL and use the HTTP response returned from the server as the return value.
-Use apparent_encoding to prevent garbled characters as much as possible.
・ Response.text is the content of the acquired content
-If you can access it by URL, you can also get CSV, image files, video files, etc. (the code is the same as above).
1 second interval
import time
time.sleep(1)
-When acquiring HTTP from multiple URLs in succession, leave an interval of at least 1 second so as not to bother the other site. ・ In the first place, it is necessary to confirm the usage restrictions such as whether the site can be accessed programmatically or the published contents are converted into data.
Save the retrieved web content to a file
response = requests.get(url)
response.encoding = response.apparent_encoding
exam_html = response.text
with open('exam.html', mode='w', encoding='utf-8') as fp:
fp.write(exam_html)
-** Use a library called Beautiful Soup **. --A program that parses HTML words and acquires tags etc. as a data structure is called an HTML parser.
python
import requests
from bs4 import BeautifulSoup
url = "https://hogehoge12345.html"
response = requests.get(url)
response.encoding = response.apparent_encoding
#Parse HTML
bs = BeautifulSoup(response.text, 'html.parser')
#Extract the part enclosed by ul tag
ul_tag = bs.find('ul')
#Extract a tag in ul tag
for a_tag in ul_tag.find_all('a'):
#Get the text of the a tag
text = a_tag.text # => "Click to jump to the link"
#Get href attribute of a tag
link_url = a_tag['href'] # => "https://hogehoge12345.html/next"
print('{}: {}'.format(text, link_url))
-Get the HTML code from \
bs.find ('ul')
-The find method traces from the beginning and retrieves only the first element, but the find_all method retrieves all elements in an iterable manner (= can be used for for loops). For details, see Link -252 /).
-Use ** CSS selector ** to extract specific tags (such as tags with a certain CSS class)
-Express by connecting tags and CSS classes with dots <div class = "exam_exam1">-> div.exam1
select method
# div.Extract the part surrounded by exam1
div_exam1 = bs.select('div.exam1')
-Select () has the same function as find and find_all to get HTML elements and return them as a list, but you can specify a CSS selector in the search condition (for details, link. blog / difference-find-and-select-in-beautiful-soup-of-python /)).
Recommended Posts