This article is for beginners of web scraping using Python3 and BeautifulSoup4.
I referred to past articles, Since a warning was displayed or it did not work due to the difference in version, I tried to summarize it again.
The basic process of web scraping is as follows.
① Get the web page. (2) Divide the elements of the acquired page and extract any part. ③ Save in the database.
Use request to get the web page of ① and BeautifulSoup4 to process ②. Since ③ differs depending on the environment, the explanation is omitted in this article.
After installing Python3 Use the pip command to install the three packages BeautifulSoup4, requests and lxml.
$ pip install requests
$ pip install lxml
$ pip install beautifulsoup4
Create the following script file.
import requests
from bs4 import BeautifulSoup
target_url = '' is a fictitious domain. Change to any url
r = requests.get(target_url) #Get from the web using requests
soup = BeautifulSoup(r.text, 'lxml') #Extract elements
for a in soup.find_all('a'):
print(a.get('href')) #Show link
Start a command prompt and execute the following command.
$ python
After running, if you see the page link on the console, you're good to go!
Here are some useful methods for BeautifulSoup.
soup.a.string #Change the character string of the a tag
soup.a.attrs #Change all attributes
soup.a.parent #Parent element returns
soup.find('a') #The first element is returned
soup.find_all(id='log') #All elements are returned'head > title') #Specified by css selector
BeautifulSoup has many other methods you can use. For details, please refer to the official document.
It is convenient to use the regular expression of re to narrow down the target element.
import re
soup.find_all('a', href=re.compile("^http")) #Links that start with http
import re
soup.find_all('a', href=re.compile("^(?!http)")) #Does not start with http(denial)
import re
soup.find_all('a', text=re.compile("N"), title=re.compile("W")) #Elements where text contains N and title contains W
A supplementary explanation of string operations that are useful to remember when scraping.
" abc ".strip()
"a, b, c,".split(',')
→[a, b, c]
"abcde".find('c') #Returns the position if there is a specified character.
"abcdc".replace('c', 'x')
Recommended Posts