This article is for beginners of web scraping using Python3 and BeautifulSoup4.
I referred to past articles, Since a warning was displayed or it did not work due to the difference in version, I tried to summarize it again.
The basic process of web scraping is as follows.
① Get the web page. (2) Divide the elements of the acquired page and extract any part. ③ Save in the database.
Use request to get the web page of ① and BeautifulSoup4 to process ②. Since ③ differs depending on the environment, the explanation is omitted in this article.
After installing Python3 Use the pip command to install the three packages BeautifulSoup4, requests and lxml.
$ pip install requests
$ pip install lxml
$ pip install beautifulsoup4
Create the following script file.
sample.py
import requests
from bs4 import BeautifulSoup
target_url = 'http://example.co.jp' #example.co.jp is a fictitious domain. Change to any url
r = requests.get(target_url) #Get from the web using requests
soup = BeautifulSoup(r.text, 'lxml') #Extract elements
for a in soup.find_all('a'):
print(a.get('href')) #Show link
Start a command prompt and execute the following command.
$ python sample.py
After running, if you see the page link on the console, you're good to go!
Here are some useful methods for BeautifulSoup.
soup.a.string #Change the character string of the a tag
soup.a.attrs #Change all attributes
soup.a.parent #Parent element returns
soup.find('a') #The first element is returned
soup.find_all(id='log') #All elements are returned
soup.select('head > title') #Specified by css selector
BeautifulSoup has many other methods you can use. For details, please refer to the official document. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
It is convenient to use the regular expression of re to narrow down the target element.
import re
soup.find_all('a', href=re.compile("^http")) #Links that start with http
import re
soup.find_all('a', href=re.compile("^(?!http)")) #Does not start with http(denial)
import re
soup.find_all('a', text=re.compile("N"), title=re.compile("W")) #Elements where text contains N and title contains W
A supplementary explanation of string operations that are useful to remember when scraping.
" abc ".strip()
→abc
"a, b, c,".split(',')
→[a, b, c]
"abcde".find('c') #Returns the position if there is a specified character.
→2
"abcdc".replace('c', 'x')
→abxdx
http://qiita.com/itkr/items/513318a9b5b92bd56185
Recommended Posts