Environment Mac, Python3
Install Beautiful Soup and lxml
$ pip install beautifulsoup4
$ pip install lxml
I got an error on the way, but the installation was successful. There are no problems so far.
from bs4 import BeautifulSoup
import urllib.request
#When getting html from the web
url = '××××××××××××'
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
html = response.read()
soup = BeautifulSoup(html, "lxml")
#When opening local html directly
soup = BeautifulSoup(open("index.html"), "lxml")
Get the element by specifying the tag that contains the information you want.
-Specify class
soup.find(class_='class_name')
#If there is no underscore after class, an error will occur.
-Specify id
soup.find(id="id_name")
#The id remains the same.
-Specify the tag together
soup.find('li', class_='class_name')
soup.find('div', id="id_name")
find () will only get the first hit. If you want to get more than one, use find_all ().
images = soup.find_all('img')
for img in images:
~Individual processing~
soup.select("p > a")
soup.select('a[href="http://example.com/"]')
It will be a sample after loading html into soup.
sample.html
<html>
<title>test title</title>
</html>
>>> soup.title
<title>test title</title>
>>> soup.title.string
'test title'
You can get it by adding .string to the end.
sample.html
<html>
<div id="hoge">
<img class="fuga" src="http://××.com/sample.jpg "/>
</div>
</html>
First, get the div tag with id = "hoge"
>>> div = soup.find('div' id="hoge")
<div id="hoge">
<img class="fuga" src="http://××.com/sample.jpg "/>
</div>
Next, get the img tag of class = "fuga" from the div
>>> img = div.find('img', class_='fuga')
<img class="fuga" src="http://××.com/sample.jpg "/>
>>> img['src']
"http://××.com/sample.jpg "
You don't actually need to get a div with this pattern. However, I wanted to make a sample that narrows down, so I added a div.
reference http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Recommended Posts