This time I will use Beautiful Soup. python 3.6.0 BeautifulSoup 4.6.0
Click here for the document English http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Japanese http://kondou.com/BS4/
$ pip install beautifulsoup4
It is a program that fetches the data of this page and displays the contents of the h1 tag. https://pythonscraping.com/pages/page1.html
from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen("https://pythonscraping.com/pages/page1.html")
bsobj=BeautifulSoup(html.read())
print(bsobj.h1)
If nothing is done, the web page will not be found, or the scraper will throw an error in an unexpected data format, so you should write exception handling.
html=urlopen("https://pythonscraping.com/pages/page1.html")
This line will result in an error if the page cannot be found So, rewrite it as follows.
try:
html=urlopen("https://pythonscraping.com/pages/page1.html")
except:
print("Page Not Found")
This line can also cause an error
bsobj=BeautifulSoup(html.read())
I rewrote it like this.
try:
bsobj=BeautifulSoup(html.read())
print(bsobj.h1)
except:
print("error")
You can find the tag you want by using find () and findAll ()
The following code displays the text in `<span class =" green "> </ span>`
span_list = bsobj.findAll("span",{"class":"green"})
If you want to display not only class = "green" but also class = "red", rewrite as follows.
span_list = bsobj.findAll("span",{"class":{"red","green"}})
span_list = bsobj.findAll("span",{"class":"green"})
for i in span_list:
print(i)
This code will display the text ``` </ span>` ``, but the tags will also be displayed. If you want only the text inside, you need to rewrite it as follows
#Display tags together
print(i)
#Display without tags
print(i.get_text)
Recommended Posts