A memorandum of html tag search by beautifulsoup4
.
#All p tags
find_all("p")
#Only the first p tag found
find("p")
#a tag and href starting with hogehoge
import re
find_all("a", href=re.compile("^hogehoge"))
#Specify parent-child relationship, loosely
select('body div p')
#Parent-child relationship # 2, strict
select('body > div > p')
#name of the class
select('.myclass')
#id name
select('#myid')
#AND condition
select('.myclass1.myclass2')
#The third of the html below<li>Search for tags
# <html>
# <body>
# <ul>
# <li>Not specified</li>
# <li>Not specified</li>
# <li>It is specified</li>
# <li>Not specified</li>
# </ul>
# </body>
# </html>
select('body > ul > li:nth-of-type(3)')
The reason why it didn't work was that the html of the scraping source site had a start tag but no close tag. The solution is to remove the start tag. (By the way, the closing tag existed on Chrome's developer tools, so I didn't notice it until I looked at the source of the page ...)
url = "http://hogehoge/"
soup = BeautifulSoup(url.text, "lxml")
#Remove the dd tag because there is no closing tag for the dd tag
for tag in soup.find_all('dd'):
tag.unwrap()
Remove all <dd>
tags.
However, if you use .decompose ()
, the elements after <dd>
will also disappear, so delete only the tag with .unwrap ()
.
Recommended Posts