For the basic usage of Beautiful Soup, please see Scraping with Python and Beautiful Soup.
This time I had the opportunity to handle HTML with Beautiful Soup, so the tips, memos, and memorandums of the processing used at that time, well, that kind of thing. Update from time to time (maybe)
for texttag in content.find_all('text'):
texttag.name = 'p'
Replaced all <text>
with <p>
for imgtag in content.find_all('img'):
if not imgtag.parent.name in ['figure']:
imgtag.wrap(content.new_tag('figure'))
Find the <img>
that is not enclosed in the <figure>
and enclose it in the <figure>
Alternatively, the same process can be performed by the following method.
for notwrap_a in content.select("p ~ a"):
notwrap_a.wrap(content.new_tag("p"))
Find the <a>
that is not enclosed in <p>
and enclose it in <p>
for tag in content.find_all('ul'):
tag.find('li').unwrap()
for unwarp_ul in content.find_all('ul'):
unwarp_ul.unwrap()
for delete_li in content.find_all('li'):
delete_li.decompose()
First, the first process finds <ul>
and removes <li>
from the first element of the list with find ('li'). Unwrap
.
Next, I removed the <ul>
and removed the last remaining <li>
.
The first element is in the state where <li>
is removed, so if you want to add a new tag,
tag.find('li').unwrap()
To
first_li = tag.find('li')
first_li.name = 'p'
I think it would be good to do something like that
for p in soup.find_all('p'):
p.parent.unwrap()
I'm removing the parent element of <p>
Suppose you have the following html
<img src="00001.jp">
<figcaption>caption string1</figcaption>
<img src="00002.jp">
<img src="00003.jp">
<figcaption>caption string3</figcaption>
If there is a <figcaption>
next to the <img>
and you want to enclose it in a <figure>
, you can do as follows.
html = "<img src="00001.jp">
<figcaption>caption string1</figcaption>
<img src="00002.jp">
<img src="00003.jp">
<figcaption>caption string3</figcaption>"
content = BeautifulSoup(html)
for img_tag in content.find_all('img'):
fig = content.new_tag('figure')
img_tag.wrap(fig)
next_node = img_tag.find_next()
if next_node and next_node.name == 'figcaption':
fig.append(next_node)
print(content)
If you do this, it will be edited as follows
<figure>
<img src="00001.jp"/>
<figcaption>caption string1</figcaption>
</figure>
<figure><img src="00002.jp"/></figure>
<figure>
<img src="00003.jp"/>
<figcaption>caption string3</figcaption>
</figure>
Recommended Posts