I decided to extract only the necessary text from the report (HTML format) spit out by a C ++ static analysis tool of a certain company, so I started with Python2.7 + Beautifulsoup4.
bs4test.py
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("rep_38248_dev1.html"))
print soup.prettify("shift_jis")
What? It only reads about 2500 lines, which is 1/10 of the entire HTML (about 26,000 lines) !! Weakened. Overwhelmed. I was in trouble.
The usual way to do this is to "find people in the same situation online." Immediately, when I search on google ... I don't have any information. Tohoho.
There is no way, so I poke around with the bs4 source code and look it up. The cause was a bug in the feed () method of lxml that bs4 called as a subcontractor, and when I fed a huge HTML text, it spilled on the way.
All you have to do is comment out LXMLTreeBuilder.feed () in bs4 / builder / _lxml.py. (For some reason, the XML parser LXMLTreeBuilderForXML.feed () has been fixed)
bs4/builder/_lxml.py
class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML):
features = [LXML, HTML, FAST, PERMISSIVE]
is_xml = False
def default_parser(self, encoding):
return etree.HTMLParser
# def feed(self, markup):
# encoding = self.soup.original_encoding
# try:
# self.parser = self.parser_for(encoding)
# self.parser.feed(markup)
# self.parser.close()
# except (UnicodeDecodeError, LookupError, etree.ParserError), e:
# raise ParserRejectedMarkup(str(e))
Googling again, it's related to the Google Groups beautifulsoup forum There was a post. LXMLTreeBuilderForXML.feed () seems to have been BugFixed at this time. So, the modification of LXMLTreeBuilder was leaked ...
Recommended Posts