Beautiful soup spills

Origin

I decided to extract only the necessary text from the report (HTML format) spit out by a C ++ static analysis tool of a certain company, so I started with Python2.7 + Beautifulsoup4.

`bs4test.py`


from bs4 import BeautifulSoup

soup = BeautifulSoup(open("rep_38248_dev1.html"))
print soup.prettify("shift_jis")

What? It only reads about 2500 lines, which is 1/10 of the entire HTML (about 26,000 lines) !! Weakened. Overwhelmed. I was in trouble.

Exploration journey

The usual way to do this is to "find people in the same situation online." Immediately, when I search on google ... I don't have any information. Tohoho.

There is no way, so I poke around with the bs4 source code and look it up. The cause was a bug in the feed () method of lxml that bs4 called as a subcontractor, and when I fed a huge HTML text, it spilled on the way.

All you have to do is comment out LXMLTreeBuilder.feed () in bs4 / builder / _lxml.py. (For some reason, the XML parser LXMLTreeBuilderForXML.feed () has been fixed)

`bs4/builder/_lxml.py`


class LXMLTreeBuilder(HTMLTreeBuilder, LXMLTreeBuilderForXML):

    features = [LXML, HTML, FAST, PERMISSIVE]
    is_xml = False

    def default_parser(self, encoding):
        return etree.HTMLParser

#    def feed(self, markup):
#        encoding = self.soup.original_encoding
#        try:
#            self.parser = self.parser_for(encoding)
#            self.parser.feed(markup)
#            self.parser.close()
#        except (UnicodeDecodeError, LookupError, etree.ParserError), e:
#            raise ParserRejectedMarkup(str(e))

End

Googling again, it's related to the Google Groups beautifulsoup forum There was a post. LXMLTreeBuilderForXML.feed () seems to have been BugFixed at this time. So, the modification of LXMLTreeBuilder was leaked ...