It is the knowledge when the execution speed of Image search bot using BeautifulSoup is improved. I hope it will be helpful for those who are in trouble because the scraping speed is slow.
You can speed up by specifying an appropriate character code in the argument of BeautifulSoup: ** from_encoding **.
from urllib import request
import bs4
page = request.urlopen("https://news.yahoo.co.jp/")
html = page.read()
# from_Substitute the character code of the site to be scraped into encoding(In the case of Yahoo News this time utf-8)
soup = bs4.BeautifulSoup(html, "html.parser", from_encoding="utf-8")
Basically, it is written after charset = in the meta tag.
<!--In the case of Yahoo News-->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
I verified it with the following script. Measured before and after creating an instance
verification_bs4.py
from urllib import request as req
from urllib import parse
import bs4
import time
import copy
url = "https://news.yahoo.co.jp/"
page = req.urlopen(url)
html = page.read()
page.close()
start = time.time()
soup = bs4.BeautifulSoup(html, "html.parser")
print('{:.5f}'.format(time.time() - start) + "[s] html.parser, None")
start = time.time()
soup = bs4.BeautifulSoup(html, "lxml")
print('{:.5f}'.format(time.time() - start) + "[s] lxml, None")
start = time.time()
hoge = copy.copy(soup)
print('{:.5f}'.format(time.time() - start) + "[s] copy(lxml, None)")
start = time.time()
soup = bs4.BeautifulSoup(html, "html.parser", from_encoding="utf-8")
print('{:.5f}'.format(time.time() - start) + "[s] html.parser, utf-8")
start = time.time()
soup = bs4.BeautifulSoup(html, "lxml", from_encoding="utf-8")
print('{:.5f}'.format(time.time() - start) + "[s] lxml, utf-8")
start = time.time()
hoge = copy.copy(soup)
print('{:.5f}'.format(time.time() - start) + "[s] copy(lxml, utf-8)")
start = time.time()
soup = bs4.BeautifulSoup(html, "lxml", from_encoding="utf-16")
#The return value is empty because the character code is different.
print('{:.5f}'.format(time.time() - start) + "[s] lxml, utf-16")
The output result is here.
% python verification_bs4.py
2.10937[s] html.parser, None
2.00081[s] lxml, None
0.04704[s] copy(lxml, None)
0.03124[s] html.parser, utf-8
0.03115[s] lxml, utf-8
0.04188[s] copy(lxml, utf-8)
0.01651[s] lxml, utf-16
By specifying the character code in ** from_encoding **, we were able to speed up the instantiation. Looking at the code that says BeautifulSoup is slow, I didn't assign it to from_encoding, so I think that's the cause.
I was wondering why it had such specifications, so I checked the source code. However, I don't usually touch Python so much, so I may be writing something that is out of consideration. The source code is here
Probably due to the ** EncodingDetector ** class defined in ** bs4 / dammit.py **. The following is a partial code excerpt.
class EncodingDetector:
"""Suggests a number of possible encodings for a bytestring.
Order of precedence:
1. Encodings you specifically tell EncodingDetector to try first
(the override_encodings argument to the constructor).
2. An encoding declared within the bytestring itself, either in an
XML declaration (if the bytestring is to be interpreted as an XML
document), or in a <meta> tag (if the bytestring is to be
interpreted as an HTML document.)
3. An encoding detected through textual analysis by chardet,
cchardet, or a similar external library.
4. UTF-8.
5. Windows-1252.
"""
@property
def encodings(self):
"""Yield a number of encodings that might work for this markup.
:yield: A sequence of strings.
"""
tried = set()
for e in self.override_encodings:
if self._usable(e, tried):
yield e
# Did the document originally start with a byte-order mark
# that indicated its encoding?
if self._usable(self.sniffed_encoding, tried):
yield self.sniffed_encoding
# Look within the document for an XML or HTML encoding
# declaration.
if self.declared_encoding is None:
self.declared_encoding = self.find_declared_encoding(
self.markup, self.is_html)
if self._usable(self.declared_encoding, tried):
yield self.declared_encoding
# Use third-party character set detection to guess at the
# encoding.
if self.chardet_encoding is None:
self.chardet_encoding = chardet_dammit(self.markup)
if self._usable(self.chardet_encoding, tried):
yield self.chardet_encoding
# As a last-ditch effort, try utf-8 and windows-1252.
for e in ('utf-8', 'windows-1252'):
if self._usable(e, tried):
yield e
If you translate the comment written at the beginning of the class, it will look like this (DeepL translation)
""""We suggest some possible encodings for byte strings.
The order of priority is as follows.
1.Encoding that instructed EncodingDetector to try first
Constructor argument override_Use encodings).
2.The encoding declared within the bytestring itself.
XML declaration(When the byte string is interpreted as XML)
document), Or<meta>In the tag(Byte string
Interpreted as an HTML document)。
3.Encoding detected by text analysis by Charde.
Use cchardet, or a similar external library.
4. 4.UTF-8。
5. Windows-1252。
"""
Inferring from the comments and processing, I think that it is slow because it is processing until the above list 1 to 5 succeeds in order. Looking at 2, the character code guess from the meta tag mentioned earlier is also done automatically, so I think that it is a consideration so that you can use it without specifying the character code by looking at the source of the website. However, when scraping, I think that I usually check the source code, so I don't think it should be so late. (We have not verified which process is the bottleneck, so please give me somebody.)
In the execution time measurement script earlier, the instance is duplicated by the copy.copy () method, but the reason why this is fast is in \ _ \ _ copy \ _ \ _ of bs4 / __ init__.py. The following is a partial code excerpt.
__init__.py
class BeautifulSoup(Tag):
def __copy__(self):
"""Copy a BeautifulSoup object by converting the document to a string and parsing it again."""
copy = type(self)(
self.encode('utf-8'), builder=self.builder, from_encoding='utf-8'
)
# Although we encoded the tree to UTF-8, that may not have
# been the encoding of the original markup. Set the copy's
# .original_encoding to reflect the original object's
# .original_encoding.
copy.original_encoding = self.original_encoding
return copy
It's faster because I've decided on utf-8 here. However, on the contrary, if the character code of the scraping site is other than utf-8, it will be slower. In the following measurement script, the character code is measured by Price com of shift-jis.
verification_bs4_2.py
from urllib import request as req
from urllib import parse
import bs4
import time
import copy
url = "https://kakaku.com/"
page = req.urlopen(url)
html = page.read()
page.close()
start = time.time()
soup = bs4.BeautifulSoup(html, "html.parser")
print('{:.5f}'.format(time.time() - start) + "[s] html.parser, None")
start = time.time()
soup = bs4.BeautifulSoup(html, "lxml")
print('{:.5f}'.format(time.time() - start) + "[s] lxml, None")
start = time.time()
soup = bs4.BeautifulSoup(html, "lxml", from_encoding="shift_jis")
print('{:.5f}'.format(time.time() - start) + "[s] lxml, shift_jis")
start = time.time()
hoge = copy.copy(soup)
print('{:.5f}'.format(time.time() - start) + "[s] copy(lxml, shift_jis)")
The output result is here.
% python verification_bs4_2.py
0.11084[s] html.parser, None
0.08563[s] lxml, None
0.08643[s] lxml, shift_jis
0.13631[s] copy(lxml, shift_jis)
As mentioned above, copy is slower than utf-8. However, in the case of shift-jis, even if nothing is specified in ** from_encoding **, the execution speed has hardly changed. ~~ I don't know this anymore </ font> ~~
Thank you for reading this far! At the end, I'm sorry that it got messy. I wonder why more than 90% of websites in the world are utf-8 but slow. I created an article because I felt that it was a problem that the sites that searched with BeautifulSoup and hit the top did not mention this. If you find it useful, it would be encouraging if you could "LGTM".
reference https://stackoverrun.com/ja/q/12619706
Recommended Posts