There is a Python library called chardet. By inputting the bytes string, it is possible to infer what character code the bytes string was written by encoding.
I wanted to use chardet with Python3, but the official is not yet compatible with Python3.
When I searched for it, I found a library called python3-chardet that forked chardet, so I decided to use it.
Download and install from github.
$ git clone [email protected]:bsidhom/python3-chardet.git
In the directory created in
$ python setup.py install
Then the installation is completed.
ipython3
import chardet
chardet.detect('abc'.encode('utf-8'))
> {'confidence': 1.0, 'encoding': 'ascii'}
chardet.detect('AIUEO'.encode('utf-8'))
> {'confidence': 0.9690625, 'encoding': 'utf-8'}
chardet.detect('AIUEO'.encode('Shift-JIS'))
> {'confidence': 0.5, 'encoding': 'windows-1252'}
It worked properly. I'm a little worried that'aiueo'.encode ('Shift-JIS') was judged to be windows-1252, but since the confidence is 0.5, chardet's confidence may be half-confident. The sentence was too short, so it can't be helped.
We conducted further experiments to see if it could be used when scripting web pages.
The target website is decided to be price.com http://kakaku.com/. It is just right because it uses Shift_JIS.
ipython3
import chardet
import requests
r = requests.get('http://kakaku.com')
chardet.detect(r.content)
> {'confidence': 0.99, 'encoding': 'SHIFT_JIS'}
He made a good decision. Unlike the example of'Aiueo'.encode ('Shift-JIS'), it correctly judged SHIFT_JIS instead of windows-1252 because it targeted a long bytes column for the entire Web page. Seem. Confidence has also increased.
I later noticed that there is a Python library for C extensions called cChardet. Can be used with Python3. Py Yoshi is amazing.
It's on pypi, so you can get it at https://pypi.python.org/pypi/cchardet/ pip.
$ pip install cchardet
Since it's a big deal, I used the top page of Kakaku.com to compare the speeds. The code is as follows.
compare.py
import chardet
import cchardet
import requests
import time
if __name__ == '__main__':
r = requests.get('http://kakaku.com')
begin_time = time.clock()
guessed_encoding = chardet.detect(r.content)
end_time = time.clock()
print('chardet: %f, %s' % (end_time - begin_time, guessed_encoding))
begin_time_of_cc = time.clock()
guessed_encoding_by_cc = cchardet.detect(r.content)
end_time_of_cc = time.clock()
print('cChardet: %f, %s' % (end_time_of_cc - begin_time_of_cc, guessed_encoding_by_cc))
And the result is as follows.
chardet: 1.440141, {'confidence': 0.99, 'encoding': 'SHIFT_JIS'}
cChardet: 0.000589, {'confidence': 0.9900000095367432, 'encoding': 'SHIFT_JIS'}
Isn't it overwhelming?
Use cChardet! !! !!
Recommended Posts