There is a Python library called chardet. By inputting the bytes string, it is possible to infer what character code the bytes string was written by encoding.

I wanted to use chardet with Python3, but the official is not yet compatible with Python3.

When I searched for it, I found a library called python3-chardet that forked chardet, so I decided to use it.

Addition cChardet is faster, so it is easier to use. I added it below the article.

Installation

Download and install from github.

$ git clone [email protected]:bsidhom/python3-chardet.git

In the directory created in

$ python setup.py install

Then the installation is completed.

Experiment

`ipython3`


import chardet

chardet.detect('abc'.encode('utf-8'))
> {'confidence': 1.0, 'encoding': 'ascii'}

chardet.detect('AIUEO'.encode('utf-8'))
> {'confidence': 0.9690625, 'encoding': 'utf-8'}

chardet.detect('AIUEO'.encode('Shift-JIS'))
> {'confidence': 0.5, 'encoding': 'windows-1252'}

It worked properly. I'm a little worried that'aiueo'.encode ('Shift-JIS') was judged to be windows-1252, but since the confidence is 0.5, chardet's confidence may be half-confident. The sentence was too short, so it can't be helped.

We conducted further experiments to see if it could be used when scripting web pages.

The target website is decided to be price.com http://kakaku.com/. It is just right because it uses Shift_JIS.

`ipython3`


import chardet
import requests

r = requests.get('http://kakaku.com')
chardet.detect(r.content)
> {'confidence': 0.99, 'encoding': 'SHIFT_JIS'}

He made a good decision. Unlike the example of'Aiueo'.encode ('Shift-JIS'), it correctly judged SHIFT_JIS instead of windows-1252 because it targeted a long bytes column for the entire Web page. Seem. Confidence has also increased.

Postscript

I later noticed that there is a Python library for C extensions called cChardet. Can be used with Python3. Py Yoshi is amazing.

It's on pypi, so you can get it at https://pypi.python.org/pypi/cchardet/ pip.

$ pip install cchardet

Since it's a big deal, I used the top page of Kakaku.com to compare the speeds. The code is as follows.

`compare.py`


import chardet
import cchardet
import requests
import time

if __name__ == '__main__':
    r = requests.get('http://kakaku.com')
    begin_time = time.clock()
    guessed_encoding = chardet.detect(r.content)
    end_time = time.clock()
    print('chardet: %f, %s' % (end_time - begin_time, guessed_encoding))

    begin_time_of_cc = time.clock()
    guessed_encoding_by_cc = cchardet.detect(r.content)
    end_time_of_cc = time.clock()
    print('cChardet: %f, %s' % (end_time_of_cc - begin_time_of_cc, guessed_encoding_by_cc))

And the result is as follows.

chardet: 1.440141, {'confidence': 0.99, 'encoding': 'SHIFT_JIS'}
cChardet: 0.000589, {'confidence': 0.9900000095367432, 'encoding': 'SHIFT_JIS'}

Isn't it overwhelming?

Conclusion

Use cChardet! !! !!

Notes using cChardet and python3-chardet in Python 3.3.1.

Installation

Experiment

ipython3

ipython3

Postscript

compare.py

Conclusion

`ipython3`

`ipython3`

`compare.py`