I checked if Python could automatically determine the character code, and I made a note.
It was easy to do with a package called chardet
.
Usage — chardet 2.3.0 documentation
test.py
from chardet.universaldetector import UniversalDetector
def check_encoding(file_path):
detector = UniversalDetector()
with open(file_path, mode='rb') as f:
for binary in f:
detector.feed(binary)
if detector.done:
break
detector.close()
print(detector.result, end='')
print(detector.result['encoding'], end='')
def main():
check_encoding('/path/to/sjis.txt')
check_encoding('/path/to/utf8.txt')
if __name__ == '__main__':
main()
Output example
$ python test.py
{'encoding': 'CP932', 'confidence': 0.99}
CP932
{'encoding': 'utf-8', 'confidence': 0.99}
utf-8
Please note that it may take some time to determine if it is a large file. (The above ʻUniversal Detetor` seems to end as soon as it can be determined)
Encoding judgment in Python --Qiita Usage — chardet 2.3.0 documentation
Recommended Posts