-Since it is troublesome to check and set the character code every time the file is read, I created a module to acquire it automatically. -It is especially useful when importing csv files containing Japanese created in Excel. -It also supports importing files on the net. -By setting the return value to encoding at the time of opening, it works without problems so far.
def check_encoding(file_path):
'''Get the character code of the file''' from chardet.universaldetector import UniversalDetector import requests
detector = UniversalDetector()
if file_path[:4] == 'http':
r = requests.get(file_path)
for binary in r:
detector.feed(binary)
if detector.done:
break
detector.close()
else:
with open(file_path, mode='rb') as f:
for binary in f:
detector.feed(binary)
if detector.done:
break
detector.close()
print(" ", detector.result, end=' => ')
print(detector.result['encoding'], end='\n')
return detector.result['encoding']
-It seems that csv including Japanese has many Shift_JIS, so it seems better to convert it to more general-purpose cp932 in the next model. -By entering the return value obtained in the first model as an argument, the optimum character code name can be obtained as the return value.
def change_encoding(encoding):
'''Convert encoding sjis relation to cp932''' if encoding in ['Shift_JIS', 'SHIFT_JIS', 'shift_jis', 'sjis', 's_jis']: encoding = 'cp932'
return encoding
Supervised, thank you.
Recommended Posts