Although zip files can store filenames in UTF-8 in recent specifications, they often use a legacy environment-dependent character code format that stores filenames. In the case of Japanese, Shift-JIS (cp932) is often used according to Windows.
In Python 2, the file name returned by the zipfile module was a byte string, so the file name of cp932 was returned as it was, but in Python 3, the character string was unified to Unicode, so when the zip file is read, the file name is decoded. It will be returned as a character string. However, of course, Japanese customs are not the default behavior, so the characters will be garbled as they are.
When I read the zipfile module in Python 3.4, it looked like this:
if flags & 0x800:
# UTF-8 file names extension
filename = filename.decode('utf-8')
else:
# Historical ZIP filename encoding
filename = filename.decode('cp437')
Wouldn't it be possible to get a UnicodeDecodeError by decoding a cp932 encoded string?
>>> len(bytes(range(256)).decode('cp437'))
256
cp437 seems to decode all bytes one-to-one per character. So, it seems good to re-encode with cp437 and then decode with cp932 again.
import zipfile
zf = zipfile.ZipFile('foo.zip')
for name in zf.namelist():
print(name.encode('cp437').decode('cp932')
Recommended Posts