Extract zip with Python (Japanese file name support)

File name in zip

It seems that the filename encoding of each file archived in the zip file can specify the presence or absence of the UTF-8 flag (in the current version), but you cannot specify any encoding other than UTF-8. ..

When compressed in Japanese locale Windows (depending on the tool), the file name is written in Shift_JIS (CP932). Most modern Linux and Mac are UTF-8.

When decompressing a zip file compressed on a different OS, it can be decompressed without problems if it has the UTF-8 flag, but if Shift_JIS (CP932) is used, the file name is garbled. It may happen.

Specifically, Windows → Linux / Mac, etc. In this case, you can use an archiver such as ʻunaror correct the garbled file name withconvmv`.

Apart from this, Python's ZipFile library also cannot recognize filenames correctly.

Python3 ZipFile library

The ZipFile library converts the byte string to a string as UTF-8 if it has the UTF-8 flag, otherwise as CP437.

https://github.com/python/cpython/blob/3.7/Lib/zipfile.py#L1358-L1365

Therefore, if you try to expand using ZipFile.extractall () etc., the Japanese file name will be expanded with garbled characters.

The workaround is to convert ZipInfo.filename back into a byte string as CP437, then back into a string with the correct encoding and make itZipFile.extract (ZipInfo).

import zipfile

f = r'/file/to/path'

with zipfile.ZipFile(f) as z:
    for info in z.infolist():
        info.filename = info.filename.encode('cp437').decode('cp932')
        z.extract(info)

In the above, the original encoding is processed as CP932, but in reality it is not always the case, so it is better to use encoding judgment or exception handling.

But ... No ...!

There is no problem if the process of extracting using ZipFile is executed on a Mac, but if you try to extract on Windows, an error will occur depending on the file name.

ZipInfo.filename has ʻos.sep replaced with / . That is, on Windows, ` (\ x5c) is replaced with/( \ x2f).

https://github.com/python/cpython/blob/3.7/Lib/zipfile.py#L347-L351

CP437 is a 1-byte character encoding, and ASCII print characters (\ x20-\ x7f) are ASCII compatible, so this replacement process is performed (even if it is originally a part of multibyte characters). I will end up. As a result, once you switch back to the byte string, a pattern like b'\ x90 \ x2f' will appear.

Shift_JIS (CP932) never uses the second byte \ x2f, so when you try to convert such a byte string to a string again as CP932, a decoding error occurs. I will.

This problem occurs when the second byte is the character \ x5c.

so. It is a so-called "bad character". (Although the reason for the failure was different from this)

No. I completely forgot about the bad characters of Shift_JIS (CP932) for the past 10 years. What's more, it comes with a changing sphere that shift_JIS (CP932) becomes an invalid byte string by replacing it with \ x2f.

Solution

The file name information (decoded as CP437) before the replacement process is stored in ZipInfo.orig_filename, which can be used to solve the problem.

import os
import zipfile

f = r'/file/to/path'

with zipfile.ZipFile(f) as z:
    for info in z.infolist():
        info.filename = info.orig_filename.encode('cp437').decode('cp932')
        if os.sep != "/" and os.sep in info.filename:
            info.filename = info.filename.replace(os.sep, "/")
        z.extract(info)