Extract zip with Python (Japanese file name support)

File name in zip

It seems that the filename encoding of each file archived in the zip file can specify the presence or absence of the UTF-8 flag (in the current version), but you cannot specify any encoding other than UTF-8. ..

When compressed in Japanese locale Windows (depending on the tool), the file name is written in Shift_JIS (CP932). Most modern Linux and Mac are UTF-8.

When decompressing a zip file compressed on a different OS, it can be decompressed without problems if it has the UTF-8 flag, but if Shift_JIS (CP932) is used, the file name is garbled. It may happen.

Specifically, Windows → Linux / Mac, etc. In this case, you can use an archiver such as ʻunaror correct the garbled file name withconvmv`.

Apart from this, Python's ZipFile library also cannot recognize filenames correctly.

Python3 ZipFile library

The ZipFile library converts the byte string to a string as UTF-8 if it has the UTF-8 flag, otherwise as CP437.

Therefore, if you try to expand using ZipFile.extractall () etc., the Japanese file name will be expanded with garbled characters.

The workaround is to convert ZipInfo.filename back into a byte string as CP437, then back into a string with the correct encoding and make itZipFile.extract (ZipInfo).

import zipfile

f = r'/file/to/path'

with zipfile.ZipFile(f) as z:
    for info in z.infolist():
        info.filename = info.filename.encode('cp437').decode('cp932')
        z.extract(info)

In the above, the original encoding is processed as CP932, but in reality it is not always the case, so it is better to use encoding judgment or exception handling.

But ... No ...!

There is no problem if the process of extracting using ZipFile is executed on a Mac, but if you try to extract on Windows, an error will occur depending on the file name.

ZipInfo.filename has ʻos.sep replaced with / . That is, on Windows, ` (\ x5c) is replaced with/( \ x2f).

CP437 is a 1-byte character encoding, and ASCII print characters (\ x20-\ x7f) are ASCII compatible, so this replacement process is performed (even if it is originally a part of multibyte characters). I will end up. As a result, once you switch back to the byte string, a pattern like b'\ x90 \ x2f' will appear.

Shift_JIS (CP932) never uses the second byte \ x2f, so when you try to convert such a byte string to a string again as CP932, a decoding error occurs. I will.

This problem occurs when the second byte is the character \ x5c.

so. It is a so-called "bad character". (Although the reason for the failure was different from this)

No. I completely forgot about the bad characters of Shift_JIS (CP932) for the past 10 years. What's more, it comes with a changing sphere that shift_JIS (CP932) becomes an invalid byte string by replacing it with \ x2f.

Solution

The file name information (decoded as CP437) before the replacement process is stored in ZipInfo.orig_filename, which can be used to solve the problem.

import os
import zipfile

f = r'/file/to/path'

with zipfile.ZipFile(f) as z:
    for info in z.infolist():
        info.filename = info.orig_filename.encode('cp437').decode('cp932')
        if os.sep != "/" and os.sep in info.filename:
            info.filename = info.filename.replace(os.sep, "/")
        z.extract(info)

Recommended Posts

Extract zip with Python (Japanese file name support)
Extract the xz file with python
Zip, unzip with python
Handle zip files with Japanese filenames in Python 3
Send Japanese email with Python3
Japanese morphological analysis with Python
Presentation Support System with Python3
Download csv file with python
Extract template of EML file saved from Thunderbird with python3.7
Select file in dialog with python → Display file name in message box
Try to decipher the garbled attachment file name with Python
Python: Extract file information from shared drive with Google Drive API
Recursively unzip zip files with python
[Automation] Extract Outlook appointments with Python
[Python] Write to csv file with Python
[Automation with python! ] Part 1: Setting file
Implemented file download with Python + Bottle
Output to csv file with Python
Create an Excel file with Python3
[Beginner] Extract character strings with Python
Speak Japanese text with OpenJTalk + python
[Automation with python! ] Part 2: File operation
Japanese file enumeration with Python2 system on Windows (5C problem countermeasure)
Exclusive control with lock file in Python
Read CSV file with python (Download & parse CSV file)
Generate Japanese test data with Python faker
Check the existence of the file with python
[Python] Get the variable name with str
Quickly create an excel file with Python #python
Extract lines that match the conditions from a text file with python
Download Japanese stock price data with python
Notes on doing Japanese OCR with Python
Let's read the RINEX file with Python ①
Extract Japanese text from PDF with PDFMiner
How to display python Japanese with lolipop
[Python] Let's make matplotlib compatible with Japanese
Record with Python → Save file (sounddevice + wave)
How to enter Japanese with Python curses
python zip
Python / numpy> Read the data file with the item name line> Use genfromtxt ()
I made a configuration file with Python
[Automation] Read mail (msg file) with Python
I tried searching for files under the folder with Python by file name
nginxparser: Try parsing nginx config file with Python
How to read a CSV file with Python 2/3
Extract text from PowerPoint with Python! (Compatible with tables)
Extract only the file name excluding the directory in the directory
Try it with Word Cloud Japanese Python JupyterLab.
Speaking Japanese with gTTS (reading a text file)
[Automation] Extract the table in PDF with Python
[Python] How to read excel file with pandas
Convert svg file to png / ico with Python
Read table data in PDF file with Python
Read Excel name / cell range with Python VBA
Develop Windows apps with Python 3 + Tkinter (exe file)
Create an image with characters in python (Japanese)
[Python] Summary of S3 file operations with boto3
Create a Photoshop format file (.psd) with python
Extract email attachments received in Thunderbird with Python
Convert the character code of the file with Python3
Wav file generation from numeric text with python