The other day I was involved in the task of raising the Python of the Django application from 2.7 to 3, but after releasing the Python3 source, the bug "** I can not download the file with the Japanese name uploaded at Python 2 **" Faced with.
According to the application specifications, the uploaded file is saved in the storage in ZIP format, and the file name is passed to zipfile.extract ()
at the time of download to extract it, but this extract () I was getting an error like
Key Error:" There is no item named'xxx.yyy' in the archive "`.
The following articles were very helpful for troubleshooting, but there were some parts that could not be solved by this alone, so I would like to write an article including the meaning of supplement.
Zip extraction with Python (Japanese file name support) --Qiita
This is described in the article mentioned above, but in Python2 zipfile.extract ()
, the file name is returned as a byte string, so you did not have to worry about the character code, but in Python3 Looking at the header information of the ZIP file, there is a specification change problem that ** if the UTF-8 flag is not set, all the file names will be decoded by CP437 **.
Even if it is said, people like me who are not familiar with the ZIP specifications and the handling of Python strings are not so good at this, so I would like to raise the resolution a little more.
First, Python's zipfile.ZipFile
object holds meta information (ZipInfo
object) for each stored file. zipfile.extract ()
uses the path included in this ZipInfo
to access the target file and extract the data.
def extract(self, member, path=None, pwd=None):
...
if not isinstance(member, ZipInfo):
member = self.getinfo(member)
...
return self._extract_member(member, path, pwd)
If the file name is passed to zipfile.extract ()
here, ZipFile.getinfo ()
will be called, and getinfo ()
will refer to the NameToInfo
attribute of the ZipFile
object and be the target file. Gets the ZipInfo
object of. The NameToInfo
attribute is a dictionary object called{filename: ZipInfo}
.
class ZipFile(object):
...
def __init__(self, file, mode="r", compression=ZIP_STORED, allowZip64=False):
...
self.NameToInfo = {} # Find file info given name
...
def getinfo(self, name):
"""Return the instance of ZipInfo given 'name'."""
info = self.NameToInfo.get(name)
if info is None:
raise KeyError(
'There is no item named %r in the archive' % name)
return info
This flow is common to Python2 and 3.
Then, what has changed in Python 3 is that, as explained at the beginning, when setting the file name included in this ZipInfo
or NameToInfo
, the file name is automatically decoded. More specifically, it is a specification change of zipfile.ZipFile._RealGetContents ()
.
In Python2, UTF-8 is used for decoding only when the UTF-8 flag is set as shown below, and the byte string is set as it is otherwise.
class ZipFile(object):
...
def __init__(self, file, mode="r", compression=ZIP_STORED, allowZip64=False):
...
try:
if key == 'r':
self._RealGetContents()
...
def _RealGetContents(self):
"""Read in the table of contents for the ZIP file."""
...
filename = fp.read(centdir[_CD_FILENAME_LENGTH])
# Create ZipInfo instance to store file information
x = ZipInfo(filename)
...
x.filename = x._decodeFilename()
self.filelist.append(x)
self.NameToInfo[x.filename] = x
...
def _decodeFilename(self):
if self.flag_bits & 0x800:
return self.filename.decode('utf-8')
else:
return self.filename
On the other hand, Python3's _RealGetContents ()
always decodes the file name with either ** UTF-8 or CP437 ** as shown below.
def _RealGetContents(self):
"""Read in the table of contents for the ZIP file."""
...
filename = fp.read(centdir[_CD_FILENAME_LENGTH])
flags = centdir[5]
if flags & 0x800:
# UTF-8 file names extension
filename = filename.decode('utf-8')
else:
# Historical ZIP filename encoding
filename = filename.decode('cp437')
Due to this specification change, CP437 will be forced to encode files with character codes that are not written in the ZIP specifications such as Shift_JIS, as well as files that are encoded in ** UTF-8 but do not have the UTF-8 flag set. Since it is decoded **, the name of the extracted file will be garbled or an error will occur.
In fact, to specify the file to extract from the ZIP file, you can pass either the filename or the ZipInfo
object as the first argument tozipfile.extract ()
.
If you want to pass a ZipInfo
object, the workaround for the article introduced in the introduction (re-decoding the ZipInfo.filename
and passing the modified ZipInfo
object toextract ()
) will work. However, when passing the file name, the error is still not resolved. Let's check with a concrete example.
Create a file named test.txt
with the file name Shift_Jis encoded in a directory named python_zip
and compress it as test.zip
in the same directory (ls
because LANG is UTF-8). Then the characters will be garbled). Also, create a directory called extracted
as the storage destination for the extracted files.
~/python_zip$ ls
''$'\203''e'$'\203''X'$'\203''g'$'\227''p'$'\202''ł'$'\267''.txt' extracted test.zip
Try to extract the txt file from this test.zip
with the code that rewrites ZipInfo.filename
as follows.
import zipfile
zf = zipfile.ZipFile("test.zip", 'r')
for info in zf.infolist():
bad_filename = info.filename
info.filename = info.filename.encode('cp437').decode('shift_jis')
zf.extract ("for testing.txt", "./extracted")
zf.close()
I get a KeyError
as shown below.
~/python_zip$ python extract_zip_py3.py
Traceback (most recent call last):
File "extract_zip_py3.py", line 24, in <module>
zf.extract ("for testing.txt", "./extracted")
(Omitted)
KeyError: "There is no item named'for testing.txt' in the archive"
As mentioned in "zipfile.extract () Overview", when specifying by file name, the NameToInfo
attribute was referenced. When I debug NameToInfo
of test.zip
, it looks like the following after rewriting ZipInfo.filename
.
{'âeâ Xâgù pé┼é╖.txt': <ZipInfo filename ='for testing.txt' filemode ='-rw-r--r--' file_size = 19>}
Sure, I've fixed the filename
, but since the key is still a garbled filename, I can't match withself.NameToInfo.get (name)
ingetinfo ()
and an error occurs. I will.
This means that in this case, if you also rewrite NameToInfo
, it will work. Modify the previous extract_zip_py3.py
as follows.
import zipfile
zf = zipfile.ZipFile("test.zip", 'r')
for info in zf.infolist():
bad_filename = info.filename
info.filename = info.filename.encode('cp437').decode('shift_jis')
zf.NameToInfo[info.filename] = info
del zf.NameToInfo[bad_filename]
print (zf.NameToInfo) # For debugging
zf.extract ("for testing.txt", "./extracted")
zf.close()
When you do this, you get:
~/python_zip$ python extract_zip_py3.py
{'For testing .txt': <ZipInfo filename ='For testing .txt' filemode ='-rw-r--r--' file_size = 19>}
~/python_zip$ ls extracted
For testing .txt
By rewriting NameToInfo
, you can get the ZipInfo
of the target file correctly, and you can confirm that the file can be extracted without garbled characters.
I think that Solution 1 is basically sufficient for Python 2 to 3 support of Django applications, but in the unlikely event that the character code to be re-decoded can be determined automatically rather than fixed, it is safer. Can be said to be high.
So, in addition to Shift_Jis's test .txt
, I created a UTF-8 test 2 .txt
and compressed it to test2.zip.
~/python_zip$ ls
''$'\203''e'$'\203''X'$'\203''g'$'\227''p'$'\202''ł'$'\267''.txt' extracted test2.zip
extract_zip_py3.py test.zip Test 2 .txt
And I modified extract_zip_py3.py
as follows.
import sys
import zipfile
import chardet
args = sys.argv
zname = args [1] # ZIP file name
fname = args [2] # File name to be extracted
zf = zipfile.ZipFile(zname, 'r')
for info in zf.infolist():
bad_filename = info.filename
code = chardet.detect (info.filename.encode ('cp437')) # Character code automatic judgment
print(code)
info.filename = info.filename.encode('cp437').decode(code['encoding'])
zf.NameToInfo[info.filename] = info
del zf.NameToInfo[bad_filename]
print(zf.NameToInfo)
zf.extract(fname, "./extracted")
zf.close()
I will try this.
~/python_zip $ python extract_zip_py3.py test.zip For testing .txt
{'encoding': 'SHIFT_JIS', 'confidence': 0.99, 'language': 'Japanese'}
{'For testing .txt': <ZipInfo filename ='For testing .txt' filemode ='-rw-r--r--' file_size = 19>}
~/python_zip $ python extract_zip_py3.py test2.zip Test 2 .txt
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
{'Test 2 .txt': <ZipInfo filename ='Test 2 .txt' filemode ='-rw-r--r--' file_size = 28>}
~/python_zip$ ls extracted
For testing .txt For testing 2.txt
As intended, I was able to re-decode the file name with the character code derived by automatic judgment and extract the file without garbled characters.
This time, the specification change in handling Python strings and the specification change of the zipfile
module overlapped, and it took a relatively long time to be able to explain in my own way. I also want to write something related to Python 2 to 3.
Thank you for visiting our website for the time being. From here on, I will write about the details that I was interested in while writing, so if you are interested, please contact me a little more.
In this workaround, the meta information of the ZipFile
object is modified, but it is often thought that the target file of the ZIP archive can be found even if the filename
or NameToInfo
is rewritten without permission. I thought it was strange.
So, if you follow the source a little more, there is a separate attribute value called ZipInfo.orig_filename
, which is written in the local header of the ZIP archive (metadata for each file stored in ZIP, see the figure below). The specification was to compare the file names (decoded character strings) and if they match, open ()
the file.
When I debugged extract_zip_py3.py
with hard coding, orig_filename
remains garbled even after rewriting NameToInfo
, and the file name (fname
) obtained from the local header is decoded with CP437
. It matched the character string that was used.
~/python_zip $ python extract_zip_py3.py test.zip For testing .txt
{'For testing .txt': <ZipInfo filename ='For testing .txt' filemode ='-rw-r--r--' file_size = 19>}
orig_filename: âeâXâgùpé┼é╖.txt
fname: b'\x83e\x83X\x83g\x97p\x82\xc5\x82\xb7.txt'
~/python_zip$ python
Python 3.7.5 (default, Nov 22 2020, 16:16:44)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> s = b'\x83e\x83X\x83g\x97p\x82\xc5\x82\xb7.txt'
>>> s.decode('cp437')
'âeâXâgùpé┼é╖.txt'
The information for finding the file is written in the local header of orig_filename
and ZIP, so it is not affected, and filename
is used for the name of the extracted file, so there is no problem. I'm doing well.
The bug that triggered this article was originally caused by the fact that the files in the ZIP archive that were encoded in UTF-8 at the time of Python2 were not flagged as UTF-8. Why did that happen?
Looking at the source of the zipfile
module in Python2, I found that it only sets the UTF-8 flag if the filename is of type Unicode
and cannot be encoded with ACSII
.
class ZipInfo (object):
...
def FileHeader(self, zip64=None):
...
filename, flag_bits = self._encodeFilenameFlags()
...
def _encodeFilenameFlags(self):
if isinstance(self.filename, unicode):
try:
return self.filename.encode('ascii'), self.flag_bits
except UnicodeEncodeError:
return self.filename.encode('utf-8'), self.flag_bits | 0x800
else:
return self.filename, self.flag_bits
So, I created a file with a Unicode character string as shown below, compressed it into a ZIP, and put the debug code in a Python2 program that extracts the file and executed it.
-*- coding: utf-8 -*-
import zipfile
# Create file, store in ZIP, extract to another directory
with open (u "for testing.txt",'w') as f:
f.write ("Write to file \ n")
zf = zipfile.ZipFile("test.zip", 'w')
zf.write (u "for testing.txt")
zf.close()
zf = zipfile.ZipFile("test.zip", 'r')
print zf.NameToInfo
print zf.infolist()[0].flag_bits
zf.extract (u "for testing .txt", "./extracted")
zf.close()
The following is the execution result. Since 0x800
is 2048
in decimal, UTF-8 certainly stands.
~/python_zip$ python --version
Python 2.7.17
~/python_zip$ python zip_py2.py
{u'\u30c6\u30b9\u30c8\u7528\u3067\u3059.txt': <zipfile.ZipInfo object at 0x7f68ec012940>}
2048
~/python_zip$ ls extracted
For testing .txt
Congratulations.
cpython/zipfile.py at 2.7 · python/cpython · GitHub cpython/zipfile.py at 3.7 · python/cpython · GitHub .ZIP File Format Specification (English) ZIP specifications summarized in Japanese · GitHub: ↑ Japanese translation. Thank you. [ZIP (File Format)-Wikipedia](https://ja.wikipedia.org/wiki/ZIP_(%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E3%83%95%E3%82%A9%E3%83%BC%E3%83%9E%E3%83%83%E3%83%88) #% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83% AB% E3% 83% 98% E3% 83% 83% E3% 83% 80) It seems that the handling of file names in zipfile has become decent-Tschinoko, this one. (beta)
Recommended Posts