When using Linux, when you unzip the zip file downloaded from Japan, the Japanese file name is garbled and it often happens. So, I usually use it as it is without worrying about it, and when I don't need it, I feel like discarding it. Even if there are necessary files, it is not a big number, so I renamed it myself. This time, there was a situation where it was useless unless I fixed a good number of files, so I did a little research.
Basically, it's because an app running multibyte on Windows embeds the file name raw in zip with cp932. Conversion is easy if this is written to the file system as cp932 when it is expanded on the Linux side. It's just a matter of renaming through iconv -f shift-jis -t utf-8
in the shell. When exporting raw, only 0x2F is out in terms of file name, but this is not included in the second byte of cp932, so it does not seem to be a problem. However, it seems that some conversion has been applied to the non-ASCII part, and it cannot be restored properly.
When I searched, there was an exchange like writing in Python with Stack Overflow, so it was easy to write. I'm a person who can't usually write useful tools in glue language.
unzip.py
#!/usr/bin/env python
import sys
import zipfile
def main(filename):
with zipfile.ZipFile(filename) as zip:
for info in zip.infolist():
info.filename = info.filename.decode('shift-jis').encode('utf-8')
zip.extract(info)
if __name__ == '__main__':
sys.exit(main(sys.argv[1]))
A script that simply expands the zip file as the first argument, thinking that it has the SJIS file name, without considering any error handling.
So, if I dig into the cause of the garbled characters thinking that I should make a note on Qiita, it looks dark under the lighthouse. Orz with the option to convert the character code properly
$ unzip -O sjis foo.zip
It seems that this is all you need. Somehow -O and -I are the opposite of intuition, but it seems that -O specifies the encoding in the archive and -I specifies the encoding of the destination file system. Also, it seems that the strange encoding was done because the automatic detection failed.
Read about help before looking at the source, me. Furthermore, Qiita also has Answer.
It was a complete waste of work if I lifted my back and did something I wouldn't normally do. But why do you like writing big code but hate writing short code? Maybe because the boiler plate ratio is high.
Recommended Posts