In a small number of environments, when a Windows 8 store app mailer receives an attachment with a Japanese name, it looks like % 1B% 24% 42 ~ .ext
or% EF% BF %% BA.ext
. There was a case that it ended up. It seems that the original file name is treated as being percent-encoded.
I didn't know the basic measures to receive with the correct file name, but for the time being, I wanted to find out only the file name, so I tried decoding using Python.
In Python you can decode percent encoding by using ʻurllib.parse.unquote`. So, as a result of converting with the following code,
unquote
import urllib.parse
a = '%1B%24%42%4A%3F%40%2E%1B%28%42%32%37%1B%24%42%47%2F%1B%28%42%2E%70%64%66'
urllib.parse.unquote(a)
The result is \ x1b $ BJ? @. \ X1b (B27 \ x1b $ BG / \ x1b (B.pdf
and cannot be read. Unquote specifies ʻencoding ='utf-8'` by default. It is interpreted by utf-8, but it seems that it is not decoded normally, so it seems to be a different character code.
By the way, if you do not know which character code was expressed, you cannot restore to the original character string, so you need to check the character code. Looking at the character string, there are patterns such as % 1B% 24% 42
and% 1B% 28% 42
, which is the code to switch the mode used in JIS code. From this, it can be expected that this character string will be JIS code.
The JIS code is also called ISO-2022-JP, and it seems that it is handled by the name iso2022-jp in Python (is this the official name?). Specify the character code with ʻencoding ='iso2022-jp'` and try decoding.
unquote
import urllib.parse
s = '%1B%24%42%4A%3F%40%2E%1B%28%42%32%37%1B%24%42%47%2F%1B%28%42%2E%70%64%66'
urllib.parse.unquote(s, encoding='iso2022-jp')
I got the result 2015.pdf
, and I was able to know the file name safely.
Recommended Posts