How to process and replace Japanese in a character string "http://hogefuga/qiita.com"-> Replace with the result of processing "hogefuga"
Proper (?) Support when you want to use URL including Japanese in urlonen of urllib Addendum: I added it at the end because the method you pointed out seemed to be correct for this response.
response = urllib.request.urlopen(url)
it is normal. Just access the url and do the object. ___ However, ___ A tragedy happened because this url contained Japanese.
url ='http://image.search.yahoo.co.jp/search?p=Evangelion' It's like that.
You will be dragged into the darkness of python with haste. *** Added error details. *** ***
Traceback (most recent call last):
・ ・ ・
response = urllib.request.urlopen(link)
File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 162, in urlopen
return opener.open(url, data, timeout)
File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 465, in open
response = self._open(req, data)
File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 483, in _open
'_open', req)
File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 443, in _call_chain
result = func(*args)
File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 1268, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 1240, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/http/client.py", line 1083, in request
self._send_request(method, url, body, headers)
File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/http/client.py", line 1118, in _send_request
self.putrequest(method, url, **skips)
File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/http/client.py", line 960, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-21: ordinal not in range(128)
As far as I see the error, ~~ urllib is just trying to convert to ascii, right? ?? ?? ~~ PS: http was trying to convert the URL to ascii! !!
Workaround Where! I searched for The Japanese part should be parsed ~~. ~~ Postscript: You should do URL encoding (percent encoding).
urllib.parse.quote_plus('Evangelion', encoding='utf-8')
Is it like that? There is a problem with this. .. ..
url = 'http://image.search.yahoo.co.jp/search?p=' + urllib.parse.quote_plus('Evangelion', encoding='utf-8')
If you do it honestly, it will be like this. .. .. You can also specify a character string to exclude when you look it up! It seems that you should pass it as the second argument.
urllib.parse.quote_plus(url, "/:?=&")
Is it like that? There may be some omissions in characters that are not covered. .. .. It worked with this, but I was a little worried, so there is another method.
On the contrary (?) I should replace all Japanese! I tried to do that.
It's confusing! However, with this method, words that match the regular expression You can replace it "pass it to a function and with the result".
I wanted to do something about it, but I couldn't think of it as a stiff head. .. .. I don't know much about python, so it's not good at first glance. .. .. It seems that lambda also has no side effects. Please let me know if there is anything else. Is it an iterator?
regex = r'[Ah-Gaa-熙]'
matchedList = re.findall(regex,url)
for m in matchedList:
url = url.replace(m, urllib.parse.quote_plus(m, encoding="utf-8"))
When it comes to all Japanese There are many articles that write [A-n], Looking at the character code table, it's a rainy day!
so! !! Even if you expose dirty code with python who is not familiar at all I wrote it because I want to share this last surprise.
@KeisukeKudo-san gave me some improvement measures, so I will introduce them here as well! Strictly speaking, my notation is leaky, so if you want to use it, please use the following.
regex = r'[Ah-Gaa-熙]' #Changed the above as follows regex = r'[^\x00-\x7F]'
How about trying [\x00-\x7F] This is a regular expression that matches the ascii character. By using the negative form above, you can get the characters that match Japanese. http://rubular.com/r/2dnoBUlKe9
@ komeda-shinji gave me some improvement measures, so I will introduce them here as well! Thinking specifically about what you want to do, when there are characters in the URL query that cannot be converted to ascii, The following is better because it means that the URL is encoded first.
It is decomposed by the URL component and only the query is URL-encoded and reconstructed.
from urllib.parse import urlparse import urllib.request url = 'http://image.search.yahoo.co.jp/search?p=Evangelion' p = urlparse(url) query = urllib.parse.quote_plus(p.query, safe='=&') url = '{}://{}{}{}{}{}{}{}{}'.format( p.scheme, p.netloc, p.path, ';' if p.params else '', p.params, '?' if p.query else '', query, '#' if p.fragment else '', p.fragment) response = urllib.request.urlopen(url)
Recommended Posts