How to process and replace Japanese in a character string "http://hogefuga/qiita.com"-> Replace with the result of processing "hogefuga"
Proper (?) Support when you want to use URL including Japanese in urlonen of urllib Addendum: I added it at the end because the method you pointed out seemed to be correct for this response.

background

I got stuck trying to scrape using urllib to collect "Evangelion" images
I was addicted to "UnicodeEncodeError". I'm not good at python. .. ..

Stumble content

response = urllib.request.urlopen(url)

it is normal. Just access the url and do the object. ___ However, ___ A tragedy happened because this url contained Japanese.

url ='http://image.search.yahoo.co.jp/search?p=Evangelion' It's like that.

You will be dragged into the darkness of python with haste. *** Added error details. *** ***

Traceback (most recent call last):
・ ・ ・
    response = urllib.request.urlopen(link)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 162, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 465, in open
    response = self._open(req, data)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 483, in _open
    '_open', req)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 443, in _call_chain
    result = func(*args)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 1268, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/urllib/request.py", line 1240, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/http/client.py", line 1083, in request
    self._send_request(method, url, body, headers)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/http/client.py", line 1118, in _send_request
    self.putrequest(method, url, **skips)
  File "/Users/mix/.pyenv/versions/3.5.0/lib/python3.5/http/client.py", line 960, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-21: ordinal not in range(128)

As far as I see the error, ~~ urllib is just trying to convert to ascii, right? ?? ?? ~~ PS: http was trying to convert the URL to ascii! !!

Workaround Where! I searched for The Japanese part should be parsed ~~. ~~ Postscript: You should do URL encoding (percent encoding).

urllib.parse.quote_plus('Evangelion', encoding='utf-8')

Is it like that? There is a problem with this. .. ..

url = 'http://image.search.yahoo.co.jp/search?p=' + urllib.parse.quote_plus('Evangelion', encoding='utf-8')

If you do it honestly, it will be like this. .. .. You can also specify a character string to exclude when you look it up! It seems that you should pass it as the second argument.

urllib.parse.quote_plus(url, "/:?=&")

Is it like that? There may be some omissions in characters that are not covered. .. .. It worked with this, but I was a little worried, so there is another method.

On the contrary (?) I should replace all Japanese! I tried to do that.

What i did

Get Japanese with regular expression-> find all and replace using the result list

It's confusing! However, with this method, words that match the regular expression You can replace it "pass it to a function and with the result".

I wanted to do something about it, but I couldn't think of it as a stiff head. .. .. I don't know much about python, so it's not good at first glance. .. .. It seems that lambda also has no side effects. Please let me know if there is anything else. Is it an iterator?

regex = r'[Ah-Gaa-熙]'
matchedList = re.findall(regex,url)
for m in matchedList:
   url = url.replace(m, urllib.parse.quote_plus(m, encoding="utf-8"))

When it comes to all Japanese There are many articles that write [A-n], Looking at the character code table, it's a rainy day!

so! !! Even if you expose dirty code with python who is not familiar at all I wrote it because I want to share this last surprise.

Postscript: Correct specification method of regular expression

@KeisukeKudo-san gave me some improvement measures, so I will introduce them here as well! Strictly speaking, my notation is leaky, so if you want to use it, please use the following.

regex = r'[Ah-Gaa-熙]'
#Changed the above as follows
regex = r'[^\x00-\x7F]'
How about trying [\x00-\x7F] This is a regular expression that matches the ascii character. By using the negative form above, you can get the characters that match Japanese. http://rubular.com/r/2dnoBUlKe9

Postscript: The most correct method for this response

@ komeda-shinji gave me some improvement measures, so I will introduce them here as well! Thinking specifically about what you want to do, when there are characters in the URL query that cannot be converted to ascii, The following is better because it means that the URL is encoded first.

It is decomposed by the URL component and only the query is URL-encoded and reconstructed.

from urllib.parse import urlparse
import urllib.request

url = 'http://image.search.yahoo.co.jp/search?p=Evangelion'
p = urlparse(url)
query = urllib.parse.quote_plus(p.query, safe='=&')
url = '{}://{}{}{}{}{}{}{}{}'.format(
    p.scheme, p.netloc, p.path,
    ';' if p.params else '', p.params,
    '?' if p.query else '', query,
    '#' if p.fragment else '', p.fragment)
response = urllib.request.urlopen(url)

When accessing a URL containing Japanese (Japanese URL) with python3, it will be encoded in html without permission and an error will occur, so make a note of the workaround.

Contents

background

Stumble content

What i did

Postscript: Correct specification method of regular expression

Postscript: The most correct method for this response