Character code † darkness † encounter report part1

Status

On Windows7 64bit, Python 2.7.5 The crawler that collects Foursquare Venues that I have been using for a long time has been turned into a tool. I only use it myself, so I / O is fairly appropriate. Input is latitude / longitude of upper right and lower left of bbox, output is ID, Venue name, latitude / longitude, genre, Separated by commas, assuming that is written to csv. Since it is supposed to be executed on the command line, the output is standard output so that it can be written to a file by redirection. When waiting for input, the text is output as an error (displayed as "latitude at the northeastern end") as to what to input. The source code is written in utf-8.

Symptoms of † darkness †

Execute as follows on the command line.

command1


$ python foursquare_crawler.py > washington_venues.csv
Chrysalis pupa

Garbled characters ... The source code of this part is as follows.

source1


sys.stderr.write("Northeastern latitude:")
first_ne_lat = float(sys.stdin.readline())

I simply thought that the character code of the command prompt was bad, so I temporarily changed the character code of the command prompt from cp932 (Shift-JIS) to utf-8. It seems that utf-8 is called cp65001 on Windows ...

command2


$ chcp 65001
Active code page: 65001

Run again.

command3


$ python foursquare_crawler.py > washington_venues.csv
北東端の緯度:

… (´ ・ ω ・ `) This time, explicitly describe the error output as utf-8 in the program.

source2


sys.stderr = codecs.getwriter('utf-8')(sys.stderr)
sys.stdin = codecs.getwriter('utf-8')(sys.stdin)
sys.stderr.write("Northeastern latitude:")
first_ne_lat = float(sys.stdin.readline())

I also wrote that standard input should be received in utf-8. But no change ...

Solution

I wondered what happened, so I made it Unicode somehow.

source3


sys.stderr.write(u"Northeastern latitude:")
first_ne_lat = float(sys.stdin.readline())

Then, restore the character code of the command prompt and execute it.

command3


$ chcp 932
Current code page: 932
$ python foursquare_crawler.py > washington_venues.csv
Northeastern latitude:

It's done (゜ ∀ ゜)! !! !! I haven't investigated it in detail, but since character strings are usually managed for each byte, it seems that there are various problems with full-width characters such as Japanese and half-width characters such as the alphabet when dividing a character string into multiple bytes. However, with Unicode, one character is processed by one character, so it seems that Unicode is better when dealing with Japanese in Python.

Now let's make all the strings Unicode ...

Recommended Posts

Character code † darkness † encounter report part1
Character code
python character code
About Python3 character code
Part 1 Attempt to code mathematics (∈)
Source code character code check script
2.x, 3.x character code of python
Character code learned in Python