tl;dr
--GCS can specify Content-Type
--Chrome tries to display text / plain
in Shift-JIS
--text / plain; charset = utf-8
is kind
Create an object in Google Cloud Storage with a script similar to the following: This is a modified version of the googleapis / python-storage Example Usage code to store strings containing Japanese. Bucket creation and authentication are not relevant here, so I will skip them.
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket-id-here')
blob = bucket.get_blob('remote/path/to/file.json')
blob.upload_from_string('{"name": "Japanese"}')
After executing the script, I would like to check with the browser whether the object was created.
The object seems to have been created successfully. Let's take a look at the contents.
Google Cloud Storage has the ability to generate temporary links, and you can download objects by following the link from your browser.
The contents of the object have been garbled like this. This is the problem this time.
file.json
{"name": "譌 ・ 譛 ャ 隱."}
I often forget to encode and decode strings. ʻUpload_from_string` is passed str type, but check if it was not necessary to encode to UTF-8 etc.
Looking at the code, what ʻupload_from_string` is doing is simple.
Excerpt
def upload_from_string(abridgement):
data = _to_bytes(data, encoding="utf-8")
string_buffer = BytesIO(data)
self.upload_from_file(abridgement)
From the above, the string encode seems to be fine.
By the way, when I was looking at the object information in the browser, I found a part that I was interested in.
type="text/plain"
GCS gives an object metadata. It seems that you can specify the Response Header when the object is called in the Content-Type metadata. https://cloud.google.com/storage/docs/metadata#content-type
By default it should be ʻapplication / octet-stream or ʻapplication / x-www-form-urlencoded
, but it seems that this is text / plain
. Is this the cause?
I hypothesized that the cause of the garbled characters was Content-Type: text / plain
, so I will set up a server at hand and check the display in order to separate it from GCS.
Set up a server that simply returns a string with Content-Type: text / plain
with bottle.
server.py
from bottle import Bottle, HTTPResponse
import os
app = Bottle()
@app.route('/')
def serve():
r = HTTPResponse(status=200, body='Hoge')
r.set_header('Content-Type', 'text/plain')
return r
if __name__ == '__main__':
port = os.environ['PORT'] if 'PORT' in os.environ else '3000'
app.run(host='0.0.0.0', port=port)
Open in browser
I reproduced it.
Next, open it in your browser with ʻapplication / json`.
If it is ʻapplication / json`, it will be displayed correctly.
From the above, it seems good to think that Content-Type: text / plain
is the cause of garbled characters regardless of GCS.
The question remains as to why Content-Type
, which should have been ʻapplication / octet-stream or ʻapplication / x-www-form-urlencoded
by default in GCS, was now text / plain
.
This is the Blob.upload_from_string of the module google.cloud.storage
used for upload. # L1650-L1660) is doing something wrong
Excerpt
def upload_from_string(
self,
data,
content_type="text/plain",
client=None,
predefined_acl=None,
if_generation_match=None,
if_generation_not_match=None,
if_metageneration_match=None,
if_metageneration_not_match=None,
):
Since the default argument of content_type is text / plain
, it is implemented as Content-Type: text / plain
unless otherwise specified.
In the past, it seems that the viewer could change the character encoding, but now it seems that the browser's automatic inference only.
> document.characterSet
"Shift_JIS"
I'm trying to display in Shift_JIS
Up to this point, we have found that the following two points are the causes of garbled characters.
--The Response Header when GCS returns an object is Content-Type: text / plain
--In Chrome browser, trying to display Content-Type: text / plain
with Shift_JIS is garbled
The characters are garbled when viewed with a browser, but there is no problem when processing with a program as follows.
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket-id-here')
blob = bucket.get_blob('remote/path/to/file.txt')
print(blob.download_as_string())
In my case, the saved object is read by a program anyway, so it was actually a problem when I wanted to see the contents easily. It can be a problem if you are using it as static file hosting.
When using upload_from_string, it is good to specify Content-type.
If it is json, you can set it to ʻapplication / json, and if it is text, you can specify charset like
text / plain; charset = utf-8`, but Chrome will read it with utf-8.
I wasn't careful because it rarely changes recently. There aren't many lessons learned this time, so be careful when using Blob.upload_from_string.
Recommended Posts