If you want to upload a small amount of data locally to BigQuery without going through GCS, you'll usually use the Google Cloud SDK bq command.
However, when I thought about I want to upload automatically from the daily batch running on EC2 Spot Instance, the Cloud SDK setup automation and authentication settings were set. I feel that it is unexpectedly troublesome.
With python's gcloud package, at least setup automation and authentication settings were easy, so I'll leave a note.
Google Cloud Python Client https://github.com/GoogleCloudPlatform/gcloud-python
It is necessary to set up a GCP service account and obtain json for authentication in advance.
For details, just search for "GCP service account" etc.
It should come in one shot with pip.
pip install gcloud
Code
Save the following and run it with python -u upload.py
. Tab delimited, the first row is header, and if a table with the same name already exists, it is set to be deleted, so please change it as appropriate.
upload.py
from gcloud import bigquery
from gcloud.bigquery import SchemaField
json_key = '/my-folder/my-project-xxxxxxxxxxxx.json'
project_name = 'my-project-1234'
dataset_name = 'my-dataset'
def run():
schema = [
SchemaField('Col1', 'INTEGER', mode='required'),
SchemaField('Col2', 'BOOLEAN', mode='required'),
SchemaField('Col3', 'STRING', mode='nullable'),
SchemaField('Col3', 'FLOAT', mode='nullable'),
]
input_file_path = '/my-folder/input.txt'
header_rows_to_skip = 1
table_name = 'MyTable'
csv_delimiter = '\t'
delete_existing_table = True
upload_one_csv_to_bq(input_file_path, csv_delimiter, header_rows_to_skip, table_name, schema, delete_existing_table)
def upload_one_csv_to_bq(local_csv_file_path, delimiter, header_rows_to_skip,
table_name, table_schema, delete_existing_table):
client = bigquery.Client.from_service_account_json(json_key, project=project_name)
dataset = client.dataset(dataset_name)
table = dataset.table(name=table_name)
if delete_existing_table:
if table.exists():
print 'Delete existing table: ' + table_name
table.delete()
table.schema = table_schema
table.create()
with open(local_csv_file_path, 'rb') as file_obj:
table.upload_from_file(file_obj, 'CSV', field_delimiter=delimiter, skip_leading_rows=header_rows_to_skip, ignore_unknown_values=False)
print 'Done'
###########
# Entry Point
###########
if __name__ == "__main__":
run()
It worked with python 2.7.10 on Amazon Linux 2016.03 and python 2.7.11 on my windows. You can go with UTF-8 even if the data includes Japanese.
At first, I was doing it while looking at the following that came out appropriately (Inserting data (synchronous) section), but it works on Windows, but when I bring it to EC2, it is table.insert_data (rows) # API request
There is an unpleasant error such as "broken pipe" around.
https://googlecloudplatform.github.io/gcloud-python/stable/bigquery-usage.html
So, in the form of using table.upload_from_file
after various detours. Well, if you have problems, I recommend giving up and reading the code
https://github.com/GoogleCloudPlatform/gcloud-python/tree/master/gcloud/bigquery