Use gcloud-python to populate BigQuery without going through GCS

If you want to upload a small amount of data locally to BigQuery without going through GCS, you'll usually use the Google Cloud SDK bq command.

However, when I thought about I want to upload automatically from the daily batch running on EC2 Spot Instance, the Cloud SDK setup automation and authentication settings were set. I feel that it is unexpectedly troublesome.

With python's gcloud package, at least setup automation and authentication settings were easy, so I'll leave a note.

Google Cloud Python Client https://github.com/GoogleCloudPlatform/gcloud-python

Preparation

Service account settings

It is necessary to set up a GCP service account and obtain json for authentication in advance.

For details, just search for "GCP service account" etc.

gcloud package installation

It should come in one shot with pip.

pip install gcloud

Code

Save the following and run it with python -u upload.py. Tab delimited, the first row is header, and if a table with the same name already exists, it is set to be deleted, so please change it as appropriate.

upload.py


from gcloud import bigquery
from gcloud.bigquery import SchemaField

json_key = '/my-folder/my-project-xxxxxxxxxxxx.json'
project_name = 'my-project-1234'
dataset_name = 'my-dataset'


def run():
    schema = [
        SchemaField('Col1', 'INTEGER', mode='required'),
        SchemaField('Col2', 'BOOLEAN', mode='required'),
        SchemaField('Col3', 'STRING', mode='nullable'),
        SchemaField('Col3', 'FLOAT', mode='nullable'),
    ]
    input_file_path = '/my-folder/input.txt'
    header_rows_to_skip = 1
    table_name = 'MyTable'
    csv_delimiter = '\t'
    delete_existing_table = True
    upload_one_csv_to_bq(input_file_path, csv_delimiter, header_rows_to_skip, table_name, schema, delete_existing_table)


def upload_one_csv_to_bq(local_csv_file_path, delimiter, header_rows_to_skip, 
                         table_name, table_schema, delete_existing_table):
    client = bigquery.Client.from_service_account_json(json_key, project=project_name)
    dataset = client.dataset(dataset_name)
    table = dataset.table(name=table_name)
    if delete_existing_table:
        if table.exists():
            print 'Delete existing table: ' + table_name
            table.delete()
    table.schema = table_schema
    table.create()
    with open(local_csv_file_path, 'rb') as file_obj:
        table.upload_from_file(file_obj, 'CSV', field_delimiter=delimiter, skip_leading_rows=header_rows_to_skip, ignore_unknown_values=False)
    print 'Done'


###########
# Entry Point
###########
if __name__ == "__main__":
    run()

It worked with python 2.7.10 on Amazon Linux 2016.03 and python 2.7.11 on my windows. You can go with UTF-8 even if the data includes Japanese.

Digression

At first, I was doing it while looking at the following that came out appropriately (Inserting data (synchronous) section), but it works on Windows, but when I bring it to EC2, it is table.insert_data (rows) # API request There is an unpleasant error such as "broken pipe" around.

https://googlecloudplatform.github.io/gcloud-python/stable/bigquery-usage.html

So, in the form of using table.upload_from_file after various detours. Well, if you have problems, I recommend giving up and reading the code

https://github.com/GoogleCloudPlatform/gcloud-python/tree/master/gcloud/bigquery

Recommended Posts

Use gcloud-python to populate BigQuery without going through GCS
Dump BigQuery tables to GCS using Python