Purpose

I want to provide a tool that non-engineer planning staff and marketing staff can upload data to BigQuery by themselves & It is troublesome for me to upload Excel data to BigQUery by myself, so I made it so that it can be operated with GUI It was.

Assumed behavior

** Upload Excel (.xlsx or .csv) to Cloud Storage. ** **

Create a path so that the GCS directory name is the BigQuery dataset name and the file name is the table name. .xlsx GS: // bucket name (optional) / dataset name / table name .xlsx .csv GS: // bucket name (optional) / dataset name / table name.csv

** GCS upload triggers GCF to work. ** **
** A new table is created in BigQuery. ** ** (* If the same table name already exists, it will be overwritten)

Preparation

Google Cloud Storage

Create a bucket for data upload

Google Cloud Functions

Make the following settings (details omitted) Python 3.7 Trigger type: Cloud Storage Bucket: GCS bucket created above

GCF

`requestments.txt`


pandas
pandas-gbq
google-cloud-storage
google-cloud-bigquery
xlrd

`main.py`


from google.cloud import storage
from google.cloud import bigquery
import pandas as pd
import re


def gsc_to_bigquery_createtable(data, context):
    # log
    print(data)
    print(context)
    print('Folder Name : {}'.format(data['name']))
    GETPATH = data['name']
    m = re.match(
        r'(?P<getDatasetId>.*)/(?P<getFileId>.*)\.(?P<getFileType>.*)',
        GETPATH)
    #Specify the bucket name and project name
    BUCKET = 'Bucket name'
    PROJECT_ID = 'Project name'
    #Get the dataset name
    DATASET_ID = m.group('getDatasetId')
    #Get the file name
    FILE_ID = m.group('getFileId')
    #Get identifier
    FILE_TYPE = m.group('getFileType')

    TMP_PATH = '/tmp/' + FILE_ID + '.' + FILE_TYPE
    #Data import from GSC to python
    gcs = storage.Client(PROJECT_ID)
    bucket = gcs.get_bucket(BUCKET)
    blob = bucket.get_blob(GETPATH)
    blob.download_to_filename(TMP_PATH)

    #Conditional branching of identifiers
    if FILE_TYPE == 'xlsx':
        df = pd.read_excel(TMP_PATH)
    elif FILE_TYPE == 'csv':
        df = pd.read_csv(TMP_PATH)

    #Create table from python to BigQuery
    full_table_id = DATASET_ID + '.' + FILE_ID
    df.to_gbq(full_table_id, project_id=PROJECT_ID, if_exists='replace')
　　# log
    print('Folder Name : {}'.format(data['name']))

Recommended Posts

[GCF + Python] How to upload Excel to GCS and create a new table in BigQuery

How to create a JSON file in Python

[Pandas] How to check duplicates and delete duplicates in a table (equivalent to deleting duplicates in Excel)

[Python] How to output a pandas table to an excel file

How to create an instance of a particular class from dict using __new__ () in python

[Python] How to delete rows and columns in a table (list of drop method options)

[Python] How to add rows and columns to a table (pandas DataFrame)

How to create a heatmap with an arbitrary domain in Python

How to work with BigQuery in Python

How to get a stacktrace in python

How to display multiplication table in python