There are several ways to use BigQuery data in python code, but the recently introduced ** BigQuery Storage API ** seems to be very good at loading speed. Official page: https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas?hl=ja
It's still a new API, and when I tried to do something a little unusual, I didn't have the sample code, so I'll leave it. Verification of execution speed etc. will be left to other articles.
Read BigQuery data directly in Python and store it in Pandas. This time, I wanted to process a large number of data, so I set conditions and extracted only some data.
First, install the library to be used with pip. It is recommended to ** prepare another virtual environment ** by conda create etc. (I put it in the environment I usually use, but it was very troublesome for me to get an error because the version dependency between libraries was broken.)
First, set the authentication information and create a client.
python
import google.auth
from google.cloud import bigquery
from google.cloud import bigquery_storage_v1beta1
credentials, your_project_id = google.auth.default(
scopes=["https://www.googleapis.com/auth/cloud-platform"]
)
bqclient = bigquery.Client(
credentials=credentials,
project=your_project_id,
)
bqstorageclient = bigquery_storage_v1beta1.BigQueryStorageClient(
credentials=credentials
)
Finally read the data. Please rewrite the project ID etc. as appropriate.
python
print('Loading data...')
table = bigquery_storage_v1beta1.types.TableReference()
table.project_id = "PROJECT_ID"//The project ID with the BQ you want to load
table.dataset_id = "DATASET_ID"//Dataset ID you want to read
table.table_id = "TASBLE_ID"//Table ID you want to read
read_options = bigquery_storage_v1beta1.types.TableReadOptions()
#Column name you want to read
cols = ['col1', 'col2', 'col3']
for col in cols:
read_options.selected_fields.append(col)
#For example, "userid= "hogehoge"If you want to narrow down the data to be fetched under the condition of
read_options.row_restriction = 'userid = "hogehoge"'
parent = "projects/{}".format(your_project_id)
session = bqstorageclient.create_read_session(
table,
parent,
read_options=read_options,
format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
sharding_strategy=(
bigquery_storage_v1beta1.enums.ShardingStrategy.LIQUID
),
)
stream = session.streams[0]
position = bigquery_storage_v1beta1.types.StreamPosition(stream=stream)
reader = bqstorageclient.read_rows(position)
dataframe = reader.to_dataframe(session)
#The datafreme contains the read data, so what's the rest?
Recommended Posts