Introduction

In Basic machine learning procedure: ② Prepare data, the process of importing the table created by BigQuery into the Pytohn environment in Pandas Dataframe format has been performed. It was.

However, as the size of the table increases, it takes a lot of time. Perhaps many people had such troubles. That's where a new service called the BigQuery Storage API came out.

I heard that one theory is 7 to 8 times faster, but what about it? I would like to try.

Analytical environment

Google BigQuery Google Colaboratory

Referenced site

Download BigQuery data to pandas using BigQuery Storage API

Target data

The table used is a table of about 100MB called myproject.mydataset.mytable. Import it in Pandas Dataframe format with a simple process of fetching all the items as shown below.

query="SELECT * FROM `myproject.mydataset.mytable`

1. BigQuery standard API

First, let's try using the standard BigQuery API that we have been using.

import time
from google.cloud import bigquery
start = time.time()

client = bigquery.Client(project="myproject")
df = client.query(query).to_dataframe()

elapsed_time = time.time() - start

It took about 120 seconds to process. Well, this is acceptable.

2.Pandas read_gbq You can do it with Pandas features without using the BigQuery API. So I will try that too.

import time
import pandas as pd

start = time.time()

df = pd.io.gbq.read_gbq(query, project_id="myproject", dialect="standard")

elapsed_time = time.time() - start

Processing is completed in about 135 seconds. It's a bit slower than the BigQuery API. Even with BigQuery's standard API, it seems that something has been devised compared to the Pandas function.

3.BigQuery Storage API And that's where the BigQuery Storage API, our theme, comes into play. When I tried to import the library with Colab, I was told that there was no library, so first install.

pip install --upgrade google-cloud-bigquery-storage

And when I install it, I get a message asking me to restart the runtime. Sometimes it appears in other libraries, but it's a bit annoying.

WARNING: The following packages were previously imported in this runtime:
  [google]
You must restart the runtime in order to use newly installed versions.

Now, restart the runtime, import the library again and run it.

import time
from google.cloud import bigquery
from google.cloud import bigquery_storage
start = time.time()

client = bigquery.Client(project="myproject")
bqstorageclient = bigquery_storage.BigQueryStorageClient()
df3 = (
    client.query(query)
    .result()
    .to_dataframe(bqstorage_client=bqstorageclient)
)

elapsed_time = time.time() - start

The execution time is an amazing 12 seconds. It's 10 times more than the standard API, not 7-8 times. I thought it was a coincidence and tried it several times, but it was completed at almost this speed, although there was an error of about 1 to 2 seconds.

in conclusion

I was surprised to see the results much faster than expected. If it is 10 times faster than usual, it is possible to capture data such as several GB in a short time. (Although the subsequent processing in Python seems to be heavy)

In addition to running BigQuery normally, it costs $ 1.10 per TB, so it can not be sporadic, but it is a service that I would like to use when the table is too large and I have to wait for tens of minutes to capture data.