I tried using the BigQuery Storage API

Introduction

In Basic machine learning procedure: ② Prepare data, the process of importing the table created by BigQuery into the Pytohn environment in Pandas Dataframe format has been performed. It was.

However, as the size of the table increases, it takes a lot of time. Perhaps many people had such troubles. That's where a new service called the BigQuery Storage API came out.

I heard that one theory is 7 to 8 times faster, but what about it? I would like to try.

Analytical environment

Google BigQuery Google Colaboratory

Referenced site

Download BigQuery data to pandas using BigQuery Storage API

Target data

The table used is a table of about 100MB called myproject.mydataset.mytable. Import it in Pandas Dataframe format with a simple process of fetching all the items as shown below.

query="SELECT * FROM `myproject.mydataset.mytable`

1. BigQuery standard API

First, let's try using the standard BigQuery API that we have been using.

import time
from google.cloud import bigquery
start = time.time()

client = bigquery.Client(project="myproject")
df = client.query(query).to_dataframe()

elapsed_time = time.time() - start

It took about 120 seconds to process. Well, this is acceptable.

2.Pandas read_gbq You can do it with Pandas features without using the BigQuery API. So I will try that too.

import time
import pandas as pd

start = time.time()

df = pd.io.gbq.read_gbq(query, project_id="myproject", dialect="standard")

elapsed_time = time.time() - start

Processing is completed in about 135 seconds. It's a bit slower than the BigQuery API. Even with BigQuery's standard API, it seems that something has been devised compared to the Pandas function.

3.BigQuery Storage API And that's where the BigQuery Storage API, our theme, comes into play. When I tried to import the library with Colab, I was told that there was no library, so first install.

pip install --upgrade google-cloud-bigquery-storage

And when I install it, I get a message asking me to restart the runtime. Sometimes it appears in other libraries, but it's a bit annoying.

WARNING: The following packages were previously imported in this runtime:
  [google]
You must restart the runtime in order to use newly installed versions.

Now, restart the runtime, import the library again and run it.

import time
from google.cloud import bigquery
from google.cloud import bigquery_storage
start = time.time()

client = bigquery.Client(project="myproject")
bqstorageclient = bigquery_storage.BigQueryStorageClient()
df3 = (
    client.query(query)
    .result()
    .to_dataframe(bqstorage_client=bqstorageclient)
)

elapsed_time = time.time() - start

The execution time is an amazing 12 seconds. It's 10 times more than the standard API, not 7-8 times. I thought it was a coincidence and tried it several times, but it was completed at almost this speed, although there was an error of about 1 to 2 seconds.

in conclusion

I was surprised to see the results much faster than expected. If it is 10 times faster than usual, it is possible to capture data such as several GB in a short time. (Although the subsequent processing in Python seems to be heavy)

In addition to running BigQuery normally, it costs $ 1.10 per TB, so it can not be sporadic, but it is a service that I would like to use when the table is too large and I have to wait for tens of minutes to capture data.

Recommended Posts

I tried using the BigQuery Storage API
I tried using the checkio API
I tried using BigQuery ML
I tried using the Google Cloud Vision API
I tried the Naro novel API 2
I tried the Naruro novel API
[Python] I tried collecting data using the API of wikipedia
[For beginners] I tried using the Tensorflow Object Detection API
I tried using Twitter api and Line api
I tried using YOUTUBE Data API V3
I tried using UnityCloudBuild API from Python
I tried to touch the COTOHA API
I tried using the COTOHA API (there is code on GitHub)
I looked at the meta information of BigQuery & tried using it
I tried using argparse
I tried using anytree
I tried using aiomysql
I tried using Summpy
I tried using coturn
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried using PyCaret
I tried using cron
I tried using ngrok
I tried using face_recognition
I tried using Jupyter
I tried using PyCaret
I tried using Heapq
I tried using doctest
I tried using folium
I tried using jinja2
I tried using time-window
I tried using AWS Rekognition's Detect Labels API
I tried using scrapy for the first time
I checked the library for using the Gracenote API
I tried using Remote API on GAE / J
vprof --I tried using the profiler for Python
I tried using PyCaret at the fastest speed
I tried using the Datetime module by Python
I tried using the image filter of OpenCV
I tried using the functional programming library toolz
I tried to summarize various sentences using the automatic summarization API "summpy"
[I tried using Pythonista 3] Introduction
I tried using easydict (memo).
I tried face recognition using Face ++
I tried using Random Forest
When I tried using Microsoft's Computer Vision API, I recognized the Galapagos sign "Stop"
Try using the Twitter API
I tried using Amazon Glacier
I touched the Qiita API
[Linux] I tried using the genetic statistics software PLINK
I tried clustering ECG data using the K-Shape method
Try using the Twitter API
I tried to approximate the sin function using chainer
I tried the changefinder library!
I tried APN (remote notification) using Parse.com REST API
I tried using git inspector
Try using the PeeringDB 2.0 API
[Python] I tried using OpenPose
I tried using magenta / TensorFlow