Collaboration between Python and BigQuery

The combination of Python and BigQuery is a good match for data analysis.

Python is not suitable for handling data that is too huge, but if you let BigQuery do only that part and cut out the data into small pieces, you can do whatever you want with Python.

The question is how to integrate Python and BigQuery, but there are two main ways to do this.

Use Python's BigQuery integration library
Use Google Cloud Datalab

2 is especially recommended for those who are accustomed to using Jupyter Notebook.

1. Use Python's BigQuery integration library

Recommended library

There are several libraries for hitting BigQuery from Python. For example, BigQuery-Python, bigquery_py.

However, in fact __the easiest and recommended __ is the one module of Pandas.io pandas.io.gbq .gbq.read_gbq.html). It works well with DataFrame objects and is very easy to authenticate, so it's great that you can use it without worrying about difficult things.

How to authenticate pandas.io.gbq

`pandas.io.All you need to use gbq is your BigQuery project ID.`


 The notation is as follows.
 In this example, the result of ``` SELECT * FROM tablename``` is stored in` `data_frame```, and then it can be used as a normal DF object.
   

```python
import pandas as pd
query = 'SELECT * FROM tablename'

#str-based queries and projects_id required
data_frame = pd.read_gbq(query, 'project-id')

When executed it will return a brief statistic of the query process (If you want to hide the statistics, specify verbose = False as an argument)

Execution example

Running pd.read_gbq will open the Google Account authentication screen in your browser. If you log in with an account that can access the project, the linked authentication will be completed and the process will start.

At this time, the json format credential file is spit out to the working folder. As long as you have this file, you can hit the query many times without re-authentication. Conversely, if this file is handed over, BigQuery will be used without permission and you will be bill shocked, so be careful about managing the file. http://qiita.com/itkr/items/745d54c781badc148bb9

Writing from Python to BigQuery side is also possible

You can also easily write a Python DataFrame object as a table on BigQuery. Library official documentation

from pandas.io import gbq
gbq.to_gbq(df, 'Table to write to', 'Project ID')

With this alone, you can return the DF object used in Python to BigQuery.

When you can do so far, for example

(1) Fetch user feature data from BigQuery (read)_bgq）
↓
② Connect and process data with Python(pandas)
↓
③ Label users with a machine learning model(scikielearn)
↓
④ Return the labeled result to BigQuery(gbq.to_gbq)

You will be able to do things like this relatively easily. Hmmm wonderful Pandas is really convenient

2. Use Google Cloud Datalab

What is Cloud Datalab?

Datalab is an interactive cloud analysis environment based on jupyter notebook (formerly iPython-Notebook) built on Google Compute Engine. The operation is performed by writing the code in an interface called "Notebook" that can be viewed and described with a browser.

Roughly, it has the following advantages.

--Interactive operation -You can keep queries, results, and charts in Notebook format. --Coexistence of Markdown format sentences and code --Inline display of code execution results (including charts) --Seamless integration with Python (both Python and SQL can be used in the same console) --Because the environment is automatically built on Google Compute Engine --Do not overwhelm the local environment --Easy to share queries, etc.

I don't think it's easy to read the text, so it may be the fastest to watch the video or image.

[Video] https://www.youtube.com/watch?v=RzIjz5HQIx4

[Image of usage example]

The characters on the first line are Markdown,
The first cell is written in SQL and the second cell is written in Python
Graphs etc. are also displayed inline
(Of course, not only graphs but also tables can be displayed)

How to get started

Beta version (?), So you can't turn on features directly from the GCP console Select a project from this page and deploy it to use the function

In addition, it seems that only the owner with owner authority or editing authority can turn on the function Once deployed to the project, it will be available to all members of the project

Cloud Datalab is deployed as a Google App Engine application module in the selected project. The Google Compute Engine and Google BigQuery APIs must be enabled for the project, and you must be authorized to use the project as an owner or editor.

Start screen image

What happens when you start

environment

An instance for Datalab is launched on Google Compute Engine, and the Datalab environment is built on it. The Datalab interface can be operated from a browser. (Of course, SSH connection is possible to this environment as well)

Notebook (SQL and Python code) written on the browser will be saved on this instance (= everyone can see it)

The GCP console doesn't have an entry point to turn on the Datalab feature, but if you're using Datalab, you'll see "Datalab" in your instance list.

Fee

There is a charge for GCE instances (~ several thousand yen? Depending on the instance specifications) Of course, you will also be charged for hitting BigQuery.

Datalab official

You only pay for the resources you use to run Cloud Datalab, as follows:

Compute Resources Cloud Datalab uses Google App Engine and Google Compute Engine resources to run within your project. When you have Cloud Datalab instances deployed within your project, you incur compute charges —the charge for one VM per Cloud Datalab instance

Google BigQuery -You incur BigQuery charges when issuing SQL queries within Cloud Datalab

Other Resources —You incur charges for other API requests you make within the Cloud Datalab environment.

Reference-Information on other companies that seem to be used (domestic)

Voyage Group http://www.slideshare.net/hagino_3000/cloud-datalabbigquery Vasily http://tech.vasily.jp/entry/cloud-datalab Wonderplanet http://wonderpla.net/blog/engineer/Try_GoogleCloudDatalab/

BigQuery integration for Python users