The nice and regrettable parts of Cloud Datalab

This entry is

It is a continuation of.

Cloud Datalab Basics

Here too I wrote something similar, but if I write it again, Cloud Datalab is as follows.

--Interactive analysis environment based on Jupyter --Environment integrated with GCP --Containerized package of Jupyter and Python libraries --Containers can be easily launched and dropped on GCE via the datalab command

Datalab assumptions

Datalab is designed to work closely with GCP projects.

By default, if nothing is specified, it will be as follows.

--The Cloud Source Repository in the project will have a [repository created] called datalab-notebooks (https://cloud.google.com/). datalab / docs / how-to / datalab-team # use_the_automatically_created_git_repository_for_sharing_notebooks) --A $ {PROJECT_ID} .appspot.com/datalab_backups bucket is created on GCS and a backup is created in it (https://cloud.google.com/datalab/docs/how-to/" working-with-notebooks # cloud_datalab_backup)

Start-up

I will try various things on the premise. Anyway, it is the start of Datalab.

$ datalab create --disk-size-gb 10 --no-create-repository datalab-test

--Specify the disk size with --disk-size-gb. --By default, it is made with 200GB, so I specified 10GB for a smaller size. --Do not create repository with --no-create-repository --If I deleted the repository alone, it wouldn't start unless I added --no-create-repository. .. .. I wonder why this. I will investigate it again.

Cooperation with BigQuery

Datalab is very nice to work with BigQuery. So, to change the story a little, Jupyter has a command function called Magic Command that starts with %%. BigQuery and GCS features are also provided.

Run query as Magic Command

As per the Sample, but you can see how great it is to write it in a cell. ..

%%bq query
SELECT id, title, num_characters
FROM `publicdata.samples.wikipedia`
WHERE wp_namespace = 0
ORDER BY num_characters DESC
LIMIT 10

Run through google.datalab.bigquery

I'm querying a cell for BQ, so I want to process it as it is [what is in the sample](https://github.com/googledatalab/notebooks/blob/master/tutorials/BigQuery/SQL%20and%20Pandas% 20DataFrames.ipynb), but you can pass the result of the query to Pandas as a dataframe. wonderful.

%%bq query -n requests
SELECT timestamp, latency, endpoint
FROM `cloud-datalab-samples.httplogs.logs_20140615`
WHERE endpoint = 'Popular' OR endpoint = 'Recent'

import google.datalab.bigquery as bq
import pandas as pd

df = requests.execute(output_options=bq.QueryOutput.dataframe()).result()

Is it like this if it seems to be via API a little more?

import google.datalab.bigquery as bq
import pandas as pd

#Query to issue
query = """SELECT timestamp, latency, endpoint
           FROM `cloud-datalab-samples.httplogs.logs_20140615`
           WHERE endpoint = 'Popular' OR endpoint = 'Recent'"""
#Create a query object
qobj = bq.Query(query)
#Get query results as pandas dataframe
df2 = qobj.execute(output_options=bq.QueryOutput.dataframe()).result()
#To the operation of pandas below
df2.head()

If you think about it carefully, since this API is provided, it seems that Magic Command is created. In fact, if you look at here, %% bq is defined as Magic Command. You can see that.

Cooperation with GCS

As with BigQuery, you can manipulate objects on GCS from the cell as sample. The point is, is it possible to read and write files? It is also helpful to be able to use the BigQuery results as a data source, but it is attractive to be able to transparently handle GCS data as a data source.

Cooperation with CloudML

I was able to confirm that something works via the API for the time being, but I will skip this time because there are many things that I do not understand as various behaviors.

Change instance type

This is the true value of the cloud. If you need it, which is not possible with on-premise, you can upgrade the specifications. You can specify the instance type with the --machine-type option in create of the datalab command. By default, n1-standard-1 is started.

#Delete the instance with the delete command
#In this case, the attached disk remains as it is
$ datalab delete datalab-test

#Start with the same machine name but different instance types
#Machine name+Because the disk is created with the pd naming convention
#If the machine name is the same, the disc will be attached without permission.
$ datalab create --no-create-repository \
                 --machine-type n1-standard-4 \
                 datalab-test

Now you can raise or lower the specs of your machine as needed.

GPU analysis environment!

For the time being, this is the highlight.

with this! !! !! After specifying the GPU instance! !! !! !! You can easily get a GPU machine learning environment! !! !! !!

When I think about it, it's not so easy in the world ... So far, GPU instance is not supported by Datalab.

Summary

Datalab is regrettable in some places, but there is a faint expectation that GPU instances will support it somehow, except for the Cloud Source Repository and the Cloud ML Engine surroundings. However, these days I think it is an important part for creating a data analysis environment. Next time, I would like to take a closer look at this area.

Other reference information

Datalab API --[Python library] included in Datalab (https://github.com/googledatalab/datalab/blob/master/containers/base/Dockerfile) --It seems that OpenCV library etc. are not included --But it is possible to additionally install the python module [https://cloud.google.com/datalab/docs/how-to/adding-libraries) --Since you can hit commands on the OS side by adding !, You should be able to include packages that can be supported by ʻapt-get`.