Google Cloud Dataflow doesn't get much attention, but it's quite convenient because you can easily switch the execution environment between local and remote. Moreover, if you think that you can only use the standard library, you can install it from list of pip or [install your own](http://qiita. com / orfeon / items / 78ff952052c4bde4bcd3) is also possible. Then, I tried to find out which library was pre-installed because it did not come out even if I searched Document briefly.

Preparation

First of all, setting options, this area is @ orfeon's round plagia ...

`Option settings`


import apache_beam as beam
import apache_beam.transforms.window as window

options = beam.utils.pipeline_options.PipelineOptions()

google_cloud_options = options.view_as(beam.utils.pipeline_options.GoogleCloudOptions)
google_cloud_options.project = '{PROJECTID}'
google_cloud_options.job_name = 'test'
google_cloud_options.staging_location = 'gs://{BUCKET_NAME}/binaries'
google_cloud_options.temp_location = 'gs://{BUCKET_NAME}/temp'

worker_options = options.view_as(beam.utils.pipeline_options.WorkerOptions)
worker_options.max_num_workers = 1

# options.view_as(beam.utils.pipeline_options.StandardOptions).runner = 'DirectRunner'
options.view_as(beam.utils.pipeline_options.StandardOptions).runner = 'DataflowRunner'

p = beam.Pipeline(options=options)

Run pip freeze to log the Python package list.

`Package list output part`


def inspect_df(dat):
    import subprocess
    import logging
    process = subprocess.Popen('pip freeze', shell=True,
                               stdout=subprocess.PIPE, 
                               stderr=subprocess.PIPE)
    for line in process.stdout:
        logging.info(line)

Run on Dataflow. You may not need hello world ...

`Pipeline execution`


(p | 'init' >> beam.Create(['hello', 'world'])
   | 'inspect' >> beam.Map(inspect_df))

p.run()

When the execution of the pipeline is completed, the package list will be output to the log, so check it in the Cloud Console.

Log check

In Dataflow Document, you can check the log from the Job details screen of Dataflow, but as of March 4, 2017, Stackdriver -> Moving to Logging.

The log is output like this.

Package List

It is a list of packages spit out in the above log. ** As of March 4, 2017 **

Package	Version
avro	1.8.1
beautifulsoup4	4.5.1
bs4	0.0.1
crcmod	1.7
Cython	0.25.2
dataflow-worker	0.5.5
dill	0.2.5
enum34	1.1.6
funcsigs	1.0.2
futures	3.0.5
google-api-python-client	1.6.2
google-apitools	0.5.7
google-cloud-dataflow	0.5.5
google-python-cloud-debugger	1.9
googledatastore	6.4.1
grpcio	1.1.0
guppy	0.1.10
httplib2	0.9.2
mock	2.0.0
nltk	3.2.1
nose	1.3.7
numpy	1.12.0
oauth2client	2.2.0
pandas	0.18.1
pbr	1.10.0
Pillow	3.4.1
proto-google-datastore-v1	1.3.1
protobuf	3.0.0
protorpc	0.11.1
pyasn1	0.2.2
pyasn1-modules	0.0.8
python-dateutil	2.6.0
python-gflags	3.0.6
python-snappy	0.5
pytz	2016.10
PyYAML	3.11
requests	2.10.0
rsa	3.4.2
scikit-learn	0.17.1
scipy	0.17.1
six	1.10.0
tensorflow	1.0.0
tensorflow-transform	0.1.4
uritemplate	3.0.0

Is it because tf.transform has arrived? In Cloud ML, ~~ TensorFlow ver is 0.12 ~~ ** (EDIT: The latest Ver is [here](https://cloud.google.com/ml-engine/docs/concepts/runtime-version-" You can check it with list)) ** It is 1.0.0 in Dataflow. scikit-learn seems a bit old.

Although staging is a little slow, Dataflow, which can easily switch between local and remote from Jupyter Notebook and perform fully managed instance startup and startup without permission, seems to be a powerful tool for applications such as data analysis and machine learning. is.

I checked the Python package pre-installed in Google Cloud Dataflow

Preparation

Option settings

Package list output part

Pipeline execution

Log check

Package List

`Option settings`

`Package list output part`

`Pipeline execution`