Google Cloud Dataflow doesn't get much attention, but it's quite convenient because you can easily switch the execution environment between local and remote. Moreover, if you think that you can only use the standard library, you can install it from list of pip or [install your own](http://qiita. com / orfeon / items / 78ff952052c4bde4bcd3) is also possible. Then, I tried to find out which library was pre-installed because it did not come out even if I searched Document briefly.
First of all, setting options, this area is @ orfeon's round plagia ...
Option settings
import apache_beam as beam
import apache_beam.transforms.window as window
options = beam.utils.pipeline_options.PipelineOptions()
google_cloud_options = options.view_as(beam.utils.pipeline_options.GoogleCloudOptions)
google_cloud_options.project = '{PROJECTID}'
google_cloud_options.job_name = 'test'
google_cloud_options.staging_location = 'gs://{BUCKET_NAME}/binaries'
google_cloud_options.temp_location = 'gs://{BUCKET_NAME}/temp'
worker_options = options.view_as(beam.utils.pipeline_options.WorkerOptions)
worker_options.max_num_workers = 1
# options.view_as(beam.utils.pipeline_options.StandardOptions).runner = 'DirectRunner'
options.view_as(beam.utils.pipeline_options.StandardOptions).runner = 'DataflowRunner'
p = beam.Pipeline(options=options)
Run pip freeze
to log the Python package list.
Package list output part
def inspect_df(dat):
import subprocess
import logging
process = subprocess.Popen('pip freeze', shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
for line in process.stdout:
logging.info(line)
Run on Dataflow. You may not need hello world ...
Pipeline execution
(p | 'init' >> beam.Create(['hello', 'world'])
| 'inspect' >> beam.Map(inspect_df))
p.run()
When the execution of the pipeline is completed, the package list will be output to the log, so check it in the Cloud Console.
In Dataflow Document, you can check the log from the Job details screen of Dataflow, but as of March 4, 2017, Stackdriver -> Moving to Logging.
The log is output like this.It is a list of packages spit out in the above log. ** As of March 4, 2017 **
Package | Version |
---|---|
avro | 1.8.1 |
beautifulsoup4 | 4.5.1 |
bs4 | 0.0.1 |
crcmod | 1.7 |
Cython | 0.25.2 |
dataflow-worker | 0.5.5 |
dill | 0.2.5 |
enum34 | 1.1.6 |
funcsigs | 1.0.2 |
futures | 3.0.5 |
google-api-python-client | 1.6.2 |
google-apitools | 0.5.7 |
google-cloud-dataflow | 0.5.5 |
google-python-cloud-debugger | 1.9 |
googledatastore | 6.4.1 |
grpcio | 1.1.0 |
guppy | 0.1.10 |
httplib2 | 0.9.2 |
mock | 2.0.0 |
nltk | 3.2.1 |
nose | 1.3.7 |
numpy | 1.12.0 |
oauth2client | 2.2.0 |
pandas | 0.18.1 |
pbr | 1.10.0 |
Pillow | 3.4.1 |
proto-google-datastore-v1 | 1.3.1 |
protobuf | 3.0.0 |
protorpc | 0.11.1 |
pyasn1 | 0.2.2 |
pyasn1-modules | 0.0.8 |
python-dateutil | 2.6.0 |
python-gflags | 3.0.6 |
python-snappy | 0.5 |
pytz | 2016.10 |
PyYAML | 3.11 |
requests | 2.10.0 |
rsa | 3.4.2 |
scikit-learn | 0.17.1 |
scipy | 0.17.1 |
six | 1.10.0 |
tensorflow | 1.0.0 |
tensorflow-transform | 0.1.4 |
uritemplate | 3.0.0 |
Is it because tf.transform
has arrived? In Cloud ML
, ~~ TensorFlow ver is 0.12 ~~ ** (EDIT: The latest Ver is [here](https://cloud.google.com/ml-engine/docs/concepts/runtime-version-" You can check it with list)) ** It is 1.0.0 in Dataflow. scikit-learn seems a bit old.
Although staging is a little slow, Dataflow, which can easily switch between local and remote from Jupyter Notebook and perform fully managed instance startup and startup without permission, seems to be a powerful tool for applications such as data analysis and machine learning. is.
Recommended Posts