Apache Beam is the framework that Google uses in its cloud service Dataflow. The selling point is that both batch processing and streaming processing can be written in the same way. This time I wrote a minimum program in Python and executed it.
--The premise is Python 2.7. --For those who have only Python 3 system or who do not use Python originally, prepare a Python environment. ――I only had a Python 3 conda environment, so I created a python 2.7 environment with conda install. Reference: Switch between Python2 and Python3 and use jupyter notebook --The installation method using Virtualenv is as per the Beam original document. Apache Beam Python SDK Quickstart --GCP documentation if you use Google's GCP. Quick Start with Python
--After entering the Python 2.7 environment, execute the following. ――By the way, in Python 3.5, cython gives an error and ends. ――In my environment
pip install apache-beam
--I read the programming guide for Python diagonally and copied and pasted about 3 sources to make the following.
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
#First make a pipeline
p = beam.Pipeline(options=PipelineOptions())
# 1.Set array to pipeline input (4 rows as input)
lines = (p
| beam.Create([
'To be, or not to be: that is the question: ',
'Whether \'tis nobler in the mind to suffer ',
'The slings and arrows of outrageous fortune, ',
'Or to take arms against a sea of troubles, ']))
# 2.Set the character string count of each line as the conversion process
word_lengths = lines | beam.Map(len)
# 3.Finally, output the count number to the standard output and finish
class ExtractWordsFn(beam.DoFn):
def process(self, element):
print(element)
p_end = word_lengths | beam.ParDo(ExtractWordsFn())
p.run()
--Results of running on Jupyter -Published to Gist
<apache_beam.runners.direct.direct_runner.DirectPipelineResult at 0x7f6b69c92950>
43
42
45
43
――It's easy, but for the time being, I was able to execute Apache Beam batch processing from Python. It takes about an hour from installation to implementation and execution. ――I want to execute streaming processing in the future. Also, it seems that Spark Streaming etc. can be used as an engine, so I would like to try that as well. ――I was happy to be able to execute the above from Bash on windows + Jupyter on Windows.
Recommended Posts