Workshop Machine Learning at Scale with Google Cloud Platform

Workshop:
Machine Learning at
scale with GCP
using ML Engine & Python Dataflow
25/01/2018
Robbe Sneyders & Juta Staes

2
About ML6
We are a team of data scientists, machine learning
experts, software engineers and mathematicians.
Our mission is to provide tailor-made systems to help
your organization get smart actionable insights from
large data volumes.
+ Specialized Machine Learning partner of Google
Cloud
Robbe Sneyders
ML engineer @ ML6
Juta Staes
ML engineer @ ML6

3 Outline
GCP tools overview
Workshop part 1: Dataflow
Workshop part 2: ML Engine
1
2
3

4 Machine Learning Pipeline
Collect
data
Create
model
Train model with
organized data
Organize
data
Deploy trained
model
iterate

5 Mapping to GCP Products
Collect
data
Create
model
Train model with
organized data
Organize
data
Deploy trained
model
Cloud Machine
Learning Engine
Cloud
Dataflow
Cloud Machine
Learning Engine
Cloud
Storage
Tensorflow

6 Google Cloud Products
Compute Storage
Data &
Analytics
Machine
Learning
Cloud
Functions

7 Google Cloud Platform: Open Cloud Philosophy
● Powerful open source frameworks that run everywhere
● Fully managed compute and storage services to run it more easily
● Free trial for 1 year with $300 worth in credits
Cloud Machine
Learning Engine
TensorFlow
Cloud
Dataflow
Apache
Beam

8 Outline
GCP tools overview
1
2
3

9 Workshop: overview
● Build ML model to classify flower images

10 Workshop: overview
Cloud Machine
Learning Engine
Cloud
Dataflow
Tensorflow
Part 1: Transform images from Cloud Storage into TF
records and split into train, test and validation set
Part 2: Build ML model, deploy it and use it to
make predictions

11 Google Cloud Storage (GCS)
● Object Storage Service
● Your data lives here:
○ Raw input data
○ Cleaned examples for TF models
○ Serialized Tensorflow models
● Single interface/API, multiple
offerings
Name Access Frequency
Multi-Regional Frequent, Cross-regional
Regional Frequent, Single-region
Nearline Less than once per month
Coldline Less than once per year

12 Apache Beam running on Cloud Dataflow
● Open source, unified model for defining both
batch and streaming data-parallel processing
pipelines.
● Using one of the open source Beam SDKs, you
build a program that defines the pipeline.
● The pipeline is then executed by one of Beam’s
supported distributed processing back-ends,
which include Apache Apex, Apache Flink,
Apache Spark, and Google Cloud Dataflow.
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline
Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
Source: https://beam.apache.org

13 Apache Beam key concepts
● Pipelines: data processing job made of a
series of computations including input,
processing, and output
● PCollections: bounded (or unbounded)
datasets which represent the input,
intermediate and output data in pipelines
● PTransforms: data processing step in a
pipeline in which one or more PCollections are
an input and output
● I/O Sources and Sinks: APIs for reading and
writing data which are the roots and
endpoints of the pipeline.
Source: https://beam.apache.org

14 Apache Beam running on Cloud Dataflow
● Fully-managed data processing service
to run Apache Beam pipelines:
○ Automated and optimized
work partitioning which can
dynamically rebalance lagging work
○ Horizontal dynamic autoscaling of
worker resources

15 Input data Flowers sample
Hosted publicly by google
● Csv file on Google Cloud Storage
○ One line per sample
○ Format: image uri, label
● Text file on Google Cloud Storage
○ All labels

16 Collect & organize data with Cloud Dataflow
Flowers sample steps
● ReadData:
○ Read metadata from 1 csv-file
○ Output one string per line
● Split:
○ Transform to string to tuple
■ (uri, label)
● ReadDictionary
○ Read labels from text file

17 One hot encoding
Daisy Dandelion Roses Sunflowers Tulips
Daisy 1 0 0 0 0
Dandelion 0 1 0 0 0
Roses 0 0 1 0 0
Sunflowers 0 0 0 1 0
Tulips 0 0 0 0 1

● OneHotEncoding:
○ Main input: (uri, label)
○ Side input: labels
○ Output: (uri, one hot encoding)
● ReadImage:
○ Read image from uri and convert to
pixels
● BuildExamples
○ Build a dictionary for each sample
to store as TFRecord

● Partition
○ Partition data into train, validation
and test set
● WriteExamples
○ Write TFRecords to GC Storage

20 AstroHack apache_beam code example
1 1
2
2
3 3
4
4
5
5

21 Starting from boilerplate code
https://github.com/Fematich/mlengine-boilerplate

22 Start coding!
goo.gl/qxG9Ln

23 Outline
GCP tools overview
1
2
3

24 Tensorflow
● Open-source library for machine learning
● Single API for multiple platforms/devices:
cpu(s), gpu(s),tpu(s), mobile phones...
● 2 step approach:
○ Construct your model as a
computational graph
○ Train your model by pushing data
through the graph
● Big community with lots of SotA model
implementations

25 ML Engine Training
● Tensorflow Training As a Service
● Data needs to be available online
● No fancy interface (only logging +
Tensorboard)
● Same code can run locally to test on small
datasets
● Nice features:
○ Easy setup of (GPU) clusters for
distributed Tensorflow models
○ Automatic parallel hyperparameter
tuning with Hypertune

26 ML Engine Predictions
● Deploy trained model:
○ model (container)
○ version (actual code)
● Predictions:
○ batch
○ online
● Autoscaling

27 Start coding!
goo.gl/qxG9Ln

Workshop Machine Learning at Scale with Google Cloud Platform

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Workshop Machine Learning at Scale with Google Cloud Platform