Apache Beam and Google Cloud Dataflow - IDG - final

Google Cloud Dataflow
the next generation of managed big data service
based on the Apache Beam programming model
Sub Szabolcs Feczak, Cloud Solutions Engineer
Google
9th Cloud & Data Center World 2016 - 한국 IDG

You leave here understanding the fundamentals of
the Apache Beam model and the Google Cloud Dataflow managed service
We have some fun.
1
Goals
2

Background and Historical
overview

The trade-off quadrant of Big Data
CompletenessSpeed
Cost
Optimization
Complexity
Time to Answer

MapReduce
Hadoop
Flume
Storm
Spark
MillWheel
Flink
Apache Beam
*
Batch
Streaming
Pipelines
UnifiedAPI
NoLambda
Iterative
Interactive
ExactlyOnce
State
Timers
Auto-Awesome
Watermarks
Windowing
High-levelAPI
ManagedService
Triggers
OpenSource
UnifiedEngine
*
*
Optimizer
* * *
*
*
* *
* *
*

Deep dive, probing familiarity with the subject
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year

Before Apache Beam
Batch
Accuracy
Simplicity
Savings
Stream
Speed
Sophistication
Scalability
OR
OR
OR
OR

After Apache Beam
Batch
Accuracy
Simplicity
Savings
Stream
Speed
Sophistication
Scalability
AND
AND
AND
AND
Balancing correctness, latency and cost with a unified batch
with a streaming model

http://research.google.com/search.html?q=dataflow

Apache Beam (incubating)
Java https://github.com
/GoogleCloudPlatform/DataflowJavaSDK
Python (ALPHA)
Scala /darkjh/scalaflow
/jhlch/scala-dataflow-dsl
Software
Development Kits
Runners
http://incubator.apache.org/projects/beam.html
The Dataflow submission to the Apache Incubator was accepted on February 1, 2016,
and the resulting project is now called Apache Beam.
Spark runner@
/cloudera/spark-
dataflow
Flink runner @
/dataArtisans/flink-dataflow

• Movement
• Filtering
• Enrichment
• Shaping
• Reduction
• Batch computation
• Continuous
computation
• Composition
• External
orchestration
• Simulation
Where might you use Apache Beam?
AnalysisETL Orchestration

Why would you go with a
managed service?

GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Cloud Dataflow Managed Service advantages
(GA since 2015 August)
Progress & Logs

Deploy Schedule & Monitor Tear Down
Worker Lifecycle Management
Cloud Dataflow Service

❯ Time & life never stop
❯ Data rates & schema are not static
❯ Scaling models are not static
❯ Non-elastic compute is wasteful and
can create lag
Challenge: cost optimization

Auto-scaling
800 QPS 1200 QPS 5000 QPS 50 QPS
10:00 11:00 12:00 13:00

100 mins. 65 mins.
vs.
Dynamic Work Rebalancing

● ParDo fusion
○ Producer Consumer
○ Sibling
○ Intelligent fusion
boundaries
● Combiner lifting e.g. partial
aggregations before
reduction
● http://research.google.com/search.
html?q=flume%20java
...
Graph Optimization
C D
C+D
consumer-producer
= ParallelDo
GBK = GroupByKey
+ = CombineValues
sibling
C D
C+D
A GBK + B
A+ GBK + B
combiner lifting

Deep dive into
the programming model

The Apache Beam Logical Model
What are you computing?
Where in event time?
When in processing time?
How do refinements relate?

What are you computing?
● A Pipeline represents a graph
● Nodes are data processing
transformations
● Edges are data sets flowing
through the pipeline
● Optimized and executed as a
unit for efficiency

What are you computing? PCollections
● is a collection of homogenous
data of the same type
● Maybe be bounded or
unbounded in size
● Each element has an implicit
timestamp
● Initially created from backing data
stores

Challenge: completeness when processing continuous data
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
8:00

What are you computing? PTransforms
transform PCollections into other PCollections.
What Where When How
Element-Wise
(Map + Reduce = ParDo)
Aggregating
(Combine, Join Group)
Composite

GroupByKey
Pair With Ones
Sum Values
Count
❯ Define new PTransforms by building up
subgraphs of existing transforms
❯ Some utilities are included in the SDK
• Count, RemoveDuplicates, Join,
Min, Max, Sum, ...
❯ You can define your own:
• DoSomething, DoSomethingElse,
etc.
❯ Why bother?
• Code reuse
• Better monitoring experience
Composite PTransforms
Apache BeamSDK

Example: Computing Integer Sums
What Where When How

What Where When How
Example: Computing Integer Sums

Key
2
Key
1
Key
3
1
Fixed
2
3
4
Key
2
Key
1
Key
3
Sliding
1
2
3
5
4
Key
2
Key
1
Key
3
Sessions
2
4
3
1
Where in Event Time?
● Windowing divides
data into event-
time-based finite
chunks.
● Required when
doing aggregations
over unbounded
data.
What Where When How

What Where When How
Example: Fixed 2-minute Windows

What Where When How
When in Processing Time?
● Triggers control
when results are
emitted.
● Triggers are often
relative to the
watermark.
ProcessingTime
Event Time
Watermark
Skew

What Where When How
Example: Triggering at the Watermark

What Where When How
Example: Triggering for Speculative & Late Data

What Where When How
How do Refinements Relate?
● How should multiple outputs per window
accumulate?
● Appropriate choice depends on consumer.
Firing Elements
Speculative 3
Watermark 5, 1
Late 2
Total Observ 11
Discarding
3
6
2
11
Accumulating
3
9
11
23
Acc. & Retracting
3
9, -3
11, -9
11

What Where When How
Example: Add Newest, Remove Previous

1. Classic Batch 2. Batch with Fixed
Windows
3. Streaming 5. Streaming with
Retractions
4. Streaming with
Speculative + Late Data
Customizing What Where When How
What Where When How

Optimizing Your Time To Answer
More time to dig
into your data
Programming
Resource
provisioning
Performance
tuning
Monitoring
Reliability
Deployment &
configuration
Handling
Growing Scale
Utilization
improvements
Data Processing with Cloud DataflowTypical Data Processing
Programming

How much more
time?
You do not just save
on processing, but
code complexity
and size as well!
Source: https://cloud.google.
com/dataflow/blog/dataflow-beam-and-
spark-comparison

What do customers have to say about
Google Cloud Dataflow
"We are utilizing Cloud Dataflow to overcome elasticity
challenges with our current Hadoop cluster. Starting with
some basic ETL workflow for BigQuery ingestion, we
transitioned into full blown clickstream processing and
analysis. This has helped us significantly improve
performance of our overall system and reduce cost."
Sudhir Hasbe, Director of Software Engineering, Zullily.com
“The current iteration of Qubit’s real-time data supply chain
was heavily inspired by the ground-breaking stream
processing concepts described in Google’s MillWheel paper.
Today we are happy to come full circle and build streaming
pipelines on top of Cloud Dataflow - which has delivered
on the promise of a highly-available and fault-tolerant
data processing system with an incredibly powerful and
expressive API.”
Jibran Saithi, Lead Architect, Qubit
"We are very excited about the productivity benefits offered by
Cloud Dataflow and Cloud Pub/Sub. It took half a day to
rewrite something that had previously taken over six
months to build using Spark"
Paul Clarke, Director of Technology, Ocado
“Boosting performance isn’t the only thing we want to get
from the new system. Our bet is that by using cloud-managed
products we will have a much lower operational overhead.
That in turn means we will have much more time to make
Spotify’s products better.”
Igor Maravić, Software Engineer working at Spotify

Let’s build something - Demo!
Ingest stream from Wikipedia edits https://wikitech.wikimedia.
org/wiki/Stream.wikimedia.org
Inspect the result set in our data warehouse (BigQuery)
Create a pipeline and run a Dataflow job to extract the top
10 active editors and top 10 pages edited
Extract words from a Shakespeare corpus, count the
occurrences of each word, write sharded results as blobs
into a key value store (Cloud Storage)
1.
2.

Thank You!
cloud.google.com/dataflow
cloud.google.com/blog/big-data/
cloud.google.com/solutions/articles#bigdata
cloud.google.com/newsletter
research.google.com

Apache Beam and Google Cloud Dataflow - IDG - final

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Beam and Google Cloud Dataflow - IDG - final

Similar to Apache Beam and Google Cloud Dataflow - IDG - final (20)

Apache Beam and Google Cloud Dataflow - IDG - final