SlideShare a Scribd company logo
1 of 43
Download to read offline
Google Cloud Dataflow
the next generation of managed big data service
based on the Apache Beam programming model
Sub Szabolcs Feczak, Cloud Solutions Engineer
Google
9th Cloud & Data Center World 2016 - 한국 IDG
You leave here understanding the fundamentals of
the Apache Beam model and the Google Cloud Dataflow managed service
We have some fun.
1
Goals
2
Background and Historical
overview
The trade-off quadrant of Big Data
CompletenessSpeed
Cost
Optimization
Complexity
Time to Answer
MapReduce
Hadoop
Flume
Storm
Spark
MillWheel
Flink
Apache Beam
*
Batch
Streaming
Pipelines
UnifiedAPI
NoLambda
Iterative
Interactive
ExactlyOnce
State
Timers
Auto-Awesome
Watermarks
Windowing
High-levelAPI
ManagedService
Triggers
OpenSource
UnifiedEngine
*
*
Optimizer
* * *
*
*
* *
* *
*
Deep dive, probing familiarity with the subject
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year
Before Apache Beam
Batch
Accuracy
Simplicity
Savings
Stream
Speed
Sophistication
Scalability
OR
OR
OR
OR
After Apache Beam
Batch
Accuracy
Simplicity
Savings
Stream
Speed
Sophistication
Scalability
AND
AND
AND
AND
Balancing correctness, latency and cost with a unified batch
with a streaming model
http://research.google.com/search.html?q=dataflow
Apache Beam (incubating)
Java https://github.com
/GoogleCloudPlatform/DataflowJavaSDK
Python (ALPHA)
Scala /darkjh/scalaflow
/jhlch/scala-dataflow-dsl
Software
Development Kits
Runners
http://incubator.apache.org/projects/beam.html
The Dataflow submission to the Apache Incubator was accepted on February 1, 2016,
and the resulting project is now called Apache Beam.
Spark runner@
/cloudera/spark-
dataflow
Flink runner @
/dataArtisans/flink-dataflow
• Movement
• Filtering
• Enrichment
• Shaping
• Reduction
• Batch computation
• Continuous
computation
• Composition
• External
orchestration
• Simulation
Where might you use Apache Beam?
AnalysisETL Orchestration
Why would you go with a
managed service?
GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Cloud Dataflow Managed Service advantages
(GA since 2015 August)
Progress & Logs
Deploy Schedule & Monitor Tear Down
Worker Lifecycle Management
Cloud Dataflow Service
❯ Time & life never stop
❯ Data rates & schema are not static
❯ Scaling models are not static
❯ Non-elastic compute is wasteful and
can create lag
Challenge: cost optimization
Auto-scaling
800 QPS 1200 QPS 5000 QPS 50 QPS
10:00 11:00 12:00 13:00
Cloud Dataflow Service
100 mins. 65 mins.
vs.
Dynamic Work Rebalancing
Cloud Dataflow Service
● ParDo fusion
○ Producer Consumer
○ Sibling
○ Intelligent fusion
boundaries
● Combiner lifting e.g. partial
aggregations before
reduction
● http://research.google.com/search.
html?q=flume%20java
...
Graph Optimization
Cloud Dataflow Service
C D
C+D
consumer-producer
= ParallelDo
GBK = GroupByKey
+ = CombineValues
sibling
C D
C+D
A GBK + B
A+ GBK + B
combiner lifting
Deep dive into
the programming model
The Apache Beam Logical Model
What are you computing?
Where in event time?
When in processing time?
How do refinements relate?
What are you computing?
● A Pipeline represents a graph
● Nodes are data processing
transformations
● Edges are data sets flowing
through the pipeline
● Optimized and executed as a
unit for efficiency
What are you computing? PCollections
● is a collection of homogenous
data of the same type
● Maybe be bounded or
unbounded in size
● Each element has an implicit
timestamp
● Initially created from backing data
stores
Challenge: completeness when processing continuous data
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
8:00
What are you computing? PTransforms
transform PCollections into other PCollections.
What Where When How
Element-Wise
(Map + Reduce = ParDo)
Aggregating
(Combine, Join Group)
Composite
GroupByKey
Pair With Ones
Sum Values
Count
❯ Define new PTransforms by building up
subgraphs of existing transforms
❯ Some utilities are included in the SDK
• Count, RemoveDuplicates, Join,
Min, Max, Sum, ...
❯ You can define your own:
• DoSomething, DoSomethingElse,
etc.
❯ Why bother?
• Code reuse
• Better monitoring experience
Composite PTransforms
Apache BeamSDK
Example: Computing Integer Sums
What Where When How
What Where When How
Example: Computing Integer Sums
Key
2
Key
1
Key
3
1
Fixed
2
3
4
Key
2
Key
1
Key
3
Sliding
1
2
3
5
4
Key
2
Key
1
Key
3
Sessions
2
4
3
1
Where in Event Time?
● Windowing divides
data into event-
time-based finite
chunks.
● Required when
doing aggregations
over unbounded
data.
What Where When How
What Where When How
Example: Fixed 2-minute Windows
What Where When How
When in Processing Time?
● Triggers control
when results are
emitted.
● Triggers are often
relative to the
watermark.
ProcessingTime
Event Time
Watermark
Skew
What Where When How
Example: Triggering at the Watermark
What Where When How
Example: Triggering for Speculative & Late Data
What Where When How
How do Refinements Relate?
● How should multiple outputs per window
accumulate?
● Appropriate choice depends on consumer.
Firing Elements
Speculative 3
Watermark 5, 1
Late 2
Total Observ 11
Discarding
3
6
2
11
Accumulating
3
9
11
23
Acc. & Retracting
3
9, -3
11, -9
11
What Where When How
Example: Add Newest, Remove Previous
1. Classic Batch 2. Batch with Fixed
Windows
3. Streaming 5. Streaming with
Retractions
4. Streaming with
Speculative + Late Data
Customizing What Where When How
What Where When How
The key takeaway
Optimizing Your Time To Answer
More time to dig
into your data
Programming
Resource
provisioning
Performance
tuning
Monitoring
Reliability
Deployment &
configuration
Handling
Growing Scale
Utilization
improvements
Data Processing with Cloud DataflowTypical Data Processing
Programming
How much more
time?
You do not just save
on processing, but
code complexity
and size as well!
Source: https://cloud.google.
com/dataflow/blog/dataflow-beam-and-
spark-comparison
What do customers have to say about
Google Cloud Dataflow
"We are utilizing Cloud Dataflow to overcome elasticity
challenges with our current Hadoop cluster. Starting with
some basic ETL workflow for BigQuery ingestion, we
transitioned into full blown clickstream processing and
analysis. This has helped us significantly improve
performance of our overall system and reduce cost."
Sudhir Hasbe, Director of Software Engineering, Zullily.com
“The current iteration of Qubit’s real-time data supply chain
was heavily inspired by the ground-breaking stream
processing concepts described in Google’s MillWheel paper.
Today we are happy to come full circle and build streaming
pipelines on top of Cloud Dataflow - which has delivered
on the promise of a highly-available and fault-tolerant
data processing system with an incredibly powerful and
expressive API.”
Jibran Saithi, Lead Architect, Qubit
"We are very excited about the productivity benefits offered by
Cloud Dataflow and Cloud Pub/Sub. It took half a day to
rewrite something that had previously taken over six
months to build using Spark"
Paul Clarke, Director of Technology, Ocado
“Boosting performance isn’t the only thing we want to get
from the new system. Our bet is that by using cloud-managed
products we will have a much lower operational overhead.
That in turn means we will have much more time to make
Spotify’s products better.”
Igor Maravić, Software Engineer working at Spotify
Demo Time!
Let’s build something - Demo!
Ingest stream from Wikipedia edits https://wikitech.wikimedia.
org/wiki/Stream.wikimedia.org
Inspect the result set in our data warehouse (BigQuery)
Create a pipeline and run a Dataflow job to extract the top
10 active editors and top 10 pages edited
Extract words from a Shakespeare corpus, count the
occurrences of each word, write sharded results as blobs
into a key value store (Cloud Storage)
1.
2.
Thank You!
cloud.google.com/dataflow
cloud.google.com/blog/big-data/
cloud.google.com/solutions/articles#bigdata
cloud.google.com/newsletter
research.google.com

More Related Content

What's hot

Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and visionStephan Ewen
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowHow to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowLucas Arruda
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
 
Your tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
Your tuning arsenal: AWR, ADDM, ASH, Metrics and AdvisorsYour tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
Your tuning arsenal: AWR, ADDM, ASH, Metrics and AdvisorsJohn Kanagaraj
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkDataWorks Summit
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Meet up roadmap cloudera 2020 - janeiro
Meet up   roadmap cloudera 2020 - janeiroMeet up   roadmap cloudera 2020 - janeiro
Meet up roadmap cloudera 2020 - janeiroThiago Santiago
 
From my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumFrom my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumClement Demonchy
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkTimo Walther
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBconfluent
 
ETL tool evaluation criteria
ETL tool evaluation criteriaETL tool evaluation criteria
ETL tool evaluation criteriaAsis Mohanty
 
Substrait Overview.pdf
Substrait Overview.pdfSubstrait Overview.pdf
Substrait Overview.pdfRinat Abdullin
 
Avro Tutorial - Records with Schema for Kafka and Hadoop
Avro Tutorial - Records with Schema for Kafka and HadoopAvro Tutorial - Records with Schema for Kafka and Hadoop
Avro Tutorial - Records with Schema for Kafka and HadoopJean-Paul Azar
 

What's hot (20)

Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowHow to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Your tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
Your tuning arsenal: AWR, ADDM, ASH, Metrics and AdvisorsYour tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
Your tuning arsenal: AWR, ADDM, ASH, Metrics and Advisors
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Meet up roadmap cloudera 2020 - janeiro
Meet up   roadmap cloudera 2020 - janeiroMeet up   roadmap cloudera 2020 - janeiro
Meet up roadmap cloudera 2020 - janeiro
 
From my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumFrom my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debezium
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
ETL tool evaluation criteria
ETL tool evaluation criteriaETL tool evaluation criteria
ETL tool evaluation criteria
 
Substrait Overview.pdf
Substrait Overview.pdfSubstrait Overview.pdf
Substrait Overview.pdf
 
Avro Tutorial - Records with Schema for Kafka and Hadoop
Avro Tutorial - Records with Schema for Kafka and HadoopAvro Tutorial - Records with Schema for Kafka and Hadoop
Avro Tutorial - Records with Schema for Kafka and Hadoop
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 

Similar to Apache Beam and Google Cloud Dataflow - IDG - final

Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT_MTL
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionDaniel Zivkovic
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)Apache Apex
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platformdhruv_chaudhari
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumbergerinside-BigData.com
 
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Ido Green
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAlluxio, Inc.
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalAvere Systems
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Alluxio, Inc.
 
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfCloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfLeah Cole
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in SparkDigital Vidya
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastDatabricks
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTJames Chittenden
 
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it YourselfWhy Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it YourselfDATAVERSITY
 

Similar to Apache Beam and Google Cloud Dataflow - IDG - final (20)

Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute final
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
 
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfCloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoT
 
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it YourselfWhy Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
Why Cloud-Native Kafka Matters: 4 Reasons to Stop Managing it Yourself
 

Apache Beam and Google Cloud Dataflow - IDG - final

  • 1. Google Cloud Dataflow the next generation of managed big data service based on the Apache Beam programming model Sub Szabolcs Feczak, Cloud Solutions Engineer Google 9th Cloud & Data Center World 2016 - 한국 IDG
  • 2. You leave here understanding the fundamentals of the Apache Beam model and the Google Cloud Dataflow managed service We have some fun. 1 Goals 2
  • 4. The trade-off quadrant of Big Data CompletenessSpeed Cost Optimization Complexity Time to Answer
  • 6. Deep dive, probing familiarity with the subject 1M Devices 16.6K Events/sec 43B Events/month 518B Events/year
  • 10.
  • 11. Apache Beam (incubating) Java https://github.com /GoogleCloudPlatform/DataflowJavaSDK Python (ALPHA) Scala /darkjh/scalaflow /jhlch/scala-dataflow-dsl Software Development Kits Runners http://incubator.apache.org/projects/beam.html The Dataflow submission to the Apache Incubator was accepted on February 1, 2016, and the resulting project is now called Apache Beam. Spark runner@ /cloudera/spark- dataflow Flink runner @ /dataArtisans/flink-dataflow
  • 12. • Movement • Filtering • Enrichment • Shaping • Reduction • Batch computation • Continuous computation • Composition • External orchestration • Simulation Where might you use Apache Beam? AnalysisETL Orchestration
  • 13. Why would you go with a managed service?
  • 14. GCP Managed Service User Code & SDK Work Manager Deploy & Schedule Monitoring UI Job Manager Cloud Dataflow Managed Service advantages (GA since 2015 August) Progress & Logs
  • 15. Deploy Schedule & Monitor Tear Down Worker Lifecycle Management Cloud Dataflow Service
  • 16. ❯ Time & life never stop ❯ Data rates & schema are not static ❯ Scaling models are not static ❯ Non-elastic compute is wasteful and can create lag Challenge: cost optimization
  • 17. Auto-scaling 800 QPS 1200 QPS 5000 QPS 50 QPS 10:00 11:00 12:00 13:00 Cloud Dataflow Service
  • 18. 100 mins. 65 mins. vs. Dynamic Work Rebalancing Cloud Dataflow Service
  • 19. ● ParDo fusion ○ Producer Consumer ○ Sibling ○ Intelligent fusion boundaries ● Combiner lifting e.g. partial aggregations before reduction ● http://research.google.com/search. html?q=flume%20java ... Graph Optimization Cloud Dataflow Service C D C+D consumer-producer = ParallelDo GBK = GroupByKey + = CombineValues sibling C D C+D A GBK + B A+ GBK + B combiner lifting
  • 20. Deep dive into the programming model
  • 21. The Apache Beam Logical Model What are you computing? Where in event time? When in processing time? How do refinements relate?
  • 22. What are you computing? ● A Pipeline represents a graph ● Nodes are data processing transformations ● Edges are data sets flowing through the pipeline ● Optimized and executed as a unit for efficiency
  • 23. What are you computing? PCollections ● is a collection of homogenous data of the same type ● Maybe be bounded or unbounded in size ● Each element has an implicit timestamp ● Initially created from backing data stores
  • 24. Challenge: completeness when processing continuous data 9:008:00 14:0013:0012:0011:0010:00 8:00 8:008:00 8:00
  • 25. What are you computing? PTransforms transform PCollections into other PCollections. What Where When How Element-Wise (Map + Reduce = ParDo) Aggregating (Combine, Join Group) Composite
  • 26. GroupByKey Pair With Ones Sum Values Count ❯ Define new PTransforms by building up subgraphs of existing transforms ❯ Some utilities are included in the SDK • Count, RemoveDuplicates, Join, Min, Max, Sum, ... ❯ You can define your own: • DoSomething, DoSomethingElse, etc. ❯ Why bother? • Code reuse • Better monitoring experience Composite PTransforms Apache BeamSDK
  • 27. Example: Computing Integer Sums What Where When How
  • 28. What Where When How Example: Computing Integer Sums
  • 29. Key 2 Key 1 Key 3 1 Fixed 2 3 4 Key 2 Key 1 Key 3 Sliding 1 2 3 5 4 Key 2 Key 1 Key 3 Sessions 2 4 3 1 Where in Event Time? ● Windowing divides data into event- time-based finite chunks. ● Required when doing aggregations over unbounded data. What Where When How
  • 30. What Where When How Example: Fixed 2-minute Windows
  • 31. What Where When How When in Processing Time? ● Triggers control when results are emitted. ● Triggers are often relative to the watermark. ProcessingTime Event Time Watermark Skew
  • 32. What Where When How Example: Triggering at the Watermark
  • 33. What Where When How Example: Triggering for Speculative & Late Data
  • 34. What Where When How How do Refinements Relate? ● How should multiple outputs per window accumulate? ● Appropriate choice depends on consumer. Firing Elements Speculative 3 Watermark 5, 1 Late 2 Total Observ 11 Discarding 3 6 2 11 Accumulating 3 9 11 23 Acc. & Retracting 3 9, -3 11, -9 11
  • 35. What Where When How Example: Add Newest, Remove Previous
  • 36. 1. Classic Batch 2. Batch with Fixed Windows 3. Streaming 5. Streaming with Retractions 4. Streaming with Speculative + Late Data Customizing What Where When How What Where When How
  • 38. Optimizing Your Time To Answer More time to dig into your data Programming Resource provisioning Performance tuning Monitoring Reliability Deployment & configuration Handling Growing Scale Utilization improvements Data Processing with Cloud DataflowTypical Data Processing Programming
  • 39. How much more time? You do not just save on processing, but code complexity and size as well! Source: https://cloud.google. com/dataflow/blog/dataflow-beam-and- spark-comparison
  • 40. What do customers have to say about Google Cloud Dataflow "We are utilizing Cloud Dataflow to overcome elasticity challenges with our current Hadoop cluster. Starting with some basic ETL workflow for BigQuery ingestion, we transitioned into full blown clickstream processing and analysis. This has helped us significantly improve performance of our overall system and reduce cost." Sudhir Hasbe, Director of Software Engineering, Zullily.com “The current iteration of Qubit’s real-time data supply chain was heavily inspired by the ground-breaking stream processing concepts described in Google’s MillWheel paper. Today we are happy to come full circle and build streaming pipelines on top of Cloud Dataflow - which has delivered on the promise of a highly-available and fault-tolerant data processing system with an incredibly powerful and expressive API.” Jibran Saithi, Lead Architect, Qubit "We are very excited about the productivity benefits offered by Cloud Dataflow and Cloud Pub/Sub. It took half a day to rewrite something that had previously taken over six months to build using Spark" Paul Clarke, Director of Technology, Ocado “Boosting performance isn’t the only thing we want to get from the new system. Our bet is that by using cloud-managed products we will have a much lower operational overhead. That in turn means we will have much more time to make Spotify’s products better.” Igor Maravić, Software Engineer working at Spotify
  • 42. Let’s build something - Demo! Ingest stream from Wikipedia edits https://wikitech.wikimedia. org/wiki/Stream.wikimedia.org Inspect the result set in our data warehouse (BigQuery) Create a pipeline and run a Dataflow job to extract the top 10 active editors and top 10 pages edited Extract words from a Shakespeare corpus, count the occurrences of each word, write sharded results as blobs into a key value store (Cloud Storage) 1. 2.