SlideShare a Scribd company logo
1 of 73
Unbounded, unordered, global scale datasets are increasingly common in day-today business,
and consumers of these datasets have detailed requirements for latency, cost, and
completeness.
Apache Beam (incubating) defines a new data processing programming model that evolved
from more than a decade of experience building Big Data infrastructure within Google,
including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow. Beam handles both batch
and streaming use cases and neatly separates properties of the data from runtime
characteristics, allowing pipelines to be portable across multiple runtime environments, both
open source (e.g., Apache Flink, Apache Spark, et al.), and proprietary (e.g., Google Cloud
Dataflow).
This talk will cover the basics of Apache Beam, touch on its evolution, and describe main
concepts in the programming model. During the talk, we’ll argue why Beam is unified, efficient
and portable.
Abstract
Davor Bonaci
Apache Beam PPMC
Software Engineer, Google Inc.
Apache Beam:
A Unified Model for Batch and
Streaming Data Processing
Hadoop Summit, June 28-30, 2016, San Jose, CA
Apache Beam is
a unified programming model
designed to provide
efficient and portable
data processing pipelines
1. The Beam Model:
What / Where / When / How
2. SDKs for writing Beam pipelines:
• Java
• Python
3. Runners for Existing Distributed
Processing Backends
• Apache Flink
• Apache Spark
• Google Cloud Dataflow
• Local runner for testing
What is Apache Beam?
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Google
Cloud
Dataflow
Execution
The Evolution of Beam
MapReduce
Google Cloud
Dataflow
Apache
Beam
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
Apache Beam is
a unified programming model
designed to provide
efficient and portable
data processing pipelines
Data...
...can be big...
...really, really big...
Tuesday
Wednesday
Thursday
… maybe infinitely big...
9:008:00 14:0013:0012:0011:0010:00
… with unknown delays.
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
Reality
Formalizing Event-Time Skew
ProcessingTime
Event Time
Ideal
Skew
Formalizing Event-Time Skew
Watermarks describe event
time progress.
"No timestamp earlier than the
watermark will be seen"
ProcessingTime
Event Time
~Watermark
Ideal
Skew
Often heuristic-based.
Too Slow? Results are delayed.
Too Fast? Some data is late.
What are you computing?
Where in event time?
When in processing time?
How do refinements relate?
What are you computing?
Element-Wise Aggregating Composite
What: Computing Integer Sums
// Collection of raw log lines
PCollection<String> raw = IO.read(...);
// Element-wise transformation into team/score pairs
PCollection<KV<String, Integer>> input =
raw.apply(ParDo.of(new ParseFn());
// Composite transformation containing an aggregation
PCollection<KV<String, Integer>> scores =
input.apply(Sum.integersPerKey());
What: Computing Integer Sums
What: Computing Integer Sums
Windowing divides data into event-time-based finite chunks.
Often required when doing aggregations over unbounded data.
Where in event time?
Fixed Sliding
1 2 3
54
Sessions
2
431
Key
2
Key
1
Key
3
Time
2 3 4
Where: Fixed 2-minute Windows
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2)))
.apply(Sum.integersPerKey());
Where: Fixed 2-minute Windows
When in processing time?
• Triggers control
when results are
emitted.
• Triggers are often
relative to the
watermark.
ProcessingTime
Event Time
~Watermark
Ideal
Skew
When: Triggering at the Watermark
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
When: Triggering at the Watermark
When: Early and Late Firings
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1))))
.apply(Sum.integersPerKey());
When: Early and Late Firings
How do refinements relate?
• How should multiple outputs per window accumulate?
• Appropriate choice depends on consumer.
Firing Elements
Speculative [3]
Watermark [5, 1]
Late [2]
Last Observed
Total Observed
Discarding
3
6
2
2
11
Accumulating
3
9
11
11
23
Acc. & Retracting
3
9, -3
11, -9
11
11
(Accumulating & Retracting not yet implemented.)
How: Add Newest, Remove Previous
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingAndRetractingFiredPanes())
.apply(Sum.integersPerKey());
How: Add Newest, Remove Previous
Correctness
Power
Composability
Flexibility
Modularity
What / Where / When / How
Correctness
Power
Composability
Flexibility
Modularity
What / Where / When / How
Distributed Systems are Distributed
Event Time Results are Stable
Correctness
Power
Composability
Flexibility
Modularity
What / Where / When / How
Sessions
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(Sessions.withGapDuration(Minutes(1))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingAndRetractingFiredPanes())
.apply(Sum.integersPerKey());
Identifying Bursts of User Activity
Correctness
Power
Composability
Flexibility
Modularity
What / Where / When / How
1.Classic Batch 2. Batch with Fixed
Windows
3. Streaming
5. Streaming With
Retractions
4. Streaming with
Speculative + Late Data
6. Sessions
Correctness
Power
Composability
Flexibility
Modularity
What / Where / When / How
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingAndRetractingFiredPanes())
.apply(Sum.integersPerKey());
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1)))
.apply(Sum.integersPerKey());
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2)))
.apply(Sum.integersPerKey());
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(Sessions.withGapDuration(Minutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingAndRetractingFiredPanes())
.apply(Sum.integersPerKey());
1.Classic Batch 2. Batch with Fixed
Windows
3. Streaming
5. Streaming With
Retractions
4. Streaming with
Speculative + Late Data
6. Sessions
Correctness
Power
Composability
Flexibility
Modularity
What / Where / When / How
Apache Beam is
a unified programming model
designed to provide
efficient and portable
data processing pipelines
Workload vary in pipelines over timeWorkload
Time
Batch pipelines go through stagesStreaming pipeline’s input varies
Perils of fixed decisionsWorkload
Time
Under-provisioned / average caseOver-provisioned / worst case
Workload
Time
Ideal case
Workload
Time
Solution: bundles
class MyDoFn extends
DoFn<String, String> {
void startBundle(...) { }
void processElement(...) { }
void finishBundle(...) { }
}
• User code operates on
bundles of elements.
• Easy parallelization.
• Dynamic sizing.
• Parallelism decisions in the
runner’s hands.
The Straggler Problem
• Work is unevenly
distributed across
tasks.
• Reasons:
• Underlying data.
• Processing.
• Effects multiplied per
stage.
Worker
Time
“Standard” workarounds for stragglers
• Split files into equal sizes?
• Pre-emptively over-split?
• Detect slow workers and re-
execute?
• Sample extensively and then
split?
Worker
Time
No amount of upfront heuristic tuning (be it manual or
automatic) is enough to guarantee good performance:
the system will always hit unpredictable situations at run-time.
A system that's able to dynamically adapt and
get out of a bad situation is much more powerful
than one that heuristically hopes to avoid getting into it.
Solution: Dynamic Work Rebalancing
Done work Active work Predicted completion Split
Now Average
completion
Time Now Average
completion
Time
Solution: Dynamic work rebalancing
class MyReader
extends BoundedReader<T> {
[...]
getFractionConsumed() { }
splitAtFraction(...) { }
}
class MySource
extends BoundedSource<T> {
[...]
splitIntoBundles() { }
getEstimatedSizeBytes() { }
}
Real world example
Dynamic bundles + work re-balancing + autoscaling
Apache Beam is
a unified programming model
designed to provide
efficient and portable
data processing pipelines
1. Write: Choose an SDK to write your
pipeline in.
2. Execute: Choose any runner at
execution time.
Apache Beam Architecture
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Google
Cloud
Dataflow
Execution
Categorizing Runner Capabilities
http://beam.incubator.apache.org/capability-matrix/
1. End users: who want to write
pipelines or transform libraries in a
language that’s familiar.
2. SDK writers: who want to make
Beam concepts available in new
languages.
3. Runner writers: who have a
distributed processing
environment and want to support
Beam pipelines
Multiple categories of users
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Google
Cloud
Dataflow
Execution
• If you have Big Data APIs, write a Beam
SDK or DSL or library of transformations.
• If you have a distributed processing
backend, write a Beam runner!
• If you have a data storage or messaging
system, write a Beam IO connector!
Growing the Open Source Community
Apache Beam is
a unified programming model
designed to provide
efficient and portable
data processing pipelines
Visions are a Journey
02/01/2016
Enter Apache
Incubator
End 2016
Beam pipelines
run on many
runners in
production uses
Early 2016
Design for use cases,
begin refactoring
Mid 2016
Additional refactoring,
non-production uses
Late 2016
Multiple runners
execute Beam
pipelines
02/25/2016
1st commit to
ASF repository
06/14/2016
1st incubating
release
June 2016
Python SDK
moves to
Beam
Learn More!
Apache Beam (incubating)
http://beam.incubator.apache.org
The World Beyond Batch 101 & 102
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Join the Beam mailing lists!
user-subscribe@beam.incubator.apache.org
dev-subscribe@beam.incubator.apache.org
Follow @ApacheBeam on Twitter
Thank you!
Extra material
Element-wise transformations
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Aggregating via Processing-Time Windows
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Aggregating via Event-Time Windows
Event Time
Processing
Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Processing Time Results Differ
Identifying Bursts of User Activity
Correctness
Power
Composability
Flexibility
Modularity
What / Where / When / How
Calculating Session Lengths
input
.apply(Window.into(Sessions.withGapDuration(Minutes(1)))
.trigger(AtWatermark())
.discardingFiredPanes())
.apply(CalculateWindowLength()));
Calculating the Average Session Length
.apply(Window.into(FixedWindows.of(Minutes(2)))
.trigger(AtWatermark())
.withEarlyFirings(AtPeriod(Minutes(1)))
.accumulatingFiredPanes())
.apply(Mean.globally());
input
.apply(Window.into(Sessions.withGapDuration(Minutes(1)))
.trigger(AtWatermark())
.discardingFiredPanes())
.apply(CalculateWindowLength()));
Apache Beam: A unified model for batch and stream processing data

More Related Content

What's hot

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkDataWorks Summit
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkDataWorks Summit
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistentconfluent
 

What's hot (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
kafka
kafkakafka
kafka
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
 

Viewers also liked

Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillHenry Saputra
 
Harnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache TwillHarnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache TwillTerence Yim
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry confluent
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamVerverica
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitSlim Baltagi
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsSlim Baltagi
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaSlim Baltagi
 
Flink Case Study: Amadeus
Flink Case Study: AmadeusFlink Case Study: Amadeus
Flink Case Study: AmadeusFlink Forward
 
Flink Case Study: OKKAM
Flink Case Study: OKKAMFlink Case Study: OKKAM
Flink Case Study: OKKAMFlink Forward
 
Flink Case Study: Capital One
Flink Case Study: Capital OneFlink Case Study: Capital One
Flink Case Study: Capital OneFlink Forward
 
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Carol Smith
 

Viewers also liked (15)

Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill
 
Harnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache TwillHarnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache Twill
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 
Flink Case Study: Amadeus
Flink Case Study: AmadeusFlink Case Study: Amadeus
Flink Case Study: Amadeus
 
Flink Case Study: OKKAM
Flink Case Study: OKKAMFlink Case Study: OKKAM
Flink Case Study: OKKAM
 
Flink Case Study: Capital One
Flink Case Study: Capital OneFlink Case Study: Capital One
Flink Case Study: Capital One
 
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
 

Similar to Apache Beam: A unified model for batch and stream processing data

Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneDataWorks Summit
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamFlink Forward
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceDataWorks Summit/Hadoop Summit
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureGabriele Modena
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
 
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...MSAdvAnalytics
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure wayBahadir Cambel
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
Everything comes in 3's
Everything comes in 3'sEverything comes in 3's
Everything comes in 3'sdelagoya
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesSigmoid
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageDamien Dallimore
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT_MTL
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesComunidade NetPonto
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle CoherenceBen Stopford
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
 

Similar to Apache Beam: A unified model for batch and stream processing data (20)

Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Everything comes in 3's
Everything comes in 3'sEverything comes in 3's
Everything comes in 3's
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the message
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de AplicaçõesWindows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
Windows Azure - Uma Plataforma para o Desenvolvimento de Aplicações
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Apache Beam: A unified model for batch and stream processing data

  • 1. Unbounded, unordered, global scale datasets are increasingly common in day-today business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam (incubating) defines a new data processing programming model that evolved from more than a decade of experience building Big Data infrastructure within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow. Beam handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtime environments, both open source (e.g., Apache Flink, Apache Spark, et al.), and proprietary (e.g., Google Cloud Dataflow). This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in the programming model. During the talk, we’ll argue why Beam is unified, efficient and portable. Abstract
  • 2. Davor Bonaci Apache Beam PPMC Software Engineer, Google Inc. Apache Beam: A Unified Model for Batch and Streaming Data Processing Hadoop Summit, June 28-30, 2016, San Jose, CA
  • 3. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 4. 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines: • Java • Python 3. Runners for Existing Distributed Processing Backends • Apache Flink • Apache Spark • Google Cloud Dataflow • Local runner for testing What is Apache Beam? Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Google Cloud Dataflow Execution
  • 5. The Evolution of Beam MapReduce Google Cloud Dataflow Apache Beam BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel
  • 6. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 10. … maybe infinitely big... 9:008:00 14:0013:0012:0011:0010:00
  • 11. … with unknown delays. 9:008:00 14:0013:0012:0011:0010:00 8:00 8:008:00
  • 13. Formalizing Event-Time Skew Watermarks describe event time progress. "No timestamp earlier than the watermark will be seen" ProcessingTime Event Time ~Watermark Ideal Skew Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.
  • 14. What are you computing? Where in event time? When in processing time? How do refinements relate?
  • 15. What are you computing? Element-Wise Aggregating Composite
  • 16. What: Computing Integer Sums // Collection of raw log lines PCollection<String> raw = IO.read(...); // Element-wise transformation into team/score pairs PCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn()); // Composite transformation containing an aggregation PCollection<KV<String, Integer>> scores = input.apply(Sum.integersPerKey());
  • 19. Windowing divides data into event-time-based finite chunks. Often required when doing aggregations over unbounded data. Where in event time? Fixed Sliding 1 2 3 54 Sessions 2 431 Key 2 Key 1 Key 3 Time 2 3 4
  • 20. Where: Fixed 2-minute Windows PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey());
  • 22. When in processing time? • Triggers control when results are emitted. • Triggers are often relative to the watermark. ProcessingTime Event Time ~Watermark Ideal Skew
  • 23. When: Triggering at the Watermark PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());
  • 24. When: Triggering at the Watermark
  • 25. When: Early and Late Firings PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());
  • 26. When: Early and Late Firings
  • 27. How do refinements relate? • How should multiple outputs per window accumulate? • Appropriate choice depends on consumer. Firing Elements Speculative [3] Watermark [5, 1] Late [2] Last Observed Total Observed Discarding 3 6 2 2 11 Accumulating 3 9 11 11 23 Acc. & Retracting 3 9, -3 11, -9 11 11 (Accumulating & Retracting not yet implemented.)
  • 28. How: Add Newest, Remove Previous PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());
  • 29. How: Add Newest, Remove Previous
  • 32. Distributed Systems are Distributed
  • 33. Event Time Results are Stable
  • 35. Sessions PCollection<KV<String, Integer>> scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(1)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());
  • 36. Identifying Bursts of User Activity
  • 38. 1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 5. Streaming With Retractions 4. Streaming with Speculative + Late Data 6. Sessions
  • 40. PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey()); 1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 5. Streaming With Retractions 4. Streaming with Speculative + Late Data 6. Sessions
  • 42. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 43. Workload vary in pipelines over timeWorkload Time Batch pipelines go through stagesStreaming pipeline’s input varies
  • 44. Perils of fixed decisionsWorkload Time Under-provisioned / average caseOver-provisioned / worst case Workload Time
  • 46. Solution: bundles class MyDoFn extends DoFn<String, String> { void startBundle(...) { } void processElement(...) { } void finishBundle(...) { } } • User code operates on bundles of elements. • Easy parallelization. • Dynamic sizing. • Parallelism decisions in the runner’s hands.
  • 47. The Straggler Problem • Work is unevenly distributed across tasks. • Reasons: • Underlying data. • Processing. • Effects multiplied per stage. Worker Time
  • 48. “Standard” workarounds for stragglers • Split files into equal sizes? • Pre-emptively over-split? • Detect slow workers and re- execute? • Sample extensively and then split? Worker Time
  • 49. No amount of upfront heuristic tuning (be it manual or automatic) is enough to guarantee good performance: the system will always hit unpredictable situations at run-time. A system that's able to dynamically adapt and get out of a bad situation is much more powerful than one that heuristically hopes to avoid getting into it.
  • 50. Solution: Dynamic Work Rebalancing Done work Active work Predicted completion Split Now Average completion Time Now Average completion Time
  • 51. Solution: Dynamic work rebalancing class MyReader extends BoundedReader<T> { [...] getFractionConsumed() { } splitAtFraction(...) { } } class MySource extends BoundedSource<T> { [...] splitIntoBundles() { } getEstimatedSizeBytes() { } }
  • 53. Dynamic bundles + work re-balancing + autoscaling
  • 54. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 55. 1. Write: Choose an SDK to write your pipeline in. 2. Execute: Choose any runner at execution time. Apache Beam Architecture Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Google Cloud Dataflow Execution
  • 57. 1. End users: who want to write pipelines or transform libraries in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines Multiple categories of users Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Google Cloud Dataflow Execution
  • 58. • If you have Big Data APIs, write a Beam SDK or DSL or library of transformations. • If you have a distributed processing backend, write a Beam runner! • If you have a data storage or messaging system, write a Beam IO connector! Growing the Open Source Community
  • 59. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  • 60. Visions are a Journey 02/01/2016 Enter Apache Incubator End 2016 Beam pipelines run on many runners in production uses Early 2016 Design for use cases, begin refactoring Mid 2016 Additional refactoring, non-production uses Late 2016 Multiple runners execute Beam pipelines 02/25/2016 1st commit to ASF repository 06/14/2016 1st incubating release June 2016 Python SDK moves to Beam
  • 61. Learn More! Apache Beam (incubating) http://beam.incubator.apache.org The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Join the Beam mailing lists! user-subscribe@beam.incubator.apache.org dev-subscribe@beam.incubator.apache.org Follow @ApacheBeam on Twitter
  • 64. Element-wise transformations 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
  • 65. Aggregating via Processing-Time Windows 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
  • 66. Aggregating via Event-Time Windows Event Time Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output
  • 68. Identifying Bursts of User Activity
  • 71.
  • 72. Calculating the Average Session Length .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .withEarlyFirings(AtPeriod(Minutes(1))) .accumulatingFiredPanes()) .apply(Mean.globally()); input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));

Editor's Notes

  1. Google published the original paper on MapReduce in 2004 -- fundamentally change the way we do distributed processing. <animate> Inside Google, kept innovating, but just published papers <animate> Externally the open source community created Hadoop. Entire ecosystem flourished, partially influenced by those Google papers. <animate> In 2014, Google Cloud Dataflow -- included both a new programming model and fully managed service share this model more broadly -- both because it is awesome and because users benefit from a larger ecosystem and portability across multiple runtimes. So Google, along with a handful of partners donated this programming model to the Apache Software Foundation, as the incubating project Apache Beam...
  2. here’s gaming logs each square represents an event where a user scored some points for their team
  3. game gets popular
  4. start organizing it into a repeated structure
  5. repetitive structure just a cheap way of representing an infinite data source. game logs are continuous distributed systems can cause ambiguity...
  6. Lets look at some points that were scored at 8am <animate> red score 8am, received quickly <animate> yellow score also happened at 8am, received at 8:30 due to network congestion <animate> green element was hours late. this was someone playing in airplane mode on the plane. had to wait for it to land. so now we’ve got an unordered, infinite data set, how do we process it...
  7. Blue axis is event, Green is processing. Ideally no delay -- elements processed when they occurred <animate> Reality looks more like that red squiggly line, where processing time is slightly delayed off event time. <animate> The variable distance between reality and ideal is called skew. need to track in order to reason about correctness.
  8. red line the watermark -- no event times earlier than this point are expected to appear in the future. often heuristic based too slow → unnecessary latency. too fast → some data comes in late, after we thought we were done for a given time period. how do we reason about these types of infinite, out-of-order datasets...
  9. not too hard if you know what kinds of questions to ask! What results are calculated? sums, joins, histograms, machine learning models? Where in event time are results calculated? Does the time each event originally occurred affect results? Are results aggregated for all time, in fixed windows, or as user activity sessions? When in processing time are results materialized? Does the time each element arrives in the system affect results? How do we know when to emit a result? What do we do about data that comes in late from those pesky users playing on transatlantic flights? And finally, how do refinements relate? If we choose to emit results multiple times, is each result independent and distinct, do they build upon one another? Let’s dive into how each question contributes when we build a pipeline...
  10. The first thing to figure out is what you are actually computing. transform each element independently, similar to the Map function in MapReduce, easy to parallelize Other transformations, like Grouping and Combining, require inspecting multiple elements at a time Some operations are really just subgraphs of other more primitive operations. Now let’s see a code snippet for our gaming example...
  11. Psuedo Java for compactness/clarity! start by reading a collection of raw events transform it into a more structured collection containing key value pairs with a team name and the number of points scored during the event. use a composite operation to sum up all the points per team. Let’s see how this code excutes...
  12. Looking at points scored for a given team blue axis, green axis <animate> ideal <animate> This score of 3 from just before 12:07 arrives almost immediately. <animate> 7 minutes delayed. elevator or subway. graph not big enough to show offline mode from transatlantic flight
  13. time is thick white line. accumulate sum into the intermediate state produces output represented by the blue rectangle. all the data available, rectangle covers all events, no matter when in time they occurred. single final result emitted when it’s all complete pretty standard batch processing -> let’s see what happens if we tweak the other questions
  14. windowing lets us create individual results for different slices of event time. divides data into finite chunks based on the event time of each element. common patterns include fixed time (like hourly, daily, monthly), sliding windows (like the last 24 hours worth of data, every hour) -- single element may be in multiple overlapping windows session based windows that capture bursts of user activity -- unaligned per key very common when trying to aggregations on infinite data also actually common pattern in batch, though historically done using composite keys.
  15. fixed windows that are 2 minutes long
  16. independent answer for every two minute period of event time. still waiting until the entire computation completes to emit any results. won’t work for infinite data! want to reduce latency...
  17. trigger define when in processing time to emit results often relative to the watermark, which is that heuristic about event time progress.
  18. request that results are emitted when we think we’ve roughly seen all the elements for a given window. actually default -- just written it for clarity.
  19. left graph shows a perfect watermark -- tracks when all the data for a given event time has arrived emit the result from each window as soon as the watermark passes. <animate> watermark is usually just a heuristic, so look more like graph on the right. now 9 is missed and if the watermark is delayed, like in the first graph, need to wait a long time for anything. would like speculative. lets use a more advanced trigger...
  20. ask for early, speculative firings every minute get updates every time a late element comes in.
  21. in all cases, able to get speculative results before the watermark. now get results when watermark passes, but still handle late value 9 even with heuristic watermark in this case, we accumulate across the multiple results per window In the final window, we see and emit 3 but then still include that 3 in the next update of 12. but this behavior around multiple firings is configurable...
  22. fire three times for a window -- a speculative firing with 3, watermark with two more values 5 and 1, and finally a late value 2. one option is emit new elements that have come in since the last result. requires consumer to be able to do final sum could produce the running sum every time. consumer may overcount produce both the new running sum and retract the old one.
  23. use accumulating and retracting.
  24. speculative results, on time results, and retractions. now the final window emits 3, then retracts the 3 when emitting 12. So those are the four questions...
  25. those are the four key questions are they the right questions? here are 5 reasons...
  26. the results we get are correct this is not something we’ve historically gotten with streaming systems.
  27. distributed systems are … distributed. if the winds had been blowing from the east instead of the west, elements might have arrived in a slightly different order.
  28. aggregating based on event time may have different intermediate results but the final results are identical across the two arrival scenarios.
  29. next, the abstractions can represent powerful and complex algorithms.
  30. earlier mentioned session windows -- burst of user activity simple code change...
  31. want to identify two groupings of points in other words, Tyler was playing the game, got distracted by a squirrel, and then resumed his play.
  32. ok… flexibility for covering all sorts of uses cases
  33. By tuning our what/where/when/how knobs, we’ve covered everything from classic batch… to sessions
  34. And not only that, we do so with lovely modular code
  35. all these uses cases -- and we never changed our core algorithm just integer summing here, but the same would apply with much more complex algorithms too
  36. so there you go -- 5 reasons that these 4 questions are awesome
  37. Data 1 file per task & files of different sizes Bigtable key range partitioned lexicographically, assuming uniform Processing Hot shuffle key ranges Data-dependent computation
  38. Pre-job stage: chunk files into equal sizes Choice of constant? Does not handle runtime asymmetry Pre-emptively over-split How much is enough? How much is too much? Per-task overheads can dominate Detect slow workers and re-execute Does not handle processing asymmetry Sample (maybe extensively) and then split Overhead Still does not handle runtime asymmetry
  39. 400 workers Read GCS → Parse → GroupByKey → Write
  40. The Beam model is attempting to generalize semantics -- will not align perfectly with all possible runtimes. Started categorizing the features in the model and the various levels of runner support. This will help users understand mismatches like using event time processing in Spark or exactly once processing with Samza.
  41. fully support three different categories of users End users who want to write data processing pipelines Includes adding value like additional connectors -- we’ve got Kafka! Additionally, support community-sourced SDKs and runners Each community has very different sets of goals and needs. having a vision and reaching it are two different things...
  42. And one of the things we’re most excited about is the collaboration opportunities that Beam enables. Been doing this stuff for a while at Google -- very hermetic environment. Looking forward to incorporating new perspectives -- to build a truly generalizable solution. Growing the Beam development community over the next few months, whether they are looking to write transform libraries for end users, new SDKs, or provide new runners.
  43. Beam entered incubation in early February. Quickly did the code donations and began bootstrapping the infrastructure. initial focus is on stabilizing internal APIs and integrating the additional runners. Part of that is understanding what different runners can do...
  44. Credits?
  45. Credits?
  46. Element wise transformations work on individual elements parsing, translating or filtering applied as elements flow past but other transforms like counting or joining require combining multiple elements together ...
  47. when doing aggregations, need to divide the infinite stream of elements into finite sized chunks that can be processed independently. simplest way using arrival time in fixed time periods can mean elements are being processed out of order, late elements may be aggregated with unrelated elements that arrived about the same time...
  48. reorganize data base on when they occurred, not when they arrived red element arrived relatively on time and stays in the noon window. green that arrived at 12:30, was actually created about 11:30, so it moves up to the 11am window. requires formalizing the difference between processing time and event time
  49. if we were aggregating based on processing time, this would result in different results for the two orderings.
  50. now you can see the sessions being built over time at first we see multiple components in the first session not until late element 9 comes in that we realize it’s one big session
  51. next -- we’ve seen what the four questions can do. what if we ask the questions twice?
  52. code to calculate the length of a user session
  53. Remember that these graphs are always shown per key here’s the graph calculating session legths for Frances and the ones for Tyler
  54. now lets take those session lengths per user ask the questions again this time using fixed windows to take the mean across the entire collection...
  55. Now calculating the average length of all sessions that ended in a given time period if we rolled out an update to our game, this would let us quickly understand if that resulted a change in user behavior if the change made the game less fun, we could see a sudden drop in how long users play