Apache Beam (formerly Google Cloud Dataflow SDK) is an unified model and set of language-specific SDKs for defining and executing data processing workflows. You design pipelines, simplifying the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service).
This presentation introduces the Beam programming model, and how you can use it to design your pipelines, transporting PCollection and applying some PTransforms. You will see how the same code will be "translated" to a target runtimes thanks to a specific runner. You will also have an overview of the current roadmap, with the new interesting features.
2. Who am I ?
● Talend
○ Software Architect
○ Apache team
● Apache
○ Member of the Apache Software Foundation
○ Champion/Mentor/PPMC/PMC/Committer for ~ 20 projects (Beam, Falcon, Lens, Brooklyn,
Slider, Karaf, Camel, ActiveMQ, ACE, Archiva, Aries, ServiceMix, Syncope, jClouds, Unomi,
Guacamole, BatchEE, Sirona, Incubator, …)
3. What is Apache Beam?
1. Agnostic (unified batch + stream) Beam programming model
2. Dataflow Java SDK (soon Python, DSLs)
3. Runners for Dataflow
a. Apache Flink (thanks to data Artisans)
b. Apache Spark (thanks to Cloudera)
c. Google Cloud Dataflow (fast, no-ops)
d. Local (in-process) runner for testing
e. OSGi/Karaf
4. Why Apache Beam?
1. Portable - You can use the same code with different runners (abstraction) and
backends on premise, in the cloud, or locally
2. Unified - Same unified model for batch and stream processing
3. Advanced features - Event windowing, triggering, watermarking, lateless, etc.
4. Extensible model and SDK - Extensible API; can define custom sources to
read and write in parallel
5. Beam Programming Model
Data processing pipeline
(executed via a Beam runner)
PTransform/IO PTransform PTransformInput Output
6. Beam Programming Model
1. Pipelines - data processing job as a directed graph of steps
2. PCollection - the data inside a pipeline
3. Transform - a step in the pipeline (taking PCollections as input, and produce
PCollections)
a. Core transforms - common transformation provided (ParDo, GroupByKey, …)
b. Composite transforms - combine multiple transforms
c. IO transforms - endpoints of a pipeline to create PCollections (consumer/root) or use
PCollections to “write” data outside of the pipeline (producer)
7. Beam Programming Model - PCollection
1. PCollection is immutable, does not support random access to element, belong
to a pipeline
2. Each element in PCollection has a timestamp (set by IO Source)
3. Coder to support different data types
4. Bounded (batch) or Unbounded (streaming) PCollection (depending of the IO
Source)
5. Grouping of unbounded PCollection with Windowing (thanks to the timestamp)
a. Fixed time window
b. Sliding time window
c. Session window
d. Global window (for bounded PCollection)
e. Can deal with time skew and data lag (late data) with trigger (time-based with watermark, data-
based with counting, composite)
8. Beam Programming Model - IO
1. IO Sources (read data as PCollections) and Sinks (write PCollections)
2. Support Bounded and/or Unbounded PCollections
3. Provided IO - File, BigQuery, BigTable, Avro, and more coming (Kafka, JMS, …)
4. Custom IO - extensible IO API to create custom sources & sinks
5. Should deal with timestamp, watermark, deduplication, parallelism (depending
of the needs)
9. Apache Beam SDKs
1. API for Beam Programming Model (design pipelines, transforms, …)
2. Current SDKs
a. Java - First SDK and primary focus for refactoring and improvement
b. Python - Dataflow SDK preview for batch processing, will be migrated to Apache Beam once
the Java SDK has been stabilized (and APIs/interfaces redefined)
3. Coming (possible) SDKs/languages - Scala, Go, Ruby, etc.
4. DSLs - domain specific languages on top of the SDKs (Java fluent DSL on top
of Java SDK, …)
10. Java SDK
public static void main(String[] args) {
// Create a pipeline parameterized by commandline flags.
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg));
p.apply(TextIO.Read.from("/path/to...")) // Read input.
.apply(new CountWords()) // Do some processing.
.apply(TextIO.Write.to("/path/to...")); // Write output.
// Run the pipeline.
p.run();
}
11. Beam Programming Model
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(SessionWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
The Apache Beam Model (by way of the Dataflow model) includes many primitives and features which
are powerful but hard to express in other models and languages.
12. Runners and Backends
● Runners “translate” the code to a target backend (the runner itself doesn’t
provide the backend)
● Many runners are tied to other top-level Apache projects, such as Apache Flink
and apache Spark
● Due to this, runners can be run on-premise (on your local Flink cluster) or in a
public cloud (using Google Cloud Dataproc or Amazon EMR) for example
● Apache Beam is focused on treating runners as a top-level use case (with APIs,
support, etc.) so runners can be developed with minimal friction for maximum
pipeline portability
13. Beam Runners
Google Cloud Dataflow Apache Flink* Apache Spark*
[*] With varying levels of fidelity.
The Apache Beam (http://beam.incubator.apache.org) site will have more details soon.
?
Other Runner*
(local, OSGi, …)
14. Use Cases
Apache Beam is a great choice for both batch and stream processing and can
handle bounded and unbounded datasets
Batch can focus on ETL/ELT, catch-up processing, daily aggregations, and so on
Stream can focus on handling real-time processing on a record-by-record basis
Real use cases
● Mobile gaming data processing, both batch and stream processing (https:
//github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/)
● Real-time event processing from IoT devices
15. Use Case - Gaming
● A game store the gaming results in the CSV file:
○ Player,team,score,timestamp
● Two pipelines:
○ UserScore (batch) sum scores for each user
○ HourlyScore (batch) similar UserScore but with a Window (hour): it calculates sum scores per
team on fixed windows.
16. User Game - Gaming - UserScore - Pipeline
Pipeline pipeline = Pipeline.create(options);
// Read events from a text file and parse them.
pipeline.apply(TextIO.Read.from(options.getInput()))
.apply(ParDo.named("ParseGameEvent").of(new ParseEventFn()))
// Extract and sum username/score pairs from the event data.
.apply("ExtractUserScore", new ExtractAndSumScore("user"))
.apply("WriteUserScoreSums",
new WriteToBigQuery<KV<String, Integer>>(options.
getTableName(),
configureBigQueryWrite()));
// Run the batch pipeline.
pipeline.run();
17. User Game - Gaming - UserScore - Avro Coder
@DefaultCoder(AvroCoder.class)
static class GameActionInfo {
@Nullable String user;
@Nullable String team;
@Nullable Integer score;
@Nullable Long timestamp;
public GameActionInfo(String user, String team, Integer score, Long
timestamp) {
…
}
…}
18. User Game - Gaming - UserScore - Parse Event Fn
static class ParseEventFn extends DoFn<String, GameActionInfo> {
// Log and count parse errors.
private static final Logger LOG = LoggerFactory.getLogger(ParseEventFn.class);
private final Aggregator<Long, Long> numParseErrors =
createAggregator("ParseErrors", new Sum.SumLongFn());
@Override
public void processElement(ProcessContext c) {
String[] components = c.element().split(",");
try {
String user = components[0].trim();
String team = components[1].trim();
Integer score = Integer.parseInt(components[2].trim());
Long timestamp = Long.parseLong(components[3].trim());
GameActionInfo gInfo = new GameActionInfo(user, team, score, timestamp);
c.output(gInfo);
} catch (ArrayIndexOutOfBoundsException | NumberFormatException e) {
numParseErrors.addValue(1L);
LOG.info("Parse error on " + c.element() + ", " + e.getMessage());
}
}
}
19. User Game - Gaming - UserScore - Sum Score Tr
public static class ExtractAndSumScore
extends PTransform<PCollection<GameActionInfo>, PCollection<KV<String, Integer>>> {
private final String field;
ExtractAndSumScore(String field) {
this.field = field;
}
@Override
public PCollection<KV<String, Integer>> apply(
PCollection<GameActionInfo> gameInfo) {
return gameInfo
.apply(MapElements
.via((GameActionInfo gInfo) -> KV.of(gInfo.getKey(field), gInfo.getScore()))
.withOutputType(new TypeDescriptor<KV<String, Integer>>() {}))
.apply(Sum.<String>integersPerKey());
}
}
20. User Game - Gaming - HourlyScore - Pipeline
pipeline.apply(TextIO.Read.from(options.getInput()))
.apply(ParDo.named("ParseGameEvent”).of(new ParseEventFn()))
// filter with byPredicate to ignore some data
.apply("FilterStartTime", Filter.byPredicate((GameActionInfo gInfo)
-> gInfo.getTimestamp() > startMinTimestamp.getMillis()))
.apply("FilterEndTime", Filter.byPredicate((GameActionInfo gInfo)
-> gInfo.getTimestamp() < stopMinTimestamp.getMillis()))
// use fixed-time window
.apply("AddEventTimestamps", WithTimestamps.of((GameActionInfo i) -> new Instant(i.getTimestamp())))
.apply(Window.named("FixedWindowsTeam")
.<GameActionInfo>into(FixedWindows.of(Duration.standardMinutes(60)))
// extract and sum teamname/score pairs from the event data.
.apply("ExtractTeamScore", new ExtractAndSumScore("team"))
// write the result
.apply("WriteTeamScoreSums",
new WriteWindowedToBigQuery<KV<String, Integer>>(options.getTableName(),
configureWindowedTableWrite()));
pipeline.run();
21. Roadmap
02/01/2016
Enter Apache
Incubator
End 2016
Cloud Dataflow
should run Beam
pipelines
Early 2016
Design for use cases,
begin refactoring
Mid 2016
Slight chaos
Late 2016
Multiple runners
execute Beam
pipelines
02/25/2016
1st commit to
ASF repository
22. More information and get involved!
1: Read about Apache Beam
Apache Beam website - http://beam.incubator.apache.org
2: See what the Apache Beam team is doing
Apache Beam JIRA - https://issues.apache.org/jira/browse/BEAM
Apache Beam mailing lists - http://beam.incubator.apache.org/mailing_lists/
3: Contribute!
Apache Beam git repo - https://github.com/apache/incubator-beam