Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Introduction to Apache Beam
JB Onofré - Talend
Who am I ?
● Talend
○ Software Architect
○ Apache team
● Apache
○ Member of the Apache Software Foundation
○ Champion/Ment...
What is Apache Beam?
1. Agnostic (unified batch + stream) Beam programming model
2. Dataflow Java SDK (soon Python, DSLs)
...
Why Apache Beam?
1. Portable - You can use the same code with different runners (abstraction) and
backends on premise, in ...
Beam Programming Model
Data processing pipeline
(executed via a Beam runner)
PTransform/IO PTransform PTransformInput Outp...
Beam Programming Model
1. Pipelines - data processing job as a directed graph of steps
2. PCollection - the data inside a ...
Beam Programming Model - PCollection
1. PCollection is immutable, does not support random access to element, belong
to a p...
Beam Programming Model - IO
1. IO Sources (read data as PCollections) and Sinks (write PCollections)
2. Support Bounded an...
Apache Beam SDKs
1. API for Beam Programming Model (design pipelines, transforms, …)
2. Current SDKs
a. Java - First SDK a...
Java SDK
public static void main(String[] args) {
// Create a pipeline parameterized by commandline flags.
Pipeline p = Pi...
Beam Programming Model
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(SessionWindows.of(Duration.stand...
Runners and Backends
● Runners “translate” the code to a target backend (the runner itself doesn’t
provide the backend)
● ...
Beam Runners
Google Cloud Dataflow Apache Flink* Apache Spark*
[*] With varying levels of fidelity.
The Apache Beam (http:...
Use Cases
Apache Beam is a great choice for both batch and stream processing and can
handle bounded and unbounded datasets...
Use Case - Gaming
● A game store the gaming results in the CSV file:
○ Player,team,score,timestamp
● Two pipelines:
○ User...
User Game - Gaming - UserScore - Pipeline
Pipeline pipeline = Pipeline.create(options);
// Read events from a text file an...
User Game - Gaming - UserScore - Avro Coder
@DefaultCoder(AvroCoder.class)
static class GameActionInfo {
@Nullable String ...
User Game - Gaming - UserScore - Parse Event Fn
static class ParseEventFn extends DoFn<String, GameActionInfo> {
// Log an...
User Game - Gaming - UserScore - Sum Score Tr
public static class ExtractAndSumScore
extends PTransform<PCollection<GameAc...
User Game - Gaming - HourlyScore - Pipeline
pipeline.apply(TextIO.Read.from(options.getInput()))
.apply(ParDo.named("Parse...
Roadmap
02/01/2016
Enter Apache
Incubator
End 2016
Cloud Dataflow
should run Beam
pipelines
Early 2016
Design for use case...
More information and get involved!
1: Read about Apache Beam
Apache Beam website - http://beam.incubator.apache.org
2: See...
Q&A
Upcoming SlideShare
Loading in …5
×

Introduction to Apache Beam

5,928 views

Published on

Apache Beam (formerly Google Cloud Dataflow SDK) is an unified model and set of language-specific SDKs for defining and executing data processing workflows. You design pipelines, simplifying the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service).


This presentation introduces the Beam programming model, and how you can use it to design your pipelines, transporting PCollection and applying some PTransforms. You will see how the same code will be "translated" to a target runtimes thanks to a specific runner. You will also have an overview of the current roadmap, with the new interesting features.

Published in: Software
  • Login to see the comments

Introduction to Apache Beam

  1. 1. Introduction to Apache Beam JB Onofré - Talend
  2. 2. Who am I ? ● Talend ○ Software Architect ○ Apache team ● Apache ○ Member of the Apache Software Foundation ○ Champion/Mentor/PPMC/PMC/Committer for ~ 20 projects (Beam, Falcon, Lens, Brooklyn, Slider, Karaf, Camel, ActiveMQ, ACE, Archiva, Aries, ServiceMix, Syncope, jClouds, Unomi, Guacamole, BatchEE, Sirona, Incubator, …)
  3. 3. What is Apache Beam? 1. Agnostic (unified batch + stream) Beam programming model 2. Dataflow Java SDK (soon Python, DSLs) 3. Runners for Dataflow a. Apache Flink (thanks to data Artisans) b. Apache Spark (thanks to Cloudera) c. Google Cloud Dataflow (fast, no-ops) d. Local (in-process) runner for testing e. OSGi/Karaf
  4. 4. Why Apache Beam? 1. Portable - You can use the same code with different runners (abstraction) and backends on premise, in the cloud, or locally 2. Unified - Same unified model for batch and stream processing 3. Advanced features - Event windowing, triggering, watermarking, lateless, etc. 4. Extensible model and SDK - Extensible API; can define custom sources to read and write in parallel
  5. 5. Beam Programming Model Data processing pipeline (executed via a Beam runner) PTransform/IO PTransform PTransformInput Output
  6. 6. Beam Programming Model 1. Pipelines - data processing job as a directed graph of steps 2. PCollection - the data inside a pipeline 3. Transform - a step in the pipeline (taking PCollections as input, and produce PCollections) a. Core transforms - common transformation provided (ParDo, GroupByKey, …) b. Composite transforms - combine multiple transforms c. IO transforms - endpoints of a pipeline to create PCollections (consumer/root) or use PCollections to “write” data outside of the pipeline (producer)
  7. 7. Beam Programming Model - PCollection 1. PCollection is immutable, does not support random access to element, belong to a pipeline 2. Each element in PCollection has a timestamp (set by IO Source) 3. Coder to support different data types 4. Bounded (batch) or Unbounded (streaming) PCollection (depending of the IO Source) 5. Grouping of unbounded PCollection with Windowing (thanks to the timestamp) a. Fixed time window b. Sliding time window c. Session window d. Global window (for bounded PCollection) e. Can deal with time skew and data lag (late data) with trigger (time-based with watermark, data- based with counting, composite)
  8. 8. Beam Programming Model - IO 1. IO Sources (read data as PCollections) and Sinks (write PCollections) 2. Support Bounded and/or Unbounded PCollections 3. Provided IO - File, BigQuery, BigTable, Avro, and more coming (Kafka, JMS, …) 4. Custom IO - extensible IO API to create custom sources & sinks 5. Should deal with timestamp, watermark, deduplication, parallelism (depending of the needs)
  9. 9. Apache Beam SDKs 1. API for Beam Programming Model (design pipelines, transforms, …) 2. Current SDKs a. Java - First SDK and primary focus for refactoring and improvement b. Python - Dataflow SDK preview for batch processing, will be migrated to Apache Beam once the Java SDK has been stabilized (and APIs/interfaces redefined) 3. Coming (possible) SDKs/languages - Scala, Go, Ruby, etc. 4. DSLs - domain specific languages on top of the SDKs (Java fluent DSL on top of Java SDK, …)
  10. 10. Java SDK public static void main(String[] args) { // Create a pipeline parameterized by commandline flags. Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg)); p.apply(TextIO.Read.from("/path/to...")) // Read input. .apply(new CountWords()) // Do some processing. .apply(TextIO.Write.to("/path/to...")); // Write output. // Run the pipeline. p.run(); }
  11. 11. Beam Programming Model PCollection<KV<String, Integer>> scores = input .apply(Window.into(SessionWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings( AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey()); The Apache Beam Model (by way of the Dataflow model) includes many primitives and features which are powerful but hard to express in other models and languages.
  12. 12. Runners and Backends ● Runners “translate” the code to a target backend (the runner itself doesn’t provide the backend) ● Many runners are tied to other top-level Apache projects, such as Apache Flink and apache Spark ● Due to this, runners can be run on-premise (on your local Flink cluster) or in a public cloud (using Google Cloud Dataproc or Amazon EMR) for example ● Apache Beam is focused on treating runners as a top-level use case (with APIs, support, etc.) so runners can be developed with minimal friction for maximum pipeline portability
  13. 13. Beam Runners Google Cloud Dataflow Apache Flink* Apache Spark* [*] With varying levels of fidelity. The Apache Beam (http://beam.incubator.apache.org) site will have more details soon. ? Other Runner* (local, OSGi, …)
  14. 14. Use Cases Apache Beam is a great choice for both batch and stream processing and can handle bounded and unbounded datasets Batch can focus on ETL/ELT, catch-up processing, daily aggregations, and so on Stream can focus on handling real-time processing on a record-by-record basis Real use cases ● Mobile gaming data processing, both batch and stream processing (https: //github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/) ● Real-time event processing from IoT devices
  15. 15. Use Case - Gaming ● A game store the gaming results in the CSV file: ○ Player,team,score,timestamp ● Two pipelines: ○ UserScore (batch) sum scores for each user ○ HourlyScore (batch) similar UserScore but with a Window (hour): it calculates sum scores per team on fixed windows.
  16. 16. User Game - Gaming - UserScore - Pipeline Pipeline pipeline = Pipeline.create(options); // Read events from a text file and parse them. pipeline.apply(TextIO.Read.from(options.getInput())) .apply(ParDo.named("ParseGameEvent").of(new ParseEventFn())) // Extract and sum username/score pairs from the event data. .apply("ExtractUserScore", new ExtractAndSumScore("user")) .apply("WriteUserScoreSums", new WriteToBigQuery<KV<String, Integer>>(options. getTableName(), configureBigQueryWrite())); // Run the batch pipeline. pipeline.run();
  17. 17. User Game - Gaming - UserScore - Avro Coder @DefaultCoder(AvroCoder.class) static class GameActionInfo { @Nullable String user; @Nullable String team; @Nullable Integer score; @Nullable Long timestamp; public GameActionInfo(String user, String team, Integer score, Long timestamp) { … } …}
  18. 18. User Game - Gaming - UserScore - Parse Event Fn static class ParseEventFn extends DoFn<String, GameActionInfo> { // Log and count parse errors. private static final Logger LOG = LoggerFactory.getLogger(ParseEventFn.class); private final Aggregator<Long, Long> numParseErrors = createAggregator("ParseErrors", new Sum.SumLongFn()); @Override public void processElement(ProcessContext c) { String[] components = c.element().split(","); try { String user = components[0].trim(); String team = components[1].trim(); Integer score = Integer.parseInt(components[2].trim()); Long timestamp = Long.parseLong(components[3].trim()); GameActionInfo gInfo = new GameActionInfo(user, team, score, timestamp); c.output(gInfo); } catch (ArrayIndexOutOfBoundsException | NumberFormatException e) { numParseErrors.addValue(1L); LOG.info("Parse error on " + c.element() + ", " + e.getMessage()); } } }
  19. 19. User Game - Gaming - UserScore - Sum Score Tr public static class ExtractAndSumScore extends PTransform<PCollection<GameActionInfo>, PCollection<KV<String, Integer>>> { private final String field; ExtractAndSumScore(String field) { this.field = field; } @Override public PCollection<KV<String, Integer>> apply( PCollection<GameActionInfo> gameInfo) { return gameInfo .apply(MapElements .via((GameActionInfo gInfo) -> KV.of(gInfo.getKey(field), gInfo.getScore())) .withOutputType(new TypeDescriptor<KV<String, Integer>>() {})) .apply(Sum.<String>integersPerKey()); } }
  20. 20. User Game - Gaming - HourlyScore - Pipeline pipeline.apply(TextIO.Read.from(options.getInput())) .apply(ParDo.named("ParseGameEvent”).of(new ParseEventFn())) // filter with byPredicate to ignore some data .apply("FilterStartTime", Filter.byPredicate((GameActionInfo gInfo) -> gInfo.getTimestamp() > startMinTimestamp.getMillis())) .apply("FilterEndTime", Filter.byPredicate((GameActionInfo gInfo) -> gInfo.getTimestamp() < stopMinTimestamp.getMillis())) // use fixed-time window .apply("AddEventTimestamps", WithTimestamps.of((GameActionInfo i) -> new Instant(i.getTimestamp()))) .apply(Window.named("FixedWindowsTeam") .<GameActionInfo>into(FixedWindows.of(Duration.standardMinutes(60))) // extract and sum teamname/score pairs from the event data. .apply("ExtractTeamScore", new ExtractAndSumScore("team")) // write the result .apply("WriteTeamScoreSums", new WriteWindowedToBigQuery<KV<String, Integer>>(options.getTableName(), configureWindowedTableWrite())); pipeline.run();
  21. 21. Roadmap 02/01/2016 Enter Apache Incubator End 2016 Cloud Dataflow should run Beam pipelines Early 2016 Design for use cases, begin refactoring Mid 2016 Slight chaos Late 2016 Multiple runners execute Beam pipelines 02/25/2016 1st commit to ASF repository
  22. 22. More information and get involved! 1: Read about Apache Beam Apache Beam website - http://beam.incubator.apache.org 2: See what the Apache Beam team is doing Apache Beam JIRA - https://issues.apache.org/jira/browse/BEAM Apache Beam mailing lists - http://beam.incubator.apache.org/mailing_lists/ 3: Contribute! Apache Beam git repo - https://github.com/apache/incubator-beam
  23. 23. Q&A

×