Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App

•

12 likes•6,821 views

Trieu Nguyen

Data & Analytics Technology Business

Google Cloud Dataflow
and lightweight Lambda
Architecture
for Big Data App
New innovative ideas and core concepts for
simple real-time analytic development
Compiled from Internet by
@tantrieuf31
http://nguyentantrieu.info

How it works:
an immutable sequence of records is captured and fed into
a batch system and a stream processing system in parallel.
You implement your transformation logic twice, once in the
batch system and once in the stream processing system.
You stitch together the results from both systems at query
time to produce a complete answer.
The Lambda Architecture

And the bad…
The problem with the Lambda Architecture is that
maintaining code that needs to produce the same
result in two complex distributed systems is exactly as
painful as it seems like it would be. I don’t think this
problem is fixable.
Programming in distributed frameworks like Storm and
Hadoop is complex. Inevitably, code ends up being
specifically engineered toward the framework it runs on.
The resulting operational complexity of systems
implementing the Lambda Architecture is the one thing that
seems to be universally agreed on by everyone doing it.

Too hard to build flexible analytics pipelines
Google is now making a huge point of the fact
that it abandoned MapReduce a long time ago
as it was too hard to build flexible analytics
pipelines. http://java.dzone.com/articles/google-cloud-dataflow-%E2%
80%93-game

http://youtu.be/TnLiEWglqHk - Google I/O 2014 - The dawn
of "Fast Data"

Google's new Dataflow
Google's new Dataflow architecture, which is
based on FlumeJava and MillWheel? They also
support code sharing.
Cloud Dataflow is a successor to MapReduce,
and is based on Google’s internal technologies
like Flume and MillWheel. This new project in
which Google placed their servers can be
considered the natural evolution of
MapReduce.

MillWheel: Fault-Tolerant Stream Processing at
Internet Scale

FlumeJava: Easy, Efﬁcient Data-Parallel
Pipelines

Lightweight Lambda Architecture
Stream processing system could be improved
to handle the full problem set in its target
domain.
→ Kappa Architecture + Micro-service

Ideas
● Use Kafka or some other system that will let you retain
the full log of the data you want to be able to reprocess
and that allows for multiple subscribers. For example, if
you want to reprocess up to 30 days of data, set your
retention in Kafka to 30 days.
● When you want to do the reprocessing, start a second
instance of your stream processing job that starts
processing from the beginning of the retained data,
but direct this output data to a new output table.
● When the second job has caught up, switch the
application to read from the new table.
● Stop the old version of the job, and delete the old output
table.

Background
Kafka maintains ordered logs like this:

A Kafka “topic” is a collection of these logs:

Refer links
1. http://cloudtimes.org/2014/07/07/mapreduce-successor-
google-cloud-dataflow-is-a-game-changer-for-hadoop-
thunder
2. http://martinfowler.com/articles/microservices.html
3. http://radar.oreilly.com/2014/07/questioning-the-lambda-
architecture.html
4. http://martinfowler.com/eaaDev/EventSourcing.html
5. http://martinfowler.com/bliki/CQRS.html
6. http://www.michael-noll.com/blog/2013/03/13/running-a-
multi-broker-apache-kafka-cluster-on-a-single-node/
7. http://kafka.apache.org
8. http://www.mc2ads.com

What's hot

Salesforce IntroRich Helton

Governance Strategies for Cloud Transformation | AWS Public Sector Summit 2016Amazon Web Services

Salesforce Community CloudJayant Jindal

The People Model and Cloud Transformation | AWS Public Sector Summit 2016Amazon Web Services

Salesforce data modelJean Brenda

Salesforce CRM 7 domains of SuccessKevin Sherman

Salesforce integration best practices columbus meetupMuleSoft Meetup

What is the Next Generation for Application Managed Services?Hexaware Technologies

Business cases for the need of cloud computingDr.Neeraj Kumar Pandey

What is Cloud Native Explained?jeetendra mandal

Cloud Strategy FirstAmazon Web Services

Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...Amazon Web Services

Salesforce Tableau CRM - Quick OverviewHarshala Shewale ☁

Introduction to salesforce pptTania Yeasmin (Preity)

CPQ Solution StudyDemand Metric

Salesforce DevOps: Where Do You Start?Chandler Anderson

Cloudera - The Modern Platform for AnalyticsCloudera, Inc.

What Is Salesforce? | Salesforce Training - What Does Salesforce Do? | Salesf...Edureka!

Apache Flink Adoption at ShopifyYaroslav Tkachenko

Service oriented architecture characteristics of soasmithaps4

What's hot (20)

Salesforce Intro

Governance Strategies for Cloud Transformation | AWS Public Sector Summit 2016

Salesforce Community Cloud

The People Model and Cloud Transformation | AWS Public Sector Summit 2016

Salesforce data model

Salesforce CRM 7 domains of Success

Salesforce integration best practices columbus meetup

What is the Next Generation for Application Managed Services?

Business cases for the need of cloud computing

What is Cloud Native Explained?

Cloud Strategy First

Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...

Salesforce Tableau CRM - Quick Overview

Introduction to salesforce ppt

CPQ Solution Study

Salesforce DevOps: Where Do You Start?

Cloudera - The Modern Platform for Analytics

What Is Salesforce? | Salesforce Training - What Does Salesforce Do? | Salesf...

Apache Flink Adoption at Shopify

Service oriented architecture characteristics of soa

Viewers also liked

Google Cloud Dataflow Two Worlds Become a Much Better OneDataWorks Summit

Google Cloud DataflowAlex Van Boxel

Strtio Spark Streaming + Siddhi CEP EngineMyung Ho Yun

Streaming Auto-scaling in Google Cloud DataflowC4Media

Stratio platform overview v4.1Stratio

[Strata] SparktaStratio

Google BigQueryMatthias Feys

Presto: Distributed sql query engine kiran palaka

Facebook Presto presentationCyanny LIANG

Real Time Analytics: Algorithms and SystemsArun Kejariwal

Big Data Architectural PatternsAmazon Web Services

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson

Présentation Google DataflowGeoffrey Garnotel

Viewers also liked (14)

Google Cloud Dataflow Two Worlds Become a Much Better One

Google Cloud Dataflow

Strtio Spark Streaming + Siddhi CEP Engine

Streaming Auto-scaling in Google Cloud Dataflow

Stratio platform overview v4.1

[Strata] Sparkta

Google BigQuery

Presto: Distributed sql query engine

Facebook Presto presentation

Real Time Analytics: Algorithms and Systems

Big Data Architectural Patterns

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala

Présentation Google Dataflow

Similar to Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App

Introduction to GCP Data Flow PresentationKnoldus Inc.

Introduction to GCP DataFlow PresentationKnoldus Inc.

E5 05 ijcite august 2014ijcite

International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions www.ijeijournal.com

B04 06 0918International Journal of Engineering Inventions www.ijeijournal.com

Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH

B04 06 0918International Journal of Engineering Inventions www.ijeijournal.com

Hourglass: a Library for Incremental Processing on HadoopMatthew Hayes

Research Inventy : International Journal of Engineering and Scienceinventy

Exploiting dynamic resource allocation foringenioustech

Apache spark y cómo lo usamos en nuestros proyectosOpenSistemas

Cloud batch a batch job queuing system on clouds with hadoop and h-baseJoão Gabriel Lima

Using Hazelcast in the Kappa architectureOliver Buckley-Salmon

Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez

Confluent & Attunity: Mainframe Data Modern Analyticsconfluent

MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit

Fundamental question and answer in cloud computing quiz by animesh chaturvediAnimesh Chaturvedi

GCP Slide.pptxUnknownPerson475333

Hadoop performance modeling for jobranjith kumar

D017212027IOSR Journals

Similar to Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App (20)

Introduction to GCP Data Flow Presentation

Introduction to GCP DataFlow Presentation

E5 05 ijcite august 2014

International Journal of Engineering Inventions (IJEI)

B04 06 0918

Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...

B04 06 0918

Hourglass: a Library for Incremental Processing on Hadoop

Research Inventy : International Journal of Engineering and Science

Exploiting dynamic resource allocation for

Apache spark y cómo lo usamos en nuestros proyectos

Cloud batch a batch job queuing system on clouds with hadoop and h-base

Using Hazelcast in the Kappa architecture

Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020

Confluent & Attunity: Mainframe Data Modern Analytics

MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON

Fundamental question and answer in cloud computing quiz by animesh chaturvedi

GCP Slide.pptx

Hadoop performance modeling for job

D017212027

Recently uploaded

2023 Survey Shows Dip in High School E-Cigarette UseBisnar Chase Personal Injury Attorneys

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Digital Marketing Plan, how digital marketing worksdeepakthakur548787

What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17

Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole

DATA ANALYSIS using various data sets like shoping data set etclalithasri22

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Principles and Practices of Data VisualizationKianJazayeri1

Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181

knowledge representation in artificial intelligencePriyadharshiniG41

Data Analysis Project: Stroke PredictionBoston Institute of Analytics

Introduction to Mongo DB-open-‐source, high-‐performance, document-‐orient...boychatmate1

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml

Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics

Recently uploaded (20)

2023 Survey Shows Dip in High School E-Cigarette Use

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

Digital Marketing Plan, how digital marketing works

What To Do For World Nature Conservation Day by Slidesgo.pptx

Non Text Magic Studio Magic Design for Presentations L&P.pdf

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...

DATA ANALYSIS using various data sets like shoping data set etc

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf

Data Factory in Microsoft Fabric (MsBIP #82)

Principles and Practices of Data Visualization

Rithik Kumar Singh codealpha pythohn.pdf

knowledge representation in artificial intelligence

Data Analysis Project: Stroke Prediction

Introduction to Mongo DB-open-‐source, high-‐performance, document-‐orient...

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...

modul pembelajaran robotic Workshop _ by Slidesgo.pptx

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf

Statistics For Management by Richard I. Levin 8ed.pdf

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model

Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App

1. Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App New innovative ideas and core concepts for simple real-time analytic development Compiled from Internet by @tantrieuf31 http://nguyentantrieu.info

2. How it works: an immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel. You implement your transformation logic twice, once in the batch system and once in the stream processing system. You stitch together the results from both systems at query time to produce a complete answer. The Lambda Architecture

3. And the bad… The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems like it would be. I don’t think this problem is fixable. Programming in distributed frameworks like Storm and Hadoop is complex. Inevitably, code ends up being specifically engineered toward the framework it runs on. The resulting operational complexity of systems implementing the Lambda Architecture is the one thing that seems to be universally agreed on by everyone doing it.

4. Too hard to build flexible analytics pipelines Google is now making a huge point of the fact that it abandoned MapReduce a long time ago as it was too hard to build flexible analytics pipelines. http://java.dzone.com/articles/google-cloud-dataflow-%E2% 80%93-game

5. http://youtu.be/TnLiEWglqHk - Google I/O 2014 - The dawn of "Fast Data"

6. Google's new Dataflow Google's new Dataflow architecture, which is based on FlumeJava and MillWheel? They also support code sharing. Cloud Dataflow is a successor to MapReduce, and is based on Google’s internal technologies like Flume and MillWheel. This new project in which Google placed their servers can be considered the natural evolution of MapReduce.

7. MillWheel: Fault-Tolerant Stream Processing at Internet Scale

8. FlumeJava: Easy, Efﬁcient Data-Parallel Pipelines

9. Lightweight Lambda Architecture Stream processing system could be improved to handle the full problem set in its target domain. → Kappa Architecture + Micro-service

10. Ideas ● Use Kafka or some other system that will let you retain the full log of the data you want to be able to reprocess and that allows for multiple subscribers. For example, if you want to reprocess up to 30 days of data, set your retention in Kafka to 30 days. ● When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table. ● When the second job has caught up, switch the application to read from the new table. ● Stop the old version of the job, and delete the old output table.

11. Background Kafka maintains ordered logs like this:

12. A Kafka “topic” is a collection of these logs:

13. Big picture of Kafka

14. Refer links 1. http://cloudtimes.org/2014/07/07/mapreduce-successor- google-cloud-dataflow-is-a-game-changer-for-hadoop- thunder 2. http://martinfowler.com/articles/microservices.html 3. http://radar.oreilly.com/2014/07/questioning-the-lambda- architecture.html 4. http://martinfowler.com/eaaDev/EventSourcing.html 5. http://martinfowler.com/bliki/CQRS.html 6. http://www.michael-noll.com/blog/2013/03/13/running-a- multi-broker-apache-kafka-cluster-on-a-single-node/ 7. http://kafka.apache.org 8. http://www.mc2ads.com

Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App

Similar to Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App (20)

More from Trieu Nguyen

More from Trieu Nguyen (20)

Recently uploaded

Recently uploaded (20)

Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App