SlideShare a Scribd company logo
1 of 49
Lambda Architecture
with Apache Spark
IMAGE
About Me
https://ua.linkedin.com/in/tarasmatyashovsky
Apache Hadoop: A Brief History
http://www.slideshare.net/fadicce/hadoop-user-group-uae-meeting
A lot of customers implemented
successful Hadoop-based M/R pipelines
which are operating today
Examples from Real Life
• Oozie workflow, operates daily and processes up to
150 TB to generate analytics
• bash managed workflow, operates daily and processes
up to 8 TB to generate analytics
It Is 2016 Now!
• Making decisions faster is more valuable
• Kafka, Storm, Trident, Samza, Spark, Flink, Parquet,
Avro, Cloud providers, etc.
Examples from Real Life
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
Lambda Architecture
A data-processing architecture
designed to handle massive quantities of data
by taking advantage of both
batch and stream processing methods
http://lambda-architecture.net/
https://www.manning.com/books/big-data
https://www.manning.com/books/big-data
Layers of Lambda Architecture
Batch layer
• manages the master dataset (an immutable, append-only set of
raw data)
• pre-computes the batch views
Serving layer
• indexes the batch views so that they can be queried in ad-hoc with
low-latency
Speed layer
• deals with recent data only
http://lambda-architecture.net/
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Relevance of Data
http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala
query =
real time view =
batch view =
function(batch view, real time view)
function(real time view, new data)
function(all data)
Trade-offs
Full recomputation vs. partial recomputation
e.g. using Bloom filters
Recomputational algorithms vs. incremental algorithms
Additive algorithms vs. approximation algorithms
e.g. HyperLogLog for count-distinct problem
Implementation of Lambda Architecture
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Can be considered as an integrated solution
for processing on all lambda architecture layers
Apache Spark: a Brief History
Why Apache Spark?
As of mid 2014,
Spark is the most active Big Data project
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Total Contributors
Core Concepts
automatically distribute data across cluster
and
parallelize operations performed on them
Components Stack
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
Enables scalable, high-throughput, fault-tolerant
stream processing of live data streams
50% users consider most important part of Spark
Spark Streaming
http://spark.apache.org/docs/latest/streaming-programming-guide.html
Streaming Architecture
• micro-batch architecture
• series of batch computations on small chunks of data
• batch interval is configurable
• exactly once semantics
http://spark.apache.org/docs/latest/streaming-programming-guide.html
Streaming Architecture
http://spark.apache.org/docs/latest/streaming-programming-guide.html
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
http://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers
http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
DStream as a Continuous Series of RDDs
http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
Provide hashtags statistics
used in a #morningatlohika tweets
All time till today + right now
Sample Application
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Batch View
apache –
architecture –
aws –
java –
jeeconf –
lambda –
morningatlohika –
simpleworkflow –
spark –
6
12
3
4
7
6
15
14
5
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Real-time View
“Cool presentation by @tmatyashovsky about
#lambda #architecture using #apache #spark
at #morningatlohika”
apache –
architecture –
morningatlohika –
lambda –
spark –
1
1
1
1
1
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Batch View + Real-time View
apache –
architecture –
aws –
java –
jeeconf –
lambda –
morningatlohika –
simpleworkflow –
spark –
7
13
3
4
7
7
16
14
6
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Simplified Steps
• Create batch view (.parquet) via Apache Spark
• Cache batch view in Apache Spark
• Start streaming application connected to Twitter
• Focus on real-time #morningatlohika tweets*
• Build incremental real-time views
• Query, i.e. merge batch and real-time views on a fly
* Stream from file system (used for testing) can be used as a backup
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Demo Time
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
http://shop.oreilly.com/product/0636920028512.do
http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
Structured Streaming in Spark 2.0
The simplest way to perform streaming analytics
is not having to reason about streaming
Static DataFrame API = Infinite DataFrame API
http://www.slideshare.net/rxin/the-future-of-realtime-in-spark
Structured Streaming
• Introduces streaming API built on top of Spark SQL
• Unifies streaming, interactive and batch queries
logs = context.read.format("json")
.stream("s3://logs")
logs.groupBy(logs.user_id)
.agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")
https://www.youtube.com/watch?v=oXkxXDG0gNk
Instead of Epilogue
http://milinda.pathirage.org/kappa-architecture.com/
http://milinda.pathirage.org/kappa-architecture.com/
Taras Matyashovsky
taras.matyashovsky@gmail.com
@tmatyashovsky
http://www.filevych.com/
Thank you!
References
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
https://www.manning.com/books/big-data
Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early release ebook from O'Reilly
Media)
http://spark.apache.org/docs/latest/streaming-programming-guide.html
http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala
http://www.rittmanmead.com/2015/08/combining-spark-streaming-and-data-frames-for-near-real-time-log-analysis/
https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html
https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Streaming%20mapWithState.html
http://spark.apache.org/docs/latest/cluster-overview.html
http://milinda.pathirage.org/kappa-architecture.com/
http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia
http://www.slideshare.net/rxin/the-future-of-realtime-in-spark
http://thenewstack.io/spark-2-0-will-offer-interactive-querying-live-data/
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
https://databricks.com/blog/2015/10/13/interactive-audience-analytics-with-spark-and-hyperloglog.html
https://www.youtube.com/watch?v=ZFBgY0PwUeY
https://www.youtube.com/watch?v=oXkxXDG0gN
http://milinda.pathirage.org/kappa-architecture.com/

More Related Content

More from Taras Matyashovsky

Influence. The Psychology of Persuasion (in IT)
Influence. The Psychology of Persuasion (in IT)Influence. The Psychology of Persuasion (in IT)
Influence. The Psychology of Persuasion (in IT)Taras Matyashovsky
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkTaras Matyashovsky
 
Morning at Lohika 1st anniversary
Morning at Lohika 1st anniversaryMorning at Lohika 1st anniversary
Morning at Lohika 1st anniversaryTaras Matyashovsky
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
New life inside monolithic application
New life inside monolithic applicationNew life inside monolithic application
New life inside monolithic applicationTaras Matyashovsky
 
Distributed applications using Hazelcast
Distributed applications using HazelcastDistributed applications using Hazelcast
Distributed applications using HazelcastTaras Matyashovsky
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 

More from Taras Matyashovsky (9)

Confession of an Engineer
Confession of an EngineerConfession of an Engineer
Confession of an Engineer
 
Influence. The Psychology of Persuasion (in IT)
Influence. The Psychology of Persuasion (in IT)Influence. The Psychology of Persuasion (in IT)
Influence. The Psychology of Persuasion (in IT)
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache Spark
 
Morning at Lohika 1st anniversary
Morning at Lohika 1st anniversaryMorning at Lohika 1st anniversary
Morning at Lohika 1st anniversary
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
New life inside monolithic application
New life inside monolithic applicationNew life inside monolithic application
New life inside monolithic application
 
Distributed applications using Hazelcast
Distributed applications using HazelcastDistributed applications using Hazelcast
Distributed applications using Hazelcast
 
Morning at Lohika
Morning at LohikaMorning at Lohika
Morning at Lohika
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Lambda Architecture with Apache Spark

Editor's Notes

  1. Receiver: Task that collects data from the input source and represents it as RDDs Is launched automatically for each input source Replicates data to another executor for fault tolerance
  2. Cluster Manager: Standalone, Apache Mesos, Hadoop Yarn Cluster Manager should be chosen and configured properly Monitoring via web UI(s) and metrics Web UI: master web UI worker web UI driver web UI - available only during execution history server - spark.eventLog.enabled = true Metrics based on Coda Hale Metrics library. Can be reported via HTTP, JMX, and CSV files.
  3. Spark 2.0: Project Tungsten 2.0 Whole stage code generation Optimized input / output -> Parquet + built-in cache Spark Streaming DataFrame API unified with Dataset API