SlideShare a Scribd company logo
1 of 30
Download to read offline
Analyzing Data at Scale
with Apache Spark
Nicola Ferraro (@ni_ferraro)
Senior Software Engineer at Red Hat
Naples, November 24th 2017
Myself
Nicola Ferraro
Senior Software Engineer at Red Hat
Working on Apache Camel, JBoss Fuse,
Fuse Integration Services for Openshift,
Syndesis, Oshinko Radanalytics.
Follow me on Twitter
@ni_ferraro
Agenda
● A brief history of Big Data
● Data processing models
● Spark on Openshift
● Demo
Big Data Systems: why?
System capable of handling data with
high:
● Volume
○ Terabytes/Petabytes of data collected
over the years
● Velocity
○ High speed streaming data to be
analyzed in near real-time
● Variety
○ Not just tabular data or json/xml, also
images, videos, free text
Volume
Velocity Variety
There!
Big Data Systems: why IoT?
Big Data Systems: which devices?
An Example?
Back to the Future II (Weather forecasting)
We can collect data from static sensors and moving cars to understand the exact
moment when it will stop raining!
E.g. https://goo.gl/FDzfdx
Big Data Systems: how?
...
...
...
...
By scaling horizontally to
1000s of machines!
A single machine can be
slow. But together they have
a huge processing power!
Evolution of Big Data Systems: Software
2006
Hadoop
...
2014+
2008
Pig (scripting)
2010
Hive (SQL)
Evolution of Big Data Systems: Infrastructure
2018 ?
2006
Commodity Hardware
2011
Big Data Appliances 2014
Virtual Machines
Evolution of Big Data Systems: Architectures
+
2011
Hybrid
(Lambda)
2016+
Streaming
(Kappa)
2006
BatchData Lake
Batch Architecture
HDFS HDFS HDFS HDFS
Map
Reduce
Map
Reduce
Map
Reduce
Map
Reduce
Hadoop
v1
1. Ingest to HDFS
2. Input-output from HDFS with MapReduce
3. Export to external systems using HDFS tools
To serving layerIngest
Lambda Architecture
HDFS
IngestMessaging Streaming
Streaming
To serving layer
Interactive Queries
NoSQL
Batch
Batch processing every
night or every n days...
Kappa Architecture
Distributed
Event Log Streaming
Streaming To serving layer
Agenda
● A brief history of Big Data
● Data processing models
● Spark on Openshift
● Demo
Map Reduce Example: Word Count
Users implemented 2 functions classes (Map and Reduce) and 1 config file
Machine 1
Old Data Processing Model: Map Reduce
Machine 2
Machine 3
Machine 4
MAP
MAP
MAP
MAP
load store
Hadoop: batch architecture
shuffle
cache
cache
cache
cache
REDUCE
REDUCE
REDUCE
REDUCE
Usually HDFS
HDFSReplicaFactor3 Most of the
work is done in
parallel by all
machines!
Introducing Spark
Fast data processing platform.
● Batch processing
● Streaming (structured or micro-batching)
● Machine Learning
● Graph-based Algorithms
Multi-language: Scala, Java, Python, R
Apache Spark: RDD
The core Spark API is based on the concept of Resilient Distributed Dataset.
RDD (Set of all events received)
val events: RDD[Event] = …
Like a Scala collection
(but lazy)
HDFS
JDBC
NoSQL
Kafka
P1 P2 P3 P4 P5 P6
Apache Spark: Functional Programming Model
Java 8 streams:
List<String> firstnames = people.stream()
.filter(p -> p.getAge() < 30)
.map(p -> p.getFirstname())
.distinct()
.collect(Collectors.toList());
Get all distinct first names of people
under 30 from a Java collection.
Apache Spark (Scala):
val firstnames = people
.filter(p => p.age < 30)
.map(p => p.firstname)
.distinct()
.collect();
The only difference: people is a 20TB
RDD and computation is performed by
several machines in parallel
Apache Spark: Streaming (or micro-batching)
DStream = Discretized Stream
The size of each micro-batch is
specified by the user (in seconds)
Sliding window mode
Apache Spark 2.0: Dataframes/Datasets
RDD/DStream are the core APIs for processing data: it’s now considered too
low-level.
Streaming → DStream[Temperature]
Batch → RDD[Temperature]
Spark 2.0 introduced Structured Streaming:
● Using the same API for streaming and still data
● Treating a stream of events as an growing append-only collection
The plan is to remove RDD/DStream
API in Spark 3.0
For now: structured streaming is
not feature-complete (Spark 2.2.0)
Stream
col1 col2
…
Append-only
Table
Apache Spark: Machine Learning
Spark MLlib has built-in algorithms:
● Classification: logistic regression, decision trees, support vector machines, …
● Regression
● Clustering: K-Means, LDA, GMM, …
● Collaborative Filtering
● …
Available for RDD and Dataframe/Datasets (incomplete)
Agenda
● A brief history of Big Data
● Data processing models
● Spark on Openshift
● Demo
Openshift
Container orchestration platform. Born at Google.
● Running Containers
● Virtual Namespaces
● Virtual Networks
● Service Discovery
● Load Balancing
● Auto-Scaling
● Health-checking and auto-recovery
● Monitoring and Logging
Creating
Containers
Orchestrating
Containers
Kubernetes Enterprise
Edition
Spark Architecture
Cluster Manager
Workers
Driver Driver App
(Main.class)
Executed by
Assigns executors to the App
Sends tasks to executors.
Task = “do something on a
data partition”
Oshinko
(Radanalytics)
Executor Executor
Task Task
Agenda
● A brief history of Big Data
● Data processing models
● Spark on Openshift
● Demo
You’ll see:
● Apache Spark on Openshift with Oshinko
● Kafka on Openshift (EnMasse)
● Spring-Boot + Apache Camel simulator
Sources and instruction available here:
https://github.com/nicolaferraro/iot-day-napoli-2017-demo
Demo
Thanks !
Questions ?
@ni_ferraro

More Related Content

What's hot

Microservices architecture presentation
Microservices architecture presentationMicroservices architecture presentation
Microservices architecture presentationJoseph SHYIRAMBERE
 
Software Defined Datacenter with Proxmox
Software Defined Datacenter with ProxmoxSoftware Defined Datacenter with Proxmox
Software Defined Datacenter with ProxmoxGLC Networks
 
platform without vendor lock-in
platform without vendor lock-inplatform without vendor lock-in
platform without vendor lock-inKai Jokiniemi
 
Building Local-loop Services for Customers
Building Local-loop Services for CustomersBuilding Local-loop Services for Customers
Building Local-loop Services for CustomersGLC Networks
 
Limiting bandwidth of specific destination based on address list
Limiting bandwidth of specific destination based on address listLimiting bandwidth of specific destination based on address list
Limiting bandwidth of specific destination based on address listAchmad Mardiansyah
 
Bitcoin cryptography
Bitcoin cryptographyBitcoin cryptography
Bitcoin cryptographyVadym Hrusha
 

What's hot (8)

Mikrotik fastpath
Mikrotik fastpathMikrotik fastpath
Mikrotik fastpath
 
Mikrotik firewall mangle
Mikrotik firewall mangleMikrotik firewall mangle
Mikrotik firewall mangle
 
Microservices architecture presentation
Microservices architecture presentationMicroservices architecture presentation
Microservices architecture presentation
 
Software Defined Datacenter with Proxmox
Software Defined Datacenter with ProxmoxSoftware Defined Datacenter with Proxmox
Software Defined Datacenter with Proxmox
 
platform without vendor lock-in
platform without vendor lock-inplatform without vendor lock-in
platform without vendor lock-in
 
Building Local-loop Services for Customers
Building Local-loop Services for CustomersBuilding Local-loop Services for Customers
Building Local-loop Services for Customers
 
Limiting bandwidth of specific destination based on address list
Limiting bandwidth of specific destination based on address listLimiting bandwidth of specific destination based on address list
Limiting bandwidth of specific destination based on address list
 
Bitcoin cryptography
Bitcoin cryptographyBitcoin cryptography
Bitcoin cryptography
 

Similar to Analyzing Data at Scale with Apache Spark

Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1Adam Muise
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingPalani Kumar
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 

Similar to Analyzing Data at Scale with Apache Spark (20)

Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Spark 101
Spark 101Spark 101
Spark 101
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 

More from Nicola Ferraro

Camel Day Italia 2021 - Camel K
Camel Day Italia 2021 - Camel KCamel Day Italia 2021 - Camel K
Camel Day Italia 2021 - Camel KNicola Ferraro
 
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...Nicola Ferraro
 
ApacheCon NA - Apache Camel K: a cloud-native integration platform
ApacheCon NA - Apache Camel K: a cloud-native integration platformApacheCon NA - Apache Camel K: a cloud-native integration platform
ApacheCon NA - Apache Camel K: a cloud-native integration platformNicola Ferraro
 
Integrating Applications: the Reactive Way
Integrating Applications: the Reactive WayIntegrating Applications: the Reactive Way
Integrating Applications: the Reactive WayNicola Ferraro
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachNicola Ferraro
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"Nicola Ferraro
 

More from Nicola Ferraro (7)

Camel Day Italia 2021 - Camel K
Camel Day Italia 2021 - Camel KCamel Day Italia 2021 - Camel K
Camel Day Italia 2021 - Camel K
 
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
 
ApacheCon NA - Apache Camel K: a cloud-native integration platform
ApacheCon NA - Apache Camel K: a cloud-native integration platformApacheCon NA - Apache Camel K: a cloud-native integration platform
ApacheCon NA - Apache Camel K: a cloud-native integration platform
 
Integrating Applications: the Reactive Way
Integrating Applications: the Reactive WayIntegrating Applications: the Reactive Way
Integrating Applications: the Reactive Way
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps Approach
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 

Recently uploaded

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 

Recently uploaded (20)

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 

Analyzing Data at Scale with Apache Spark

  • 1. Analyzing Data at Scale with Apache Spark Nicola Ferraro (@ni_ferraro) Senior Software Engineer at Red Hat Naples, November 24th 2017
  • 2.
  • 3. Myself Nicola Ferraro Senior Software Engineer at Red Hat Working on Apache Camel, JBoss Fuse, Fuse Integration Services for Openshift, Syndesis, Oshinko Radanalytics. Follow me on Twitter @ni_ferraro
  • 4. Agenda ● A brief history of Big Data ● Data processing models ● Spark on Openshift ● Demo
  • 5. Big Data Systems: why? System capable of handling data with high: ● Volume ○ Terabytes/Petabytes of data collected over the years ● Velocity ○ High speed streaming data to be analyzed in near real-time ● Variety ○ Not just tabular data or json/xml, also images, videos, free text Volume Velocity Variety There!
  • 6. Big Data Systems: why IoT?
  • 7. Big Data Systems: which devices?
  • 8. An Example? Back to the Future II (Weather forecasting) We can collect data from static sensors and moving cars to understand the exact moment when it will stop raining! E.g. https://goo.gl/FDzfdx
  • 9. Big Data Systems: how? ... ... ... ... By scaling horizontally to 1000s of machines! A single machine can be slow. But together they have a huge processing power!
  • 10. Evolution of Big Data Systems: Software 2006 Hadoop ... 2014+ 2008 Pig (scripting) 2010 Hive (SQL)
  • 11. Evolution of Big Data Systems: Infrastructure 2018 ? 2006 Commodity Hardware 2011 Big Data Appliances 2014 Virtual Machines
  • 12. Evolution of Big Data Systems: Architectures + 2011 Hybrid (Lambda) 2016+ Streaming (Kappa) 2006 BatchData Lake
  • 13. Batch Architecture HDFS HDFS HDFS HDFS Map Reduce Map Reduce Map Reduce Map Reduce Hadoop v1 1. Ingest to HDFS 2. Input-output from HDFS with MapReduce 3. Export to external systems using HDFS tools To serving layerIngest
  • 14. Lambda Architecture HDFS IngestMessaging Streaming Streaming To serving layer Interactive Queries NoSQL Batch Batch processing every night or every n days...
  • 15. Kappa Architecture Distributed Event Log Streaming Streaming To serving layer
  • 16. Agenda ● A brief history of Big Data ● Data processing models ● Spark on Openshift ● Demo
  • 17. Map Reduce Example: Word Count Users implemented 2 functions classes (Map and Reduce) and 1 config file
  • 18. Machine 1 Old Data Processing Model: Map Reduce Machine 2 Machine 3 Machine 4 MAP MAP MAP MAP load store Hadoop: batch architecture shuffle cache cache cache cache REDUCE REDUCE REDUCE REDUCE Usually HDFS HDFSReplicaFactor3 Most of the work is done in parallel by all machines!
  • 19. Introducing Spark Fast data processing platform. ● Batch processing ● Streaming (structured or micro-batching) ● Machine Learning ● Graph-based Algorithms Multi-language: Scala, Java, Python, R
  • 20. Apache Spark: RDD The core Spark API is based on the concept of Resilient Distributed Dataset. RDD (Set of all events received) val events: RDD[Event] = … Like a Scala collection (but lazy) HDFS JDBC NoSQL Kafka P1 P2 P3 P4 P5 P6
  • 21. Apache Spark: Functional Programming Model Java 8 streams: List<String> firstnames = people.stream() .filter(p -> p.getAge() < 30) .map(p -> p.getFirstname()) .distinct() .collect(Collectors.toList()); Get all distinct first names of people under 30 from a Java collection. Apache Spark (Scala): val firstnames = people .filter(p => p.age < 30) .map(p => p.firstname) .distinct() .collect(); The only difference: people is a 20TB RDD and computation is performed by several machines in parallel
  • 22. Apache Spark: Streaming (or micro-batching) DStream = Discretized Stream The size of each micro-batch is specified by the user (in seconds) Sliding window mode
  • 23. Apache Spark 2.0: Dataframes/Datasets RDD/DStream are the core APIs for processing data: it’s now considered too low-level. Streaming → DStream[Temperature] Batch → RDD[Temperature] Spark 2.0 introduced Structured Streaming: ● Using the same API for streaming and still data ● Treating a stream of events as an growing append-only collection The plan is to remove RDD/DStream API in Spark 3.0 For now: structured streaming is not feature-complete (Spark 2.2.0) Stream col1 col2 … Append-only Table
  • 24. Apache Spark: Machine Learning Spark MLlib has built-in algorithms: ● Classification: logistic regression, decision trees, support vector machines, … ● Regression ● Clustering: K-Means, LDA, GMM, … ● Collaborative Filtering ● … Available for RDD and Dataframe/Datasets (incomplete)
  • 25. Agenda ● A brief history of Big Data ● Data processing models ● Spark on Openshift ● Demo
  • 26. Openshift Container orchestration platform. Born at Google. ● Running Containers ● Virtual Namespaces ● Virtual Networks ● Service Discovery ● Load Balancing ● Auto-Scaling ● Health-checking and auto-recovery ● Monitoring and Logging Creating Containers Orchestrating Containers Kubernetes Enterprise Edition
  • 27. Spark Architecture Cluster Manager Workers Driver Driver App (Main.class) Executed by Assigns executors to the App Sends tasks to executors. Task = “do something on a data partition” Oshinko (Radanalytics) Executor Executor Task Task
  • 28. Agenda ● A brief history of Big Data ● Data processing models ● Spark on Openshift ● Demo
  • 29. You’ll see: ● Apache Spark on Openshift with Oshinko ● Kafka on Openshift (EnMasse) ● Spring-Boot + Apache Camel simulator Sources and instruction available here: https://github.com/nicolaferraro/iot-day-napoli-2017-demo Demo