SlideShare a Scribd company logo
1 of 54
1
Introduction to Spark
Gwen Shapira, Solutions Architect
Spark is next-generation Map Reduce
2
MapReduce has been around for a while
It made distributed compute easier
But, can we do better?
3
MapReduce Issues
• Launching mappers and reducers takes time
• One MR job can rarely do a full computation
• Writing to disk (in triplicate!) between each job
• Going back to queue between jobs
• No in-memory caching
• No iterations
• Very high latency
• Not the greatest APIs either
4
Spark:
Easy to Develop, Fast to Run
5
Spark Features
• In-memory cache
• General execution graphs
• APIs in Scala, Java and Python
• Integrates but does not depend on Hadoop
6
Why is it better?
• (Much) Faster than MR
• Iterative programming – Must have for ML
• Interactive – allows rapid exploratory analytics
• Flexible execution graph:
• Map, map, reduce, reduce, reduce, map
• High productivity compared to MapReduce
7
Word Count
file = spark.textFile(“hdfs://…”)
file.flatMap(line = > line.split(“ “))
.map(word=>(word,1))
.reduceByKey(_+_)
Remember MapReduce WordCount?
8
Agenda
• Concepts
• Examples
• Streaming
• Summary
9
10
Concepts
AMP Lab BDAS
11
CDH5 (simplified)
12
HDFS + In-Memory Cache
YARN
Spark MR Impala
Spark Streaming ML Lib
How Spark runs on a Cluster
13
Driver
Worker
Worker
Data
RAM
Data
RAMWorker
Data
RAM
Workflow
• SparkContext in driver connects to Master
• Master allocates resources for app on cluster
• SC acquires executors on worker nodes
• SC sends the app code (JAR) to executors
• SC sends tasks to executors
14
RDD – Resilient Distributed Dataset
• Collection of elements
• Read-only
• Partitioned
• Fault-tolerant
• Supports parallel operations
15
RDD Types
• Parallelized Collection
• Parallelize(Seq)
• HDFS files
• Text, Sequence or any InputFormat
• Both support same operations
16
Operations
Transformations
• Map
• Filter
• Sample
• Join
• ReduceByKey
• GroupByKey
• Distinct
Actions
• Reduce
• Collect
• Count
• First, Take
• SaveAs
• CountByKey
17
Transformations are lazy
18
Lazy transformation
19
Find all lines that mention “MySQL”
Only the timestamp portion of the line
Set the date and hour as key, 1 as value
Now reduce by key and sum the values
Return the result as Array so I can print
Find lines, get
timestamp…
Aha! Finally
something
to do!
Persistence / Caching
• Store RDD in memory for later use
• Each node persists a partition
• Persist() marks an RDD for caching
• It will be cached first time an action is performed
• Use for iterative algorithms
20
Caching – Storage Levels
• MEMORY_ONLY
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2…
21
Fault Tolerance
• Lost partitions can be re-computed from source data
• Because we remember all transformations
22
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
23
Examples
Word Count
file = spark.textFile(“hdfs://…”)
file.flatMap(line = > line.split(“ “))
.map(word=>(word,1))
.reduceByKey(_+_)
Remember MapReduce WordCount?
24
Log Mining
• Load error messages from a log into memory
• Interactively search for patterns
25
Log Mining
lines = spark.textFile(“hdfs://…”)
errors = lines.filter(_.startsWith(“ERROR”)
messages = errors.map(_.split(„t‟)(2))
cachedMsgs = messages.cache()
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
…
26
Base RDD
Transformed
RDD
Action
Logistic Regression
• Read two sets of points
• Looks for a plane W that separates them
• Perform gradient descent:
• Start with random W
• On each iteration, sum a function of W over the data
• Move W in a direction that improves it
27
Intuition
28
Logistic Regression
val points = spark.textFile(…).map(parsePoint).cache()
val w = Vector.random(D)
for (I <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x )
.reduce(_+_)
w -= gradient
}
println(“Final separating plane: ” + w)
29
Conviva Use-Case
• Monitor online video consumption
• Analyze trends
Need to run tens of queries like this a day:
SELECT videoName, COUNT(1)
FROM summaries
WHERE date='2011_12_12' AND customer='XYZ'
GROUP BY videoName;
30
Conviva With Spark
val sessions =
sparkContext.sequenceFile[SessionSummary,NullWritable](pathToSessionSum
maryOnHdfs)
val cachedSessions = sessions.filter(whereConditionToFilterSessions).cache
val mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) }
val reduceFn : (Long, Long) => Long = { (a,b) => a+b }
val results =
cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap
31
32
Streaming
What is it?
• Extension of Spark API
• For high-throughput fault-tolerant processing
of live data streams
33
Sources & Outputs
• Kafka
• Flume
• Twitter
• JMS Queues
• TCP sockets
• HDFS
• Databases
• Dashboards
34
Architecture
35
Input
Streaming
Context
Spark
Context
DStreams
• Stream is broken down into micro-batches
• Each micro-batch is an RDD
• This means any Spark function or library can apply to
a stream
• Including ML-Lib, graph processing, etc.
36
Processing DStreams
37
Processing Dstreams - Stateless
38
Processing Dstreams - Stateful
39
Dstream Operators
• Transformation
produce DStream from one or more parent streams
• Stateless (independent per interval)
Map, reduce
• Stateful (share data across intervals)
Window, incremental aggregation, time-
skewed join
• Output
Write data to external system (save RDD to HDFS)
Save, foreach
40
Fault Recovery
• Input from TCP, Flume or Kafka is stored on 2 nodes
• In case of failure:
missing RDDs will be re-computed from surviving nodes.
• RDDs are deterministic
• So any computation will lead to the same result
• Transformation can guarantee
exactly once semantics.
• Even through failure
41
Key Question -
How fast can the system recover?
42
Example – Streaming WordCount
import org.apache.spark.streaming.{Seconds, StreamingContext}
import StreamingContext._
...
// Create the context and set up a network input stream
val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1))
val lines = ssc.socketTextStream(args(1), args(2).toInt)
// Split the lines into words, count them
// print some of the counts on the master
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
43
44
Shark
Shark Architecture
• Identical to Hive
• Same CLI, JDBC, SQL Parser, Metastore
• Replaced the optimizer, plan generator and the
execution engine.
• Added Cache Manager.
• Generate Spark code instead of Map Reduce
45
Hive Compatibility
• MetaStore
• HQL
• UDF / UDAF
• SerDes
• Scripts
46
Dynamic Query Plans
• Hive MetaData often lacks statistics
• Join types often requires hinting
• Shark gathers statistics per partition
• While materializing map output
• Partition sizes, record count, skew, histograms
• Alter plan accordingly
47
Columnar Memory Store
• Better compression
• CPU efficiency
• Cache Locality
48
Spark + Shark Integration
val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid")
val features = users.mapRows { row =>
new Vector(extractFeature1(row.getInt("age")),
extractFeature2(row.getStr("country")), ...)}
val trainedVector = logRegress(features.cache())
49
50
Summary
Why Spark?
• Flexible
• High performance
• Machine learning,
iterative algorithms
• Interactive data
explorations
• Developer productivity
51
Why not Spark?
• Still immature
• Uses *lots* of memory
• Equivalent functionality
in Impala, Storm, etc
52
How Spark Works?
• RDDs – resilient distributed data
• Lazy transformations
• Fault tolerant caching
• Streams – micro-batches of RDDs
53
54

More Related Content

What's hot

Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talkDataStax Academy
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_finalRamya Sunil
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent
 
How Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and StormHow Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and StormEdureka!
 
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...Amazon Web Services
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...DataWorks Summit/Hadoop Summit
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentLessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentDataWorks Summit
 
Apache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupApache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupSnehal Nagmote
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBasedave_revell
 
Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaJoe Stein
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips confluent
 

What's hot (20)

Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_final
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
How Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and StormHow Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and Storm
 
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentLessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
 
Apache Kafka Women Who Code Meetup
Apache Kafka Women Who Code MeetupApache Kafka Women Who Code Meetup
Apache Kafka Women Who Code Meetup
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache Kafka
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips
 

Similar to Introduction to Spark - Next-Generation MapReduce

Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_sparkGeetanjali G
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark DownscalingDatabricks
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...DataStax Academy
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsJulien Anguenot
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Sparknickmbailey
 

Similar to Introduction to Spark - Next-Generation MapReduce (20)

Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_spark
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Spark etl
Spark etlSpark etl
Spark etl
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Shark
SharkShark
Shark
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
 

More from Gwen (Chen) Shapira

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep DiveGwen (Chen) Shapira
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote Gwen (Chen) Shapira
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGwen (Chen) Shapira
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Gwen (Chen) Shapira
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebookGwen (Chen) Shapira
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersGwen (Chen) Shapira
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3Gwen (Chen) Shapira
 

More from Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3
 
Is hadoop for you
Is hadoop for youIs hadoop for you
Is hadoop for you
 

Recently uploaded

complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxsomshekarkn64
 

Recently uploaded (20)

complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptx
 

Introduction to Spark - Next-Generation MapReduce

  • 1. 1 Introduction to Spark Gwen Shapira, Solutions Architect
  • 3. MapReduce has been around for a while It made distributed compute easier But, can we do better? 3
  • 4. MapReduce Issues • Launching mappers and reducers takes time • One MR job can rarely do a full computation • Writing to disk (in triplicate!) between each job • Going back to queue between jobs • No in-memory caching • No iterations • Very high latency • Not the greatest APIs either 4
  • 5. Spark: Easy to Develop, Fast to Run 5
  • 6. Spark Features • In-memory cache • General execution graphs • APIs in Scala, Java and Python • Integrates but does not depend on Hadoop 6
  • 7. Why is it better? • (Much) Faster than MR • Iterative programming – Must have for ML • Interactive – allows rapid exploratory analytics • Flexible execution graph: • Map, map, reduce, reduce, reduce, map • High productivity compared to MapReduce 7
  • 8. Word Count file = spark.textFile(“hdfs://…”) file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_) Remember MapReduce WordCount? 8
  • 9. Agenda • Concepts • Examples • Streaming • Summary 9
  • 12. CDH5 (simplified) 12 HDFS + In-Memory Cache YARN Spark MR Impala Spark Streaming ML Lib
  • 13. How Spark runs on a Cluster 13 Driver Worker Worker Data RAM Data RAMWorker Data RAM
  • 14. Workflow • SparkContext in driver connects to Master • Master allocates resources for app on cluster • SC acquires executors on worker nodes • SC sends the app code (JAR) to executors • SC sends tasks to executors 14
  • 15. RDD – Resilient Distributed Dataset • Collection of elements • Read-only • Partitioned • Fault-tolerant • Supports parallel operations 15
  • 16. RDD Types • Parallelized Collection • Parallelize(Seq) • HDFS files • Text, Sequence or any InputFormat • Both support same operations 16
  • 17. Operations Transformations • Map • Filter • Sample • Join • ReduceByKey • GroupByKey • Distinct Actions • Reduce • Collect • Count • First, Take • SaveAs • CountByKey 17
  • 19. Lazy transformation 19 Find all lines that mention “MySQL” Only the timestamp portion of the line Set the date and hour as key, 1 as value Now reduce by key and sum the values Return the result as Array so I can print Find lines, get timestamp… Aha! Finally something to do!
  • 20. Persistence / Caching • Store RDD in memory for later use • Each node persists a partition • Persist() marks an RDD for caching • It will be cached first time an action is performed • Use for iterative algorithms 20
  • 21. Caching – Storage Levels • MEMORY_ONLY • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2… 21
  • 22. Fault Tolerance • Lost partitions can be re-computed from source data • Because we remember all transformations 22 msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 24. Word Count file = spark.textFile(“hdfs://…”) file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_) Remember MapReduce WordCount? 24
  • 25. Log Mining • Load error messages from a log into memory • Interactively search for patterns 25
  • 26. Log Mining lines = spark.textFile(“hdfs://…”) errors = lines.filter(_.startsWith(“ERROR”) messages = errors.map(_.split(„t‟)(2)) cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count … 26 Base RDD Transformed RDD Action
  • 27. Logistic Regression • Read two sets of points • Looks for a plane W that separates them • Perform gradient descent: • Start with random W • On each iteration, sum a function of W over the data • Move W in a direction that improves it 27
  • 29. Logistic Regression val points = spark.textFile(…).map(parsePoint).cache() val w = Vector.random(D) for (I <- 1 to ITERATIONS) { val gradient = points.map(p => (1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x ) .reduce(_+_) w -= gradient } println(“Final separating plane: ” + w) 29
  • 30. Conviva Use-Case • Monitor online video consumption • Analyze trends Need to run tens of queries like this a day: SELECT videoName, COUNT(1) FROM summaries WHERE date='2011_12_12' AND customer='XYZ' GROUP BY videoName; 30
  • 31. Conviva With Spark val sessions = sparkContext.sequenceFile[SessionSummary,NullWritable](pathToSessionSum maryOnHdfs) val cachedSessions = sessions.filter(whereConditionToFilterSessions).cache val mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) } val reduceFn : (Long, Long) => Long = { (a,b) => a+b } val results = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap 31
  • 33. What is it? • Extension of Spark API • For high-throughput fault-tolerant processing of live data streams 33
  • 34. Sources & Outputs • Kafka • Flume • Twitter • JMS Queues • TCP sockets • HDFS • Databases • Dashboards 34
  • 36. DStreams • Stream is broken down into micro-batches • Each micro-batch is an RDD • This means any Spark function or library can apply to a stream • Including ML-Lib, graph processing, etc. 36
  • 38. Processing Dstreams - Stateless 38
  • 39. Processing Dstreams - Stateful 39
  • 40. Dstream Operators • Transformation produce DStream from one or more parent streams • Stateless (independent per interval) Map, reduce • Stateful (share data across intervals) Window, incremental aggregation, time- skewed join • Output Write data to external system (save RDD to HDFS) Save, foreach 40
  • 41. Fault Recovery • Input from TCP, Flume or Kafka is stored on 2 nodes • In case of failure: missing RDDs will be re-computed from surviving nodes. • RDDs are deterministic • So any computation will lead to the same result • Transformation can guarantee exactly once semantics. • Even through failure 41
  • 42. Key Question - How fast can the system recover? 42
  • 43. Example – Streaming WordCount import org.apache.spark.streaming.{Seconds, StreamingContext} import StreamingContext._ ... // Create the context and set up a network input stream val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1)) val lines = ssc.socketTextStream(args(1), args(2).toInt) // Split the lines into words, count them // print some of the counts on the master val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() 43
  • 45. Shark Architecture • Identical to Hive • Same CLI, JDBC, SQL Parser, Metastore • Replaced the optimizer, plan generator and the execution engine. • Added Cache Manager. • Generate Spark code instead of Map Reduce 45
  • 46. Hive Compatibility • MetaStore • HQL • UDF / UDAF • SerDes • Scripts 46
  • 47. Dynamic Query Plans • Hive MetaData often lacks statistics • Join types often requires hinting • Shark gathers statistics per partition • While materializing map output • Partition sizes, record count, skew, histograms • Alter plan accordingly 47
  • 48. Columnar Memory Store • Better compression • CPU efficiency • Cache Locality 48
  • 49. Spark + Shark Integration val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid") val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)} val trainedVector = logRegress(features.cache()) 49
  • 51. Why Spark? • Flexible • High performance • Machine learning, iterative algorithms • Interactive data explorations • Developer productivity 51
  • 52. Why not Spark? • Still immature • Uses *lots* of memory • Equivalent functionality in Impala, Storm, etc 52
  • 53. How Spark Works? • RDDs – resilient distributed data • Lazy transformations • Fault tolerant caching • Streams – micro-batches of RDDs 53
  • 54. 54

Editor's Notes

  1. Transformations create new datasets from existing ones.Actions return value
  2. When you apply a transformation to an RDD, they don’t happen right away. Instead they are remembered and are computed only when an action requires a result.
  3. Spark’s storage levels are meant to provide different tradeoffs between memory usage and CPU efficiency. We recommend going through the following process to select one:If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access.Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition is about as fast as reading it from disk.Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.If you want to define your own storage level (say, with replication factor of 3 instead of 2), then use the function factor method apply() of the StorageLevel singleton object.
  4. Count number of words received from network server every secondThe socketTextStream returns a DStream of lines received from a TCP socket-based source. The lines DStream is transformed into a DStream using the flatMap operation, where each line is split into words. This words DStream is then mapped to a DStream of (word, 1) pairs, which is finally reduced to get the word counts. wordCounts.print() will print 10 of the counts generated every second.