SlideShare a Scribd company logo
1 of 80
Download to read offline
Building Scalable
Data Pipelines
Evan Chan
Who am I
Distinguished Engineer, Tuplejump
@evanfchan
http://velvia.github.io
User and contributor to Spark
since 0.9
Co-creator and maintainer of
Spark Job Server
TupleJump - Big Data Dev Partners 3
Instant
Gratification
I want insights now
I want to act on news right away
I want stuff personalized for me (?)
Fast Data, not

Big Data
How Fast do you
Need to Act?
Financial trading - milliseconds
Dashboards - seconds to minutes
BI / Reports - hours to days?
What’s Your App?
Concurrent video viewers
Anomaly detection
Clickstream analysis
Live geospatial maps
Real-time trend detection & learning
Common
Components
Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users
Example: Real-time
trend detection
Events: time, OS, location, asset/product ID
Analyze 1-5 second batches of new “hot”
data in stream processor
Combine with recent and historical top K
feature vectors in database
Update database recent feature vectors
Serve to users
Example 2: Smart
Cities
Smart City
Streaming Data
City buses - regular telemetry (position,
velocity, timestamp)
Street sweepers - regular telemetry
Transactions from rail, subway, buses, smart
cards
311 info
911 info - new emergencies
Citizens want to
know…
Where and for how long can I park my
car?
Are transportation options affected by
311 and 911 events?
How long will it take the next bus to
get here?
Where is the closest bus to where I am?
Cities want to
know…
How can I maximize parking revenue?
More granular updates to parking spots that don't
need sweeping
How does traffic affect waiting times in public
transit, and revenue?
Patterns in subway train times - is a breakdown
coming?
Population movement - where should new transit
routes be placed?
Message
Queue
Stream
Processing
Layer
Event
storage
Ad-
Hoc
311
911
Buses
Metro
Short term
telemetry
Models
Dashboard
The HARD Principle
Highly Available, Resilient, Distributed
Flexibility - do as many transformations
as possible with as few components as
possible
Real-time: “NoETL”
Community: best of breed OSS projects with
huge adoption and commercial support
Message Queue
Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users
Why a message
queue?
Centralized publish-subscribe of
events
Need more processing? Add another
consumer
Buffer traffic spikes
Replay events in cases of failure
Message Queues
help distribute data
A-F
G-M
N-S
T-Z
Input 1
Input 2
Input3
Input4
Processing
Processing
Processing
Processing
Intro to Apache
Kafka
Kafka is a distributed publish subscribe
system
It uses a commit log to track changes
Kafka was originally created at LinkedIn
Open sourced in 2011
Graduated to a top-level Apache project
in 2012
On being HARD
Many Big Data projects are open source
implementations of closed source products
Unlike Hadoop, HBase or Cassandra, Kafka
actually isn't a clone of an existing closed
source product
The same codebase being used for years at LinkedIn
answers the questions:
Does it scale?
Is it robust?
Ad Hoc ETL
Decoupled ETL
Avro Schemas And Schema Registry
Keys and values in Kafka can be Strings
or byte arrays
Avro is a serialization format used
extensively with Kafka and Big Data
Kafka uses a Schema Registry to keep
track of Avro schemas
Verifies that the correct schemas are being used
Consumer Groups
Commit Logs
Kafka Resources
Official docs - https://
kafka.apache.org/
documentation.html
Design section is really good read
http://www.confluent.io/product
Includes schema registry
Stream Processing
Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users
Types of Stream
Processors
Event by Event: Apache Storm,
Apache Flink, Intel GearPump, Akka
Micro-batch: Apache Spark
Hybrid? Google Dataflow
Apache Storm and
Flink
Transform one message at a time
Very low latency
State and more complex analytics difficult
Akka and
Gearpump
Actor to actor messaging. Local state.
Used for extreme low latency (ad networks, etc)
Dynamically reconfigurable topology
Configurable fault tolerance and failure
recovery
Cluster or local mode - you don’t always need
distribution!
Spark Streaming
Data processed as stream of micro batches
Higher latency (seconds), higher
throughput, more complex analysis / ML
possible
Same programming model as batch
Why Spark?
file = spark.textFile("hdfs://...")
 
file.flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{
18 private final static IntWritable one = new IntWritable(1);
19 private Text word = new Text();
20
21 public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer = new StringTokenizer(line);
24 while (tokenizer.hasMoreTokens()) {
25 word.set(tokenizer.nextToken());
26 context.write(word, one);
27 }
28 }
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
32
33 public void reduce(Text key, Iterable<IntWritable> values, Context context)
34 throws IOException, InterruptedException {
35 int sum = 0;
36 for (IntWritable val : values) {
37 sum += val.get();
38 }
39 context.write(key, new IntWritable(sum));
40 }
41 }
42
43 public static void main(String[] args) throws Exception {
44 Configuration conf = new Configuration();
45
46 Job job = new Job(conf, "wordcount");
47
48 job.setOutputKeyClass(Text.class);
49 job.setOutputValueClass(IntWritable.class);
50
51 job.setMapperClass(Map.class);
52 job.setReducerClass(Reduce.class);
53
54 job.setInputFormatClass(TextInputFormat.class);
55 job.setOutputFormatClass(TextOutputFormat.class);
56
57 FileInputFormat.addInputPath(job, new Path(args[0]));
58 FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60 job.waitForCompletion(true);
61 }
62
63 }
Spark Production Deployments
Explosion of Specialized Systems
Spark and Berkeley AMP Lab
Benefits of Unified Libraries
Optimizations can be shared between libraries
Core
Project Tungsten
MLlib
Shared statistics libraries
Spark Streaming
GC and memory management
Mix and match
modules
Easily go from DataFrames (SQL) to
MLLib / statistics, for example:
scala> import org.apache.spark.mllib.stat.Statistics
scala> val numMentions = df.select("NumMentions").map(row => row.getInt(0).toDouble)
numMentions: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[100] at map at DataFrame.scala:848
scala> val numArticles = df.select("NumArticles").map(row => row.getInt(0).toDouble)
numArticles: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[104] at map at DataFrame.scala:848
scala> val correlation = Statistics.corr(numMentions, numArticles, "pearson")
Spark Worker Failure
Rebuild RDD Partitions on Worker
from Lineage
Spark SQL & DataFrames
DataFrames & Catalyst Optimizer
Catalyst Optimizations
Column and partition pruning
(Column filters)
Predicate pushdowns (Row filters)
Spark SQL Data Sources API
Enables custom data sources to participate in
SparkSQL = DataFrames + Catalyst
Production Impls
spark-csv (Databricks)
spark-avro (Databricks)
spark-cassandra-connector (DataStax)
elasticsearch-hadoop (Elastic.co)
Spark Streaming
Streaming Sources
Basic: Files, Akka actors, queues of RDDs,
Socket
Advanced
Kafka
Kinesis
Flume
Twitter firehose
DStreams = micro-batches
Streaming Fault Tolerance
Incoming data is replicated to 1
other node
Write Ahead Log for sources that
support ACKs
Checkpointing for recovery if Driver
fails
Direct Kafka Streaming: KafkaRDD
No single Receiver
Parallelizable
No Write Ahead Log
Kafka *is* the Write Ahead Log!
KafkaRDD stores Kafka offsets
KafkaRDD partitions recover from offsets

Spark MLlib & GraphX
Spark MLlib Common Algos
Classifiers
DecisionTree, RandomForest
Clustering
K-Means, Streaming K-Means
Collaborative Filtering
Alternating Least Squares (ALS)
Spark Text Processing Algos
TF/IDF
LDA
Word2Vec
*Pro-Tip:
Use Stanford CoreNLP!
Spark ML Pipelines
Modeled after scikit-learn
Spark GraphX
PageRank
Top Influencers
Connected Components
Measure of clusters
Triangle Counting
Measure of cluster density
Handling State
Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users
What Kind of State?
Non-persistent / in-memory:
concurrent viewers
Short term: latest trends
Longer term: raw event & aggregate
storage
ML Models, predictions, scored data
Spark RDDs
Immutable, cache in memory and/or
on disk
Spark Streaming: UpdateStateByKey
IndexedRDD - can update bits of
data
Snapshotting for recovery
•Massively Scalable
• High Performance
• Always On
• Masterless
Scale
Apache Cassandra
• Scales Linearly to as many nodes as
you need
• Scales whenever you need
Performance
Apache Cassandra
• It’s Fast
• Built to sustain massive data insertion
rates in irregular pattern spikes
Fault
Tolerance
&
Availability
Apache Cassandra
• Automatic Replication
• Multi Datacenter
• Decentralized - no single point of failure
• Survive regional outages
• New nodes automatically add
themselves to the cluster
• DataStax drivers automatically discover
new nodes
Architecture
Apache Cassandra
• Distributed, Masterless Ring Architecture
• Network Topology Aware
• Flexible, Schemaless - your data
structure can evolve seamlessly over
time
To download:
https://cassandra.apache.org/
download/
https://github.com/pcmanus/ccm
^ Highly recommended for local
testing/cluster setup
Cassandra Data
Modeling
Primary key = (partition keys, clustering keys)
Fast queries = fetch single partition
Range scans by clustering key
Must model for query patterns
Clustering 1 Clustering 2 Clustering 3
Partition 1
Partition 2
Partition 3
City Bus Data
Modeling Example
Primary key = (Bus UUID, timestamp)
Easy queries: location and speed of single
bus for a range of time
Can also query most recent location + speed
of all buses (slower)
1020 s 1010 s 1000 s
Bus A speed, GPS
Bus B
Bus C
Using Cassandra for
Short Term Storage
Idea is store and read small values
Idempotent writes + huge write
capacity = ideal for streaming
ingestion
For example, store last few (latest +
last N) snapshots of buses, taxi
locations, recent traffic info
But Mommy! What about
longer term data?
I need to read lots
of data, fast!!
- Ad hoc analytics of events
- More specialized / geospatial
- Building ML models from
large quantities of data
- Storing scored/classified data
from models
- OLAP / Data Warehousing
Can Cassandra
Handle Batch?
Cassandra tables are much better at
lots of small reads than big data scans
You CAN store data efficiently in C*
Files seem easier for long term storage
and analysis
But are files compatible with streaming?
Lambda
Architecture
Lambda is Hard
and Expensive
Very high TCO - Many moving parts - KV store,
real time, batch
Lots of monitoring, operations, headache
Running similar code in two places
Lower performance - lots of shuffling data,
network hops, translating domain objects
Reconcile queries against two different places
NoLambda
A unified system
Real-time processing and reprocessing
No ETLs
Fault tolerance
Everything is a stream
Can Cassandra do
batch and ad-hoc?
Yes, it can be competitive with Hadoop
actually….
If you know how to be creative with storing your
data!
Tuplejump/SnackFS - HDFS for Cassandra
github.com/tuplejump/FiloDB - analytics database
Store your data using Protobuf / Avro / etc.
Introduction to
FiloDB
Efficient columnar storage - 5-10x better
Scan speeds competitive with Parquet - 100x
faster than regular Cassandra tables
Very fine grained filtering for sub-second
concurrent queries
Easy BI and ad-hoc analysis via Spark SQL/
Dataframes (JDBC etc.)
Uses Cassandra for robust, proven storage
Combining FiloDB
+ Cassandra
Regular Cassandra tables for highly concurrent,
aggregate / key-value lookups (dashboards)
FiloDB + C* + Spark for efficient long term event
storage
Ad hoc / SQL / BI
Data source for MLLib / building models
Data storage for classified / predicted /
scored data
Message
Queue
Events
Spark
Streaming
Short term
storage, K-V
Adhoc,
SQL, ML
Cassandra
FiloDB: Events,
ad-hoc, batch
Spark
Dashboa
rds,
maps
Message
Queue
Events
Spark
Streaming Models
Cassandra
FiloDB: Long term event storage
Spark Learned
Data
FiloDB + Cassandra
Robust, peer to peer, proven storage
platform
Use for short term snapshots, dashboards
Use for efficient long term event
storage & ad hoc querying
Use as a source to build detailed
models
Thank you!
@evanfchan
http://tuplejump.com

More Related Content

What's hot

Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 

What's hot (20)

Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Apache NiFi Record Processing
Apache NiFi Record ProcessingApache NiFi Record Processing
Apache NiFi Record Processing
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL Analytics
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 

Viewers also liked

Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit
 

Viewers also liked (8)

Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Similar to Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Similar to Building Scalable Data Pipelines - 2016 DataPalooza Seattle (20)

The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Leverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platformLeverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platform
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 

More from Evan Chan

Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 

More from Evan Chan (16)

Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 

Recently uploaded

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Recently uploaded (20)

UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

  • 2. Who am I Distinguished Engineer, Tuplejump @evanfchan http://velvia.github.io User and contributor to Spark since 0.9 Co-creator and maintainer of Spark Job Server
  • 3. TupleJump - Big Data Dev Partners 3
  • 4. Instant Gratification I want insights now I want to act on news right away I want stuff personalized for me (?)
  • 6. How Fast do you Need to Act? Financial trading - milliseconds Dashboards - seconds to minutes BI / Reports - hours to days?
  • 7. What’s Your App? Concurrent video viewers Anomaly detection Clickstream analysis Live geospatial maps Real-time trend detection & learning
  • 9. Example: Real-time trend detection Events: time, OS, location, asset/product ID Analyze 1-5 second batches of new “hot” data in stream processor Combine with recent and historical top K feature vectors in database Update database recent feature vectors Serve to users
  • 11. Smart City Streaming Data City buses - regular telemetry (position, velocity, timestamp) Street sweepers - regular telemetry Transactions from rail, subway, buses, smart cards 311 info 911 info - new emergencies
  • 12. Citizens want to know… Where and for how long can I park my car? Are transportation options affected by 311 and 911 events? How long will it take the next bus to get here? Where is the closest bus to where I am?
  • 13. Cities want to know… How can I maximize parking revenue? More granular updates to parking spots that don't need sweeping How does traffic affect waiting times in public transit, and revenue? Patterns in subway train times - is a breakdown coming? Population movement - where should new transit routes be placed?
  • 15. The HARD Principle Highly Available, Resilient, Distributed Flexibility - do as many transformations as possible with as few components as possible Real-time: “NoETL” Community: best of breed OSS projects with huge adoption and commercial support
  • 18. Why a message queue? Centralized publish-subscribe of events Need more processing? Add another consumer Buffer traffic spikes Replay events in cases of failure
  • 19. Message Queues help distribute data A-F G-M N-S T-Z Input 1 Input 2 Input3 Input4 Processing Processing Processing Processing
  • 20. Intro to Apache Kafka Kafka is a distributed publish subscribe system It uses a commit log to track changes Kafka was originally created at LinkedIn Open sourced in 2011 Graduated to a top-level Apache project in 2012
  • 21. On being HARD Many Big Data projects are open source implementations of closed source products Unlike Hadoop, HBase or Cassandra, Kafka actually isn't a clone of an existing closed source product The same codebase being used for years at LinkedIn answers the questions: Does it scale? Is it robust?
  • 24. Avro Schemas And Schema Registry Keys and values in Kafka can be Strings or byte arrays Avro is a serialization format used extensively with Kafka and Big Data Kafka uses a Schema Registry to keep track of Avro schemas Verifies that the correct schemas are being used
  • 27. Kafka Resources Official docs - https:// kafka.apache.org/ documentation.html Design section is really good read http://www.confluent.io/product Includes schema registry
  • 30. Types of Stream Processors Event by Event: Apache Storm, Apache Flink, Intel GearPump, Akka Micro-batch: Apache Spark Hybrid? Google Dataflow
  • 31. Apache Storm and Flink Transform one message at a time Very low latency State and more complex analytics difficult
  • 32. Akka and Gearpump Actor to actor messaging. Local state. Used for extreme low latency (ad networks, etc) Dynamically reconfigurable topology Configurable fault tolerance and failure recovery Cluster or local mode - you don’t always need distribution!
  • 33. Spark Streaming Data processed as stream of micro batches Higher latency (seconds), higher throughput, more complex analysis / ML possible Same programming model as batch
  • 34. Why Spark? file = spark.textFile("hdfs://...")   file.flatMap(line => line.split(" "))     .map(word => (word, 1))     .reduceByKey(_ + _) 1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }
  • 38. Benefits of Unified Libraries Optimizations can be shared between libraries Core Project Tungsten MLlib Shared statistics libraries Spark Streaming GC and memory management
  • 39. Mix and match modules Easily go from DataFrames (SQL) to MLLib / statistics, for example: scala> import org.apache.spark.mllib.stat.Statistics scala> val numMentions = df.select("NumMentions").map(row => row.getInt(0).toDouble) numMentions: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[100] at map at DataFrame.scala:848 scala> val numArticles = df.select("NumArticles").map(row => row.getInt(0).toDouble) numArticles: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[104] at map at DataFrame.scala:848 scala> val correlation = Statistics.corr(numMentions, numArticles, "pearson")
  • 40. Spark Worker Failure Rebuild RDD Partitions on Worker from Lineage
  • 41. Spark SQL & DataFrames
  • 43. Catalyst Optimizations Column and partition pruning (Column filters) Predicate pushdowns (Row filters)
  • 44. Spark SQL Data Sources API Enables custom data sources to participate in SparkSQL = DataFrames + Catalyst Production Impls spark-csv (Databricks) spark-avro (Databricks) spark-cassandra-connector (DataStax) elasticsearch-hadoop (Elastic.co)
  • 46. Streaming Sources Basic: Files, Akka actors, queues of RDDs, Socket Advanced Kafka Kinesis Flume Twitter firehose
  • 48. Streaming Fault Tolerance Incoming data is replicated to 1 other node Write Ahead Log for sources that support ACKs Checkpointing for recovery if Driver fails
  • 49. Direct Kafka Streaming: KafkaRDD No single Receiver Parallelizable No Write Ahead Log Kafka *is* the Write Ahead Log! KafkaRDD stores Kafka offsets KafkaRDD partitions recover from offsets

  • 50. Spark MLlib & GraphX
  • 51. Spark MLlib Common Algos Classifiers DecisionTree, RandomForest Clustering K-Means, Streaming K-Means Collaborative Filtering Alternating Least Squares (ALS)
  • 52. Spark Text Processing Algos TF/IDF LDA Word2Vec *Pro-Tip: Use Stanford CoreNLP!
  • 53. Spark ML Pipelines Modeled after scikit-learn
  • 54. Spark GraphX PageRank Top Influencers Connected Components Measure of clusters Triangle Counting Measure of cluster density
  • 57. What Kind of State? Non-persistent / in-memory: concurrent viewers Short term: latest trends Longer term: raw event & aggregate storage ML Models, predictions, scored data
  • 58. Spark RDDs Immutable, cache in memory and/or on disk Spark Streaming: UpdateStateByKey IndexedRDD - can update bits of data Snapshotting for recovery
  • 59. •Massively Scalable • High Performance • Always On • Masterless
  • 60. Scale Apache Cassandra • Scales Linearly to as many nodes as you need • Scales whenever you need
  • 61. Performance Apache Cassandra • It’s Fast • Built to sustain massive data insertion rates in irregular pattern spikes
  • 62. Fault Tolerance & Availability Apache Cassandra • Automatic Replication • Multi Datacenter • Decentralized - no single point of failure • Survive regional outages • New nodes automatically add themselves to the cluster • DataStax drivers automatically discover new nodes
  • 63. Architecture Apache Cassandra • Distributed, Masterless Ring Architecture • Network Topology Aware • Flexible, Schemaless - your data structure can evolve seamlessly over time
  • 65. Cassandra Data Modeling Primary key = (partition keys, clustering keys) Fast queries = fetch single partition Range scans by clustering key Must model for query patterns Clustering 1 Clustering 2 Clustering 3 Partition 1 Partition 2 Partition 3
  • 66. City Bus Data Modeling Example Primary key = (Bus UUID, timestamp) Easy queries: location and speed of single bus for a range of time Can also query most recent location + speed of all buses (slower) 1020 s 1010 s 1000 s Bus A speed, GPS Bus B Bus C
  • 67. Using Cassandra for Short Term Storage Idea is store and read small values Idempotent writes + huge write capacity = ideal for streaming ingestion For example, store last few (latest + last N) snapshots of buses, taxi locations, recent traffic info
  • 68. But Mommy! What about longer term data?
  • 69. I need to read lots of data, fast!! - Ad hoc analytics of events - More specialized / geospatial - Building ML models from large quantities of data - Storing scored/classified data from models - OLAP / Data Warehousing
  • 70. Can Cassandra Handle Batch? Cassandra tables are much better at lots of small reads than big data scans You CAN store data efficiently in C* Files seem easier for long term storage and analysis But are files compatible with streaming?
  • 72. Lambda is Hard and Expensive Very high TCO - Many moving parts - KV store, real time, batch Lots of monitoring, operations, headache Running similar code in two places Lower performance - lots of shuffling data, network hops, translating domain objects Reconcile queries against two different places
  • 73. NoLambda A unified system Real-time processing and reprocessing No ETLs Fault tolerance Everything is a stream
  • 74. Can Cassandra do batch and ad-hoc? Yes, it can be competitive with Hadoop actually…. If you know how to be creative with storing your data! Tuplejump/SnackFS - HDFS for Cassandra github.com/tuplejump/FiloDB - analytics database Store your data using Protobuf / Avro / etc.
  • 75. Introduction to FiloDB Efficient columnar storage - 5-10x better Scan speeds competitive with Parquet - 100x faster than regular Cassandra tables Very fine grained filtering for sub-second concurrent queries Easy BI and ad-hoc analysis via Spark SQL/ Dataframes (JDBC etc.) Uses Cassandra for robust, proven storage
  • 76. Combining FiloDB + Cassandra Regular Cassandra tables for highly concurrent, aggregate / key-value lookups (dashboards) FiloDB + C* + Spark for efficient long term event storage Ad hoc / SQL / BI Data source for MLLib / building models Data storage for classified / predicted / scored data
  • 77. Message Queue Events Spark Streaming Short term storage, K-V Adhoc, SQL, ML Cassandra FiloDB: Events, ad-hoc, batch Spark Dashboa rds, maps
  • 79. FiloDB + Cassandra Robust, peer to peer, proven storage platform Use for short term snapshots, dashboards Use for efficient long term event storage & ad hoc querying Use as a source to build detailed models