SlideShare a Scribd company logo
1 of 32
Download to read offline
Streaming	Algorithms
Eric	Fu
1
Streaming	algorithms	have	to	be	approximate
• Space	cost	<=	O(logN)
• Time	cost	<=	O(N)
• online
Problem	Statement
In	limited	space,	in	one	pass,	over	a	sequence	of	items
Compute	the	following
min,	max,	average,	
standard	deviation	
moving	average	
Cardinality	(count	of	distinct	items	in	a	stream)
Heavy	hitters	(aka	find	most	frequent	items)
Order	statistics	(rank	of	an	item	in	sorted	sequence)
3
Key	Tricks
• hashing
• sketching
Hyper	log	log
Cardinality	(count	of	distinct	items	in	a	stream)
Repeated	Minimum
Bits	emitted	by	a	hash
In	hash	of	all	items,	observe	number	of	times	you	get	bit	‘1’	followed	
by	many	zeros
7
Bit	patterns
For	num	=	[1,	1000]
h	=	hash(num)
Number	of	hashes	ending	in Out	of	1000
0 530
10 281
100 140
1000 53
10000 28
100000 9
1000000 12
10000000 5
100000000 2
1000000000 0
10000000000 0
100000000000 0
8
Bit	‘1’	followed	by	9	or	
more	zeroes	not	found
Because	1000	~	2^10
Flajolet-Martin	sketch	algo
1. For	each	item
2. Index	=	rightmost	bit		in	hash(item)
3. Bitmap[index]	=	1
(at	this	point,	bitmap	=	“000...00000101011111”)
1. Estimated	N	~	2	rightmost	‘0’	bit	in	bitmap
9
Further	improvements	:	split	stream	into	M	substreams	and	use	harmonic	mean	of	their	
counters,	use	64-bit	hash	instead	of	32,	add	custom	correction	factors	to	hash	at	low	and	high	
range.
Comparison	between	3	different	versions
*	my	FM-sketch	implementation	is	incomplete	– actual	algo	is	not	that	bad
10
X : actual cardinality
Y : estimated
cardinality
What	is	a	sketch	?
• A	sketch	maintains	one	or	more	“random	variables”	
which	provide	answers	that	are	probabilistically	
accurate.
• In	Hyperloglog,	this	random	variable	is	the	“position	
of	the	rightmost	zero”.			It	roughly	estimates	the	
actual	cardinality	of	the	set.
• A	sketch	uses	universal	hash	function	to	distribute	
data	uniformly.
• To	reduce	variance,	it	may	use	many	pairwise-
independent	hashes	and	take	their	average.
11
* all random variables do not have
normal distribution. Above Pic is to
help in visualizing
MJRTY
Find majority
How	would	you	determine	the	majority	element	of:
sequence: A A A C C B B C C C B C C
https://www.cs.utexas.edu/~moore/best-ideas/mjrty/index.html
Count-Min	Sketch	
Heavy	hitters	(find	most	frequent	items)
14
Bloom	Filter
Count-Min	sketch
Insert Probe
Count-Min	sketch
For	heavy	hitters,	need	additional	heap	data	structure	to	maintain	
those	items	which	hashed	to	high	value	slots.
Leaky	counters	
Heavy	hitters	(find	most	frequent	items)
Observation
Karp
1. Keep	a	frequency	Map<item,	count>
2. For	each	v	in	sequence
3. increment	Map[v].count
4. If	map.size()	>	threshold
5. for	each	element	in	Map
6. decrement	Map[element].count
7. if	count	is	zero,	delete	Map[element]
Algo	has	second	pass	to	adjust	counts. Paper	discusses	additional	optimizations.				
Implemented	in	Apache	Spark.		See	DataFrameStatFunctions.freqItems().
Maintain	a	truncated	histogram
20
Frugal Streaming
Order	statistics	(rank	of	an	item	in	sorted	sequence)
21
Order	statistics	terminology
Given	sorted	sequence	[1,	1,	1,	2,	3]	
1. 0-quantile	=	minimum	
2. 0.25	quantile	=	1st quartile	=	25	percentile
3. 0.50	quantile	=	2nd quartile	=	50	percentile	=	median
4. 0.75	quantile	=	3rd quartile	=	75	percentile
5. 1-quantile	=	maximum
22
Order	statistics	offline	algorithm
• There	exists	an	offline	and	exact	algorithm	to	find	the	kth item	in	a	set
• QuickSelect (Blum,	et	al)	which	is	effectively	a	truncated	quicksort
• Can	run	in	linear	time	algorithm	(depending	on	pivot)
23
Frugal	streaming
1. Median_est =	0
2. For	v	in	stream
3. if	(v		>	median_est)
4. Increment	median_est
5. else	if	(v	<	median_est)
6. Decrement	median_est
24
Memory	=	log(N)	bits	where	N	=	cardinality
Caveat:	Reported	median	may	not	be	in	the	stream
Performs	poorly	on	sorted	data
Works	best	if	stream	items	are	independent	and	random
Median	drift s	in	the	direction	of	the	true		median.
Probability	of	drifting	after	reaching	true	median	is	low.
Paper	discusses	extension	to	compute	other	quantiles
4 2 1 5 52 43
4 4 2 4 33 43
2 1 2 32 43
Stream
True median
estimated 1
Frugal	streaming (75%)
1. quantile_75	=	0
2. For	v	in	stream
3. r	=	random()
4. if	(v		>	quantile_75	&&	r	>	1	– 0.75)
5. Increment	quantile_75
6. else	if	(v	<	quantile_75	&&	r	>	0.75)
7. Decrement	quantile_75
Streaming	k-means	
Order	statistics	(rank	of	an	item	in	sorted	sequence)
K-means	in	ML
More	Clusters
t-Digest		- Dunning	et	al
29
Each	centroid	attracts	points	nearest	to	it.		Keeps	“average”	and	“count”	of	
these	points.
Maintain	a	balanced	binary	tree	of	centroid	nodes
t-Digest
t-Digest
• Use	sorted	structure	to	find	quantiles.
• Centroids	at	both	ends	are	deliberately	kept	small	to	increase	accuracy	of	
outliers.		
• Can	merge	two	T-digests.
• Performs	poorly	on	ascending/descending	stream.
31
Thanks!

More Related Content

What's hot

Concept of hashing
Concept of hashingConcept of hashing
Concept of hashing
Rafi Dar
 
lecture 11
lecture 11lecture 11
lecture 11
sajinsc
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Christopher Conlan
 

What's hot (20)

Hash table
Hash tableHash table
Hash table
 
Concept of hashing
Concept of hashingConcept of hashing
Concept of hashing
 
Linear sorting
Linear sortingLinear sorting
Linear sorting
 
Application of hashing in better alg design tanmay
Application of hashing in better alg design tanmayApplication of hashing in better alg design tanmay
Application of hashing in better alg design tanmay
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
 
358 33 powerpoint-slides_15-hashing-collision_chapter-15
358 33 powerpoint-slides_15-hashing-collision_chapter-15358 33 powerpoint-slides_15-hashing-collision_chapter-15
358 33 powerpoint-slides_15-hashing-collision_chapter-15
 
Search algorithms master
Search algorithms masterSearch algorithms master
Search algorithms master
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
 
Hashing Techniques in Data Structures Part2
Hashing Techniques in Data Structures Part2Hashing Techniques in Data Structures Part2
Hashing Techniques in Data Structures Part2
 
Hashing and Hash Tables
Hashing and Hash TablesHashing and Hash Tables
Hashing and Hash Tables
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Hashing PPT
Hashing PPTHashing PPT
Hashing PPT
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Best,worst,average case .17581556 045
Best,worst,average case .17581556 045Best,worst,average case .17581556 045
Best,worst,average case .17581556 045
 
08 Hash Tables
08 Hash Tables08 Hash Tables
08 Hash Tables
 
4.4 hashing
4.4 hashing4.4 hashing
4.4 hashing
 
Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data Structures
 
lecture 11
lecture 11lecture 11
lecture 11
 
Big o notation
Big o notationBig o notation
Big o notation
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
 

Similar to Data Streaming Algorithms

asymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysisasymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysis
Anindita Kundu
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 
2. Asymptotic Notations and Complexity Analysis.pptx
2. Asymptotic Notations and Complexity Analysis.pptx2. Asymptotic Notations and Complexity Analysis.pptx
2. Asymptotic Notations and Complexity Analysis.pptx
Rams715121
 
Advanced Datastructures and algorithms CP4151unit1b.pdf
Advanced Datastructures and algorithms CP4151unit1b.pdfAdvanced Datastructures and algorithms CP4151unit1b.pdf
Advanced Datastructures and algorithms CP4151unit1b.pdf
Sheba41
 
Skiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingSkiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sorting
zukun
 

Similar to Data Streaming Algorithms (20)

asymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysisasymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysis
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 
2. Asymptotic Notations and Complexity Analysis.pptx
2. Asymptotic Notations and Complexity Analysis.pptx2. Asymptotic Notations and Complexity Analysis.pptx
2. Asymptotic Notations and Complexity Analysis.pptx
 
Annotations.pdf
Annotations.pdfAnnotations.pdf
Annotations.pdf
 
02 order of growth
02 order of growth02 order of growth
02 order of growth
 
Advanced Datastructures and algorithms CP4151unit1b.pdf
Advanced Datastructures and algorithms CP4151unit1b.pdfAdvanced Datastructures and algorithms CP4151unit1b.pdf
Advanced Datastructures and algorithms CP4151unit1b.pdf
 
Analysis Of Algorithms - Hashing
Analysis Of Algorithms - HashingAnalysis Of Algorithms - Hashing
Analysis Of Algorithms - Hashing
 
Big O Notation
Big O NotationBig O Notation
Big O Notation
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptx
 
Polymath: Version 1.0 and Beyond
Polymath: Version 1.0 and BeyondPolymath: Version 1.0 and Beyond
Polymath: Version 1.0 and Beyond
 
Big o notation
Big o notationBig o notation
Big o notation
 
Chapter1.ppt
Chapter1.pptChapter1.ppt
Chapter1.ppt
 
computer logic and digital design chapter 1
computer logic and digital design chapter 1computer logic and digital design chapter 1
computer logic and digital design chapter 1
 
Lec1
Lec1Lec1
Lec1
 
Skiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingSkiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sorting
 
Hash presentation
Hash presentationHash presentation
Hash presentation
 
chapter1.pdf ......................................
chapter1.pdf ......................................chapter1.pdf ......................................
chapter1.pdf ......................................
 
Chapter1
Chapter1Chapter1
Chapter1
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
Introducción al Análisis y diseño de algoritmos
Introducción al Análisis y diseño de algoritmosIntroducción al Análisis y diseño de algoritmos
Introducción al Análisis y diseño de algoritmos
 

More from 宇 傅

More from 宇 傅 (12)

Parallel Query Execution
Parallel Query ExecutionParallel Query Execution
Parallel Query Execution
 
The Evolution of Data Systems
The Evolution of Data SystemsThe Evolution of Data Systems
The Evolution of Data Systems
 
The Volcano/Cascades Optimizer
The Volcano/Cascades OptimizerThe Volcano/Cascades Optimizer
The Volcano/Cascades Optimizer
 
PelotonDB - A self-driving database for hybrid workloads
PelotonDB - A self-driving database for hybrid workloadsPelotonDB - A self-driving database for hybrid workloads
PelotonDB - A self-driving database for hybrid workloads
 
Immutable Data Structures
Immutable Data StructuresImmutable Data Structures
Immutable Data Structures
 
The Case for Learned Index Structures
The Case for Learned Index StructuresThe Case for Learned Index Structures
The Case for Learned Index Structures
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
 
Functional Programming in Java 8
Functional Programming in Java 8Functional Programming in Java 8
Functional Programming in Java 8
 
第三届阿里中间件性能挑战赛冠军队伍答辩
第三届阿里中间件性能挑战赛冠军队伍答辩第三届阿里中间件性能挑战赛冠军队伍答辩
第三届阿里中间件性能挑战赛冠军队伍答辩
 
Golang 101
Golang 101Golang 101
Golang 101
 
Docker Container: isolation and security
Docker Container: isolation and securityDocker Container: isolation and security
Docker Container: isolation and security
 
Paxos and Raft Distributed Consensus Algorithm
Paxos and Raft Distributed Consensus AlgorithmPaxos and Raft Distributed Consensus Algorithm
Paxos and Raft Distributed Consensus Algorithm
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Data Streaming Algorithms