SlideShare a Scribd company logo
1 of 56
Download to read offline
A Reliable Memory Centric 
Distributed Storage System 
a 
Haoyuan Li (@haoyuan) 
November 20th, 2014; AMPCamp 5 
tachyon-project.org ; meetup.com/tachyon ; 
UC 
Berkeley
Outline 
• Overview 
– Feature 
1: 
Memory 
Centric 
Storage 
Architecture 
– Feature 
2: 
Lineage 
in 
Storage 
• Open 
Source 
• Roadmap
Outline 
• Overview 
– Feature 
1: 
Memory 
Centric 
Storage 
Architecture 
– Feature 
2: 
Lineage 
in 
Storage 
• Open 
Source 
• Roadmap
Projects 
UC 
Berkeley 
• Berkeley 
Data 
AnalyEcs 
Stack 
(BDAS) 
a 
cluster 
manager 
making 
it 
easy 
to 
write 
and 
deploy 
distributed 
applicaEons. 
a 
parallel 
compuEng 
system 
supporEng 
general 
and 
efficient 
in-­‐memory 
execuEon. 
a 
reliable 
distributed 
memory-­‐centric 
storage 
enabling 
memory-­‐speed 
data 
sharing.
Why 
Tachyon? 
5
Memory 
is 
King 
• RAM 
throughput 
increasing 
exponenEally 
• Disk 
throughput 
increasing 
slowly 
Memory-­‐locality 
key 
to 
interacEve 
response 
Eme
Realized 
by 
many… 
• Frameworks 
already 
leverage 
memory 
7
Problem 
solved? 
8
Missing 
a 
SoluEon 
for 
Storage 
Layer 
9
An 
Example: 
-­‐ 
• Fast 
in-­‐memory 
data 
processing 
framework 
– Keep 
one 
in-­‐memory 
copy 
inside 
JVM 
– Track 
lineage 
of 
operaEons 
used 
to 
derive 
data 
– Upon 
failure, 
use 
lineage 
to 
recompute 
data 
map 
Lineage 
Tracking 
filter 
map 
join 
reduce
Issue 
1 
Data 
Sharing 
is 
the 
bo/leneck 
in 
analy4cs 
pipeline: 
Slow 
writes 
to 
disk 
Spark 
Task 
Spark 
mem 
block 
manager 
block 
1 
block 
3 
Spark 
Task 
Spark 
mem 
block 
manager 
block 
3 
block 
1 
HDFS 
/ 
Amazon 
S3 
block 
1 
block 
3 
block 
2 
block 
4 
storage 
engine 
& 
execution 
engine 
same 
process 
(slow 
writes) 
11
Issue 
1 
Spark 
Task 
Spark 
mem 
block 
manager 
block 
1 
block 
3 
Hadoop 
MR 
YARN 
HDFS 
/ 
Amazon 
S3 
block 
1 
block 
3 
block 
2 
block 
4 
storage 
engine 
& 
execution 
engine 
same 
process 
(slow 
writes) 
12 
Data 
Sharing 
is 
the 
bo/leneck 
in 
analy4cs 
pipeline: 
Slow 
writes 
to 
disk
Issue 
2 
Spark 
Task 
Spark 
memory 
block 
manager 
block 
1 
block 
3 
HDFS 
/ 
Amazon 
S3 
block 
1 
block 
3 
block 
2 
block 
4 
execution 
engine 
& 
storage 
engine 
same 
process 
13 
Cache 
loss 
when 
process 
crashes.
Issue 
2 
crash 
Spark 
memory 
block 
manager 
block 
1 
block 
3 
HDFS 
/ 
Amazon 
S3 
block 
1 
block 
3 
block 
2 
block 
4 
execution 
engine 
& 
storage 
engine 
same 
process 
14 
Cache 
loss 
when 
process 
crashes.
Issue 
2 
HDFS 
/ 
Amazon 
S3 
block 
1 
block 
3 
block 
2 
block 
4 
execution 
engine 
& 
storage 
engine 
same 
process 
crash 
15 
Cache 
loss 
when 
process 
crashes.
Issue 
3 
In-­‐memory 
Data 
Duplica4on 
& 
Java 
Garbage 
Collec4on 
Spark 
Task 
Spark 
mem 
block 
manager 
HDFS 
/ 
Amazon 
S3 
block 
1 
block 
3 
Spark 
Task 
Spark 
mem 
block 
manager 
block 
3 
block 
1 
block 
1 
block 
3 
block 
2 
block 
4 
execution 
engine 
& 
storage 
engine 
same 
process 
(duplication 
& 
GC) 
16
Tachyon 
Reliable 
data 
sharing 
at 
memory-­‐speed 
within 
and 
across 
cluster 
frameworks/jobs 
17
SoluEon 
Overview 
Basic 
idea 
• Feature 
1: 
memory-­‐centric 
storage 
architecture 
• Feature 
2: 
push 
lineage 
down 
to 
storage 
layer 
Facts 
• One 
data 
copy 
in 
memory 
• RecomputaEon 
for 
fault-­‐tolerance
Stack 
Spark 
MapRe 
duce 
Spark 
H2O 
SQL 
GraphX 
Impala 
HDFS 
S3 
Gluster 
FS 
Orange 
FS 
…… 
NFS 
Ceph 
……
Memory-­‐Centric 
Storage 
Architecture 
20
Lineage 
in 
Storage 
21
Issue 
1 
revisited 
Memory-­‐speed 
data 
sharing 
among 
jobs 
in 
different 
frameworks 
execution 
engine 
& 
storage 
engine 
same 
process 
(fast 
writes) 
Spark 
Task 
Spark 
mem 
Hadoop 
MR 
YARN 
HDFS 
/ 
Amazon 
S3 
block 
1 
block 
3 
block 
2 
block 
4 
HDFS 
disk 
block 
1 
block 
3 
block 
2 
block 
4 
Tachyon 
in-­‐memory 
22
Issue 
2 
revisited 
Spark 
Task 
Spark 
memory 
block 
manager 
HDFS 
/ 
Amazon 
S3 
block 
2 
execution 
engine 
& 
storage 
engine 
same 
process 
Tachyon 
in-­‐memory 
block 
1 
block 
3 
block 
4 
23 
Keep 
in-­‐memory 
data 
safe, 
even 
when 
a 
job 
crashes.
Issue 
2 
revisited 
crash 
Spark 
memory 
block 
manager 
HDFS 
disk 
block 
1 
block 
3 
block 
2 
block 
4 
execution 
engine 
& 
storage 
engine 
same 
process 
Tachyon 
in-­‐memory 
HDFS 
/ 
Amazon 
S3 
block 
1 
block 
3 
block 
2 
block 
4 
24 
Keep 
in-­‐memory 
data 
safe, 
even 
when 
a 
job 
crashes.
Issue 
2 
revisited 
Keep 
in-­‐memory 
data 
safe, 
even 
when 
a 
job 
crashes. 
crash 
HDFS 
disk 
block 
1 
block 
3 
block 
2 
block 
4 
execution 
engine 
& 
storage 
engine 
same 
process 
Tachyon 
in-­‐memory 
HDFS 
/ 
Amazon 
S3 
block 
1 
block 
3 
block 
2 
block 
4 
25
Issue 
3 
revisited 
No 
in-­‐memory 
data 
duplica4on, 
much 
less 
GC 
Spark 
Task 
Spark 
mem 
Spark 
Task 
Spark 
mem 
HDFS 
/ 
Amazon 
S3 
block 
1 
block 
3 
block 
2 
block 
4 
execution 
engine 
& 
storage 
engine 
same 
process 
(no 
duplication 
& 
GC) 
HDFS 
disk 
block 
1 
block 
3 
block 
2 
block 
4 
Tachyon 
in-­‐memory 
26
Comparison 
with 
in 
Memory 
HDFS
Workflow 
Improvement 
Performance 
comparison 
for 
realisEc 
workflow. 
The 
workflow 
ran 
4x 
faster 
on 
Tachyon 
than 
on 
MemHDFS. 
In 
case 
of 
node 
failure, 
applicaEons 
in 
Tachyon 
sEll 
finishes 
3.8x 
faster. 
28
Further 
Improve 
Spark’s 
Performance 
Grep 
Program
How 
easy 
/ 
hard 
to 
use 
Tachyon? 
30
Spark/MapReduce/Shark 
without 
Tachyon 
• Spark 
– val 
file 
= 
sc.textFile(“hdfs://ip:port/path”) 
• Hadoop 
MapReduce 
– hadoop 
jar 
hadoop-­‐examples-­‐1.0.4.jar 
wordcount 
hdfs://localhost:19998/input 
hdfs://localhost: 
19998/output 
• Shark 
– CREATE 
TABLE 
orders_cached 
AS 
SELECT 
* 
FROM 
orders;
Spark/MapReduce/Shark 
with 
Tachyon 
• Spark 
– val 
file 
= 
sc.textFile(“tachyon://ip:port/path”) 
• Hadoop 
MapReduce 
– hadoop 
jar 
hadoop-­‐examples-­‐1.0.4.jar 
wordcount 
tachyon://localhost:19998/input 
tachyon:// 
localhost:19998/output 
• Shark 
– CREATE 
TABLE 
orders_tachyon 
AS 
SELECT 
* 
FROM 
orders;
Spark 
on 
Tachyon 
./bin/spark-­‐shell 
sc.hadoopConfiguraEon.set("fs.tachyon.impl", 
"tachyon.hadoop.TFS") 
// 
Load 
input 
from 
Tachyon 
val 
file 
= 
sc.textFile("tachyon://localhost:19998/LICENSE") 
file.count() 
; 
file.take(10); 
// 
Store 
RDD 
OFF_HEAP 
in 
Tachyon 
import 
org.apache.spark.storage.StorageLevel; 
file.persist(StorageLevel.OFF_HEAP) 
file.count(); 
file.take(10); 
// 
Save 
output 
to 
Tachyon 
file.flatMap(line 
=> 
line.split(" 
")).map(s 
=> 
(s, 
1)).reduceByKey((a, 
b) 
=> 
a 
+ 
b).saveAsTextFile("tachyon://localhost:19998/LICENSE_WC")
Outline 
• Overview 
– Feature 
1: 
Memory 
Centric 
Storage 
Architecture 
– Feature 
2: 
Lineage 
in 
Storage 
• Open 
Source 
• Roadmap
History 
Started 
at 
UC 
Berkeley 
AMPLab 
from 
the 
summer 
of 
2012 
• Reliable, 
Memory 
Speed 
Storage 
for 
Cluster 
CompuEng 
Frameworks 
(UC 
Berkeley 
EECS 
Tech 
Report) 
• Haoyuan 
Li, 
Ali 
Ghodsi, 
Matei 
Zaharia, 
Ion 
Stoica, 
Scot 
Shenker 
35
A 
Open 
Source 
Status 
• Apache License 2.0, Version 0.5.0 (July 2014) 
• Deployed at > 50 companies (July 2014) 
• 20+ Companies Contributing (? contributors) 
• Spark and MapReduce applications can run without any 
code change
Release 
Growth 
37 
Tachyon 0.1: 
-1 contributor 
Dec ‘12
Release 
Growth 
Tachyon 0.2: 
- 3 contributors 
Apr ‘13 
38 
Tachyon 0.1: 
-1 contributor 
Dec ‘12
Release 
Growth 
Tachyon 0.3: 
- 15 contributors 
Tachyon 0.2: 
- 3 contributors 
Apr ‘13Oct‘13 
39 
Tachyon 0.1: 
-1 contributor 
Dec ‘12
Release 
Growth 
Tachyon 0.4: 
- 30 contributors 
Tachyon 0.3: 
- 15 contributors 
Tachyon 0.2: 
- 3 contributors 
Apr ‘13Oct‘13 
Feb ‘14 
40 
Tachyon 0.1: 
-1 contributor 
Dec ‘12
Release 
Growth 
Tachyon 0.5: 
- 46 contributors 
Tachyon 0.4: 
- 30 contributors 
Tachyon 0.3: 
- 15 contributors 
Tachyon 0.2: 
- 3 contributors 
Apr ‘13Oct‘13 
Feb ‘14 
41 
July ‘14 
Tachyon 0.1: 
-1 contributor 
Dec ‘12
Open 
Community 
42 
Berkeley 
Contributors 
Non-­‐Berkeley 
Contributors
Thanks 
to 
our 
Code 
Contributors! 
Aaron Davidson 
Achal Soni 
Ali Ghodsi 
Andrew Ash 
Ankur Dave 
Anurag Khandelwal 
Aslan Bekirov 
Bill Zhao 
Brad Childs 
Calvin Jia 
Chao Chen 
Cheng Chang 
Cheng Hao 
Colin Patrick McCabe 
David Capwell 
43 
David Zhu 
Dan Crankshaw 
Darion Yaphet 
Du Li 
Fei Wang 
Gerald Zhang 
Grace Huang 
Haoyuan Li 
Henry Saputra 
Hobin Yoon 
Huamin Chen 
Jey Kottalam 
Joseph Tang 
Juan Zhou 
Jun Aoki 
Lin Xing 
Lukasz Jastrzebski 
Manu Goyal 
Mark Hamstra 
Mingfei Shi 
Mubarak Seyed 
Nick Lanham 
Orcun Simsek 
Pengfei Xuan 
Qianhao Dong 
Qifan Pu 
Raymond Liu 
Reynold Xin 
Robert Metzger 
Rong Gu 
Sean Zhong 
Seonghwan Moon 
Siyuan He 
Shaoshan Liu 
Shivaram Venkataraman 
Srinivas Parayya 
Tao Wang 
Timothy St. Clair 
Thu Kyaw 
Vamsi Chitters 
Vikram Sreekanti 
Xi Liu 
Xiang Zhong 
Xiaomin Zhang 
Zhao Zhang
Thanks 
to 
Redhat!
Commercially 
supported 
by 
x 
and 
running 
in 
dozens 
of 
their 
customers’ 
clusters 
Thanks 
to 
Redhat!
Tachyon 
is 
the 
Default 
Off-­‐Heap 
Storage 
SoluLon 
for 
. 
Thanks 
to 
Redhat!
Exchange 
Data 
Between 
Spark 
and 
H20 
47
More 
Support 
from 
Industry 
48
Reaching 
wider 
communiEes: 
e.g. 
GlusterFS 
49
Under 
Filesystem 
Choices 
(Big 
Data, 
Cloud, 
HPC, 
Enterprise)
Outline 
• Overview 
– Feature 
1: 
Memory 
Centric 
Storage 
Architecture 
– Feature 
2: 
Lineage 
in 
Storage 
• Open 
Source 
• Roadmap
Features 
• Memory 
Centric 
Storage 
Architecture 
• Lineage 
in 
Storage 
(alpha) 
• Hierarchical 
Storage 
• Data 
Serving 
• Different 
hardware 
• More… 
• Your 
Requirements? 
52
Short 
Term 
Roadmap 
(0.6 
Release) 
• Ceph 
IntegraEon 
(Ceph 
Community) 
• Hierarchical 
Local 
Storage 
(Intel) 
• Performance 
Improvement 
(Yahoo) 
• Easy 
Deployment 
through 
Vagrant 
(Redhat) 
• MulE-­‐tenancy 
(AMPLab) 
• Further 
Improve 
Spark 
IntegraEon 
(Spark 
Community, 
Databricks) 
• Mesos 
IntegraEon 
(Mesos 
Community, 
Mesosphere) 
• Network 
Sub-­‐system 
Improvement 
(Pivotal) 
• Many 
more 
from 
AMPLab 
and 
Industry 
Contributors 
53
Goal? 
54
Beter 
Assist 
Other 
Components 
Spark 
MapRe 
duce 
Spark 
H2O 
SQL 
GraphX 
Impala 
HDFS 
S3 
Gluster 
FS 
Orange 
FS 
…… 
NFS 
Ceph 
…… 
Welcome 
CollaboraLon!
Thanks! 
Ques,ons? 
• More 
Informa,on: 
– Website: 
h9p://tachyon-­‐project.org 
– Github: 
h9ps://github.com/amplab/tachyon 
– Meetup: 
h9p://www.meetup.com/Tachyon 
• Email: 
haoyuan@cs.berkeley.edu

More Related Content

What's hot

Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5Yan Zhou
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Databricks
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons Provectus
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 

What's hot (20)

Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 

Viewers also liked

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...Spark Summit
 
Linux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreLinux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreMark Wong
 
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Spark Summit
 
The Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanThe Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanGalder Zamarreño
 
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftAdvanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftDaniel Krook
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSCeph Community
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
ELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot TimesELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot Timesandrewmurraympc
 
Velox: Models in Action
Velox: Models in ActionVelox: Models in Action
Velox: Models in ActionDan Crankshaw
 
Naïveté vs. Experience
Naïveté vs. ExperienceNaïveté vs. Experience
Naïveté vs. ExperienceMike Fogus
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scalejeykottalam
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stackjeykottalam
 
Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetZubair Nabi
 
A Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and ConcurrencyA Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and ConcurrencyDavid Beazley (Dabeaz LLC)
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark Summit
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopHortonworks
 
In Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter LockIn Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter LockDavid Beazley (Dabeaz LLC)
 

Viewers also liked (20)

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
 
Open Stack Cheat Sheet V1
Open Stack Cheat Sheet V1Open Stack Cheat Sheet V1
Open Stack Cheat Sheet V1
 
Linux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreLinux Filesystems, RAID, and more
Linux Filesystems, RAID, and more
 
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
 
The Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanThe Hot Rod Protocol in Infinispan
The Hot Rod Protocol in Infinispan
 
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftAdvanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
ELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot TimesELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot Times
 
Velox: Models in Action
Velox: Models in ActionVelox: Models in Action
Velox: Models in Action
 
Naïveté vs. Experience
Naïveté vs. ExperienceNaïveté vs. Experience
Naïveté vs. Experience
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stack
 
Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using Mininet
 
OpenStack Cheat Sheet V2
OpenStack Cheat Sheet V2OpenStack Cheat Sheet V2
OpenStack Cheat Sheet V2
 
A Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and ConcurrencyA Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and Concurrency
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
 
In Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter LockIn Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter Lock
 
Python in Action (Part 2)
Python in Action (Part 2)Python in Action (Part 2)
Python in Action (Part 2)
 

Similar to Tachyon-2014-11-21-amp-camp5

Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Nexus, Inc.
 
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon Nexus, Inc.
 
A Reliable Memory-Centric Distributed Storage System
A Reliable Memory-Centric Distributed Storage SystemA Reliable Memory-Centric Distributed Storage System
A Reliable Memory-Centric Distributed Storage SystemAlluxio, Inc.
 
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Tachyon Nexus, Inc.
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangSpark Summit
 
Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Jiří Šimša
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureDatabricks
 
Tachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMTachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMShaoshan Liu
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioAlluxio, Inc.
 
Fast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on TachyonFast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on TachyonAlluxio, Inc.
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio, Inc.
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Alluxio, Inc.
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 

Similar to Tachyon-2014-11-21-amp-camp5 (20)

Tachyon workshop 2015-07-19
Tachyon workshop 2015-07-19Tachyon workshop 2015-07-19
Tachyon workshop 2015-07-19
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)
 
Tachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage SystemTachyon: An Open Source Memory-Centric Distributed Storage System
Tachyon: An Open Source Memory-Centric Distributed Storage System
 
A Reliable Memory-Centric Distributed Storage System
A Reliable Memory-Centric Distributed Storage SystemA Reliable Memory-Centric Distributed Storage System
A Reliable Memory-Centric Distributed Storage System
 
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
Tachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBMTachyon_meetup_5-28-2015-IBM
Tachyon_meetup_5-28-2015-IBM
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+AlluxioBest Practice in Accelerating Data Applications with Spark+Alluxio
Best Practice in Accelerating Data Applications with Spark+Alluxio
 
Fast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on TachyonFast Big Data Analytics with Spark on Tachyon
Fast Big Data Analytics with Spark on Tachyon
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Tachyon-2014-11-21-amp-camp5

  • 1. A Reliable Memory Centric Distributed Storage System a Haoyuan Li (@haoyuan) November 20th, 2014; AMPCamp 5 tachyon-project.org ; meetup.com/tachyon ; UC Berkeley
  • 2. Outline • Overview – Feature 1: Memory Centric Storage Architecture – Feature 2: Lineage in Storage • Open Source • Roadmap
  • 3. Outline • Overview – Feature 1: Memory Centric Storage Architecture – Feature 2: Lineage in Storage • Open Source • Roadmap
  • 4. Projects UC Berkeley • Berkeley Data AnalyEcs Stack (BDAS) a cluster manager making it easy to write and deploy distributed applicaEons. a parallel compuEng system supporEng general and efficient in-­‐memory execuEon. a reliable distributed memory-­‐centric storage enabling memory-­‐speed data sharing.
  • 6. Memory is King • RAM throughput increasing exponenEally • Disk throughput increasing slowly Memory-­‐locality key to interacEve response Eme
  • 7. Realized by many… • Frameworks already leverage memory 7
  • 9. Missing a SoluEon for Storage Layer 9
  • 10. An Example: -­‐ • Fast in-­‐memory data processing framework – Keep one in-­‐memory copy inside JVM – Track lineage of operaEons used to derive data – Upon failure, use lineage to recompute data map Lineage Tracking filter map join reduce
  • 11. Issue 1 Data Sharing is the bo/leneck in analy4cs pipeline: Slow writes to disk Spark Task Spark mem block manager block 1 block 3 Spark Task Spark mem block manager block 3 block 1 HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process (slow writes) 11
  • 12. Issue 1 Spark Task Spark mem block manager block 1 block 3 Hadoop MR YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 storage engine & execution engine same process (slow writes) 12 Data Sharing is the bo/leneck in analy4cs pipeline: Slow writes to disk
  • 13. Issue 2 Spark Task Spark memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & storage engine same process 13 Cache loss when process crashes.
  • 14. Issue 2 crash Spark memory block manager block 1 block 3 HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & storage engine same process 14 Cache loss when process crashes.
  • 15. Issue 2 HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & storage engine same process crash 15 Cache loss when process crashes.
  • 16. Issue 3 In-­‐memory Data Duplica4on & Java Garbage Collec4on Spark Task Spark mem block manager HDFS / Amazon S3 block 1 block 3 Spark Task Spark mem block manager block 3 block 1 block 1 block 3 block 2 block 4 execution engine & storage engine same process (duplication & GC) 16
  • 17. Tachyon Reliable data sharing at memory-­‐speed within and across cluster frameworks/jobs 17
  • 18. SoluEon Overview Basic idea • Feature 1: memory-­‐centric storage architecture • Feature 2: push lineage down to storage layer Facts • One data copy in memory • RecomputaEon for fault-­‐tolerance
  • 19. Stack Spark MapRe duce Spark H2O SQL GraphX Impala HDFS S3 Gluster FS Orange FS …… NFS Ceph ……
  • 22. Issue 1 revisited Memory-­‐speed data sharing among jobs in different frameworks execution engine & storage engine same process (fast writes) Spark Task Spark mem Hadoop MR YARN HDFS / Amazon S3 block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 Tachyon in-­‐memory 22
  • 23. Issue 2 revisited Spark Task Spark memory block manager HDFS / Amazon S3 block 2 execution engine & storage engine same process Tachyon in-­‐memory block 1 block 3 block 4 23 Keep in-­‐memory data safe, even when a job crashes.
  • 24. Issue 2 revisited crash Spark memory block manager HDFS disk block 1 block 3 block 2 block 4 execution engine & storage engine same process Tachyon in-­‐memory HDFS / Amazon S3 block 1 block 3 block 2 block 4 24 Keep in-­‐memory data safe, even when a job crashes.
  • 25. Issue 2 revisited Keep in-­‐memory data safe, even when a job crashes. crash HDFS disk block 1 block 3 block 2 block 4 execution engine & storage engine same process Tachyon in-­‐memory HDFS / Amazon S3 block 1 block 3 block 2 block 4 25
  • 26. Issue 3 revisited No in-­‐memory data duplica4on, much less GC Spark Task Spark mem Spark Task Spark mem HDFS / Amazon S3 block 1 block 3 block 2 block 4 execution engine & storage engine same process (no duplication & GC) HDFS disk block 1 block 3 block 2 block 4 Tachyon in-­‐memory 26
  • 27. Comparison with in Memory HDFS
  • 28. Workflow Improvement Performance comparison for realisEc workflow. The workflow ran 4x faster on Tachyon than on MemHDFS. In case of node failure, applicaEons in Tachyon sEll finishes 3.8x faster. 28
  • 29. Further Improve Spark’s Performance Grep Program
  • 30. How easy / hard to use Tachyon? 30
  • 31. Spark/MapReduce/Shark without Tachyon • Spark – val file = sc.textFile(“hdfs://ip:port/path”) • Hadoop MapReduce – hadoop jar hadoop-­‐examples-­‐1.0.4.jar wordcount hdfs://localhost:19998/input hdfs://localhost: 19998/output • Shark – CREATE TABLE orders_cached AS SELECT * FROM orders;
  • 32. Spark/MapReduce/Shark with Tachyon • Spark – val file = sc.textFile(“tachyon://ip:port/path”) • Hadoop MapReduce – hadoop jar hadoop-­‐examples-­‐1.0.4.jar wordcount tachyon://localhost:19998/input tachyon:// localhost:19998/output • Shark – CREATE TABLE orders_tachyon AS SELECT * FROM orders;
  • 33. Spark on Tachyon ./bin/spark-­‐shell sc.hadoopConfiguraEon.set("fs.tachyon.impl", "tachyon.hadoop.TFS") // Load input from Tachyon val file = sc.textFile("tachyon://localhost:19998/LICENSE") file.count() ; file.take(10); // Store RDD OFF_HEAP in Tachyon import org.apache.spark.storage.StorageLevel; file.persist(StorageLevel.OFF_HEAP) file.count(); file.take(10); // Save output to Tachyon file.flatMap(line => line.split(" ")).map(s => (s, 1)).reduceByKey((a, b) => a + b).saveAsTextFile("tachyon://localhost:19998/LICENSE_WC")
  • 34. Outline • Overview – Feature 1: Memory Centric Storage Architecture – Feature 2: Lineage in Storage • Open Source • Roadmap
  • 35. History Started at UC Berkeley AMPLab from the summer of 2012 • Reliable, Memory Speed Storage for Cluster CompuEng Frameworks (UC Berkeley EECS Tech Report) • Haoyuan Li, Ali Ghodsi, Matei Zaharia, Ion Stoica, Scot Shenker 35
  • 36. A Open Source Status • Apache License 2.0, Version 0.5.0 (July 2014) • Deployed at > 50 companies (July 2014) • 20+ Companies Contributing (? contributors) • Spark and MapReduce applications can run without any code change
  • 37. Release Growth 37 Tachyon 0.1: -1 contributor Dec ‘12
  • 38. Release Growth Tachyon 0.2: - 3 contributors Apr ‘13 38 Tachyon 0.1: -1 contributor Dec ‘12
  • 39. Release Growth Tachyon 0.3: - 15 contributors Tachyon 0.2: - 3 contributors Apr ‘13Oct‘13 39 Tachyon 0.1: -1 contributor Dec ‘12
  • 40. Release Growth Tachyon 0.4: - 30 contributors Tachyon 0.3: - 15 contributors Tachyon 0.2: - 3 contributors Apr ‘13Oct‘13 Feb ‘14 40 Tachyon 0.1: -1 contributor Dec ‘12
  • 41. Release Growth Tachyon 0.5: - 46 contributors Tachyon 0.4: - 30 contributors Tachyon 0.3: - 15 contributors Tachyon 0.2: - 3 contributors Apr ‘13Oct‘13 Feb ‘14 41 July ‘14 Tachyon 0.1: -1 contributor Dec ‘12
  • 42. Open Community 42 Berkeley Contributors Non-­‐Berkeley Contributors
  • 43. Thanks to our Code Contributors! Aaron Davidson Achal Soni Ali Ghodsi Andrew Ash Ankur Dave Anurag Khandelwal Aslan Bekirov Bill Zhao Brad Childs Calvin Jia Chao Chen Cheng Chang Cheng Hao Colin Patrick McCabe David Capwell 43 David Zhu Dan Crankshaw Darion Yaphet Du Li Fei Wang Gerald Zhang Grace Huang Haoyuan Li Henry Saputra Hobin Yoon Huamin Chen Jey Kottalam Joseph Tang Juan Zhou Jun Aoki Lin Xing Lukasz Jastrzebski Manu Goyal Mark Hamstra Mingfei Shi Mubarak Seyed Nick Lanham Orcun Simsek Pengfei Xuan Qianhao Dong Qifan Pu Raymond Liu Reynold Xin Robert Metzger Rong Gu Sean Zhong Seonghwan Moon Siyuan He Shaoshan Liu Shivaram Venkataraman Srinivas Parayya Tao Wang Timothy St. Clair Thu Kyaw Vamsi Chitters Vikram Sreekanti Xi Liu Xiang Zhong Xiaomin Zhang Zhao Zhang
  • 45. Commercially supported by x and running in dozens of their customers’ clusters Thanks to Redhat!
  • 46. Tachyon is the Default Off-­‐Heap Storage SoluLon for . Thanks to Redhat!
  • 47. Exchange Data Between Spark and H20 47
  • 48. More Support from Industry 48
  • 49. Reaching wider communiEes: e.g. GlusterFS 49
  • 50. Under Filesystem Choices (Big Data, Cloud, HPC, Enterprise)
  • 51. Outline • Overview – Feature 1: Memory Centric Storage Architecture – Feature 2: Lineage in Storage • Open Source • Roadmap
  • 52. Features • Memory Centric Storage Architecture • Lineage in Storage (alpha) • Hierarchical Storage • Data Serving • Different hardware • More… • Your Requirements? 52
  • 53. Short Term Roadmap (0.6 Release) • Ceph IntegraEon (Ceph Community) • Hierarchical Local Storage (Intel) • Performance Improvement (Yahoo) • Easy Deployment through Vagrant (Redhat) • MulE-­‐tenancy (AMPLab) • Further Improve Spark IntegraEon (Spark Community, Databricks) • Mesos IntegraEon (Mesos Community, Mesosphere) • Network Sub-­‐system Improvement (Pivotal) • Many more from AMPLab and Industry Contributors 53
  • 55. Beter Assist Other Components Spark MapRe duce Spark H2O SQL GraphX Impala HDFS S3 Gluster FS Orange FS …… NFS Ceph …… Welcome CollaboraLon!
  • 56. Thanks! Ques,ons? • More Informa,on: – Website: h9p://tachyon-­‐project.org – Github: h9ps://github.com/amplab/tachyon – Meetup: h9p://www.meetup.com/Tachyon • Email: haoyuan@cs.berkeley.edu