Getting Started with Alluxio + Spark + S3

•

2 likes•2,678 views

Alluxio, Inc.

Bay Area Meetup presentation (6/15/16)

Technology

Alluxio (formerly Tachyon):
Getting Started with Alluxio + Spark + S3
Calvin Jia
June 15, 2016 @ Alluxio Meetup (hosted by Intel)
Related Blog Post: http://goo.gl/MUpL0O

Who Am I?
• Calvin Jia
• SWE @ Alluxio, Inc.
• Alluxio PMC Member
• Twitter: @JiaCalvin
2

Outline
• Technology Overview
• Alluxio + Spark + S3
• Demo
3

Why Alluxio?
• Data sharing between jobs
• Data resilience during application crashes
• Consolidate memory usage and alleviate GC
issues
5

In-‐Memory

Storage
block
1
block
3
In-‐Memory

Storage
block
1
block
3
block
2
block
4
storage
engine
&

execution
engine
same
process
Data Sharing Between Jobs
Inter-‐process
sharing
slowed
down
by
network
I/O
6

Data Sharing Between Jobs
block
1
block
3
block
2
block
4
HDFS
disk
block
1
block
3
block
2
block
4 In-‐Memory
block
1
block
3 block
4
storage
&

execution
engine
separated
Inter-‐process
sharing
can
happen
at
memory
speed
7

Data Resilience during Crashes
In-‐Memory
Storage
block
1
block
3
block
1
block
3
block
2
block
4
storage
engine
&

execution
engine
same
process
Process
crash
requires
network
I/O
to
re-‐read
the
data
8

Data Resilience during Crashes
Crash
In-‐Memory
Storage
block
1
block
3
block
1
block
3
block
2
block
4
storage
engine
&

execution
engine
same
process
Process
crash
requires
network
I/O
to
re-‐read
the
data
9

Data Resilience during Crashes
block
1
block
3
block
2
block
4
Crash
storage
engine
&

execution
engine
same
process
Process
crash
requires
network
I/O
to
re-‐read
the
data
10

Data Resilience during Crashes
storage
&

execution
engine
separated
HDFS
disk
block
1
block
3
block
2
block
4 In-‐Memory
block
1
block
3 block
4
Process
crash
only
needs
memory
I/O
to
re-‐read
the
data
11

Data Resilience during Crashes
Crash
storage
&

execution
engine
separated
Process
crash
only
needs
memory
I/O
to
re-‐read
the
data
HDFS
disk
block
1
block
3
block
2
block
4 In-‐Memory
block
1
block
3 block
4
12

Consolidating Memory
In-‐Memory
Storage
block
1
block
3
In-‐Memory
Storage
block
3
block
1
block
1
block
3
block
2
block
4
storage
engine
&

execution
engine
same
process
Data
duplicated
at
memory-‐level
13

Consolidating Memory
block
1
block
3
block
2
block
4
storage
&

execution
engine
separated
HDFS
disk
block
1
block
3
block
2
block
4 In-‐Memory
block
1
block
3 block
4
Data
not
duplicated
at
memory-‐level
14

Outline
• Technology Overview
• Alluxio + Spark + S3
• Demo
15

Visualizing the Stack
16
FAST
104 - 105 MB/s
MODERATE 103 - 104 MB/s
SLOW 102 - 103 MB/s
Only when necessary
Limited
Often
SSD
HDD
Mem

When to use Alluxio
•Two or more jobs access the same dataset
•Job(s) may not always succeed
•Dataset larger than Spark JVM
•Jobs are pipelined
•Resulting data does not need to be
immediately persisted
17

Version Selection
• Alluxio 1.1.0
–Latest released version
–Many improvements, upgrade recommended
• Spark 1.6.1
–Latest released version
–Remember to use Spark Alluxio client, ie. -
Pspark
–Spark 2.0 is coming out soon, will recommend
the best way to integrate with Alluxio
18

API Selection
• Access data directly through the FileSystem API, but
change scheme to alluxio://
–Minimal code change
–Do not need to reason about logic
•Example:
–val file = sc.textFile(“s3n://my-‐bucket/myFile”)
–val file = sc.textFile(“alluxio://master:19998/myFile”)
19

Outline
• Technology Overview
• Alluxio + Spark + S3
• Demo
20

What's hot

Alluxio (formerly Tachyon): The Journey thus far and the Road AheadAlluxio, Inc.

Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Alluxio, Inc.

Presentation by TachyonNexus & Baidu at Strata Singapore 2015Tachyon Nexus, Inc.

Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017Alluxio, Inc.

Spark Summit EU talk by Jiri SimsaSpark Summit

Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Nexus, Inc.

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...Alluxio, Inc.

Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...Alluxio, Inc.

Best Practices for Using Alluxio with SparkAlluxio, Inc.

Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017Alluxio, Inc.

Presentation by TachyonNexus & Intel at Strata Singapore 2015Tachyon Nexus, Inc.

Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Alluxio, Inc.

Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016Alluxio, Inc.

Alluxio-FUSE as a data access layer for DaskAlluxio, Inc.

Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...Alluxio, Inc.

Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.

Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017 Alluxio, Inc.

The Missing Piece of On-Demand ClustersAlluxio, Inc.

Alluxio Presentation at AMPLab Summer Retreat 2016Alluxio, Inc.

Flexible and Fast Storage for Deep Learning with Alluxio Alluxio, Inc.

What's hot (20)

Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead

Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016

Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017

Spark Summit EU talk by Jiri Simsa

Tachyon Presentation at AMPCamp 6 (November, 2015)

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...

Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...

Best Practices for Using Alluxio with Spark

Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017

Presentation by TachyonNexus & Intel at Strata Singapore 2015

Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017

Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016

Alluxio-FUSE as a data access layer for Dask

Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...

Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...

Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017

The Missing Piece of On-Demand Clusters

Alluxio Presentation at AMPLab Summer Retreat 2016

Flexible and Fast Storage for Deep Learning with Alluxio

Viewers also liked

232 deview2013 oss를활용한분산아키텍처구현NAVER D2

Play node conferenceJohn Kim

NODE.JS 글로벌 기업 적용 사례 그리고, real-time 어플리케이션 개발하기John Kim

Node.js in FlittoSeungWoo Lee

시간당 수백만 요청을 처리하는 node.js 서버 운영기 - Playnode 2015Goonoo Kim

Java/Spring과 Node.js의공존동수 장

Viewers also liked (6)

232 deview2013 oss를활용한분산아키텍처구현

Play node conference

NODE.JS 글로벌 기업 적용 사례 그리고, real-time 어플리케이션 개발하기

Node.js in Flitto

시간당 수백만 요청을 처리하는 node.js 서버 운영기 - Playnode 2015

Java/Spring과 Node.js의공존

Similar to Getting Started with Alluxio + Spark + S3

Improving Memory Utilization of Spark Jobs Using AlluxioAlluxio, Inc.

A Reliable Memory-Centric Distributed Storage SystemAlluxio, Inc.

Spark Summit EU talk by Jiri SimsaAlluxio, Inc.

Best Practice in Accelerating Data Applications with Spark+AlluxioAlluxio, Inc.

Tachyon-2014-11-21-amp-camp5Haoyuan Li

Using Spark with Tachyon by Gene PangSpark Summit

Running Solr in the Cloud at Memory Speed with Alluxiothelabdude

Oracle database smart flash cacheJohan Louwers

Best Practices for Using Alluxio with Apache Spark with Gene PangSpark Summit

Best Practices for Using Alluxio with SparkAlluxio, Inc.

Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit

Best Practices for Using Alluxio with SparkAlluxio, Inc.

Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsAlluxio, Inc.

Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Databricks

Spark Pipelines in the Cloud with Alluxio by Bin FanData Con LA

OOW13: It's a solid state-worldMarc Fielding

Alluxio: Unify Data at Memory SpeedAlluxio, Inc.

Running Solr at Memory Speed with Alluxio - Timothy Potter, LucidworksLucidworks

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Data Con LA

Racing with DroidsPeter Hlavaty

Similar to Getting Started with Alluxio + Spark + S3 (20)

Improving Memory Utilization of Spark Jobs Using Alluxio

A Reliable Memory-Centric Distributed Storage System

Spark Summit EU talk by Jiri Simsa

Best Practice in Accelerating Data Applications with Spark+Alluxio

Tachyon-2014-11-21-amp-camp5

Using Spark with Tachyon by Gene Pang

Running Solr in the Cloud at Memory Speed with Alluxio

Oracle database smart flash cache

Best Practices for Using Alluxio with Apache Spark with Gene Pang

Best Practices for Using Alluxio with Spark

Apache Ignite vs Alluxio: Memory Speed Big Data Analytics

Best Practices for Using Alluxio with Spark

Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics

Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...

Spark Pipelines in the Cloud with Alluxio by Bin Fan

OOW13: It's a solid state-world

Alluxio: Unify Data at Memory Speed

Running Solr at Memory Speed with Alluxio - Timothy Potter, Lucidworks

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...

Racing with Droids

Recently uploaded

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Search Engine Optimization SEO PDF for 2024.pdfRankYa

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Gen AI in Business - Global Trends Report 2024.pdfAddepto

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Story boards and shot lists for my a level piececharlottematthew16

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

"ML in Production",Oleksandr BaganFwdays

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

Recently uploaded (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning

SIP trunking in Janus @ Kamailio World 2024

"Debugging python applications inside k8s environment", Andrii Soldatenko

Commit 2024 - Secret Management made easy

Search Engine Optimization SEO PDF for 2024.pdf

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

The Ultimate Guide to Choosing WordPress Pros and Cons

DevoxxFR 2024 Reproducible Builds with Apache Maven

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Nell’iperspazio con Rocket: il Framework Web di Rust!

Streamlining Python Development: A Guide to a Modern Project Setup

Gen AI in Business - Global Trends Report 2024.pdf

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Story boards and shot lists for my a level piece

Scanning the Internet for External Cloud Exposures via SSL Certs

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

DevEX - reference for building teams, processes, and platforms

"ML in Production",Oleksandr Bagan

Anypoint Exchange: It’s Not Just a Repo!

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

Getting Started with Alluxio + Spark + S3

1. Alluxio (formerly Tachyon): Getting Started with Alluxio + Spark + S3 Calvin Jia June 15, 2016 @ Alluxio Meetup (hosted by Intel) Related Blog Post: http://goo.gl/MUpL0O

2. Who Am I? • Calvin Jia • SWE @ Alluxio, Inc. • Alluxio PMC Member • Twitter: @JiaCalvin 2

3. Outline • Technology Overview • Alluxio + Spark + S3 • Demo 3

4. Alluxio Ecosystem 4

5. Why Alluxio? • Data sharing between jobs • Data resilience during application crashes • Consolidate memory usage and alleviate GC issues 5

6. In-‐Memory Storage block 1 block 3 In-‐Memory Storage block 1 block 3 block 2 block 4 storage engine & execution engine same process Data Sharing Between Jobs Inter-‐process sharing slowed down by network I/O 6

7. Data Sharing Between Jobs block 1 block 3 block 2 block 4 HDFS disk block 1 block 3 block 2 block 4 In-‐Memory block 1 block 3 block 4 storage & execution engine separated Inter-‐process sharing can happen at memory speed 7

8. Data Resilience during Crashes In-‐Memory Storage block 1 block 3 block 1 block 3 block 2 block 4 storage engine & execution engine same process Process crash requires network I/O to re-‐read the data 8

9. Data Resilience during Crashes Crash In-‐Memory Storage block 1 block 3 block 1 block 3 block 2 block 4 storage engine & execution engine same process Process crash requires network I/O to re-‐read the data 9

10. Data Resilience during Crashes block 1 block 3 block 2 block 4 Crash storage engine & execution engine same process Process crash requires network I/O to re-‐read the data 10

11. Data Resilience during Crashes storage & execution engine separated HDFS disk block 1 block 3 block 2 block 4 In-‐Memory block 1 block 3 block 4 Process crash only needs memory I/O to re-‐read the data 11

12. Data Resilience during Crashes Crash storage & execution engine separated Process crash only needs memory I/O to re-‐read the data HDFS disk block 1 block 3 block 2 block 4 In-‐Memory block 1 block 3 block 4 12

13. Consolidating Memory In-‐Memory Storage block 1 block 3 In-‐Memory Storage block 3 block 1 block 1 block 3 block 2 block 4 storage engine & execution engine same process Data duplicated at memory-‐level 13

14. Consolidating Memory block 1 block 3 block 2 block 4 storage & execution engine separated HDFS disk block 1 block 3 block 2 block 4 In-‐Memory block 1 block 3 block 4 Data not duplicated at memory-‐level 14

15. Outline • Technology Overview • Alluxio + Spark + S3 • Demo 15

16. Visualizing the Stack 16 FAST 104 - 105 MB/s MODERATE 103 - 104 MB/s SLOW 102 - 103 MB/s Only when necessary Limited Often SSD HDD Mem

17. When to use Alluxio •Two or more jobs access the same dataset •Job(s) may not always succeed •Dataset larger than Spark JVM •Jobs are pipelined •Resulting data does not need to be immediately persisted 17

18. Version Selection • Alluxio 1.1.0 –Latest released version –Many improvements, upgrade recommended • Spark 1.6.1 –Latest released version –Remember to use Spark Alluxio client, ie. - Pspark –Spark 2.0 is coming out soon, will recommend the best way to integrate with Alluxio 18

19. API Selection • Access data directly through the FileSystem API, but change scheme to alluxio:// –Minimal code change –Do not need to reason about logic •Example: –val file = sc.textFile(“s3n://my-‐bucket/myFile”) –val file = sc.textFile(“alluxio://master:19998/myFile”) 19

20. Outline • Technology Overview • Alluxio + Spark + S3 • Demo 20

Getting Started with Alluxio + Spark + S3

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Getting Started with Alluxio + Spark + S3

Similar to Getting Started with Alluxio + Spark + S3 (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Getting Started with Alluxio + Spark + S3