Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Getting Started with Alluxio + Spark + S3
1. Alluxio (formerly Tachyon):
Getting Started with Alluxio + Spark + S3
Calvin Jia
June 15, 2016 @ Alluxio Meetup (hosted by Intel)
Related Blog Post: http://goo.gl/MUpL0O
2. Who Am I?
• Calvin Jia
• SWE @ Alluxio, Inc.
• Alluxio PMC Member
• Twitter: @JiaCalvin
2
5. Why Alluxio?
• Data sharing between jobs
• Data resilience during application crashes
• Consolidate memory usage and alleviate GC
issues
5
6. In-‐Memory
Storage
block
1
block
3
In-‐Memory
Storage
block
1
block
3
block
2
block
4
storage
engine
&
execution
engine
same
process
Data Sharing Between Jobs
Inter-‐process
sharing
slowed
down
by
network
I/O
6
7. Data Sharing Between Jobs
block
1
block
3
block
2
block
4
HDFS
disk
block
1
block
3
block
2
block
4 In-‐Memory
block
1
block
3 block
4
storage
&
execution
engine
separated
Inter-‐process
sharing
can
happen
at
memory
speed
7
8. Data Resilience during Crashes
In-‐Memory
Storage
block
1
block
3
block
1
block
3
block
2
block
4
storage
engine
&
execution
engine
same
process
Process
crash
requires
network
I/O
to
re-‐read
the
data
8
9. Data Resilience during Crashes
Crash
In-‐Memory
Storage
block
1
block
3
block
1
block
3
block
2
block
4
storage
engine
&
execution
engine
same
process
Process
crash
requires
network
I/O
to
re-‐read
the
data
9
10. Data Resilience during Crashes
block
1
block
3
block
2
block
4
Crash
storage
engine
&
execution
engine
same
process
Process
crash
requires
network
I/O
to
re-‐read
the
data
10
11. Data Resilience during Crashes
storage
&
execution
engine
separated
HDFS
disk
block
1
block
3
block
2
block
4 In-‐Memory
block
1
block
3 block
4
Process
crash
only
needs
memory
I/O
to
re-‐read
the
data
11
12. Data Resilience during Crashes
Crash
storage
&
execution
engine
separated
Process
crash
only
needs
memory
I/O
to
re-‐read
the
data
HDFS
disk
block
1
block
3
block
2
block
4 In-‐Memory
block
1
block
3 block
4
12
16. Visualizing the Stack
16
FAST
104 - 105 MB/s
MODERATE 103 - 104 MB/s
SLOW 102 - 103 MB/s
Only when necessary
Limited
Often
SSD
HDD
Mem
17. When to use Alluxio
•Two or more jobs access the same dataset
•Job(s) may not always succeed
•Dataset larger than Spark JVM
•Jobs are pipelined
•Resulting data does not need to be
immediately persisted
17
18. Version Selection
• Alluxio 1.1.0
–Latest released version
–Many improvements, upgrade recommended
• Spark 1.6.1
–Latest released version
–Remember to use Spark Alluxio client, ie. -
Pspark
–Spark 2.0 is coming out soon, will recommend
the best way to integrate with Alluxio
18
19. API Selection
• Access data directly through the FileSystem API, but
change scheme to alluxio://
–Minimal code change
–Do not need to reason about logic
•Example:
–val file = sc.textFile(“s3n://my-‐bucket/myFile”)
–val file = sc.textFile(“alluxio://master:19998/myFile”)
19