More Related Content Similar to Best Practices for Using Alluxio with Apache Spark with Gene Pang (20) More from Spark Summit (20) Best Practices for Using Alluxio with Apache Spark with Gene Pang1. Best Practices for Using
Alluxio with Spark
Gene Pang,Alluxio, Inc.
Spark Summit EU - October 2017
2. About Me
• Gene Pang
• Software engineer @ Alluxio, Inc.
• Alluxio open source PMC member
• Ph.D. from AMPLab @ UC Berkeley
• Worked at Google before UC Berkeley
• Twitter: @unityxx
• Github: @gpang
©2017 Alluxio, Inc.All Rights Reserved 2
5. Data Ecosystem Today
• Many Compute
Frameworks
• Multiple Storage Systems
• Most not co-located
©2017 Alluxio, Inc.All Rights Reserved 5
6. Data Ecosystem Issues
• Each application manage
multiple data sources
• Add/Removing data
sources require
application changes
• Storage optimizations
requires application
change
• Lower performance due
to lack of locality
©2017 Alluxio, Inc.All Rights Reserved 6
7. Data Ecosystem with Alluxio
• Apps only talk to
Alluxio
• Simple Add/Remove
• No App Changes
• Memory
Performance
Native File System
Hadoop Compatible
File System
Native Key-Value
Interface
Fuse Compatible File
System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
©2017 Alluxio, Inc.All Rights Reserved 7
8. Next Gen Analytics with Alluxio
Native File System
Hadoop Compatible
File System
Native Key-Value
Interface
Fuse Compatible File
System
HDFS Interface Amazon S3 Interface Swift Interface GlusterFS Interface
Apps, Data & Storage
at Memory Speed
ü Big Data/IoT
ü AI/ML
ü Deep Learning
ü Cloud Migration
ü Multi Platform
ü Autonomous
©2017 Alluxio, Inc.All Rights Reserved 8
9. Fastest Growing Big Data
Open Source Projects
Fastest Growing open-
source project in the big
data ecosystem
Running in large
production clusters
600+ Contributors from
100+ organizations
0
100
200
300
400
500
0 10 20 30 40 45
NumberofContributors
Github Open Source Contributors by Month
Alluxio
Spark
Kafka
Redis
HDFS
Cassandra
Hive
©2017 Alluxio, Inc.All Rights Reserved 9
11. Big Data Case Study –
1 110/30/17 ©2017 Alluxio, Inc.All Rights Reserved
Challenge –
Gain end to end view of
business with large volume of
data
Queries were slow / not
interactive, resulting in
operational inefficiency
SPARK
TERADATA
SPARK
TERADATA
Solution –
ETL Data from Teradata to Alluxio
Impact –
Faster Time to Market – “Now we
don’t have to work Sundays”
http://bit.ly/2oMx95W
12. Big Data Case Study –
1 210/30/17 ©2017 Alluxio, Inc.All Rights Reserved
Challenge –
Gain end to end view of
business with large volume of
data
Queries were slow / not
interactive, resulting in
operational inefficiency
SPARK
Baidu File System
SPARK
Baidu File System
Solution –
With Alluxio, data queries are 30X
faster
Impact –
Higher operational efficiency
http://bit.ly/2pDHS3O
13. Big Data Case Study –
Challenge –
Gain end to end view of
business with large volume of
data for $5B Travel Site
Queries were slow / not
interactive, resulting in
operational inefficiency
SPARK
HDFS
Solution –
With Alluxio, 300x improvement in
performance
Impact –
Increased revenue from immediate
response to user behavior
Use case: http://bit.ly/2pDJdrq
CEPH
HDFS CEPH
FLINK SPARK FLINK
©2017 Alluxio, Inc.All Rights Reserved 1 3
14. Machine Learning Case Study –
1 410/30/17 ©2017 Alluxio, Inc.All Rights Reserved
Challenge –
Disparate Data both on-prem
and Cloud. Heterogeneous
types of data.
Scaling of Exabyte size data.
Slow due to disk based
approach.
SPARK
HDFS
SPARK
MINIO
Solution –
Using Alluxio to prevent I/O
bottlenecks
Impact –
Orders of magnitude higher
performance than before.
http://bit.ly/2p18ds3
MESOS
16. 1 6©2017 Alluxio, Inc.All Rights Reserved
Application
AlluxioClient
Alluxio
Master
Alluxio
Worker
Alluxio
Worker
…
Storage
Storage
…
Alluxio Architecture
17. Alluxio Client
Applications interact with Alluxio via the Alluxio client
• Java Native Alluxio Filesystem Client
• Alluxio specific operations like [un]pin, [un]mount, [un]set TTL
• HDFS-Compatible Filesystem Client
• No code change necessary
• S3 API
©2017 Alluxio, Inc.All Rights Reserved 1 7
18. Alluxio Master
Master is responsible for managing metadata
• Filesystem namespace metadata
• Blocks / workers metadata
Primary master writes journal for durable operations
• Secondary masters replay journal entries
©2017 Alluxio, Inc.All Rights Reserved 1 8
19. Alluxio Worker
Worker is responsible for managing block data
Worker stores block data on various storage media
• HDD, SSD, Memory
Reads and writes data to underlying storage systems
©2017 Alluxio, Inc.All Rights Reserved 1 9
21. Sharing Data via Memory
Storage Engine &
Execution Engine
Same Process
• Two copies of data in memory – double the memory used
• Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark
Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Spark Compute
Spark
Storage
block 1
block 3
©2017 Alluxio, Inc.All Rights Reserved 2 1
22. Sharing Data via Memory
Storage Engine &
Execution Engine
Different process
• Half the memory used
• Sharing Data at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage
©2017 Alluxio, Inc.All Rights Reserved 2 2
23. Data Resilience During Crash
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
©2017 Alluxio, Inc.All Rights Reserved 2 3
24. Data Resilience During Crash
CRASH
Spark Storage
block 1
block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
• Process Crash Requires Network and/or Disk I/O to Re-read Data
Storage Engine &
Execution Engine
Same Process
©2017 Alluxio, Inc.All Rights Reserved 2 4
25. Data Resilience During Crash
CRASH
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Storage Engine &
Execution Engine
Same Process
• Process Crash Requires Network and/or Disk I/O to Re-read Data
©2017 Alluxio, Inc.All Rights Reserved 2 5
26. Data Resilience During Crash
Spark Compute
Spark Storage
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
Storage Engine &
Execution Engine
Different process
©2017 Alluxio, Inc.All Rights Reserved 2 6
27. Data Resilience During Crash
• Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3
block 1
block 3
block 2
block 4
HDFS
disk
block 1
block 3
block 2
block 4
Alluxio
block 1
block 3 block 4
CRASH Storage Engine &
Execution Engine
Different process
©2017 Alluxio, Inc.All Rights Reserved 2 7
28. Accessing Alluxio Data From Spark
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file
©2017 Alluxio, Inc.All Rights Reserved 2 8
29. Code Example for Spark RDDs
Writing RDD to Alluxio
rdd.saveAsTextFile(alluxioPath)!
rdd.saveAsObjectFile(alluxioPath)!
Reading RDD from Alluxio
rdd = sc.textFile(alluxioPath)!
rdd = sc.objectFile(alluxioPath)!
©2017 Alluxio, Inc.All Rights Reserved 2 9
30. Code Example for Spark DataFrames
Writing to Alluxio df.write.parquet(alluxioPath)!
Reading from Alluxio df = sc.read.parquet(alluxioPath)!
©2017 Alluxio, Inc.All Rights Reserved 3 0
31. Deploying Alluxio with Spark
©2017 Alluxio, Inc.All Rights Reserved 3 1
Spark
Alluxio
Storage
Spark
Alluxio
Storage
Colocate Alluxio Workers with Spark
for optimal I/O performance
Deploy Alluxio between
Spark and Storage
33. Experiments
Spark 2.2.0 + Alluxio 1.6.0
Single worker:Amazon r3.2xlarge
Compare reading cached parquet files
©2017 Alluxio, Inc.All Rights Reserved 3 3
35. New Context: 50 GB DataFrame (S3)
6x – 8x speedup
©2017 Alluxio, Inc.All Rights Reserved 3 5
36. Conclusion
Easy to use Alluxio with Spark
Alluxio enables improved I/O performance
Easily interact with various storage systems with Alluxio
©2017 Alluxio, Inc.All Rights Reserved 3 6