Alluxio Use Cases and Future Directions

DATA ORCHESTRATION SUMMI
T
Alluxio Use Cases and Future Directions
Bin Fan - Founding Engineer, VP of Open Source @ Alluxio
Calvin Jia - Founding Engineer @ Alluxio

Data Orchestration for
Analytics & AI in the Cloud
A DATA ORCHESTRATION APPROACH
Available:

DATA ORCHESTRATION SUMMIT
Agenda
• Alluxio Use Cases
• Future Directions
• Community Collaborations

DATA ORCHESTRATION
SUMMIT
2020
Alluxio Use Cases

Companies Using Alluxio

Single Cloud & On-Prem Use Cases
Consistent SLAs, Performance, and
Cost Savings on cloud storage
USE CASE 01: CLOUD USE CASE 02: ON PREM
PUBLIC CLOUD
Tensorflow
Alluxio
Speed-up analytics on on-prem
object stores
ON PREMISE
Spark
Alluxio
OR OR

CHALLENGES WITH CLOUD STORAGE
USE CASE 01: CLOUD
Ineﬀicient access to cloud storage
• Performance is variable and consistent SLAs are hard to achieve
• Metadata operations are expensive & slowdown workloads
• Embedded caching solutions are ineﬀective for ephemeral
workloads & clusters
Tensorflow
Alluxio
OR

• 40%+ reduction in AI training time & cost
• 2-8x performance with Analytics engines
• Eliminate storage access cost to cut total cost by up to 50%
• Reduce latency spikes by up to 6x using data pre-loading &
consistent performance guarantees
• Optional oﬀ-cluster caching for ephemeral workloads
SOLUTION
Consistent SLAs, Performance &
Cost Savings on cloud storage
USE CASE 01: CLOUD
Tensorflow
Alluxio
OR

CHALLENGES WITH ON-PREM OBJECT STORES
USE CASE 02: ON PREM
Slow transition to object storage
• Performance for analytics & AI workloads can be very poor
• No native support for popular frameworks
• Expensive metadata operations further reduce performance
t
Spark
Alluxio
OR OR

• Improved performance over co-located HDFS with the
flexibility of segregated storage
• Support for multiple APIs
• No changes to the end-user experience
• Enable cheap storage at a fraction of the cost
SOLUTION
Speed-up analytics & AI on
on-prem object stores
USE CASE 02: ON PREM
t
Spark
Alluxio
SAME REGION
OR OR

Hybrid Cloud & Multi-Datacenter
Burst compute to a public cloud
and gradually migrate
USE CASE 03: HYBRID
Hive
Alluxio
PUBLIC CLOUD
ON PREMISE
Hybrid Cloud Gateway to utilize
on-prem compute for data in the cloud
USE CASE 04: HYBRID
Alluxio
Pytorch
PUBLIC CLOUD
ON PREMISE
Cross Datacenter Access without
changing Ingest Pipeline across regions
USE CASE 05: MULTI-DATACENTER
Presto
Alluxio
DATACENTER 1
DATACENTER 2
INGESTION

ALLUXIO 12
CHALLENGES WITH HYBRID CLOUD BURSTING
USE CASE 03: HYBRID
Migrating Analytics or AI to the
Cloud is Hard
• Repeated data access across the corporate network to a public
cloud is not feasible
• Copying data to cloud storage is time consuming and complex
• Using a cloud storage system like S3 means expensive
application changes and low performance
t
Hive
Alluxio

t
Hive
Alluxio
SAME REGION
ALLUXIO 13
• Performance as if data is on the cloud compute cluster
• 100% of I/O is oﬀloaded from on-premises
• No changes to end-user experience and security model
• Common data fabric with only a logical data copies
• Utilization of elastic cloud compute for up to 4x costs savings
SOLUTION
Burst Compute to a Public Cloud
and Gradually Migrate
USE CASE 03: HYBRID

Alluxio @ Walmart
• Zero-Copy
○ No new copies of data in the cloud
• High Performance
○ Data caching accelerates queries
• Lower Costs
○ One source of truth for data avoids
additional storage

ALLUXIO 15
CHALLENGES WITH HYBRID CLOUD STORAGE
USE CASE 04: HYBRID
Accessing Cloud Storage from a
Private Datacenter
• No unified view for cloud and on-prem storage
• Prohibitively high network egress costs
• Inability to utilize compute on-premises for data generated
in the cloud
• Inadequate performance for analytics and AI
PyTorch
ON PREMISE
PUBLIC CLOUD

ALLUXIO 16
• Performance as if data is on the on-prem compute cluster
• Intelligent distributed caching for reads & writes
• Network cost savings of up to 80% by eliminating replication
• No changes to the end-user experience with flexible APIs and
security model on cloud storage
SOLUTION
Hybrid Cloud Storage Gateway for
data in the cloud
USE CASE 04: HYBRID
Alluxio
PyTorch
ON PREMISE
PUBLIC CLOUD

ALLUXIO 17
CHALLENGES WITH SUPPORTING SATELLITE CLUSTERS
ACROSS DATA CENTERS
USE CASE 05: MULTI DATACENTER
Utilization of compute resources
across datacenters
• Orchestrating data to compute clusters in another data center is
manual and time consuming
• Storing and managing multiple copies of the data is expensive
with unnecessary network traﬀic for replication
• Running replication frameworks on an overloaded storage
cluster dramatically impacts performance of existing workloads
Presto
Alluxio
DATACENTER 1
a
DATACENTER 2
Hive

ALLUXIO 18
• No redundant data copies across datacenters
• Elimination of complex data synchronization
• 3-6x performance compared to remote data access across regions
• Self-service data infrastructure across business units
SOLUTION
Cross Datacenter Access without
changing Ingest Pipeline
USE CASE 05: MULTI DATACENTER
Presto
Alluxio
DATACENTER 1
a
DATACENTER 2
Hive

Alluxio @ Adobe
Primary DC with large Hadoop Cluster out of
space, ad hoc SQL workloads exponentially
growing as analyst headcount as reached 1800 ppl
PROBLEM
● 80% less network usage
● More stable infrastructure
● Lower costs
● Results come in faster
● Easier to scale
● Ability handle new analysts with no impact and increase response times
● Self-service for end-users
Leverage compute resources outside of
primary on-prem DC for multiple analytical
frameworks.
SOLUTION
REMOTE DATA RESULTS

Alluxio & Data Analytics
• Data Analytics runs on Data Lakes
• Data Lakes are designed for data storage, not access
• Alluxio is the Data Orchestration layer which bridges the
compute and data layers
○ If the Data Lake is remote
○ If the Data Lake is overloaded
○ If the Data Lake has variable latency
○ If the Data Lake has low performance
○ If the Data Lake doesn’t support the same semantics
○ ...

DATA ORCHESTRATION
SUMMIT
2020
Growing Workloads

Alluxio & AI w/ K8s
• Machine Learning & AI runs on Data Lakes
• Compared to Data Analytics, AI workloads have diﬀerent
characteristics, but a similar mismatch between compute
and storage

Alluxio & AI - Better Together
• Access Pattern - Repeated access on a dataset
• Dataset - Many small files
• Preferred API - Posix Filesystem
• Workload Regularity - Predictable, bulk access

Powered by the Community
• Future directions and growing workloads for Alluxio are
greatly influenced by our community! Thank you!

DATA ORCHESTRATION
SUMMIT
2020
Community Collaborations

Alluxio Open Source Project Stats
Latest stable release: 2.4.1
Total number of contributors: 1092
+1013 more commits since v2.1.0 (Nov 2019, 1st Summit)
5100+ Slack users (alluxio.io/slack)

Fast Growing User Slack Channel
alluxio.io/slack

Production Deployments at Scale
● Top-tier cell phone provider
○ 3000+ Alluxio servers in a single cluster
● Top-tier social network company
○ 10,000+ concurrent Alluxio clients
○ 10+PB data managed

Special Interest Groups in Ecosystem
● SIG in Machine Learning/K8s on Alluxio
■ Regular Community R&D meetings
■ Re-implemented JNI-based FUSE integration
■ Performance optimizations for small files, RPCs
● A new SIG kicked off in Presto on Alluxio

Experimental Two-week Release Cycle
● Previous release cadence: quarterly
● New experimental release schedule:
○ every two weeks
○ starting early December!
● What does it bring to Alluxio community?
○ deliver feature/bug fixes faster

Welcome to Join Alluxio Community!
alluxio.io/slack Alluxio-Global-Online-Meetup/

Alluxio Use Cases and Future Directions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Alluxio Use Cases and Future Directions

Similar to Alluxio Use Cases and Future Directions (20)

More from Alluxio, Inc.

More from Alluxio, Inc. (20)

Recently uploaded

Recently uploaded (20)

Alluxio Use Cases and Future Directions