Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Data Orchestration for Analytics and AI in the Cloud Era
Calvin Jia, Founding Engineer (Alluxio)
Bin Fan, Founding Engineer, VP of Open Source (Alluxio)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
6. DATAÂ ORCHESTRATIONÂ SUMMIT
Single Cloud & On-Prem Use Cases
Consistent SLAs, Performance, and
Cost Savings on cloud storage
USE CASE 01: CLOUD USE CASE 02: ON PREM
PUBLIC CLOUD
Tensorflow
Alluxio
Speed-up analytics on on-prem
object stores
ON PREMISE
Spark
Alluxio
OR OR
7. CHALLENGES WITH CLOUD STORAGE
USE CASE 01: CLOUD
Inefficient access to cloud storage
• Performance is variable and consistent SLAs are hard to achieve
• Metadata operations are expensive & slowdown workloads
• Embedded caching solutions are ineffective for ephemeral
workloads & clusters
Tensorflow
Alluxio
OR
8. • 40%+ reduction in AI training time & cost
• 2-8x performance with Analytics engines
• Eliminate storage access cost to cut total cost by up to 50%
• Reduce latency spikes by up to 6x using data pre-loading &
consistent performance guarantees
• Optional off-cluster caching for ephemeral workloads
SOLUTION
Consistent SLAs, Performance &
Cost Savings on cloud storage
USE CASE 01: CLOUD
Tensorflow
Alluxio
OR
9. CHALLENGES WITH ON-PREM OBJECT STORES
USE CASE 02: ON PREM
Slow transition to object storage
• Performance for analytics & AI workloads can be very poor
• No native support for popular frameworks
• Expensive metadata operations further reduce performance
t
Spark
Alluxio
OR OR
10. • Improved performance over co-located HDFS with the
flexibility of segregated storage
• Support for multiple APIs
• No changes to the end-user experience
• Enable cheap storage at a fraction of the cost
SOLUTION
Speed-up analytics & AI on
on-prem object stores
USE CASE 02: ON PREM
t
Spark
Alluxio
SAME REGION
OR OR
11. DATAÂ ORCHESTRATIONÂ SUMMIT
Hybrid Cloud & Multi-Datacenter
Burst compute to a public cloud
and gradually migrate
USE CASE 03: HYBRID
Hive
Alluxio
PUBLIC CLOUD
ON PREMISE
Hybrid Cloud Gateway to utilize
on-prem compute for data in the cloud
USE CASE 04: HYBRID
Alluxio
Pytorch
PUBLIC CLOUD
ON PREMISE
Cross Datacenter Access without
changing Ingest Pipeline across regions
USE CASE 05: MULTI-DATACENTER
Presto
Alluxio
DATACENTER 1
DATACENTER 2
INGESTION
12. ALLUXIO 12
CHALLENGES WITH HYBRID CLOUD BURSTING
USE CASE 03: HYBRID
Migrating Analytics or AI to the
Cloud is Hard
• Repeated data access across the corporate network to a public
cloud is not feasible
• Copying data to cloud storage is time consuming and complex
• Using a cloud storage system like S3 means expensive
application changes and low performance
t
Hive
Alluxio
13. t
Hive
Alluxio
SAME REGION
ALLUXIO 13
• Performance as if data is on the cloud compute cluster
• 100% of I/O is offloaded from on-premises
• No changes to end-user experience and security model
• Common data fabric with only a logical data copies
• Utilization of elastic cloud compute for up to 4x costs savings
SOLUTION
Burst Compute to a Public Cloud
and Gradually Migrate
USE CASE 03: HYBRID
14. DATAÂ ORCHESTRATIONÂ SUMMIT
Alluxio @ Walmart
• Zero-Copy
â—‹ No new copies of data in the cloud
• High Performance
â—‹ Data caching accelerates queries
• Lower Costs
â—‹ One source of truth for data avoids
additional storage
15. ALLUXIO 15
CHALLENGES WITH HYBRID CLOUD STORAGE
USE CASE 04: HYBRID
Accessing Cloud Storage from a
Private Datacenter
• No unified view for cloud and on-prem storage
• Prohibitively high network egress costs
• Inability to utilize compute on-premises for data generated
in the cloud
• Inadequate performance for analytics and AI
PyTorch
ON PREMISE
PUBLIC CLOUD
16. ALLUXIO 16
• Performance as if data is on the on-prem compute cluster
• Intelligent distributed caching for reads & writes
• Network cost savings of up to 80% by eliminating replication
• No changes to the end-user experience with flexible APIs and
security model on cloud storage
SOLUTION
Hybrid Cloud Storage Gateway for
data in the cloud
USE CASE 04: HYBRID
Alluxio
PyTorch
ON PREMISE
PUBLIC CLOUD
17. ALLUXIO 17
CHALLENGES WITH SUPPORTING SATELLITE CLUSTERS
ACROSS DATA CENTERS
USE CASE 05: MULTI DATACENTER
Utilization of compute resources
across datacenters
• Orchestrating data to compute clusters in another data center is
manual and time consuming
• Storing and managing multiple copies of the data is expensive
with unnecessary network traffic for replication
• Running replication frameworks on an overloaded storage
cluster dramatically impacts performance of existing workloads
Presto
Alluxio
DATACENTER 1
a
DATACENTER 2
Hive
18. ALLUXIO 18
• No redundant data copies across datacenters
• Elimination of complex data synchronization
• 3-6x performance compared to remote data access across regions
• Self-service data infrastructure across business units
SOLUTION
Cross Datacenter Access without
changing Ingest Pipeline
USE CASE 05: MULTI DATACENTER
Presto
Alluxio
DATACENTER 1
a
DATACENTER 2
Hive
19. DATAÂ ORCHESTRATIONÂ SUMMIT
Alluxio @ Adobe
Primary DC with large Hadoop Cluster out of
space, ad hoc SQL workloads exponentially
growing as analyst headcount as reached 1800 ppl
PROBLEM
â—Ź 80% less network usage
â—Ź More stable infrastructure
â—Ź Lower costs
â—Ź Results come in faster
â—Ź Easier to scale
â—Ź Ability handle new analysts with no impact and increase response times
â—Ź Self-service for end-users
Leverage compute resources outside of
primary on-prem DC for multiple analytical
frameworks.
SOLUTION
REMOTE DATA RESULTS
20. DATAÂ ORCHESTRATIONÂ SUMMIT
Alluxio & Data Analytics
• Data Analytics runs on Data Lakes
• Data Lakes are designed for data storage, not access
• Alluxio is the Data Orchestration layer which bridges the
compute and data layers
â—‹ If the Data Lake is remote
â—‹ If the Data Lake is overloaded
â—‹ If the Data Lake has variable latency
â—‹ If the Data Lake has low performance
○ If the Data Lake doesn’t support the same semantics
â—‹ ...
22. DATAÂ ORCHESTRATIONÂ SUMMIT
Alluxio & AI w/ K8s
• Machine Learning & AI runs on Data Lakes
• Compared to Data Analytics, AI workloads have different
characteristics, but a similar mismatch between compute
and storage
23. DATAÂ ORCHESTRATIONÂ SUMMIT
Alluxio & AI - Better Together
• Access Pattern - Repeated access on a dataset
• Dataset - Many small files
• Preferred API - Posix Filesystem
• Workload Regularity - Predictable, bulk access
28. DATAÂ ORCHESTRATIONÂ SUMMIT
Production Deployments at Scale
â—Ź Top-tier cell phone provider
â—‹ 3000+ Alluxio servers in a single cluster
â—Ź Top-tier social network company
â—‹ 10,000+ concurrent Alluxio clients
â—‹ 10+PB data managed
29. DATAÂ ORCHESTRATIONÂ SUMMIT
Special Interest Groups in Ecosystem
â—Ź SIG in Machine Learning/K8s on Alluxio
â– Regular Community R&D meetings
â– Re-implemented JNI-based FUSE integration
â– Performance optimizations for small files, RPCs
â—Ź A new SIG kicked off in Presto on Alluxio
30. DATAÂ ORCHESTRATIONÂ SUMMIT
Experimental Two-week Release Cycle
â—Ź Previous release cadence: quarterly
â—Ź New experimental release schedule:
â—‹ every two weeks
â—‹ starting early December!
â—Ź What does it bring to Alluxio community?
â—‹ deliver feature/bug fixes faster