SlideShare a Scribd company logo
1 of 18
HDFS Tiered
Storage
Chris Douglas, Virajith Jalaparti
Microsoft CISL
>id
Microsoft Cloud and Information Services Lab (CISL)
Applied research group in large-scale systems and machine learning
Contributions to Apache Hadoop YARN
Preemption, reservations/planning, federation, distributed sched.
Apache REEF: control-plane for big data systems
Chris Douglas (cdoug@microsoft.com)
Contributor to Apache Hadoop since 2007, member of its PMC
Virajith Jalaparti (vijala@microsoft.com)
Data in Hadoop
All data in one place
Tools written against abstractions
Compatible FileSystems (Azure/S3/etc.)
Multi-tenant
Management APIs
Quotas, auth, encryption, media
Works well if all data is in one cluster
In most cases, we have multiple clusters…
Multiple storage clusters
Production/research partitioning
Compliance and regulatory restrictions
Datasets can be shared
Geographically distributed clusters
Disaster recovery
Cloud backup/Hybrid clouds
Heterogeneous storage tiers in a cluster
Compute +
Storage
Compute +
Storage
wasb://…
hdfs://b/
hdfs://a/
Managing multiple clusters: Today
Using the framework
Copy data (distcp) between clusters
(+) Clients process local copies, no visible
partial copies
(-) Uses compute resources, requires
capacity planning
Using the application
Directly access data in multiple clusters
(+) Consistency managed at client
(-) Auth to all data sources, consistency is
hard, no opportunities for transparent
caching
D A
hdfs://a/ hdfs://b/
A
r/w
hdfs://a/ hdfs://b/
r/w
Managing multiple clusters: Our proposal
Tiering: Using the platform
Synchronize storage with remote namespace
(+) Transparent to users, caching/prefetching,
unified namespace
(-) Conflicts may be unresolvable
Use HDFS to coordinate external storage
No capability or performance gap
Support for heterogeneous media (RAM/SSD/DISK), rebalancing, security,
quotas, etc.
A
hdfs://a/
hdfs://b/
r/w
mount
Challenges
Synchronize metadata without copying data
Dynamically page in “blocks” on demand
Define policies to prefetch and evict local replicas
Mirror changes in remote namespace
Handle out-of-band churn in remote storage
Avoid dropping valid, cached data (e.g., rename)
Handle writes consistently
Writes committed to the backing store must “make sense”
Proposal: Provided Storage Type
Peer to RAM, SSD, DISK in HDFS (HDFS-2832)
Data in external store mapped to HDFS blocks
Each block associated with an Alias = (REF, nonce)
Used to map blocks to external data
Nonce used to detect changes on backing store
E.g.: REF = (file URI, offset, length); nonce = GUID
Mapping stored in a BlockMap
KV store accessible by NN and all DNs
ProvidedVolume on Datanodes reads/writes
data from/to external store
DN1
External store
DN2
BlockManager
/𝑎/𝑓𝑜𝑜 → 𝑏𝑖, … , 𝑏𝑗
𝑏𝑖 → {𝑠1, 𝑠2, 𝑠3}
/𝑎𝑑𝑙/𝑏𝑎𝑟 → 𝑏 𝑘, … , 𝑏𝑙
𝑏_𝑘 → {𝑠 𝑃𝑅𝑂𝑉𝐼𝐷𝐸𝐷}
FSNamesystem
NN
BlockMap
𝑏 𝑘→ 𝐴𝑙𝑖𝑎𝑠 𝑘
…
RAM_DISK SSD DISK PROVIDED
Example: Using an immutable cloud store
External namespace ext://nn
… …
… …
/
a b c
e f g
d
External store
(Data)
mount
Client
read(/d/e)
DN1 DN2
HDFS cluster
NN
read(/c/d/e)
(file data)
(file data)
Example: Using an immutable cloud store
FSImage
BlockMap
/𝑑/𝑒 → {𝑏1, 𝑏2, … }
/d/f/z1 → {𝑏𝑖, 𝑏𝑖+1, … }
…
𝑏𝑖 → {rep = 1, PROVIDED}
…
𝑏𝑖 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 0, 𝐿 , inodeId1}
𝑏𝑖+1 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 𝐿, 2𝐿 , inodeId1}
…
Create FSImage and BlockMap
Block StoragePolicy can be set as required
E.g. {rep=2, PROVIDED, DISK }
External namespace ext://nn
… …
… …
/
a b c
e f g
d
External store
Example: Using an immutable cloud store
FSImage
BlockMap
Start NN with the FSImage
Replication > 1 start copying to local media
All blocks reachable from NN when a DN
with PROVIDED storage heartbeats in
In contrast to READ_ONLY_SHARED (HDFS-5318)
… …
d
e f g
NN
BlockManager
DN1 DN2
… …
… …
/
a b c
e f g
d
External namespace
Example: Using an immutable cloud store
FSImage
BlockMap
Block locations stored as a
composite DN
Contains all DNs with the
storage configured
Resolved in getBlockLocation()
to a single DN
DN looks up block in
BlockMap, uses Alias to read
from external store
Data can be cached locally as
it is read (read-through cache)
… …
d
e f g
NN
BlockManager
DN1 DN2
DFSClient
getBlockLocation
(“/d/f/z1”, 0, L)
return LocatedBlocks
{{DN2, 𝑏𝑖, PROVIDED}}
lookup(𝑏𝑖)
(“/c/d/f/z1/”, 0, L, GUID1)
External store
Benefits of the PROVIDED design
Use existing HDFS features to enforce quotas, limits on storage tiers
Simpler implementation, no mismatch between HDFS invariants and framework
Supports different types of back-end storages
org.apache.hadoop.FileSystem, blob stores, etc.
Enables several policies to improve performance
Set replication in FSImage to pre-fetch
Read-through cache
Actively pre-fetch while cluster is running
Set StoragePolicy for the file to prefetch
Credentials hidden from client
Only NN and DNs require credentials of external store
HDFS can be used to enforce access controls for remote store
Handling out-of-band changes
Nonce for correctness
Asynchronously poll external store
Integrate detected changes to the NN
Update BlockMap on file creation/deletion
Consensus, shared log, etc.
Tighter NS integration complements provided store abstraction
Operations like rename can cause unnecessary evictions
Heuristics based on common rename scenarios (e.g., output promotion) to
assign block ids
Assumptions
Churn is rare and relatively predictable
Analytic workloads, ETL into external/cloud storage, compute in cluster
Clusters are either consumers/producers for a subtree/region
FileSystem has too little information to resolve conflicts
Clients can recognize/ignore inconsistent states
External stores can tighten these semantics
Independent of PROVIDED storage
Implementation roadmap
Read-only image (with periodic, naive refresh)
ViewFS-based: NN configured to refresh from root
Mount within an existing NN
Refresh view of remote cluster and sync
Write-through
Cloud backup: no namespace in external store, replication only
Return to writer only when data are committed to external store
Write-back
Lazily replicate to external store
Resources
Tiered Storage HDFS-9806 [issues.apache.org]
Design documentation
List of subtasks – take one!
Discussion of scope, implementation, and feedback
Read-only replicas HDFS-5318 [issues.apache.org]
Related READ_ONLY_SHARED work; excellent design doc
{cdoug,vijala}@microsoft.com
Alternative approaches: Client-driven tiering
Existing solutions: ViewFS/HADOOP-12077
Challenges
Maintain synchronized client views
Enforcing storage quotas, rate limiting reads etc. fall upon the client
Clients need sufficient privileges to read/write data
Client is responsible for maintaining the system in a consistent state
Need to recover partially completed operations from other clients

More Related Content

What's hot

Democratizing Memory Storage
Democratizing Memory StorageDemocratizing Memory Storage
Democratizing Memory StorageDataWorks Summit
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfsdatabloginfo
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemAnand Kulkarni
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovVasil Remeniuk
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaEdureka!
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentDataWorks Summit
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 Chris Almond
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera FieldHBaseCon
 
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...Alluxio, Inc.
 

What's hot (20)

Democratizing Memory Storage
Democratizing Memory StorageDemocratizing Memory Storage
Democratizing Memory Storage
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
What is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | EdurekaWhat is HDFS | Hadoop Distributed File System | Edureka
What is HDFS | Hadoop Distributed File System | Edureka
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...
 

Similar to HDFS Tiered Storage Provides Access to Remote Data Stores

Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFSHadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFSErik Krogen
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
 
Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce Ceph Community
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Ceph at salesforce ceph day external presentation
Ceph at salesforce   ceph day external presentationCeph at salesforce   ceph day external presentation
Ceph at salesforce ceph day external presentationSameer Tiwari
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems ReviewSchubert Zhang
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax
 
What is Object storage ?
What is Object storage ?What is Object storage ?
What is Object storage ?Nabil Kassi
 
Sector Cloudcom Tutorial
Sector Cloudcom TutorialSector Cloudcom Tutorial
Sector Cloudcom Tutoriallilyco
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File SystemsManish Chopra
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsBob Pusateri
 

Similar to HDFS Tiered Storage Provides Access to Remote Data Stores (20)

Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFSHadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Ceph at salesforce ceph day external presentation
Ceph at salesforce   ceph day external presentationCeph at salesforce   ceph day external presentation
Ceph at salesforce ceph day external presentation
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems Review
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
 
What is Object storage ?
What is Object storage ?What is Object storage ?
What is Object storage ?
 
Sector Cloudcom Tutorial
Sector Cloudcom TutorialSector Cloudcom Tutorial
Sector Cloudcom Tutorial
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File Systems
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Hdfs
HdfsHdfs
Hdfs
 
HDFS tiered storage
HDFS tiered storageHDFS tiered storage
HDFS tiered storage
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

HDFS Tiered Storage Provides Access to Remote Data Stores

  • 1. HDFS Tiered Storage Chris Douglas, Virajith Jalaparti Microsoft CISL
  • 2. >id Microsoft Cloud and Information Services Lab (CISL) Applied research group in large-scale systems and machine learning Contributions to Apache Hadoop YARN Preemption, reservations/planning, federation, distributed sched. Apache REEF: control-plane for big data systems Chris Douglas (cdoug@microsoft.com) Contributor to Apache Hadoop since 2007, member of its PMC Virajith Jalaparti (vijala@microsoft.com)
  • 3. Data in Hadoop All data in one place Tools written against abstractions Compatible FileSystems (Azure/S3/etc.) Multi-tenant Management APIs Quotas, auth, encryption, media Works well if all data is in one cluster
  • 4. In most cases, we have multiple clusters… Multiple storage clusters Production/research partitioning Compliance and regulatory restrictions Datasets can be shared Geographically distributed clusters Disaster recovery Cloud backup/Hybrid clouds Heterogeneous storage tiers in a cluster Compute + Storage Compute + Storage wasb://… hdfs://b/ hdfs://a/
  • 5. Managing multiple clusters: Today Using the framework Copy data (distcp) between clusters (+) Clients process local copies, no visible partial copies (-) Uses compute resources, requires capacity planning Using the application Directly access data in multiple clusters (+) Consistency managed at client (-) Auth to all data sources, consistency is hard, no opportunities for transparent caching D A hdfs://a/ hdfs://b/ A r/w hdfs://a/ hdfs://b/ r/w
  • 6. Managing multiple clusters: Our proposal Tiering: Using the platform Synchronize storage with remote namespace (+) Transparent to users, caching/prefetching, unified namespace (-) Conflicts may be unresolvable Use HDFS to coordinate external storage No capability or performance gap Support for heterogeneous media (RAM/SSD/DISK), rebalancing, security, quotas, etc. A hdfs://a/ hdfs://b/ r/w mount
  • 7. Challenges Synchronize metadata without copying data Dynamically page in “blocks” on demand Define policies to prefetch and evict local replicas Mirror changes in remote namespace Handle out-of-band churn in remote storage Avoid dropping valid, cached data (e.g., rename) Handle writes consistently Writes committed to the backing store must “make sense”
  • 8. Proposal: Provided Storage Type Peer to RAM, SSD, DISK in HDFS (HDFS-2832) Data in external store mapped to HDFS blocks Each block associated with an Alias = (REF, nonce) Used to map blocks to external data Nonce used to detect changes on backing store E.g.: REF = (file URI, offset, length); nonce = GUID Mapping stored in a BlockMap KV store accessible by NN and all DNs ProvidedVolume on Datanodes reads/writes data from/to external store DN1 External store DN2 BlockManager /𝑎/𝑓𝑜𝑜 → 𝑏𝑖, … , 𝑏𝑗 𝑏𝑖 → {𝑠1, 𝑠2, 𝑠3} /𝑎𝑑𝑙/𝑏𝑎𝑟 → 𝑏 𝑘, … , 𝑏𝑙 𝑏_𝑘 → {𝑠 𝑃𝑅𝑂𝑉𝐼𝐷𝐸𝐷} FSNamesystem NN BlockMap 𝑏 𝑘→ 𝐴𝑙𝑖𝑎𝑠 𝑘 … RAM_DISK SSD DISK PROVIDED
  • 9. Example: Using an immutable cloud store External namespace ext://nn … … … … / a b c e f g d External store (Data) mount Client read(/d/e) DN1 DN2 HDFS cluster NN read(/c/d/e) (file data) (file data)
  • 10. Example: Using an immutable cloud store FSImage BlockMap /𝑑/𝑒 → {𝑏1, 𝑏2, … } /d/f/z1 → {𝑏𝑖, 𝑏𝑖+1, … } … 𝑏𝑖 → {rep = 1, PROVIDED} … 𝑏𝑖 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 0, 𝐿 , inodeId1} 𝑏𝑖+1 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 𝐿, 2𝐿 , inodeId1} … Create FSImage and BlockMap Block StoragePolicy can be set as required E.g. {rep=2, PROVIDED, DISK } External namespace ext://nn … … … … / a b c e f g d External store
  • 11. Example: Using an immutable cloud store FSImage BlockMap Start NN with the FSImage Replication > 1 start copying to local media All blocks reachable from NN when a DN with PROVIDED storage heartbeats in In contrast to READ_ONLY_SHARED (HDFS-5318) … … d e f g NN BlockManager DN1 DN2 … … … … / a b c e f g d External namespace
  • 12. Example: Using an immutable cloud store FSImage BlockMap Block locations stored as a composite DN Contains all DNs with the storage configured Resolved in getBlockLocation() to a single DN DN looks up block in BlockMap, uses Alias to read from external store Data can be cached locally as it is read (read-through cache) … … d e f g NN BlockManager DN1 DN2 DFSClient getBlockLocation (“/d/f/z1”, 0, L) return LocatedBlocks {{DN2, 𝑏𝑖, PROVIDED}} lookup(𝑏𝑖) (“/c/d/f/z1/”, 0, L, GUID1) External store
  • 13. Benefits of the PROVIDED design Use existing HDFS features to enforce quotas, limits on storage tiers Simpler implementation, no mismatch between HDFS invariants and framework Supports different types of back-end storages org.apache.hadoop.FileSystem, blob stores, etc. Enables several policies to improve performance Set replication in FSImage to pre-fetch Read-through cache Actively pre-fetch while cluster is running Set StoragePolicy for the file to prefetch Credentials hidden from client Only NN and DNs require credentials of external store HDFS can be used to enforce access controls for remote store
  • 14. Handling out-of-band changes Nonce for correctness Asynchronously poll external store Integrate detected changes to the NN Update BlockMap on file creation/deletion Consensus, shared log, etc. Tighter NS integration complements provided store abstraction Operations like rename can cause unnecessary evictions Heuristics based on common rename scenarios (e.g., output promotion) to assign block ids
  • 15. Assumptions Churn is rare and relatively predictable Analytic workloads, ETL into external/cloud storage, compute in cluster Clusters are either consumers/producers for a subtree/region FileSystem has too little information to resolve conflicts Clients can recognize/ignore inconsistent states External stores can tighten these semantics Independent of PROVIDED storage
  • 16. Implementation roadmap Read-only image (with periodic, naive refresh) ViewFS-based: NN configured to refresh from root Mount within an existing NN Refresh view of remote cluster and sync Write-through Cloud backup: no namespace in external store, replication only Return to writer only when data are committed to external store Write-back Lazily replicate to external store
  • 17. Resources Tiered Storage HDFS-9806 [issues.apache.org] Design documentation List of subtasks – take one! Discussion of scope, implementation, and feedback Read-only replicas HDFS-5318 [issues.apache.org] Related READ_ONLY_SHARED work; excellent design doc {cdoug,vijala}@microsoft.com
  • 18. Alternative approaches: Client-driven tiering Existing solutions: ViewFS/HADOOP-12077 Challenges Maintain synchronized client views Enforcing storage quotas, rate limiting reads etc. fall upon the client Clients need sufficient privileges to read/write data Client is responsible for maintaining the system in a consistent state Need to recover partially completed operations from other clients

Editor's Notes

  1. Welcome. Thanks for coming. We’re discussing a proposal for implementing tiering in HDFS, building on its support for heterogeneous storage.
  2. We’re members of the Microsoft C.I.S.L., an applied research lab that publishes papers, builds prototypes, writes production code for Microsoft clusters... but who cares about that? We work in open source, particularly Apache projects, particularly Apache Hadoop. REEF is out of CISL which is like a stdlib for resource management frameworks, including YARN and Mesos. [CD] intro [VJ] intro
  3. Hadoop gained traction by putting all of an org’s data in one place, in common formats, to be processed by common tools. Different applications get a consistent view of their data from HDFS. Data is protected and managed by a set of user and operator invariants that assign quotas, authenticate users, encrypt data, and distribute it across heterogeneous media. If you have only one source of data to process using that abstraction, then you get to enjoy nice things and the rest of us will sullenly resent you.
  4. However, reality is far removed from this. In most companies which deal with some kind of data, big or small, there are multiple clusters which store the data. You typically have multiple production clusters, either owned by different groups, or separate due to compliance, privacy or regulatory restrictions; and some datasets can be accessed across each other. Also, for scenarios like BCP or backing to the cloud, we have to deal with geographically separate storages, which might be different systems altogether -- for example, you might be running HDFS locally but are backing to Azure Blob store. Further, many clusters today have different storage devices or tiers like RAM/SSD/Disk within a single cluster. In such cases, we would like to make efficient and performant use of these storage tiers, for example, by placing the hottest data in RAM and the cold data on DISK or tapes.
  5. In most cases, these multiple clusters and differrent tiers of storage are managed today using two main techniques, The first one is to use the framework for example, people run distcp jobs to copy data over from one storage cluster to another. While this allows for clients to process local copies of data, and leaves no visible intermediate state, it needs compute resources and manual capacity planning. The second one is to use the application to handle multiple clusters, the application can be made aware of the fact that data is in multiple clusters and it can read the data from each one separately while reasoning about the data’s consistency. However, now each application must implement techniques to these reads, authenticate to different sources, and this leaves us with no opportunities for transparent caching or prefetching to improve performance.
  6. Our proposal is to use the platform to manage multiple storage clusters. So, we propose to use the storage layer to manage the multiple external storages. This allows us to use different storages for multiple applications and users in a transparent manner, we can use local storage to cache data from remote storage and have a single uniform namespace across multiple storage systems, which can be in the same building or on the other side of the world, in the cloud. In this talk, we are going to describe how we can enable HDFS to do this – how we can mount external storage systems in HDFS. This allows us to exploit all the capabilities and features that HDFS supports such as quotas, and security in accessing the different storage systems.
  7. XXX CUT XXX We introduce a new provided storage type which will be a peer to existing storage types. So, in HDFS today, The NN is partitioned into a namespace (FSNamesystem) that maps files to block IDs, and the block lifecycle management in the BlockManager. Each file is a sequence of block IDs in the namespace. Each block ID is mapped to a list of replicas resident on a storage attached to a datanode. A storage is a device (DISK, SSD, memory region) attached to a Datanode. Because HDFS understands blocks, even for files in the provided storage, we have a similar mapping. However, we also need to have some mapping of these blocks and how data is laid out in the provided store. For this, replicas of a block in “provided” storage are mapped to an alias. An alias is simply a tuple: a reference resolvable in the namespace of the external store, and a nonce to verify that the reference still locates the data matching that block. If my external store is another FileSystem, then my reference may be a URI, offset, length. and the nonce includes an inode/fileID, modification time, etc. Finally, we have provided volumes in Datanodes which are used to read and write data from the external store. The provided volume is essentially implements a client that is capable to talking to the external store.
  8. To understand how this would work in practice, let’s look at a simple example where we want to access an external cloud storage through HDFS. Let’s ignore writes for now. -> Now suppose, this is the part of the namespace we want to -> mount in HDFS. -> if the mount is successful, we should be able to access data in the cloud through HDFS. That is -> if a client comes and requests for a particular file, say /d/e, from HDFS, then HDFS should be -> able to read the file from the external store, -> get back the data from the external store and -> stream the data back to the client. Now, I will hand it over to Chris to explain how we make all of this to happen using the PROVIDED abstraction I just introduced.
  9. Let’s drill down into an example. Assume we want to mount this external namespace into HDFS. Rather, this subtree. [] We can generate a mirror of the metadata as an FSImage (checkpoint of NN state). For every file, we also partition it into blocks, and store the reference in the blockmap with the corresponding nonce. [] Note that the image contains only the block IDs and storage policy, while the blockmap stores the block alias. So if file /c/d/e were 1GB, the image could record 4 logical blocks. For each block, the blockmap would record the reference (URI,offset,len) and a nonce (inodeId, LMT) sufficient to detect inconsistency.
  10. A quick note on block reports, if those are unfamiliar. By the way: if any of this is unfamiliar, please speak up The NN persists metadata about blocks, but their location in the cluster is reported by DNs. Each DN reports the volumes (HDD, SSD) attached to it, and a list of block IDs stored in each. At startup, the NN comes out of safe mode (starts accepting writes) when some fraction of its namespace is available. [] When a DN reports its provided storage, it does not send a full block report for the provided storage (which is, recall, a peer of its local media). It only reports that any block stored therein is reachable through it. As long as the NN has at least one DN reporting that provided storage, it considers all the blocks in the block map as reachable. The NN scans the block map to discover DN blocks in that provided storage. This is in contrast to some existing work supporting read-only replicas, where every DN sends a block report of the shared data, as when multiple DNs mount a filer.
  11. ORIG Inside the NN, blocks are stored as a composite.
  12. Inside the NN, we relax the invariant that a storage- a HDD/SDD- belongs to only one DN. So when a client requests the block locations for a file (here z1) [] the NN will report all the local replicas, and NN will select a single PROV replica, say closest to the client. This avoids reporting every DN as a valid target, which is accurate, but not helpful for applications. [] When the client requests the PROV block from the DN, the DN will lookup the block in the blockmap [] find the block alias, resolve the reference [] request the block data from the external store [] and return the data to the client, having verified the nonce [] because the block is read through the DN, we can also cache the data as a local block.
  13. There are a few points worth calling out, here. * First, this is a relatively small change to HDFS. The only client-visible change adds a new storage type. As a user, this is simpler than coordinating with copying jobs. In our cloud example, all the cluster’s data is immediately available once it’s in the namespace, even if the replication policy hasn’t prefetched data into local media. * Second, particularly for read-only mounts, this is a narrow API to implement. For cloud backup scenarios- where the NN is the only writer to the namespace- then we only need the block to object ID map and NN metadata to mount a prefix/snapshot of the cluster. * Third, because the client reads through the DN, it can cache a copy of the block on read. Pointedly, the NN can direct the client to any DN that should cache a copy on read, opening some interesting combinations of placement policies and read-through caching. The DN isn’t necessarily the closest to the client, but it may follow another objective function or replication policy. * Finally, in our example the cloud credentials are hidden from the client. S3/WAS both authenticate clients to containers using a single key. Because HDFS owns and protects the external store’s credentials, the client only accesses data permitted by HDFS. Generally, we can use features of HDFS that aren’t directly supported by the backing store if we can define the mapping.
  14. It’s imperative that we never return the wrong data. If a file were overwritten in the backing store, we will never return part of the first file, and part of the second. The nonce is what we use to protect ourselves from that. But there needs to be some way to ingest new data into HDFS. If our external store has a namespace compatible with FS, then we can always scan it, but... while refresh is limited to scans, the view to the client can be inconsistent. A client may see some unpromoted output, some promoted output, and a sentinel file declaring it completely promoted. Better cooperation with external stores can tighten the namespace integration, to expose meaningful states. For example, if the external store could expose meaningful snapshots, then HDFS could move from one to the next, maintaining a read-only version while it updates. If the path remains valid while the NN updates itself, we have clearer guarantees. For anyone familiar with CORFU and Tango (MSR, Dahlia Malkhi, Mahesh Balakrishnan, Ted Wobber), or with WANdisco’s integration of their Paxos engine with the NN, we can make the metadata sync tight and meaningful. We still need the logic at the block layer we’re adding as provided storage. After correctness, we also need to be mindful of efficiency. Output is often promoted by renaming it, and if the NN were to interpret that as a deletion and creation, our HDFS cluster would discard blocks just to recopy them, right at the moment they are consumed. One of our goals is to conservatively identify these cases based on actual workloads.
  15. Since I mentioned strong consensus engines, this isn’t a “real” shared namespace. Even the read-only case is eventually consistent; in the base case we’re scanning the entire subtree in the external store. That’s obviously not workable, but most bigdata workloads don’t hit pathological cases. The typical year/month/day/hour layouts common to analytics clusters are mostly additive, and this is sufficient for that case. * When writes conflict, there is only so much the FS can do to merge conflicts. Set aside really complex cases like compactions; even simple cases may not merge cleanly. If a user creates a directory that is also present in the external store, can that be merged? Maybe not; successful creation might be gating access; many frameworks in the Hadoop ecosystem follow conventions that rely on atomicity of operations in HDFS. * The permissions, timestamps, or storage policy may not match, and there isn’t a “correct” answer for the merged result (absent application semantics). * So we assume that, generally (or by construction), clusters will be either producers or consumers for some part of the shared namespace. Fundamentally: no magic, here. We haven’t made any breakthroughs in consensus, but provided storage is a tractable solution that happens to cover some common cases/deployments in its early incarnations, and from a R&D perspective, some very interesting problems in the policy space. Please find us after the talk, we love to talk about this.
  16. The implementation will be staged. The read-only case is relatively straightforward; we implemented a proof-of-concept spread over a few weeks. A link is posted to JIRA. We will start with a NN managing an external store, merged using federation (ViewFS). This lets us defer the mounting logic, which would otherwise interfere with NN operation. We will then explore strategies for creating and maintaining mounts in the primary NN, alongside other data. For those familiar with the NN locking and the formidable challenge of relaxing it, note that most of the invariants we’d enforce don’t apply inside the mount. Quotas aren’t enforced, renames outside can be disallowed, etc. So it may be possible to embed this in the NN. Refresh will start as naive scans, then improve. Identifying subtrees that change and/or are accessed more frequently could improve the general case, but polling is fundamentally limited. Given some experience, we can recognize the common abstractions when tiering over stores that expose version information, snapshots, etc. and write some tighter integrations. Writes are complex, so we will move from working system to working system. We’re wiring the PROV type into the DN state machines, so the write-through case should be tractable, particularly when the external store behind the provided abstraction is an object store. Ultimately, we’d like to use local storage to batch- or even prioritize- writes to the external store. Because HDFS sits between the client and the external store: if we have limited bandwidth, want to apply cost or priority models, etc. these can be embedded in HDFS.
  17. Please join us. We have a design document posted to JIRA, an active discussion of the implementation choices, and we’ll be starting a branch to host these changes. The existing work on READ_ONLY_SHARED replicas has a superlative design doc, if you want to contribute but need some orientation in the internal details. We have a few minutes for questions, but please find us after the talk. There are far more details than we can possibly cover in a single presentation and we’re still setting the design, so we’re very open to collaboration. Thanks, and... let’s take a couple questions.