This document proposes a design for tiered storage in HDFS that allows data to be stored in heterogeneous storage tiers including an external storage system. It describes challenges in synchronizing metadata and data across clusters and proposes using HDFS to coordinate an external storage system in a transparent way to users. The "PROVIDED" storage type would allow blocks to be retrieved directly from the external store via aliases, handling data consistency and security while leveraging HDFS features like quotas and replication policies. Implementation would start with read-only support and progress to full read-write capabilities.
2. >id
Microsoft Cloud and Information Services Lab (CISL)
Applied research group in large-scale systems and machine learning
Contributions to Apache Hadoop YARN
Preemption, reservations/planning, federation, distributed sched.
Apache REEF: control-plane for big data systems
Chris Douglas (cdoug@microsoft.com)
Contributor to Apache Hadoop since 2007, member of its PMC
Virajith Jalaparti (vijala@microsoft.com)
3. Data in Hadoop
All data in one place
Tools written against abstractions
Compatible FileSystems (Azure/S3/etc.)
Multi-tenant
Management APIs
Quotas, auth, encryption, media
Works well if all data is in one cluster
4. In most cases, we have multiple clusters…
Multiple storage clusters
Production/research partitioning
Compliance and regulatory restrictions
Datasets can be shared
Geographically distributed clusters
Disaster recovery
Cloud backup/Hybrid clouds
Heterogeneous storage tiers in a cluster
Compute +
Storage
Compute +
Storage
wasb://…
hdfs://b/
hdfs://a/
5. Managing multiple clusters: Today
Using the framework
Copy data (distcp) between clusters
(+) Clients process local copies, no visible
partial copies
(-) Uses compute resources, requires
capacity planning
Using the application
Directly access data in multiple clusters
(+) Consistency managed at client
(-) Auth to all data sources, consistency is
hard, no opportunities for transparent
caching
D A
hdfs://a/ hdfs://b/
A
r/w
hdfs://a/ hdfs://b/
r/w
6. Managing multiple clusters: Our proposal
Tiering: Using the platform
Synchronize storage with remote namespace
(+) Transparent to users, caching/prefetching,
unified namespace
(-) Conflicts may be unresolvable
Use HDFS to coordinate external storage
No capability or performance gap
Support for heterogeneous media (RAM/SSD/DISK), rebalancing, security,
quotas, etc.
A
hdfs://a/
hdfs://b/
r/w
mount
7. Challenges
Synchronize metadata without copying data
Dynamically page in “blocks” on demand
Define policies to prefetch and evict local replicas
Mirror changes in remote namespace
Handle out-of-band churn in remote storage
Avoid dropping valid, cached data (e.g., rename)
Handle writes consistently
Writes committed to the backing store must “make sense”
8. Proposal: Provided Storage Type
Peer to RAM, SSD, DISK in HDFS (HDFS-2832)
Data in external store mapped to HDFS blocks
Each block associated with an Alias = (REF, nonce)
Used to map blocks to external data
Nonce used to detect changes on backing store
E.g.: REF = (file URI, offset, length); nonce = GUID
Mapping stored in a BlockMap
KV store accessible by NN and all DNs
ProvidedVolume on Datanodes reads/writes
data from/to external store
DN1
External store
DN2
BlockManager
/𝑎/𝑓𝑜𝑜 → 𝑏𝑖, … , 𝑏𝑗
𝑏𝑖 → {𝑠1, 𝑠2, 𝑠3}
/𝑎𝑑𝑙/𝑏𝑎𝑟 → 𝑏 𝑘, … , 𝑏𝑙
𝑏_𝑘 → {𝑠 𝑃𝑅𝑂𝑉𝐼𝐷𝐸𝐷}
FSNamesystem
NN
BlockMap
𝑏 𝑘→ 𝐴𝑙𝑖𝑎𝑠 𝑘
…
RAM_DISK SSD DISK PROVIDED
9. Example: Using an immutable cloud store
External namespace ext://nn
… …
… …
/
a b c
e f g
d
External store
(Data)
mount
Client
read(/d/e)
DN1 DN2
HDFS cluster
NN
read(/c/d/e)
(file data)
(file data)
10. Example: Using an immutable cloud store
FSImage
BlockMap
/𝑑/𝑒 → {𝑏1, 𝑏2, … }
/d/f/z1 → {𝑏𝑖, 𝑏𝑖+1, … }
…
𝑏𝑖 → {rep = 1, PROVIDED}
…
𝑏𝑖 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 0, 𝐿 , inodeId1}
𝑏𝑖+1 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 𝐿, 2𝐿 , inodeId1}
…
Create FSImage and BlockMap
Block StoragePolicy can be set as required
E.g. {rep=2, PROVIDED, DISK }
External namespace ext://nn
… …
… …
/
a b c
e f g
d
External store
11. Example: Using an immutable cloud store
FSImage
BlockMap
Start NN with the FSImage
Replication > 1 start copying to local media
All blocks reachable from NN when a DN
with PROVIDED storage heartbeats in
In contrast to READ_ONLY_SHARED (HDFS-5318)
… …
d
e f g
NN
BlockManager
DN1 DN2
… …
… …
/
a b c
e f g
d
External namespace
12. Example: Using an immutable cloud store
FSImage
BlockMap
Block locations stored as a
composite DN
Contains all DNs with the
storage configured
Resolved in getBlockLocation()
to a single DN
DN looks up block in
BlockMap, uses Alias to read
from external store
Data can be cached locally as
it is read (read-through cache)
… …
d
e f g
NN
BlockManager
DN1 DN2
DFSClient
getBlockLocation
(“/d/f/z1”, 0, L)
return LocatedBlocks
{{DN2, 𝑏𝑖, PROVIDED}}
lookup(𝑏𝑖)
(“/c/d/f/z1/”, 0, L, GUID1)
External store
13. Benefits of the PROVIDED design
Use existing HDFS features to enforce quotas, limits on storage tiers
Simpler implementation, no mismatch between HDFS invariants and framework
Supports different types of back-end storages
org.apache.hadoop.FileSystem, blob stores, etc.
Enables several policies to improve performance
Set replication in FSImage to pre-fetch
Read-through cache
Actively pre-fetch while cluster is running
Set StoragePolicy for the file to prefetch
Credentials hidden from client
Only NN and DNs require credentials of external store
HDFS can be used to enforce access controls for remote store
14. Handling out-of-band changes
Nonce for correctness
Asynchronously poll external store
Integrate detected changes to the NN
Update BlockMap on file creation/deletion
Consensus, shared log, etc.
Tighter NS integration complements provided store abstraction
Operations like rename can cause unnecessary evictions
Heuristics based on common rename scenarios (e.g., output promotion) to
assign block ids
15. Assumptions
Churn is rare and relatively predictable
Analytic workloads, ETL into external/cloud storage, compute in cluster
Clusters are either consumers/producers for a subtree/region
FileSystem has too little information to resolve conflicts
Clients can recognize/ignore inconsistent states
External stores can tighten these semantics
Independent of PROVIDED storage
16. Implementation roadmap
Read-only image (with periodic, naive refresh)
ViewFS-based: NN configured to refresh from root
Mount within an existing NN
Refresh view of remote cluster and sync
Write-through
Cloud backup: no namespace in external store, replication only
Return to writer only when data are committed to external store
Write-back
Lazily replicate to external store
17. Resources
Tiered Storage HDFS-9806 [issues.apache.org]
Design documentation
List of subtasks – take one!
Discussion of scope, implementation, and feedback
Read-only replicas HDFS-5318 [issues.apache.org]
Related READ_ONLY_SHARED work; excellent design doc
{cdoug,vijala}@microsoft.com
18. Alternative approaches: Client-driven tiering
Existing solutions: ViewFS/HADOOP-12077
Challenges
Maintain synchronized client views
Enforcing storage quotas, rate limiting reads etc. fall upon the client
Clients need sufficient privileges to read/write data
Client is responsible for maintaining the system in a consistent state
Need to recover partially completed operations from other clients
Editor's Notes
Welcome. Thanks for coming. We’re discussing a proposal for implementing tiering in HDFS, building on its support for heterogeneous storage.
We’re members of the Microsoft C.I.S.L., an applied research lab that publishes papers, builds prototypes, writes production code for Microsoft clusters... but who cares about that? We work in open source, particularly Apache projects, particularly Apache Hadoop. REEF is out of CISL which is like a stdlib for resource management frameworks, including YARN and Mesos.
[CD] intro
[VJ] intro
Hadoop gained traction by putting all of an org’s data in one place, in common formats, to be processed by common tools. Different applications get a consistent view of their data from HDFS. Data is protected and managed by a set of user and operator invariants that assign quotas, authenticate users, encrypt data, and distribute it across heterogeneous media.
If you have only one source of data to process using that abstraction, then you get to enjoy nice things and the rest of us will sullenly resent you.
However, reality is far removed from this.
In most companies which deal with some kind of data, big or small, there are multiple clusters which store the data. You typically have multiple production clusters, either owned by different groups, or separate due to compliance, privacy or regulatory restrictions; and some datasets can be accessed across each other.
Also, for scenarios like BCP or backing to the cloud, we have to deal with geographically separate storages, which might be different systems altogether -- for example, you might be running HDFS locally but are backing to Azure Blob store.
Further, many clusters today have different storage devices or tiers like RAM/SSD/Disk within a single cluster. In such cases, we would like to make efficient and performant use of these storage tiers, for example, by placing the hottest data in RAM and the cold data on DISK or tapes.
In most cases, these multiple clusters and differrent tiers of storage are managed today using two main techniques,
The first one is to use the framework for example, people run distcp jobs to copy data over from one storage cluster to another. While this allows for clients to process local copies of data, and leaves no visible intermediate state, it needs compute resources and manual capacity planning.
The second one is to use the application to handle multiple clusters, the application can be made aware of the fact that data is in multiple clusters and it can read the data from each one separately while reasoning about the data’s consistency. However, now each application must implement techniques to these reads, authenticate to different sources, and this leaves us with no opportunities for transparent caching or prefetching to improve performance.
Our proposal is to use the platform to manage multiple storage clusters. So, we propose to use the storage layer to manage the multiple external storages. This allows us to use different storages for multiple applications and users in a transparent manner, we can use local storage to cache data from remote storage and have a single uniform namespace across multiple storage systems, which can be in the same building or on the other side of the world, in the cloud.
In this talk, we are going to describe how we can enable HDFS to do this – how we can mount external storage systems in HDFS. This allows us to exploit all the capabilities and features that HDFS supports such as quotas, and security in accessing the different storage systems.
XXX CUT XXX
We introduce a new provided storage type which will be a peer to existing storage types.
So, in HDFS today, The NN is partitioned into a namespace (FSNamesystem) that maps files to block IDs, and the block lifecycle management in the BlockManager. Each file is a sequence of block IDs in the namespace. Each block ID is mapped to a list of replicas resident on a storage attached to a datanode. A storage is a device (DISK, SSD, memory region) attached to a Datanode.
Because HDFS understands blocks, even for files in the provided storage, we have a similar mapping. However, we also need to have some mapping of these blocks and how data is laid out in the provided store. For this, replicas of a block in “provided” storage are mapped to an alias. An alias is simply a tuple: a reference resolvable in the namespace of the external store, and a nonce to verify that the reference still locates the data matching that block.
If my external store is another FileSystem, then my reference may be a URI, offset, length. and the nonce includes an inode/fileID, modification time, etc.
Finally, we have provided volumes in Datanodes which are used to read and write data from the external store. The provided volume is essentially implements a client that is capable to talking to the external store.
To understand how this would work in practice, let’s look at a simple example where we want to access an external cloud storage through HDFS. Let’s ignore writes for now.
-> Now suppose, this is the part of the namespace we want to
-> mount in HDFS.
-> if the mount is successful, we should be able to access data in the cloud through HDFS. That is
-> if a client comes and requests for a particular file, say /d/e, from HDFS, then HDFS should be
-> able to read the file from the external store,
-> get back the data from the external store and
-> stream the data back to the client.
Now, I will hand it over to Chris to explain how we make all of this to happen using the PROVIDED abstraction I just introduced.
Let’s drill down into an example. Assume we want to mount this external namespace into HDFS. Rather, this subtree.
[] We can generate a mirror of the metadata as an FSImage (checkpoint of NN state). For every file, we also partition it into blocks, and store the reference in the blockmap with the corresponding nonce.
[] Note that the image contains only the block IDs and storage policy, while the blockmap stores the block alias. So if file /c/d/e were 1GB, the image could record 4 logical blocks. For each block, the blockmap would record the reference (URI,offset,len) and a nonce (inodeId, LMT) sufficient to detect inconsistency.
A quick note on block reports, if those are unfamiliar. By the way: if any of this is unfamiliar, please speak up The NN persists metadata about blocks, but their location in the cluster is reported by DNs. Each DN reports the volumes (HDD, SSD) attached to it, and a list of block IDs stored in each. At startup, the NN comes out of safe mode (starts accepting writes) when some fraction of its namespace is available.
[] When a DN reports its provided storage, it does not send a full block report for the provided storage (which is, recall, a peer of its local media). It only reports that any block stored therein is reachable through it. As long as the NN has at least one DN reporting that provided storage, it considers all the blocks in the block map as reachable. The NN scans the block map to discover DN blocks in that provided storage.
This is in contrast to some existing work supporting read-only replicas, where every DN sends a block report of the shared data, as when multiple DNs mount a filer.
ORIG Inside the NN, blocks are stored as a composite.
Inside the NN, we relax the invariant that a storage- a HDD/SDD- belongs to only one DN. So when a client requests the block locations for a file (here z1)
[] the NN will report all the local replicas, and NN will select a single PROV replica, say closest to the client. This avoids reporting every DN as a valid target, which is accurate, but not helpful for applications.
[] When the client requests the PROV block from the DN, the DN will lookup the block in the blockmap
[] find the block alias, resolve the reference
[] request the block data from the external store
[] and return the data to the client, having verified the nonce
[] because the block is read through the DN, we can also cache the data as a local block.
There are a few points worth calling out, here.
* First, this is a relatively small change to HDFS. The only client-visible change adds a new storage type. As a user, this is simpler than coordinating with copying jobs. In our cloud example, all the cluster’s data is immediately available once it’s in the namespace, even if the replication policy hasn’t prefetched data into local media.
* Second, particularly for read-only mounts, this is a narrow API to implement. For cloud backup scenarios- where the NN is the only writer to the namespace- then we only need the block to object ID map and NN metadata to mount a prefix/snapshot of the cluster.
* Third, because the client reads through the DN, it can cache a copy of the block on read. Pointedly, the NN can direct the client to any DN that should cache a copy on read, opening some interesting combinations of placement policies and read-through caching. The DN isn’t necessarily the closest to the client, but it may follow another objective function or replication policy.
* Finally, in our example the cloud credentials are hidden from the client. S3/WAS both authenticate clients to containers using a single key. Because HDFS owns and protects the external store’s credentials, the client only accesses data permitted by HDFS. Generally, we can use features of HDFS that aren’t directly supported by the backing store if we can define the mapping.
It’s imperative that we never return the wrong data. If a file were overwritten in the backing store, we will never return part of the first file, and part of the second. The nonce is what we use to protect ourselves from that.
But there needs to be some way to ingest new data into HDFS. If our external store has a namespace compatible with FS, then we can always scan it, but...
while refresh is limited to scans, the view to the client can be inconsistent. A client may see some unpromoted output, some promoted output, and a sentinel file declaring it completely promoted. Better cooperation with external stores can tighten the namespace integration, to expose meaningful states. For example, if the external store could expose meaningful snapshots, then HDFS could move from one to the next, maintaining a read-only version while it updates. If the path remains valid while the NN updates itself, we have clearer guarantees.
For anyone familiar with CORFU and Tango (MSR, Dahlia Malkhi, Mahesh Balakrishnan, Ted Wobber), or with WANdisco’s integration of their Paxos engine with the NN, we can make the metadata sync tight and meaningful. We still need the logic at the block layer we’re adding as provided storage.
After correctness, we also need to be mindful of efficiency. Output is often promoted by renaming it, and if the NN were to interpret that as a deletion and creation, our HDFS cluster would discard blocks just to recopy them, right at the moment they are consumed. One of our goals is to conservatively identify these cases based on actual workloads.
Since I mentioned strong consensus engines, this isn’t a “real” shared namespace. Even the read-only case is eventually consistent; in the base case we’re scanning the entire subtree in the external store. That’s obviously not workable, but most bigdata workloads don’t hit pathological cases. The typical year/month/day/hour layouts common to analytics clusters are mostly additive, and this is sufficient for that case.
* When writes conflict, there is only so much the FS can do to merge conflicts. Set aside really complex cases like compactions; even simple cases may not merge cleanly. If a user creates a directory that is also present in the external store, can that be merged? Maybe not; successful creation might be gating access; many frameworks in the Hadoop ecosystem follow conventions that rely on atomicity of operations in HDFS.
* The permissions, timestamps, or storage policy may not match, and there isn’t a “correct” answer for the merged result (absent application semantics).
* So we assume that, generally (or by construction), clusters will be either producers or consumers for some part of the shared namespace.
Fundamentally: no magic, here. We haven’t made any breakthroughs in consensus, but provided storage is a tractable solution that happens to cover some common cases/deployments in its early incarnations, and from a R&D perspective, some very interesting problems in the policy space. Please find us after the talk, we love to talk about this.
The implementation will be staged. The read-only case is relatively straightforward; we implemented a proof-of-concept spread over a few weeks. A link is posted to JIRA.
We will start with a NN managing an external store, merged using federation (ViewFS). This lets us defer the mounting logic, which would otherwise interfere with NN operation. We will then explore strategies for creating and maintaining mounts in the primary NN, alongside other data. For those familiar with the NN locking and the formidable challenge of relaxing it, note that most of the invariants we’d enforce don’t apply inside the mount. Quotas aren’t enforced, renames outside can be disallowed, etc. So it may be possible to embed this in the NN.
Refresh will start as naive scans, then improve. Identifying subtrees that change and/or are accessed more frequently could improve the general case, but polling is fundamentally limited. Given some experience, we can recognize the common abstractions when tiering over stores that expose version information, snapshots, etc. and write some tighter integrations.
Writes are complex, so we will move from working system to working system. We’re wiring the PROV type into the DN state machines, so the write-through case should be tractable, particularly when the external store behind the provided abstraction is an object store.
Ultimately, we’d like to use local storage to batch- or even prioritize- writes to the external store. Because HDFS sits between the client and the external store: if we have limited bandwidth, want to apply cost or priority models, etc. these can be embedded in HDFS.
Please join us. We have a design document posted to JIRA, an active discussion of the implementation choices, and we’ll be starting a branch to host these changes. The existing work on READ_ONLY_SHARED replicas has a superlative design doc, if you want to contribute but need some orientation in the internal details.
We have a few minutes for questions, but please find us after the talk. There are far more details than we can possibly cover in a single presentation and we’re still setting the design, so we’re very open to collaboration. Thanks, and... let’s take a couple questions.