More Related Content Similar to Ozone: An Object Store in HDFS (20) More from DataWorks Summit (20) Ozone: An Object Store in HDFS1. © Hortonworks Inc. 2011 - 2015
Ozone: An Object Store in HDFS
Jitendra Nath Pandey
jitendra@hortonworks.com
jitendra@apache.org
@jnathp
Page 1
2. © Hortonworks Inc. 2011 - 2015
About me
• Engineering Manager @Hortonworks
– Manager / Architect for HDFS at Hortonworks
• ASF Member
– PMC Member at Apache Hadoop
– PMC Member at Apache Ambari
– Committer in Apache Hive
Page 2Architecting the Future of Big Data
3. © Hortonworks Inc. 2011 - 2015
Outline
• Introduction
• How ozone fits in HDFS
• Ozone architecture
• Notes on implementation
• Q & A
Page 3Architecting the Future of Big Data
5. © Hortonworks Inc. 2011 - 2015
Storage in Hadoop Ecosystem
• File system
– The HDFS
• SQL Database
– Hive on HDFS
• NoSQL
– Hbase on HDFS
• Object Store
– We need Ozone!
Page 5Architecting the Future of Big Data
6. © Hortonworks Inc. 2011 - 2015
Object Store vs File System
• Object stores offer lot more scale
– Trillions of objects is common
– Simpler semantics make it possible
• Wide range of object sizes
– A few KB to several GB
Page 6Architecting the Future of Big Data
7. © Hortonworks Inc. 2011 - 2015
Ozone: Introduction
• Ozone : An object store in hadoop
– Durable
– Reliable
– Highly Scalable
– Trillions of objects
– Wide range of object sizes
– Secure
– Highly Available
– REST API as the primary access interface
8. © Hortonworks Inc. 2011 - 2015
Ozone Introduction
• An Ozone URL
– http://hostname/myvolume/mybucket/mykey
• An S3 URL
– http://hostname/mybucket/mykey
• A Windows Azure URL
– http://hostname/myaccount/mybucket/mykey
9. © Hortonworks Inc. 2011 - 2015
Definitions
• Storage Volume
– A notion similar to an account
– Allows admin controls on usage of the object store e.g. storage quota
– Different from account because no user management in HDFS
– In private clouds often a ‘user’ is managed outside the cluster
– Created and managed by admins only
• Bucket
– Consists of keys and objects
– Similar to a bucket in S3 or a container in Azure
– ACLs
10. © Hortonworks Inc. 2011 - 2015
Definitions
• Key
– Unique in a bucket.
• Object
– Values in a bucket
– Each corresponds to a unique key within a bucket
11. © Hortonworks Inc. 2011 - 2015
REST API
• POST – Creates Volumes and Buckets
– Only Admin creates volumes
– Bucket can be created by owner of the volume
• PUT – Updates Volumes and Buckets
– Only admin can change some volume settings
– Buckets have ACLs
• GET
– Lists Volumes
– List Buckets
12. © Hortonworks Inc. 2011 - 2015
REST API
• DELETE
– Delete Volumes
– Delete Buckets
• Keys
– PUT : Creates Keys
– GET : Gets the data back
– Streaming read and write
– DELETE : Removes the Key
13. © Hortonworks Inc. 2011 - 2015
Storing Buckets
• Buckets grow up to millions of objects and several terabytes
– Don’t fit in a single node
– Split into partitions or shards
• Bucket partitions and metadata are distributed and replicated
• Storage Container
– Store multiple objects
– The unit of replication
– Consistent Replicas
14. © Hortonworks Inc. 2011 - 2015
Ozone in HDFS
Where does it fit?
Architecting the Future of Big Data Page 14
15. © Hortonworks Inc. 2011 - 2015
Hdfs Federation Extended
Page 15Architecting the Future of Big Data
... ...
DN 1 DN 2 DN m
.. .. ..
Block Pools
Pool nPool kPool 1
Common Storage
BlockStorage
HDFS Namespaces &
Block Pool management
Ozone Block Pool management
16. © Hortonworks Inc. 2011 - 2015
Impact on HDFS
• Ozone will reuse the DN storage
– Use their own block pools so that both HDFS and Ozone can share DNs
• Ozone will reuse Block Pool Management part of the namenode
– Includes heartbeats, block reports
• Storage Container abstraction is added to DNs
– Co-exists with HDFS blocks on the DNs
– New data pipeline
17. © Hortonworks Inc. 2011 - 2015
HDFS Scalability
• Scalability of the File System
– Support a billion files
– Namespace scalability
– Block-space scalability
• Namespace scalability is independent of Ozone
– Partial namespace on disk
– Parallel Effort (HDFS-8286)
• Block-space scalability
– Block space constitutes a big part of namenode metadata
– Block map on disk doesn’t work
– We hope to reuse some of the lessons of Ozone’s “many small objects in a storage
container” to allow multiple blocks in “storage container”
18. © Hortonworks Inc. 2011 - 2015
Architecture
Architecting the Future of Big Data Page 18
19. © Hortonworks Inc. 2011 - 2015
How it works
• URL
– http://hostname/myvolume/mybucket/mykey
• Simple Steps
– Full bucket name : ‘myvolume/mybucket’
– Find where bucket metadata is stored
– Fetch bucket metadata
– Check ACLs
– Find where the key is stored
– Read the data
20. © Hortonworks Inc. 2011 - 2015
How it works
• All the data or metadata is stored in Storage Containers
– Each storage container is identified by a unique id (Think of a block id in HDFS)
– A bucket name is mapped to a container id
– A key is mapped to a container id
• Container Id is mapped to Datanodes
21. © Hortonworks Inc. 2011 - 2015
Components
DN
Storage
Container
Manager
Ozone
Handler
DN
Ozone
Handler
DN
Ozone
Handler
22. © Hortonworks Inc. 2011 - 2015
New Components
• Storage Container Manager
– Maintains locations of each container (Container Map)
– Collects heartbeats and container reports from data-nodes
– Serves the location of container upon request
– Stores key partitioning metadata
• Ozone Handler
– A module hosted by Datanodes
– Implements Ozone REST API
– Connects to Storage Container Manager for key partitioning and container lookup
– Connects to local or remote Datanodes to read/write from/to containers
– Enforces authorization checks and administrative limits
23. © Hortonworks Inc. 2011 - 2015
Call Flow
DN
Storage
Container
Manager
DN DN
Client
REST
Call
Ozone
Handler
Ozone
Handler
Ozone
Handler
Read Metadata Container
24. © Hortonworks Inc. 2011 - 2015
Call Flow..
DN
Storage
Container
Manager
DN DN
Client
Ozone
Handler
Ozone
Handler
Ozone
Handler
Redirect Read Data
25. © Hortonworks Inc. 2011 - 2015
Implementation
Architecting the Future of Big Data Page 25
26. © Hortonworks Inc. 2011 - 2015
Mapping a Key to a Container
• Keys need to be mapped to Container IDs
– Horizontal partitioning of the key space
• Partition function
– Hash Partitioning
– Minimal state to be stored
– Better distribution, no hotspots
– Range Partitioning
– Sorted keys
– Provides ordered listing
27. © Hortonworks Inc. 2011 - 2015
Hash Partitioning
• Key is hashed
– the hash value is mapped to the container Id
• Prefix matching
– The container id is the longest matching prefix of the key
– Storage Container Manager implements a prefix tree
• Need extendible hashing
– Minimize the number of keys to be re-hashed when a new container added
– New containers are added by splitting an existing container
28. © Hortonworks Inc. 2011 - 2015
Prefix Matching for Hashes
Bucket-Id:
0xab
Bitwise-Trie
Root
Trie Node Trie Node
0 1
Trie Node
0 1
Container
0xab003
Container
0xab005
Container
0xab001
10
Container
0xab002
Container
0xab000
10• Storage Container stores
one tree for each bucket.
• The containers are at the
leaves.
• Size = Θ(#containers)
Key
0xab125
Trie Node
Container
0xab000
0
Container
0xab004
1
29. © Hortonworks Inc. 2011 - 2015
Range Partitioning
• Range Partitioning
– The container map maintains a range index tree for each bucket.
– Each node of the tree corresponds to a key range
– Children nodes split the range of their parent nodes
– The lookup is performed by traversing down the tree to more granular ranges for a
key until we reach a leaf
30. © Hortonworks Inc. 2011 - 2015
Range Index Tree
Bucket-Id:
0xab
K1 – K20
K1 – K10 K11 – K20
K11-15
K16 – K20
Container
0xab003
Container
0xab005
Container
0xab001
K14 – K15K11 – K13
Container
0xab002
Container
0xab000
K6 – K10K1 – K5• Storage Container map
consists of arrays of such
trees one for each bucket.
• The containers are at the
leaves.
• Size = Θ(#containers)
Key =
K15
31. © Hortonworks Inc. 2011 - 2015
Storage Container
• A storage unit in the datanode
– Generalization of the HDFS Block
– Id, Generation Stamp, Size
– Unit of replication
– Consistent replicas
• Container size
– 1G - 10G
– Container size affects the scale of Storage Container Manager
– Large containers take longer to replicate an individual block
– A system property and not a data property
32. © Hortonworks Inc. 2011 - 2015
Storage Container Requirements
• Stores variety of data, results in different requirements
• Metadata
– Individual units of data are very small - kilobytes.
– An atomic update is important.
– get/put API is sufficient.
• Object Data
– The storage container needs to store object data with wide range of sizes
– Must support streaming APIs to read/write individual objects
33. © Hortonworks Inc. 2011 - 2015
Storage Container Implementation
• Storage container prototype using RocksDB
– An embeddable key-value store
• Replication
– Need ability to replicate while data is being written
– RocksDB supports snapshots and incremental backups for replication
• A hybrid use of RocksDB
– Small objects : Keys and Objects stored in RocksDB
– Large objects : Object stored in an individual file, RocksDB contains keys and file
path
34. © Hortonworks Inc. 2011 - 2015
Storage Container Implementation
• Transactions for consistency and reliability
– The storage containers implement a few atomic and persistent operations i.e.
transactions. The container provides reliability guarantees for these
operations.
– Commit : This operation promotes an object being written to a finalized object.
Once this operation succeeds, the container guarantees that the object is
available for reading.
– Put : This operation is useful for small writes such as metadata writes.
– Delete : deletes the object
• A new data pipeline for storage containers
35. © Hortonworks Inc. 2011 - 2015
Data Pipeline Consistency
• HDFS Consistency Mechanism uses two pieces of block state
– Generation Stamp
– Block length
• Storage containers use following two
– Generation stamp
– Transaction id
• Storage Container must persist last executed transaction.
• Transaction id is allocated by leader of the pipeline.
36. © Hortonworks Inc. 2011 - 2015
Data Pipeline Consistency
• Upon a restart, datanode discards all uncommitted data for a storage
container
– State synchronized to last committed transaction
• When comparing two replicas
– Replica with latest generation stamp is honored
– If same generation stamp, the replica with latest transaction id is honored
– Correctness argument: Replicas with same generation stamp and same
transaction id must be together in the same pipeline
37. © Hortonworks Inc. 2011 - 2015
Phased Development
• Phase 1
– Basic API
– Storage container machinery, reliability, replication.
• Phase 2
– High availability
– Security
– Multipart upload
• Phase 3
– Caching to improve latency.
– Object versioning
– Cross geo replication.
38. © Hortonworks Inc. 2011 - 2015
Team
• Anu Engineer
– aengineer@hortonworks.com
• Arpit Agarwal
– aagarwal@hortonworks.com
• Chris Nauroth
– cnauroth@hortonworks.com
• Jitendra Pandey
– jitendra@hortonworks.com
39. © Hortonworks Inc. 2011 - 2015
Special Thanks
• Sanjay Radia
• Enis Soztutar
• Suresh Srinivas
Editor's Notes Hdfs as a storage system
Great file system
Works fantastic for map-reduce
Great adoption in enterprises
Example : Need to store all my customer documents
A few million customers each with a few thousand documents
Don’t need a directory structure
Need REST API as the primary access mechanism
Simple access semantics
Very large scale (billions of documents)
Wide range of object sizes
File System forces to think in terms of files and directories.
Two important questions
What is the partitioning scheme?
How does a storage container look like?