SlideShare a Scribd company logo
1 of 40
© Hortonworks Inc. 2011 - 2015
Ozone: An Object Store in HDFS
Jitendra Nath Pandey
jitendra@hortonworks.com
jitendra@apache.org
@jnathp
Page 1
© Hortonworks Inc. 2011 - 2015
About me
• Engineering Manager @Hortonworks
– Manager / Architect for HDFS at Hortonworks
• ASF Member
– PMC Member at Apache Hadoop
– PMC Member at Apache Ambari
– Committer in Apache Hive
Page 2Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Outline
• Introduction
• How ozone fits in HDFS
• Ozone architecture
• Notes on implementation
• Q & A
Page 3Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Introduction
Architecting the Future of Big Data Page 4
© Hortonworks Inc. 2011 - 2015
Storage in Hadoop Ecosystem
• File system
– The HDFS
• SQL Database
– Hive on HDFS
• NoSQL
– Hbase on HDFS
• Object Store
– We need Ozone!
Page 5Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Object Store vs File System
• Object stores offer lot more scale
– Trillions of objects is common
– Simpler semantics make it possible
• Wide range of object sizes
– A few KB to several GB
Page 6Architecting the Future of Big Data
© Hortonworks Inc. 2011 - 2015
Ozone: Introduction
• Ozone : An object store in hadoop
– Durable
– Reliable
– Highly Scalable
– Trillions of objects
– Wide range of object sizes
– Secure
– Highly Available
– REST API as the primary access interface
© Hortonworks Inc. 2011 - 2015
Ozone Introduction
• An Ozone URL
– http://hostname/myvolume/mybucket/mykey
• An S3 URL
– http://hostname/mybucket/mykey
• A Windows Azure URL
– http://hostname/myaccount/mybucket/mykey
© Hortonworks Inc. 2011 - 2015
Definitions
• Storage Volume
– A notion similar to an account
– Allows admin controls on usage of the object store e.g. storage quota
– Different from account because no user management in HDFS
– In private clouds often a ‘user’ is managed outside the cluster
– Created and managed by admins only
• Bucket
– Consists of keys and objects
– Similar to a bucket in S3 or a container in Azure
– ACLs
© Hortonworks Inc. 2011 - 2015
Definitions
• Key
– Unique in a bucket.
• Object
– Values in a bucket
– Each corresponds to a unique key within a bucket
© Hortonworks Inc. 2011 - 2015
REST API
• POST – Creates Volumes and Buckets
– Only Admin creates volumes
– Bucket can be created by owner of the volume
• PUT – Updates Volumes and Buckets
– Only admin can change some volume settings
– Buckets have ACLs
• GET
– Lists Volumes
– List Buckets
© Hortonworks Inc. 2011 - 2015
REST API
• DELETE
– Delete Volumes
– Delete Buckets
• Keys
– PUT : Creates Keys
– GET : Gets the data back
– Streaming read and write
– DELETE : Removes the Key
© Hortonworks Inc. 2011 - 2015
Storing Buckets
• Buckets grow up to millions of objects and several terabytes
– Don’t fit in a single node
– Split into partitions or shards
• Bucket partitions and metadata are distributed and replicated
• Storage Container
– Store multiple objects
– The unit of replication
– Consistent Replicas
© Hortonworks Inc. 2011 - 2015
Ozone in HDFS
Where does it fit?
Architecting the Future of Big Data Page 14
© Hortonworks Inc. 2011 - 2015
Hdfs Federation Extended
Page 15Architecting the Future of Big Data
... ...
DN 1 DN 2 DN m
.. .. ..
Block Pools
Pool nPool kPool 1
Common Storage
BlockStorage
HDFS Namespaces &
Block Pool management
Ozone Block Pool management
© Hortonworks Inc. 2011 - 2015
Impact on HDFS
• Ozone will reuse the DN storage
– Use their own block pools so that both HDFS and Ozone can share DNs
• Ozone will reuse Block Pool Management part of the namenode
– Includes heartbeats, block reports
• Storage Container abstraction is added to DNs
– Co-exists with HDFS blocks on the DNs
– New data pipeline
© Hortonworks Inc. 2011 - 2015
HDFS Scalability
• Scalability of the File System
– Support a billion files
– Namespace scalability
– Block-space scalability
• Namespace scalability is independent of Ozone
– Partial namespace on disk
– Parallel Effort (HDFS-8286)
• Block-space scalability
– Block space constitutes a big part of namenode metadata
– Block map on disk doesn’t work
– We hope to reuse some of the lessons of Ozone’s “many small objects in a storage
container” to allow multiple blocks in “storage container”
© Hortonworks Inc. 2011 - 2015
Architecture
Architecting the Future of Big Data Page 18
© Hortonworks Inc. 2011 - 2015
How it works
• URL
– http://hostname/myvolume/mybucket/mykey
• Simple Steps
– Full bucket name : ‘myvolume/mybucket’
– Find where bucket metadata is stored
– Fetch bucket metadata
– Check ACLs
– Find where the key is stored
– Read the data
© Hortonworks Inc. 2011 - 2015
How it works
• All the data or metadata is stored in Storage Containers
– Each storage container is identified by a unique id (Think of a block id in HDFS)
– A bucket name is mapped to a container id
– A key is mapped to a container id
• Container Id is mapped to Datanodes
© Hortonworks Inc. 2011 - 2015
Components
DN
Storage
Container
Manager
Ozone
Handler
DN
Ozone
Handler
DN
Ozone
Handler
© Hortonworks Inc. 2011 - 2015
New Components
• Storage Container Manager
– Maintains locations of each container (Container Map)
– Collects heartbeats and container reports from data-nodes
– Serves the location of container upon request
– Stores key partitioning metadata
• Ozone Handler
– A module hosted by Datanodes
– Implements Ozone REST API
– Connects to Storage Container Manager for key partitioning and container lookup
– Connects to local or remote Datanodes to read/write from/to containers
– Enforces authorization checks and administrative limits
© Hortonworks Inc. 2011 - 2015
Call Flow
DN
Storage
Container
Manager
DN DN
Client
REST
Call
Ozone
Handler
Ozone
Handler
Ozone
Handler
Read Metadata Container
© Hortonworks Inc. 2011 - 2015
Call Flow..
DN
Storage
Container
Manager
DN DN
Client
Ozone
Handler
Ozone
Handler
Ozone
Handler
Redirect Read Data
© Hortonworks Inc. 2011 - 2015
Implementation
Architecting the Future of Big Data Page 25
© Hortonworks Inc. 2011 - 2015
Mapping a Key to a Container
• Keys need to be mapped to Container IDs
– Horizontal partitioning of the key space
• Partition function
– Hash Partitioning
– Minimal state to be stored
– Better distribution, no hotspots
– Range Partitioning
– Sorted keys
– Provides ordered listing
© Hortonworks Inc. 2011 - 2015
Hash Partitioning
• Key is hashed
– the hash value is mapped to the container Id
• Prefix matching
– The container id is the longest matching prefix of the key
– Storage Container Manager implements a prefix tree
• Need extendible hashing
– Minimize the number of keys to be re-hashed when a new container added
– New containers are added by splitting an existing container
© Hortonworks Inc. 2011 - 2015
Prefix Matching for Hashes
Bucket-Id:
0xab
Bitwise-Trie
Root
Trie Node Trie Node
0 1
Trie Node
0 1
Container
0xab003
Container
0xab005
Container
0xab001
10
Container
0xab002
Container
0xab000
10• Storage Container stores
one tree for each bucket.
• The containers are at the
leaves.
• Size = Θ(#containers)
Key
0xab125
Trie Node
Container
0xab000
0
Container
0xab004
1
© Hortonworks Inc. 2011 - 2015
Range Partitioning
• Range Partitioning
– The container map maintains a range index tree for each bucket.
– Each node of the tree corresponds to a key range
– Children nodes split the range of their parent nodes
– The lookup is performed by traversing down the tree to more granular ranges for a
key until we reach a leaf
© Hortonworks Inc. 2011 - 2015
Range Index Tree
Bucket-Id:
0xab
K1 – K20
K1 – K10 K11 – K20
K11-15
K16 – K20
Container
0xab003
Container
0xab005
Container
0xab001
K14 – K15K11 – K13
Container
0xab002
Container
0xab000
K6 – K10K1 – K5• Storage Container map
consists of arrays of such
trees one for each bucket.
• The containers are at the
leaves.
• Size = Θ(#containers)
Key =
K15
© Hortonworks Inc. 2011 - 2015
Storage Container
• A storage unit in the datanode
– Generalization of the HDFS Block
– Id, Generation Stamp, Size
– Unit of replication
– Consistent replicas
• Container size
– 1G - 10G
– Container size affects the scale of Storage Container Manager
– Large containers take longer to replicate an individual block
– A system property and not a data property
© Hortonworks Inc. 2011 - 2015
Storage Container Requirements
• Stores variety of data, results in different requirements
• Metadata
– Individual units of data are very small - kilobytes.
– An atomic update is important.
– get/put API is sufficient.
• Object Data
– The storage container needs to store object data with wide range of sizes
– Must support streaming APIs to read/write individual objects
© Hortonworks Inc. 2011 - 2015
Storage Container Implementation
• Storage container prototype using RocksDB
– An embeddable key-value store
• Replication
– Need ability to replicate while data is being written
– RocksDB supports snapshots and incremental backups for replication
• A hybrid use of RocksDB
– Small objects : Keys and Objects stored in RocksDB
– Large objects : Object stored in an individual file, RocksDB contains keys and file
path
© Hortonworks Inc. 2011 - 2015
Storage Container Implementation
• Transactions for consistency and reliability
– The storage containers implement a few atomic and persistent operations i.e.
transactions. The container provides reliability guarantees for these
operations.
– Commit : This operation promotes an object being written to a finalized object.
Once this operation succeeds, the container guarantees that the object is
available for reading.
– Put : This operation is useful for small writes such as metadata writes.
– Delete : deletes the object
• A new data pipeline for storage containers
© Hortonworks Inc. 2011 - 2015
Data Pipeline Consistency
• HDFS Consistency Mechanism uses two pieces of block state
– Generation Stamp
– Block length
• Storage containers use following two
– Generation stamp
– Transaction id
• Storage Container must persist last executed transaction.
• Transaction id is allocated by leader of the pipeline.
© Hortonworks Inc. 2011 - 2015
Data Pipeline Consistency
• Upon a restart, datanode discards all uncommitted data for a storage
container
– State synchronized to last committed transaction
• When comparing two replicas
– Replica with latest generation stamp is honored
– If same generation stamp, the replica with latest transaction id is honored
– Correctness argument: Replicas with same generation stamp and same
transaction id must be together in the same pipeline
© Hortonworks Inc. 2011 - 2015
Phased Development
• Phase 1
– Basic API
– Storage container machinery, reliability, replication.
• Phase 2
– High availability
– Security
– Multipart upload
• Phase 3
– Caching to improve latency.
– Object versioning
– Cross geo replication.
© Hortonworks Inc. 2011 - 2015
Team
• Anu Engineer
– aengineer@hortonworks.com
• Arpit Agarwal
– aagarwal@hortonworks.com
• Chris Nauroth
– cnauroth@hortonworks.com
• Jitendra Pandey
– jitendra@hortonworks.com
© Hortonworks Inc. 2011 - 2015
Special Thanks
• Sanjay Radia
• Enis Soztutar
• Suresh Srinivas
© Hortonworks Inc. 2011 - 2015
Thanks!
Architecting the Future of Big Data Page 40

More Related Content

What's hot

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Zero-Copy Event-Driven Servers with Netty
Zero-Copy Event-Driven Servers with NettyZero-Copy Event-Driven Servers with Netty
Zero-Copy Event-Driven Servers with Netty
Daniel Bimschas
 

What's hot (20)

Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Sqoop
SqoopSqoop
Sqoop
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Hive partitioning best practices
Hive partitioning  best practicesHive partitioning  best practices
Hive partitioning best practices
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Zero-Copy Event-Driven Servers with Netty
Zero-Copy Event-Driven Servers with NettyZero-Copy Event-Driven Servers with Netty
Zero-Copy Event-Driven Servers with Netty
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 

Similar to Ozone: An Object Store in HDFS

HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 

Similar to Ozone: An Object Store in HDFS (20)

Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objects
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic Web
 
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Ozone: An Object Store in HDFS

  • 1. © Hortonworks Inc. 2011 - 2015 Ozone: An Object Store in HDFS Jitendra Nath Pandey jitendra@hortonworks.com jitendra@apache.org @jnathp Page 1
  • 2. © Hortonworks Inc. 2011 - 2015 About me • Engineering Manager @Hortonworks – Manager / Architect for HDFS at Hortonworks • ASF Member – PMC Member at Apache Hadoop – PMC Member at Apache Ambari – Committer in Apache Hive Page 2Architecting the Future of Big Data
  • 3. © Hortonworks Inc. 2011 - 2015 Outline • Introduction • How ozone fits in HDFS • Ozone architecture • Notes on implementation • Q & A Page 3Architecting the Future of Big Data
  • 4. © Hortonworks Inc. 2011 - 2015 Introduction Architecting the Future of Big Data Page 4
  • 5. © Hortonworks Inc. 2011 - 2015 Storage in Hadoop Ecosystem • File system – The HDFS • SQL Database – Hive on HDFS • NoSQL – Hbase on HDFS • Object Store – We need Ozone! Page 5Architecting the Future of Big Data
  • 6. © Hortonworks Inc. 2011 - 2015 Object Store vs File System • Object stores offer lot more scale – Trillions of objects is common – Simpler semantics make it possible • Wide range of object sizes – A few KB to several GB Page 6Architecting the Future of Big Data
  • 7. © Hortonworks Inc. 2011 - 2015 Ozone: Introduction • Ozone : An object store in hadoop – Durable – Reliable – Highly Scalable – Trillions of objects – Wide range of object sizes – Secure – Highly Available – REST API as the primary access interface
  • 8. © Hortonworks Inc. 2011 - 2015 Ozone Introduction • An Ozone URL – http://hostname/myvolume/mybucket/mykey • An S3 URL – http://hostname/mybucket/mykey • A Windows Azure URL – http://hostname/myaccount/mybucket/mykey
  • 9. © Hortonworks Inc. 2011 - 2015 Definitions • Storage Volume – A notion similar to an account – Allows admin controls on usage of the object store e.g. storage quota – Different from account because no user management in HDFS – In private clouds often a ‘user’ is managed outside the cluster – Created and managed by admins only • Bucket – Consists of keys and objects – Similar to a bucket in S3 or a container in Azure – ACLs
  • 10. © Hortonworks Inc. 2011 - 2015 Definitions • Key – Unique in a bucket. • Object – Values in a bucket – Each corresponds to a unique key within a bucket
  • 11. © Hortonworks Inc. 2011 - 2015 REST API • POST – Creates Volumes and Buckets – Only Admin creates volumes – Bucket can be created by owner of the volume • PUT – Updates Volumes and Buckets – Only admin can change some volume settings – Buckets have ACLs • GET – Lists Volumes – List Buckets
  • 12. © Hortonworks Inc. 2011 - 2015 REST API • DELETE – Delete Volumes – Delete Buckets • Keys – PUT : Creates Keys – GET : Gets the data back – Streaming read and write – DELETE : Removes the Key
  • 13. © Hortonworks Inc. 2011 - 2015 Storing Buckets • Buckets grow up to millions of objects and several terabytes – Don’t fit in a single node – Split into partitions or shards • Bucket partitions and metadata are distributed and replicated • Storage Container – Store multiple objects – The unit of replication – Consistent Replicas
  • 14. © Hortonworks Inc. 2011 - 2015 Ozone in HDFS Where does it fit? Architecting the Future of Big Data Page 14
  • 15. © Hortonworks Inc. 2011 - 2015 Hdfs Federation Extended Page 15Architecting the Future of Big Data ... ... DN 1 DN 2 DN m .. .. .. Block Pools Pool nPool kPool 1 Common Storage BlockStorage HDFS Namespaces & Block Pool management Ozone Block Pool management
  • 16. © Hortonworks Inc. 2011 - 2015 Impact on HDFS • Ozone will reuse the DN storage – Use their own block pools so that both HDFS and Ozone can share DNs • Ozone will reuse Block Pool Management part of the namenode – Includes heartbeats, block reports • Storage Container abstraction is added to DNs – Co-exists with HDFS blocks on the DNs – New data pipeline
  • 17. © Hortonworks Inc. 2011 - 2015 HDFS Scalability • Scalability of the File System – Support a billion files – Namespace scalability – Block-space scalability • Namespace scalability is independent of Ozone – Partial namespace on disk – Parallel Effort (HDFS-8286) • Block-space scalability – Block space constitutes a big part of namenode metadata – Block map on disk doesn’t work – We hope to reuse some of the lessons of Ozone’s “many small objects in a storage container” to allow multiple blocks in “storage container”
  • 18. © Hortonworks Inc. 2011 - 2015 Architecture Architecting the Future of Big Data Page 18
  • 19. © Hortonworks Inc. 2011 - 2015 How it works • URL – http://hostname/myvolume/mybucket/mykey • Simple Steps – Full bucket name : ‘myvolume/mybucket’ – Find where bucket metadata is stored – Fetch bucket metadata – Check ACLs – Find where the key is stored – Read the data
  • 20. © Hortonworks Inc. 2011 - 2015 How it works • All the data or metadata is stored in Storage Containers – Each storage container is identified by a unique id (Think of a block id in HDFS) – A bucket name is mapped to a container id – A key is mapped to a container id • Container Id is mapped to Datanodes
  • 21. © Hortonworks Inc. 2011 - 2015 Components DN Storage Container Manager Ozone Handler DN Ozone Handler DN Ozone Handler
  • 22. © Hortonworks Inc. 2011 - 2015 New Components • Storage Container Manager – Maintains locations of each container (Container Map) – Collects heartbeats and container reports from data-nodes – Serves the location of container upon request – Stores key partitioning metadata • Ozone Handler – A module hosted by Datanodes – Implements Ozone REST API – Connects to Storage Container Manager for key partitioning and container lookup – Connects to local or remote Datanodes to read/write from/to containers – Enforces authorization checks and administrative limits
  • 23. © Hortonworks Inc. 2011 - 2015 Call Flow DN Storage Container Manager DN DN Client REST Call Ozone Handler Ozone Handler Ozone Handler Read Metadata Container
  • 24. © Hortonworks Inc. 2011 - 2015 Call Flow.. DN Storage Container Manager DN DN Client Ozone Handler Ozone Handler Ozone Handler Redirect Read Data
  • 25. © Hortonworks Inc. 2011 - 2015 Implementation Architecting the Future of Big Data Page 25
  • 26. © Hortonworks Inc. 2011 - 2015 Mapping a Key to a Container • Keys need to be mapped to Container IDs – Horizontal partitioning of the key space • Partition function – Hash Partitioning – Minimal state to be stored – Better distribution, no hotspots – Range Partitioning – Sorted keys – Provides ordered listing
  • 27. © Hortonworks Inc. 2011 - 2015 Hash Partitioning • Key is hashed – the hash value is mapped to the container Id • Prefix matching – The container id is the longest matching prefix of the key – Storage Container Manager implements a prefix tree • Need extendible hashing – Minimize the number of keys to be re-hashed when a new container added – New containers are added by splitting an existing container
  • 28. © Hortonworks Inc. 2011 - 2015 Prefix Matching for Hashes Bucket-Id: 0xab Bitwise-Trie Root Trie Node Trie Node 0 1 Trie Node 0 1 Container 0xab003 Container 0xab005 Container 0xab001 10 Container 0xab002 Container 0xab000 10• Storage Container stores one tree for each bucket. • The containers are at the leaves. • Size = Θ(#containers) Key 0xab125 Trie Node Container 0xab000 0 Container 0xab004 1
  • 29. © Hortonworks Inc. 2011 - 2015 Range Partitioning • Range Partitioning – The container map maintains a range index tree for each bucket. – Each node of the tree corresponds to a key range – Children nodes split the range of their parent nodes – The lookup is performed by traversing down the tree to more granular ranges for a key until we reach a leaf
  • 30. © Hortonworks Inc. 2011 - 2015 Range Index Tree Bucket-Id: 0xab K1 – K20 K1 – K10 K11 – K20 K11-15 K16 – K20 Container 0xab003 Container 0xab005 Container 0xab001 K14 – K15K11 – K13 Container 0xab002 Container 0xab000 K6 – K10K1 – K5• Storage Container map consists of arrays of such trees one for each bucket. • The containers are at the leaves. • Size = Θ(#containers) Key = K15
  • 31. © Hortonworks Inc. 2011 - 2015 Storage Container • A storage unit in the datanode – Generalization of the HDFS Block – Id, Generation Stamp, Size – Unit of replication – Consistent replicas • Container size – 1G - 10G – Container size affects the scale of Storage Container Manager – Large containers take longer to replicate an individual block – A system property and not a data property
  • 32. © Hortonworks Inc. 2011 - 2015 Storage Container Requirements • Stores variety of data, results in different requirements • Metadata – Individual units of data are very small - kilobytes. – An atomic update is important. – get/put API is sufficient. • Object Data – The storage container needs to store object data with wide range of sizes – Must support streaming APIs to read/write individual objects
  • 33. © Hortonworks Inc. 2011 - 2015 Storage Container Implementation • Storage container prototype using RocksDB – An embeddable key-value store • Replication – Need ability to replicate while data is being written – RocksDB supports snapshots and incremental backups for replication • A hybrid use of RocksDB – Small objects : Keys and Objects stored in RocksDB – Large objects : Object stored in an individual file, RocksDB contains keys and file path
  • 34. © Hortonworks Inc. 2011 - 2015 Storage Container Implementation • Transactions for consistency and reliability – The storage containers implement a few atomic and persistent operations i.e. transactions. The container provides reliability guarantees for these operations. – Commit : This operation promotes an object being written to a finalized object. Once this operation succeeds, the container guarantees that the object is available for reading. – Put : This operation is useful for small writes such as metadata writes. – Delete : deletes the object • A new data pipeline for storage containers
  • 35. © Hortonworks Inc. 2011 - 2015 Data Pipeline Consistency • HDFS Consistency Mechanism uses two pieces of block state – Generation Stamp – Block length • Storage containers use following two – Generation stamp – Transaction id • Storage Container must persist last executed transaction. • Transaction id is allocated by leader of the pipeline.
  • 36. © Hortonworks Inc. 2011 - 2015 Data Pipeline Consistency • Upon a restart, datanode discards all uncommitted data for a storage container – State synchronized to last committed transaction • When comparing two replicas – Replica with latest generation stamp is honored – If same generation stamp, the replica with latest transaction id is honored – Correctness argument: Replicas with same generation stamp and same transaction id must be together in the same pipeline
  • 37. © Hortonworks Inc. 2011 - 2015 Phased Development • Phase 1 – Basic API – Storage container machinery, reliability, replication. • Phase 2 – High availability – Security – Multipart upload • Phase 3 – Caching to improve latency. – Object versioning – Cross geo replication.
  • 38. © Hortonworks Inc. 2011 - 2015 Team • Anu Engineer – aengineer@hortonworks.com • Arpit Agarwal – aagarwal@hortonworks.com • Chris Nauroth – cnauroth@hortonworks.com • Jitendra Pandey – jitendra@hortonworks.com
  • 39. © Hortonworks Inc. 2011 - 2015 Special Thanks • Sanjay Radia • Enis Soztutar • Suresh Srinivas
  • 40. © Hortonworks Inc. 2011 - 2015 Thanks! Architecting the Future of Big Data Page 40

Editor's Notes

  1. Hdfs as a storage system Great file system Works fantastic for map-reduce Great adoption in enterprises
  2. Example : Need to store all my customer documents A few million customers each with a few thousand documents Don’t need a directory structure Need REST API as the primary access mechanism Simple access semantics Very large scale (billions of documents) Wide range of object sizes File System forces to think in terms of files and directories.
  3. Two important questions What is the partitioning scheme? How does a storage container look like?