Danny Chen presented on Uber's use of HBase for global indexing to support large-scale data ingestion. Uber uses HBase to provide a global view of datasets ingested from Kafka and other data sources. To generate indexes, Spark jobs are used to transform data into HFiles, which are loaded into HBase tables. Given the large volumes of data, techniques like throttling HBase access and explicit serialization are used. The global indexing solution supports requirements for high throughput, strong consistency and horizontal scalability across Uber's data lake.
2. Danny Chen
dannyc@uber.com
● Engineering Manager on Hadoop
Data Platform team
● Leading Data Ingestion team
● Previous worked @ on
storage team (Manhattan)
● Enjoy playing basketball, biking,
and spending time w/my kids.
3. Uber Apache Hadoop
Platform Team Mission
Build products to support reliable,
scalable, easy-to-use, compliant, and
efficient data transfer (both ingestion
& dispersal) as well as data storage
leveraging the Apache Hadoop
ecosystem.
Apache Hadoop is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other
countries. No endorsement by The Apache Software Foundation is implied by the use of this mark.
4. Overview
● High-Level Ingestion & Dispersal
introduction
● Different types of workloads
● Need for Global Index
● How Global Index Works
● Generating Global Indexes with
HFiles
● Throttling HBase Access
● Next Steps
6. Hadoop Data Ecosystem at Uber
Apache
Hadoop
Data
Lake
Schemaless
Analytical
Processing
Apache Kafka, Cassandra, Spark, and HDFS logos are either registered trademarks or trademarks of the Apache Software Foundation in the
United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Data
Ingestion
Data
Dispersal
9. Bootstrap
● One time only at beginning of lifecycle
● Large amounts of data
● Millions of QPS throughput
● Need to finish in a matter of hours
● NoSQL stores cannot keep up
10. Incremental
● Dominates lifecycle of Hive table ingestion
● Incremental upstream changes from Kafka
or other data sources.
● 1000’s QPS per dataset
● Reasonable throughput requirements for
NoSQL stores
13. Requirements for Global Index
● Large amounts of historical data ingested in short
amount of time
● Append only vs Append-plus-update
● Data layout and partitioning
● Bookkeeping for data layout
● Strong consistency
● High Throughput
● Horizontally scalable
● Required a NoSQL store
14. ● Decision was to use HBase
● Trade Availability for Consistency
● Automatic Rebalancing of HBase tables via region splitting
● Global view of dataset via master/slave architecture
VS
23. HFile Index Job Tuning
● Explicitly register classes with Kryo Serialization
● Reduce 3 shuffle stages to one
● Proper HFile Size
● Proper Partition Counting Size
● 13 TB index data with 54 billion indexes
○ 2 hours to generate indexes
○ 10 min to load
Lots of effort into making a completely self-serve onboarding process
Analytical users will little technical knowledge of Spark, Hadoop, Hive etc will still be able to take advantage of our platform
Our assertion is that when relevant data is discoverable in the appropriate data stores for either analytical purposes, there really can be a substantial gains in terms of efficiency and value for your business. Marmaray is critical for ensuring data is in the appropriate data store.
Familiarity with suite of tools in our Hadoop Ecosystem for many potential use cases to extract insights out of raw data
Completion of the Hadoop Ecosystem of tools at Uber and original vision of the Data Processing Platform
Heatpipe/Watchtower produce quality schematized data
Ingest the data via Marmaray
Orchestrate jobs via Workflow Management System to run analytics and generate derived datasets, or build models using Michelangelo
Disperse the data using Marmaray to stores with low latency semantics
What sets it apart
Generic ingestion framework
Not tightly coupled to any source or sink
Shouldn’t be coupled to a specific source or a specific sink (product teams focus on this)
Dividing bootstrap and incremental allows us to choose a kv store where we that can scale for incremental phase indexing but not necessarily for bootstrapping of data.
HBase automatically rebalances tables within a cluster by splitting up key ranges when a region gets too large.
Can also load balance by having new regions moved to other servers
The master-slave architecture enables getting a global view of the spread of a dataset across the cluster, which we utilize in customizing dataset specific throughputs to our HBase cluster.
During incremental ingestion
We work in mini batches. It is the job of work unit calculator to provide required level of throttling
We work in mini batches. It is the job of work unit calculator to provide required level of throttling
Our Big Data ecosystem’s model of indexes stored in HBase contains entities shown in green that help identify files that need to be updated corresponding to a given record in an append-plus-update dataset.
The layout of index entries in HFiles lets us sort based on key value and column.
This is for the one time upload case
FlatMapToMair transformation in Apache Spark does not preserve the ordering of entries, so a partition isolated sort is performed. The partitioning is unchanged to ensure each partition still corresponds to a non-overlapping key range.
HFiles are written to the cluster where HBase is hosted to ensure HBase region servers have access to them during the upload process.
- Hfile upload by be severely affected by splitting
- Presplit HBase table into as many regions as there are HFiles so each Hfile can fit within a regio
- We avoid splitting Hfile based on Hfile size and it severely impacts Hfile upload time (10 min even for tens of TB)
- Done by presplitting hbase table so each hfile fits within a seaparate HBase region with non overlapping keys
HFiles are written to the cluster where HBase is hosted to ensure HBase region servers have access to them during the upload process.
Three Apache Spark jobs corresponding to three different datasets access their respective HBase index table, creating loads on HBase regional servers hosting these tables.
Adding more servers to the HBase cluster for a single dataset that is using global index linearly correlates with a QPS increase, although the dataset’s QPSFraction remains constant.
Explore other indexing solutions to possibly merge bootstrap and incremental indexing solutions for easier maintenance.