Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
1.
2. FASTER, FASTER, FASTER:
THE TRUE STORY OF A
MOBILE ANALYTICS DATA
MART ON HIVE
Mithun Radhakrishnan
Josh Walters
3. 3
• Mithun Radhakrishnan
• Hive Engineer at Yahoo
• Hive Committer
• Has an irrational fear of spider
monkeys
• mithun@apache.org
• @mithunrk
About myself
6. 6
From: The [REDACTED] ETL team
To: The Yahoo Hive Team
Subject: A small matter of size...
Dear YHive team,
We have partitioned our table using the following 6 partition keys:
{hourly-timestamp, name, property, geo-location, shoe-size, and so on…}.
For a given timestamp, the combined cardinality of the remaining
partition-keys is about 10000/hr.
If queries on partitioned tables are supposed to be faster, how come
queries on our table take forever just to get off the ground?
Yours gigantically,
Project [REDACTED]
7. 7
ABOUT ME
• Josh Walters
• Data Engineer at Yahoo
• I build lots of data pipelines
• Can eat a whole plate of deep fried cookie dough
• http://joshwalters.com
• @joshwalters
8. 8
WHAT IS THE CUSTOMER NEED?
• Faster ETL
• Faster queries
• Faster ramp up
9. 9
CASE STUDY: MOBILE DATA MART
• Mobile app usage data
• Optimize performance
• Interactive analytics
10. 10
LOW HANGING FRUIT
• Tez Tez Tez!
• Vectorized query execution
• Map-side aggregations
• Auto-convert map join
14. 14
ORC!
• Used in largest data systems
• 90% boost on sorted columns
• 30x compression versus raw text
• Fits well with our tech stack
15. 15
SKETCH ALL THE THINGS
• Very accurate
• Can store sketches in Hive
• Union, intersection, difference
• 75% boost on relevant queries
16. 16
SKETCH ALL THE THINGS
SELECT COUNT(DISTINCT id)
FROM DB.TABLE
WHERE ...; -- ~100 seconds
SELECT estimate(sketch(id))
FROM DB.TABLE
WHERE ...; -- ~25 seconds
17. 17
SKETCH ALL THE THINGS
Standard Deviation 1 2 3
Confidence Interval 68% 95% 99%
K = 16 25% 51% 77%
K = 512 4% 8% 13%
K = 4096 1% 3% 4%
K = 16384 < 1% 1% 2%
18. 18
MORE SKETCH INFO
• Summarization, Approx. and
Sampling: Tradeoffs for
Improving Query,
Hadoop Summit, 2015
• http://datasketches.github.io
20. 20
FUNNEL ANALYSIS
• Complex to write, difficult to reuse
• Slow, requires multiple joins
• Using UDFs, now runs in seconds, not hours
• https://github.com/yahoo/hive-funnel-udf
21. 21
REALLY FAST OLAP
• OLAP type queries are the most common
• Aggregate only queries: group, count, sum, …
• Can we optimize for such queries?
22. 22
OLAP WITH DRUID
• Interactive, sub-second latency
• Ingest raw records, then aggregate
• Open source, actively developed
• http://druid.io
23. 23
BI TOOL
• Many options
• Don’t cover all needs
• Need graphs and dashboards
24. 24
CARAVEL
• Hive, Druid, Redshift, MySQL, …
• Simple query construction
• Open source, actively developed
• https://github.com/airbnb/caravel
25. 25
WHAT WE LEARNED
• Product teams need custom data marts
• Complex to build and run
• Just want to focus on business logic
26. 26
DATA MART IN A BOX!
• Generalized ETL pipeline
• Easy to spin-up
• Automatic continuous delivery
• Just give us a query!
33. 33
• Out of the box:
• Tez container reuse
• set tez.am.container.reuse.enabled=true;
• Tez speculative execution
• set tez.am.speculation.enabled=true;
• Reduce-side vectorization
• set hive.vectorized.execution.reduce.enabled=true;
• set hive.vectorized.execution.reduce.groupby.enabled=true;
Performance Tuning
34. 34
• Understand your data:
• Use ORC’s index-based filtering:
• set hive.optimize.index.filter=true;
• Bloom filters
• ALTER TABLE my_orc SET TBLPROPERTIES(“orc.bloom.filter.columns”=“foo,bar”);
• Cardinality?
• Sort on filter-column
• Trade-offs: Parallelism vs. filtering
Performance Tuning
35. 35
• Understand your queries:
• Prefer LIKE and INSTR over REGEXP*
• Compile-time date/time functions:
• current_date()
• current_timestamp()
• Queries generated from UI tools
Performance Tuning
36. 36
• Index-based filtering available to Pig / MR users
• HCatLoader, HCatInputFormat
• Split-calculation improvements
• Block-based BI
• Parallel ETL
• Disabled dictionaries for Complex data types
• OOMs
Performance Improvements - ORC
42. 42
• AvroSerDe needs read-schema at job-runtime (i.e. map-side)
• Stored on HDFS
• ETL Jobs need 10-20K maps
• Replication factor
• Data-node outage
• It gets steadily worse
• Block-replication on node-loss
• Task attempt retry
• More nodes lost
• Rinse and repeat
The Problem
44. 44
• Reconcile metastore-schema against read-schema?
• toAvroSchema( fromAvroSchema( avroSchema )) != avroSchema
• Store schema in TBLPROPERTIES?
• Cache read-schema during SerDe::initialize()
• Once per map-task
• Prefetch read-schema at query-planning phase
• Once per job
• Separate optimizer
The Solution
This table is our largest. We use this to test and break our system.
Customers always want data faster
Everyone wants data ETL’ed faster
Analysts and product owners want faster queries
Users need to be able to ramp up quickly and be able to use the data
Mobile app data: swipes, clicks, usage time, etc
Query at the speed of thought
Analysts needs results now, not in 3 hours
Tez provided huge benefits to our jobs, massive performance improvements, not used for all jobs at Yahoo.
Vectorized execution is easy to enable. Will perform transformations on batches of records, greatly increasing performance.
Map-side aggregations can help to limit the work done in the reduce stage by performing some of the transformations in the map stage.
With auto-convert join, you don’t have to provide hints in the query (which few users do). Can help speedup lots of join queries.
Makes things faster, but still not good enough.
More partitions, more control over reads, smaller data read, faster queries!
Really deep partitions, multiple nested levels of partitions.
Too many partitions have other problems, too many part files cluttering up your namespace. Reducers have to have handles open to many part files, causing a slowdown.
Can cause a lot of problems. HCatalog can’t handle that many partitions. The time it takes HCatalog to lookup the metadata can greatly reduce the gains from partitioning the data.
We would like thousands of partitions, have to settle for hundreds.
Deep partitions group similar data together, helping with compression
We gave a talk on this at Stanford’s XLDB conference, if you want more info watch the video!
Next we wanted to see if we swapped our cluster to use SSDs, would that help with Hive performance?
Not much improvement, our jobs were mostly CPU bound.
Does it make sense for intermediate task attempt output to be stored on solid state drives?
Our audience data pipeline processes about 200 billion events a day. This comes out to roughly 400 Terabytes of uncompressed data a day.
This compresses down to 15 Terabytes of data a day with ORC.
We have to store this data for 18 months, so you can see where compression can be really important to us.
Our users may also want to run queries over that whole time period, so our file format must be efficient enough to handle that
Sketches, or streaming algorithms, provide some useful features for very large datasets
Queries like distinct count are very common for analysts, and can be quite slow
Sketches can perform these queries in a single pass, with minimal memory usage
These sketches can be used to do distinct counts, but they can also be used in unions, intersections, and differences
We observed a 75% speed boost on relevant queries
Information about Sketches has been presented before
The code is open source, and there are UDFs for Hive and Pig
Users occasionally want to run very complex queries that would be too difficult to write in Pig or Hive
One of the most common for our users was funnel analysis.
In these instances, UDFs can provide a lot of help to our data users
Funnel analysis is used to measure how users are flowing through a series of actions
For example: How many people go to the signup page? How many of those people complete the information? How many of them then submit the information?
Each stage should have the same or fewer users
Usually you would have to do multiple selects and joins to get this to work
The query can become very large and unwieldy
We came up with a simple UDF to perform this whole process in a single map reduce job, greatly simplifying and speeding up the process
This UDF is open source, feel free to contribute!
Analyst queries can commonly be answered by an OLAP system
Can these queries run with sub-second latencies?
Aggregate only, no single record results
Really fast, useable, interactive queries
Don’t have to do anything special to the data, Druid ingests the records raw and then aggregates
Open source system, lots of contributors, very actively developed
We began a search for a user interface to sit on top of these data marts
There are many options, but they don’t cover all our needs: support for many database systems, open source, actively developed, and so on
Dashboards was one of the most important features we were looking for
Caravel, out of AirBnb, was what we decided to go with
Has support for Druid and any system that has a SQLAlchemy connector (which is just about everything)
The project is very active, and we are contributing to it
This mobile data mart wasn’t the only of its kind at Yahoo
We had many different teams trying to build similar systems
We decided it would be a good idea to build a data mart framework for other teams to use
Data marts are a slice of a data warehouse, a small projection and transformation for a specific business unit
These data marts cover the use case of the business unit
Analysts, marketing, and sales teams may not know Oozie, how to setup continuous delivery for data pipelines, or other data pipeline best practices
Could they just provide some ETL logic, and magically build a data mart pipeline?
A data pipeline framework
Fast to spin up (less than an hour)
Only need a Hive ETL query
Comes with continuous delivery, windowed aggregates, low latency OLAP processing, and a business intelligence UI
Low latency OLAP? How?
Simple architecture, by keeping it general it is able to cover many different use cases
Features such as windowed aggregates and Druid can be easily removed if not needed
Can be made even more real-time by using a lower time granularity in the initial ETL step
We have successfully used 10 minute granularity, resulting in an almost real-time data system
Example project runs in 4000 nodes busy cluster. A dedicated queue is configured to control the query concurrency.
For the example use case, 50% queries runs < 0.5 min, 75% queries runs < 5 mins.
Y! does not have the luxury of dedicated, underutilized clusters purely for interactive use.
HDFS bandwidth, disk bandwidth, network bandwidth are all shared, even if the Yarn queue is different.
(1min – T6 min)
For batch and interactive queries, also used as ETL tool
Support Looker/Tableau/Microstrategy for dashboard and ad-hoc query
Example project runs in 4000 nodes busy cluster. A dedicated queue is configured to control the query concurrency.
For the example use case, 50% queries runs < 0.5 min, 75% queries runs < 5 mins.
HiveConf.java is huge now. Configuration can be tricky.
Here are some settings that you should be enabling out of the box.
Container reuse: Useful not just to amortize the cost of container spin-up, but also to place task output closer to the next stage.
Speculative execution: Same as in MR. Slow task-attempts can be worked around.
Reduce-side vectorization: “Only” 10-30% improvements.
Explain index-based filtering. aka PPD.
ORC Files are split into Stripes, with several row-groups per stripe.
Each stripe has rows stored in columnar fashion, and column-statistics, including max/min values per column.
Index-based filtering skips a row-group based on your query predicate if the value doesn’t fall within the min/max limits for the row-group. Simple, right?
1.2 now has Bloom-filters. You can choose your columns. Greater likelihood of false positives if cardinality is large? Confirm!
Sorting on a column has tradeoffs.
Similar column-values being contiguous helps compression/encoding, and skipping more rows together.
But you could land up with a few tasks with all the data to be processed.
REGEXP is generic, and will perform worse than LIKE and INSTR. Prefer the latter, if you don’t absolutely require REGEXP.
1.2 has compile-time date/time functions. At query-build! As opposed to once per row.
Using BI/UI tools? Look closely at the generated queries. Might be using REGEXP, unix_timestamp(), etc.
Tableau used to use “SELECT * FROM your_table LIMIT 0; “ to discover metadata.
Column-projection pushdown was available in Pig through Hcat for some time. Now, PPD as well.
We’ve improved split-calculation:
Block-based BI: 1 split per block! (Checked in independently in Apache.)
ETL: Not usable at large scale, in current form.
Better memory-usage when writing complex types in ORC, by disabling dictionaries (just for complex types).
Skew-joins are available in Pig. The need for it is apparently specific to Y!. Current approach in Hive is a little clunky. We have a fix coming.
Better memory usage with SpillableRowContainers, especially for wide-tables.
The loyalty to a data-format can approach fundamentalist proportions, as illustrated by this Y!Hive user, who was asked to consider ORC format for when his column-schema matures.
Whatchusay?
At scale, reading from a single schema-file on HDFS can be detrimental.
This has gotten entirely too silly.
Eugene O’Neill: “There is no present or future… Only the past happening over and over again, now. “
Schema stored on disk. Statistics/histograms stored alongside data.
Bucky Lasek at the X-Games in 2001. Notice where he’s looking… Not at the camera, but setting up his next trick.