1. Hive at Yahoo: Letters from the trenches
P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n , C h r i s D r o m e ⎪ J u n e 1 0 , 2 0 1 5
2 0 1 5 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
2. About myself
2 2014 Hadoop Summit, San Jose, California
Mithun Radhakrishnan
Hive Engineer at Yahoo!
Hive Committer and long-time
contributor
› Metastore-scaling
› Integration
› HCatalog
mithun@apache.org
@mithunrk
3. About myself
3 2014 Hadoop Summit, San Jose, California
Chris Drome
Hive Engineer at Yahoo!
Hive contributor
cdrome@yahoo-inc.com
7. 1 TB
7 2015 Hadoop Summit, San Jose, California
› 6.2x speedup over Hive 0.10 (RCFile)
• Between 2.5-17x
› Average query time: 172 seconds
• Between 5-947 seconds
• Down from 729 seconds (Hive 0.10 RCFile)
› 61% queries completed in under 2 minutes
› 81% queries completed in under 4 minutes
8. Explaining the speed-ups
8 2015 Hadoop Summit, San Jose, California
Hadoop 2.x, et al.
Apache Tez
› (Arbitrary DAG)-based Execution Engine
› “Playing the gaps” between M&R
• Intermediate data and the HDFS
› Smart scheduling
› Container re-use
› Pipelined job start-up
Hive
› Statistics
› Vectorized Execution
ORC
› PPD
9. Expectations with Hive 0.13 production
9 2014 Hadoop Summit, San Jose, California
Tez would outperform M/R by miles
Tez would enable better cluster utilization
› Use less resources
Tez (and dependencies) would be “production ready”
› GUI for task logs, DAG overviews, swim-lanes
› Speculative execution
Similarly, ORC and Vectorization
› Support evolving schemas
10. The Y!Grid
10 2015 Hadoop Summit, San Jose, California
18 Hadoop Clusters in YGrid
› 41565 Nodes
› Biggest cluster: 5728 Nodes
› 1M jobs a day
Hadoop 2.6+
Large Datasets
› Daily, hourly, minute-level frequencies
› Thousands of partitions, 100s of 1000s of files, TBs of data per partition
› 580 PB of data, total
Pig 0.14 on Tez, Pig 0.11
Hive 0.13 on Tez
HCatalog for interoperability
Oozie for scheduling
GDM for data-loading
Spark, HBase, Storm, etc…
11. Data processing use cases
11 2015 Hadoop Summit, San Jose, California
Grid usage
› 30+ million jobs per month
› 12+ million Oozie launcher jobs
Pig usage
› Handles majority of data pipelines/ETL (~43% of jobs)
Hive usage
› Relatively smaller niche
› 632,000 queries per month (35% Tez)
HCatalog for Inter-operability
› Metadata storage for all Hadoop data
› Yahoo-scale
› Pig pipelines with Hive analytics
12. Business Intelligence Tools
12 2015 Hadoop Summit, San Jose, California
Tableau, MicroStrategy
Power users
› Tableau Server for scheduled reports
Challenges:
› Security
• ACLs, Authentication, Encryption over the wire
› Bandwidth
• Transporting results over ODBC
• Limit result-set to 1000s-10000s of rows
• Aggregations
› Query Latency
• Metadata queries
• Partition/Table scans
• Materialized views
13. Data producer owns the data
› Unlike traditional DBs
Multi-paradigm data access/generation
› Pig/Hive/MapReduce using HCatalog
Highly available metadata service
UI for tracking/debugging jobs
Execution engine should ideally support speculative execution
13 2015 Hadoop Summit, San Jose, California
Non-negotiables for Hive upgrade at Yahoo!
14. Yahoo! Hive-0.13
14 2015 Hadoop Summit, San Jose, California
Based on Apache Hive-0.13.1
Internal Yahoo! Patches (admin web-services, data discovery, etc.)
Community patches to stabilize Apache Hive-0.13.1
› Tez
• HIVE-7544, HIVE-6748, HIVE-7112, …
› Vectorization
• HIVE-8163, HIVE-8092, HIVE-7188, HIVE-7105, HIVE-7514, …
› Failures
• HIVE-7851, HIVE-7459, HIVE-7771, HIVE-7396, …
› Optimizations
• HIVE-7231, HIVE-7219, HIVE-7203, HIVE-7052, …
› Data integrity
• HIVE-7694, HIVE-7494, HIVE-7045, HIVE-7346, HIVE-7232, …
Phased upgrades
› Phase 1: 285 JIRAs
› Phase 2: 23 JIRAs (HIVE-8781 and related dependencies)
› Phase 3: 46 JIRAs (HIVE-10114 and related dependencies)
15. One remote Hive Metastore “instance”
› 4 HCatalog Servers behind a hardware VIP
• L3DSR load balancer
• 96GB-128GB RAM, 16 core boxes
› Backed by Oracle RAC
About 10 Gateways
› Interactive use of Hive (and Pig, Oozie, M/R)
› hive.metastore.uris -> HCatalog
About 4 HiveServer2 instances
› Ad Hoc queries, aggregation
15 2015 Hadoop Summit, San Jose, California
Hive deployment (per cluster)
16. Evolution of grid services at Yahoo!
16 Yahoo Confidential & Proprietary
Gateway Machines
Grid
OracleOracle RAC
Browser
HUE
Hive Server 2
BI Tools
HCatalogHCatalog
17. Query performance on very large data sets
› HIVE-8292: Reading … has high overhead in MapOperator.cleanUpInputFileChangedOp
Split-generation on very large data sets
› Tends to generate more splits (maps tasks) compared to M/R
› Long split generation times
› Hogging the Hadoop queues
• Wave factor vs multi-tenancy requirements
› HIVE-10114: Split strategies for ORC
Scaling problems with ATS
› More of a problem with Pig workflows
› 10K+ tasks/job are routine
› AM progress reporting, heart-beating, memory usage
› Hadoop 2.6.0.10+
17 2015 Hadoop Summit, San Jose, California
Challenges experienced with Hive on Tez
19. At Yahoo! Scale,
› 100s of Databases per cluster
› 100s of Tables per database
› 100s of columns per Table
› 1000s of Partitions per Table
• Larger tables: Thousands of partitions, per hour
• Millions of partitions every few days
• 10s of millions of partitions, over dataset retention period
Problems:
› Metadata volume
• Database/Table/Partition IO Formats
• Record serialization details
• HDFS paths
• Statistics
– Per partition
– Per column
19 2015 Hadoop Summit, San Jose, California
Fast execution engines aren’t the whole picture
21. 21 2015 Hadoop Summit, San Jose, California
From: Another ETL pipeline.
To: The Yahoo Hive Team
Subject: Slow queries
YHive team,
My query fails with OutOfMemoryError. I tried increasing
container size, but it still fails. Please help!
Here are my settings:
set mapreduce.input.fileinputformat.split.maxsize=16777216;
set mapreduce.map.memory.mb=4096;
set mapreduce.reduce.memory.mb=4096;
set mapred.child.java.opts=“-Xmx1024m”
...
INSERT OVERWRITE TABLE my_table PARTITION( foo, bar, goo )
SELECT * FROM {
...
}
...
22. 22 2015 Hadoop Summit, San Jose, California
From: YET another ETL pipeline.
To: The Yahoo Hive Team
Subject: Slow UDF performance
YHive team,
Why does using a simple custom UDF cause queries to
time out?
SELECT foo, bar, my_function( goo )
FROM my_large_table
WHERE ...
24. From: The ETL team
To: The Yahoo Hive Team
Subject: A small matter of size...
Dear YHive team,
We have partitioned our table using the following
6 partition keys: {hourly-timestamp, name, property,
geo-location, shoe-size, and so on…}.
For a given timestamp, the combined cardinality of the
remaining partition-keys is about 10000/hr.
If queries on partitioned tables are supposed to
be faster, how come queries on our table take forever
just to get off the ground?
Yours gigantically,
Project Grape Ape
24 2015 Hadoop Summit, San Jose, California
26. Metadata volume and Query Execution time
26 2015 Hadoop Summit, San Jose, California
Anatomy of a Hive query
1. Compile query to AST
2. Thrift-call to Metastore, for partition list
3. Examine partitions, data-paths, etc. Construct physical query plan.
4. Run optimizers on the plan
5. Execute plan. (M/R, Tez).
Partition pruner:
› Removes partitions that shouldn’t participate in the query.
› In effect, remove input-directories from the Hadoop job.
27. The problems of large-scale metadata
27 2015 Hadoop Summit, San Jose, California
Partition pruner is single-threaded
› Query spans a day
› Query spanning a week? 2 million partitions
Partition objects are huge:
› HDFS Paths
› IO Formats
› Record Deserializer info
› Data column schema
Datanucleus:
› 1 Partition: Join 6 Oracle tables in the backend.
Thrift serialization/deserialization takes minutes.
› *Minutes*.
28. Immediate workarounds
28 2015 Hadoop Summit, San Jose, California
“Hive wasn’t originally designed for more than 10000s of partitions,
total…”
Throw hardware at it
› 4 HCatalog servers behind a hardware VIP
› High-RAM boxes:
• 96GB-128 GB metastore processes
• Tune each to use 100 connections to the Oracle RAC
Client-side tuning
› Increase hive.metastore.client.socket.timeout
› Increase heap size as needed (container size)
› Multi-threaded fstat operations
29. Fix the leaky/noisy bits
29 2015 Hadoop Summit, San Jose, California
Metastore frequently ran out of memory:
› Disable Hadoop FileSystem cache
• HIVE-3098, HDFS-3545
• FileSystem.CACHE used UGI.hashcode()
– Compared Subjects for equality, not equivalence.
› Fixed Thrift 0.9
• TSaslServerTransport had circular references
• JVM couldn’t detect these for cleanup
– WeakReferences are your friend
• Fix incompatibility with L3DSR pings
Data discovery from Oozie:
› Use JMS notifications, on publication
› Oozie Coordinators wake up on ActiveMQ notification, kick off dependent workflows
› Reduced polling frequency
30. More fixes
30 2015 Hadoop Summit, San Jose, California
Metadata-only queries:
› SELECT DISTINCT tstamp FROM my_purple_table ORDER BY tstamp DESC LIMIT
1000;
› Replace HiveMetaStoreClient::getPartitions() with getPartitionNames().
› Local job, versus cluster.
Optimize the optimizer:
› The first step in some optimizers:
• List<Partition> partitions = hiveMetaStoreClient.getPartitions( db, table,
(short)-1 );
• Pray that the client and/or the metastore don’t run out of memory.
• Take a nap.
› Fixed PartitionPruner, MetadataOnlyOptimizer.
31. Long-term fixes:
31 2015 Hadoop Summit, San Jose, California
DirectSQL short-circuits:
› Datanucleus problems at scale
• (Yes, we are aware of the irony that might result from extrapolation.)
› Specific to the backing DB.
Compaction of Partition info:
› HIVE-7223, HIVE-7576, HIVE-9845, etc.
› Schema evolves infrequently
› Partition-info rarely differs from table-info
– Except HDFS paths (which are super-strings)
› List<Partition> vs Iterator<Partition>
• PartitionSet abstraction
– The delight of Inheritance in Thrift
• Reduced memory foot-prints
32. 32 2015 Hadoop Summit, San Jose, California
“The finest trick of The Devil was to
persuade you that he does not exist.”
-- ???
36. From: A major reporting team
To: The Yahoo Hive Team
Subject: Urgent! Customer reports are borking.
Dear YHive team,
When we connect Tableau Server 8.3 to Y!Hive
0.12/0.13, it is unusably slow. Queries take too long
to run, and time out.
We’d prefer not to change our query-code too
much. How soon can Hive accommodate our simple queries?
Yours hysterically,
Project Zodiac
36 2015 Hadoop Summit, San Jose, California
37. Analysis: The query
37 2015 Hadoop Summit, San Jose, California
Non-const partition key predicates:
› E.g.
WHERE utc_time <= from_unixtime(unix_timestamp()- 2*24*60*60,
'yyyyMMdd')
AND utc_time >= from_unixtime(unix_timestamp()- 32*24*60*60,
'yyyyMMdd')
› Solution: Use constant expressions where possible.
› Fix: Hive 1.x supports dynamic partition pruning, and constant folding.
Costly joins with partitioned dimension tables:
› E.g.
› SELECT … FROM fact_table JOIN (SELECT * FROM dimension_table
WHERE dt IN (SELECT MAX(dt) from dimension_table);
› Workaround: External “pointer” tables.
› Fix: Dynamic partition pruning.
38. Analysis: The data
38 2015 Hadoop Summit, San Jose, California
Data stored in TEXTFILE
› Solution: Switch to columnar storage
• ORC, dictionary encoding, vectorization, predicate pushdown
Over-partitioning:
› Too many partition keys
› Diminishing returns with partition pruning
› Solution: Eliminate partition keys, consider sorting
Small Part files
› Hard-coded nReducers
› E.g.
hive> dfs -count /projects/foo_stats;
9081 682735 1876847648672 /projects/foo.db/foo_stats
› Solution:
• set hive.merge.mapfiles=true;
• set hive.merge.mapredfiles=true;
• set hive.merge.tezfiles=true;
39. We’re not done yet
39 2015 Hadoop Summit San Jose
Tez/ATS scaling
Speed up split calculation
Auto/Offline compaction
Abuse detection
Better handling of schema
evolution
Skew Joins in Hive
UDFs with JNI and configuring
LD_LIBRARY_PATH
42. YHive configuration settings:
42 2014 Hadoop Summit, San Jose, California
set hive.merge.mapfiles=false; -- Except when producing data.
set hive.merge.mapredfiles=false; -- Except when producing data.
set tez.merge.files=false; -- Except when producing data.
-- For ORC files.
-- dfs.blocksize=134217728; -- hdfs-site.xml
set orc.stripe.size=67108864; -- 64MB stripes.
set orc.compress.size=262144; -- 256KB compress buffer.
set orc.compress=ZLIB; -- Override to NONE, per table.
set orc.create.index=true; -- ORC indexes.
set orc.optimize.index.filter=true; -- Predicate pushdown with ORC index
set orc.row.index.stride=10000;
43. YHive configuration settings: (contd)
43 2014 Hadoop Summit, San Jose, California
-- Delegation Token Store settings:
set hive.cluster.delegation.token.store.class=ZooKeeperTokenStore;
set hive.cluster.delegation.token.renew-interval=172800000;
(Start HCat Server with -Djute.maxbuffer=24MB -> 190K+ tokens.)
-- Data Nucleus settings:
set datanucleus.connectionPoolingType=DBCP; -- !(BoneCP).
set datanucleus.cache.level1.type=none;
set datanucleus.cache.level2.type=none;
set datanucleus.connectionPool.maxWait=200000;
set datanucleus.connectionPool.minIdle=0;
-- Misc.
set hive.metastore.event.listeners=com.yahoo.custom.JMSListener;
44. Zookeeper Token Storage performance
44 2014 Hadoop Summit, San Jose, California
Jute Buffer Size (in MB) Max delegation token count
4MB 30K
8MB 60K
12MB 90K
16MB 130K
20MB 160K
24MB 190K
46. Why Hive on Tez?
46 2015 Hadoop Summit, San Jose, California
Shark, Impala
› Pre-emption for in-memory systems
› Multi-tenant, shared clusters
› Heterogeneous nodes
› Existing ecosystem
› Community-driven development
Shark
› Good proof of concept, but was not production ready
› Shuffle performance
› Hive on Spark – under active development
47. Analysis: Tableau/ODBC driver
47 2015 Hadoop Summit, San Jose, California
Tableau has come a long way, but
› Schema discovery
• SELECT * FROM my_large_table LIMIT 0;
• SELECT DISTINCT part_key FROM my_large_table;
› SQL dialect
• Depends on vendor-specific driver-name
› Schema metadata-scans
• 3 partition listings per query
› Miscellaneous problems:
• “Custom SQL” rewrites
• Trouble with quoting
tl;dr : Try to transition to Simba’s 2.0.x Drivers with Tableau 8.3.x
Editor's Notes
TODO: Update latest profile pic
TODO: Update latest profile pic
At last year’s talk, which was received so enthusiastically.
Tez : Scheduling. Playing the gaps, like Beethoven’s Fifth.
Why 13? Why move from 12?
10000s of files?
Spark, HBase
Talk up the work from Gemini. Power-users of Tableau Server.
People with RDBMS expertise think Partitions are analogous to Indexes. The more you have, the faster the query should run.
Talk up the work from Gemini. Power-users of Tableau Server.
People with RDBMS expertise think Partitions are analogous to Indexes. The more you have, the faster the query should run.
Add diagram for deployment of Hive, and its evolution.
Describe the problem with
Add diagram for deployment of Hive, and its evolution.
Last year saw a tonne of benchmarketing. Tez vs Spark (vs Impala). We’ve had several choices of execution engines.
But we seem to have forgotten to scale a crucial part of the system. The metastore.
Talk about the kinds of metadata:
Input/Output formats, per table, per partition.
Record format information. SerDe classes.
Data paths
Table/Partition level statistics:
Also mention the Hundreds of columns per table.
Small split-size.
My_function() is a webservice call.
hive.log.incremental.plan.progress.
This table is our largest. We use this to test and break our system.
Focus on data-paths.
Interesting segue: The “short” nPartitions parameter.
Interesting segue: The “short” nPartitions parameter.
Interesting segue: The “short” nPartitions parameter.
Elaborate the problems with datanucleus at scale:
Thread safety
Memory usage
Performance
Schema evolution can happen both at a geological pace, as well as a tectonic scale.
Inheritance in Thrift is like implementing it in C.
Mention that similar changes were made in Pig/HCatalog, for compressing Partition info. 26x storage saving (for split meta-info), + 10x faster for the query to start.
The Java anecdote.
Verbal Kint.
Bonus: The rooftop scene in Sherlock 2.3.
Charles Baudelaire. The Java anecdote.
Introduce the beast that is Tableau.
Flash the “simple” query.
Bucky Lasek at the X-Games in 2001. Notice where he’s looking… Not at the camera, but setting up his next trick.
Talk about distcp –pgrub, for ORC files.
At last year’s talk, which was received so enthusiastically.
Shark was a good proof-of-concept, but was not production ready.
Praise the work from Simba.
Rework slide. Too much info. Just put the TLDR.
SQL dialect
Depends on vendor-specific driver-name
Schema metadata-scans
3 partition listings per query
Miscellaneous problems:
“Custom SQL” rewrites
Trouble with quoting