10 compute nodes, with 512GB RAM per node, running HDP 2.3
Scale 10K (10TB), interactive queries
Single query runs – via Hive CLI
Concurrency runs – via HiveServer2 and jmeter
Hive1: Hive 1.2 + Tez 0.7
Pre-warm and container reuse enabled
LLAP: Close to the 2.0 Hive branch, Tez close to the current master branch
Caching Enabled (as of November 2015)
1. DPP: Implemented in two sequential jobs. The first one processes the pruning part, saving the dynamic values on HDFS. The second job uses these values to filter out unwanted partitoins. Not fully tested yet.
2. Spark RDD persistence is used to store the temporary results from repeated subqueires to avoid re-computation. This is similar to materialized view and happens automatically. This is especially useful for cases of self-join, self-union, and CTE.
3. Vectorized map-join, optimized hashtable for mapjoin. These are very similar to tez.
4. Use parallel order by provided by Spark to do global sorting without limiting to one reducer. Internally, however, spark does the sampling.
5. Wait for a few seconds after SparkContext is created before submitting the job to make sure that enough number of executors are launched. SparkContext allows a job to be submitted right way, even if the executors are still starting up. Parallelism at reducer is partially determined by the number of available executors at the time when the job is submitted. This is useful for short-lived sessions, such as those launched by Oozie.