The document proposes a new benchmark called Elasticity Test (ET) to evaluate elastic cloud big data systems under service level agreement (SLA) constraints. The ET generates realistic workloads based on production job arrival patterns and scales of data. It measures SLA compliance by calculating the distance between actual query completion times and specified SLAs. This provides a more meaningful metric than the current TPCx-BB metric. Experimental results on Apache Hive and Spark using the new ET and metric show significant differences from the current metric, highlighting weaknesses in elasticity and isolation. Future work includes testing database-as-a-service platforms and further study of specifying and incorporating SLAs into benchmarks.
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
1. Benchmarking Elastic Cloud Big Data
Services under SLA Constraints
Nicolas Poggi, Victor Cuevas-Vicenttín, David Carrera, Josep Lluis Berral, Thomas Fenech,
Gonzalo Gomez, Davide Brini, Alejandro Montero
Umar Farooq Minhas, Jose A. Blakeley, Donald Kossmann, Raghu Ramakrishnan and
Clemens Szyperski.
TPCTC - August 2019
2. Outline
1. Intro to TPCx-BB
a. Limitations for cloud systems
b. Contributions
2. Realistic workload generation
a. Production datasets
b. Job arrival rates
3. Elasticity Test
a. Current metric
b. SLA-based addition
4. Experimental evaluation
a. Elasticity Test
b. Load, Power, Throughput tests
c. Metric evaluation
5. Conclusions
a. Future directions
2
3. Benchmarking and TPCx-BB
• Benchmarks capture the solution to a problem and guide decisions.
• Widely used in development, configuration, and testing.
• TPCx-BB (BigBench) is the first standardized big data benchmark
• Collaboration between industry and academia
• Follows the retailer model of TPC-DS
• Adds:
• Semi and unstructured data
• SQL, UDF, ML, and NLP queries
Retailer data model
4. TPCx-BB benchmark workflow
• Similar to previous TPC database benchmarks:
• Load Test (TLD):
• Generates the DB
• imports raw data, metastore, stats, columnar
• Power Test (TPT)
• Runs queries sequentially
• Throughput Test (TTT)
• Runs queries concurrently
• Includes a data refresh stage
• Produces a final performance metric
• BB queries per minute
DB @ SF
Load data
Seq q1 … q30
User1 q15 q21 … q16
User2 q12 q18 … q2
UserN …
Metric
5. Limitations of the cocurrency test
Drawback 1:
• Constant concurrency workloads
at the same scale
Drawback 2:
• Does not consider QoS (isolation)
• Query time degradation is not obvious
from the final metric
• We found poor scalability under
concurrency in BB [1]Stream1 q15 q21 … q16
Stream2 q12 q18 … q2
Stream3 q16 q30 … q19
…
[1] Characterizing BigBench queries, Hive, and Spark in multi-cloud environments TPCTC'17
Q4 from 10 to 100GB
over 15X slower
6. Proposal and contributions
1. Build a realistic big data workload generator
• Based on production workloads
2. Measure QoS in the form of per-query SLAs
• Apply the results in a new metric
• With minimal parameters
3. Extend TPCx-BB with a new concurrency test and metric
• Implement a driver and evaluate differences
8. Analyzing production big data workloads
• Cosmos cluster operated within Microsoft
• Sample of 350,000 job submissions
• Over a month of data in 2017
• Objectives:
1. Model job submission patterns
2. Workload characterization
Peaks
Valleys
9. Modeling arrival rates
• Use Hidden Markov Model (HMM) to
model temporal pattern in the workload
• Probabilities between finite number of states
• HMM allows scaling the workload
Peaks
Valleys
10. Modeling arrival rates
• Use Hidden Markov Model (HMM) to
model temporal pattern in the workload
• Probabilities between finite number of states
• HMM allows scaling the workload
Fluctuations are captured by 4
states and the transitions between them
Peaks
Valleys
11. Job input data size
• As no general temporal pattern found
• Cumulative distribution sufficient for
modeling SF
• CDF used to generate random
variates mapped to SF
• 1, 10, 100, 1000 GB
• Studied further in [2]
• Findings:
• 55% < 1GB
• 90% < 1TB
CDF of the job’s input data size
[2] Big Data Data Management Systems performance analysis using Aloja and BigBench. Master thesis
13. Methodology for generating workloads
1. Set scale (max concurrent submissions)
• Defaults to n
• Total queries = n * total queries
2. Generate model (queries per interval)
1. Assign queries to each batch randomly
• Query repetition avoided within a batch
2. Multi scale factors can be set
• Include all standard smaller SF
3. Define granularity
1. Set time between batches
2. Defaults to 60s.
14. Methodology for generating workloads
1. Set scale (max concurrent submissions)
• Defaults to n
• Total queries = n * total queries
2. Generate model (queries per interval)
1. Assign queries to each batch randomly
• Query repetition avoided within a batch
2. Multi scale factors can be set
• Include all standard smaller SF
3. Define granularity
1. Set time between batches
2. Defaults to 60s.
t1 q17
t2 q7
t3 q15 q21
t4 q6 q9 q14
t5 q9 q14
t6 q11 q22 q21
t7 q16 q15
t8 q24
…
Elasticity Test sequence
Timeintervals
# queries / batch
15. New SLA-aware benchmark metric
• Query-specific SLAs
• Sets a limit for query completion time
• Measures
• Number of misses
• Distance to SLA
• Currently defined ad-hoc
• Uses Power Test times for the SUT(s)
• Adds a 25% margin tolerance
• Benefits
• Works on all SF and future proof
16. New SLA-aware benchmark metric
• Query-specific SLAs
• Sets a limit for query completion time
• Measures
• Number of misses
• Distance to SLA
• Currently defined ad-hoc
• Uses Power Test times for the SUT(s)
• Adds a 25% margin tolerance
• Benefits
• Works on all SF and future proof
Example:
q1 took 38s. in isolation
SLA for q1 = 47.5s.
17. New SLA-aware benchmark metric
• Query-specific SLAs on concurrency
• Sets a limit for query completion time
• Measures
• Number of misses
• Distance to SLA
• Indirectly isolation and dependencies
• Currently defined ad-hoc
• Uses Power Test times for the SUT(s)
• Adds a 25% margin tolerance
• Benefits
• Works on all SF and future proof to tech.
Example:
q1 took 38s. in isolation
SLA for q1 = 47.5s.
t1 q17
t2 q7
t3 q15 q21
t4 q6 q9 q14
t5 q9 q14
t6 q11 q22 q21
t7 q16 q15
t8 q24
…
Elasticity Test sequence
Time
# queries / batch time
SLA distance
31. SLA distance
• Distance between the actual execution time and the specified SLA
Queries that complete within their SLA
do not contribute to the sum
33. SLA factor
< 1 when less tan 25% of the queries fail their SLA,
> 1 if more of 25% of the queries fail their SLA
34. SLA factor
< 1 when less tan 25% of the queries fail their SLA,
> 1 if more of 25% of the queries fail their SLA
Number of queries that fail to meet their SLA
35. SLA factor
< 1 when less tan 25% of the queries fail their SLA,
> 1 if more of 25% of the queries fail their SLA
37. Experimental evaluation
• Experiments performed on Apache Hive (2.2/2.3) and Spark (2.1/2.2)
• Benchmark runs limited to the 14 SQL queries of TPCx-BB
• Due to errors and scalability limitations
• Using a fixed scale factor
• Total 512-cores and 2TB of RAM
• 32 workers: 16 vcpus and 64GB RAM
• Ran on 3 major cloud providers using block storage
• Results anonymized
• (Only results for Provider1 at 10TB presented)
39. Elasticity Test at 10TB and 2 streams
Provider A: Hive Provider A: Spark
40. Complete TPCx-BB test times at 10TB
21
Provider A: Hive Provider A: Spark
Elasticity Time (s) 7,084 6,603
Throughput Time (s) 12,878 6,496
Power Time (s) 5,036 5,520
Load time (s) 5,124 5,124
Total Time (s) 30,122 23,743
5,124 5,124
5,036 5,520
12,878
6,496
7,084
6,603
Total Time (s), 30,122
Total Time (s), 23,743
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
Time(s)
Provider A: Hive Provider B: Spark
41. BB++Qpm (new)
1 2
Provider A: Hive 1,352 295
Provider A: Spark 1,767 1,286
Provider A: Hive 1,352
Provider A: Hive 295
Provider A: Spark 1,767
Provider A: Spark 1,286
Metricscore
Comparison of the two scores at 10TB
22
Hive gets 4.3x
lower score in
the new metric
30% diff
Spark also
gets a lower
score
BB++QpmBBQpm
BBQpm (old)
42. BB++Qpm (new)
1 2
Provider A: Hive 1,352 295
Provider A: Spark 1,767 1,286
Provider A: Hive 1,352
Provider A: Hive 295
Provider A: Spark 1,767
Provider A: Spark 1,286
Metricscore
Comparison of the two scores at 10TB
22
Hive gets 4.3x
lower score in
the new metric
30% diff
Spark also
gets a lower
score
BB++QpmBBQpm
BBQpm (old)
44. Summary
• The throughput test under TPC DB benchmarks provides limited signal
• Closed loop system (constant load)
• Does not consider temporal patterns
• Limited test of load balancers and schedulers (no queueing)
• Modeling a real-world big data cluster we have produced:
• A workload generator with job arrival rates
• Multi-data-scales test
• Extended TPCx-BB with the Elasticity Test
• Incorporating SLAs and proposing a new metric
• Evaluated its applicability to cloud big data systems
• And how scores differs to the current metric
24
45. Conclusions and future work
• The Elasticity Test considers aspects crucial for the cloud
• Dynamic workloads in accordance to real-world behavior
• QoS at the query-level or isolation
• The ET can improve the development of elastic cloud systems
• By rewarding systems that can keep QoS under concurrency
• While saving costs in periods of low intensity
Future directions
• Test elastic DBaaS / QaaS under concurrency
• Specification of SLAs needs to be studied further
• Work with this community and gather feedback and next steps
46. Thanks, questions?
Follow up / feedback : Npoggi@ac.upc.edu
Benchmarking Elastic Cloud Big Data Services
under SLA Constraints
TPCTC - August 2019