Developing Big Data Analytics often involves trial and error debugging, due to the unclean nature of datasets or wrong assumptions about data. Data scientists typically write code that implements a data processing pipeline and test it on their local workstation with a small sample data, downloaded from a TB-scale data warehouse. They cross fingers and hope that the program works in the expensive production cloud.When a job fails or they get a suspicious result, data scientists spend hours guessing at the source of the error, digging through post-mortem logs.
In such cases, the data scientists may want to pinpoint the root cause of errors by investigating a subset of corresponding input records. In this talk, we present BigSift, an automated debugger for Apache Spark that data engineers and scientists can use. It takes an Apache Spark program, a user-defined test oracle function, and a dataset as input and outputs a minimum set of input records that reproduces the same test failure.
BigSift combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failure-inducing inputs. It redefines data provenance for the purpose of debugging using a test oracle function and implements several unique optimizations, specifically geared towards the iterative nature of automated debugging workloads. BigSift exposes an interactive web interface where a user can monitor a big data analytics job running remotely on the cloud, write a user-defined test oracle function, and then trigger the automated debugging process. BigSift also provides a set of predefined test oracle functions, which can be used for explaining common types of anomalies in big data analytics. This debugging effort is led by UCLA Professors Miryung Kim and Tyson Condie, and produced several research papers in top Software Engineering and Database conferences.
Automated Debugging of Big Data Analytics in Apache Spark Using BigSift with Muhammad Ali Gulzar and Miryung Kim
1. Muhammad Ali Gulzar Miryung Kim
University of California, Los Angeles
Automated Debugging of Big Data
Analytics in Apache Spark Using BigSift
#Res3SAIS
2. Big Data Debugging in the Dark
2#Res3SAIS
Map Reduce
Develop locally Hope it works Run in cloud Bug!
Guesswork
3. BigDebug: Debugging Primitives for
Interactive Big Data Processing
ICSE 2016
Muhammad Ali Gulzar, Matteo Interlandi, Seunghyun Yoo, Sai
Deep Tetali Tyson Condie, Todd Millstein, Miryung Kim
Presented at Spark Summit 2017
3#Res3SAIS
4. Interactive Debugging with BigDebug
4#Res3SAIS
4. Backward and Forward Tracing
1. Simulated Breakpoint 2. On Demand Guarded Watchpoint
3. Crash Culprit Identification
5. Titian: Data Provenance support in Spark
VLDB 2015
Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali
Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, Tyson Condie
5#Res3SAIS
6. Data Provenance – in SQL
6#Res3SAIS
Tuple-ID Time Sendor-ID Temperature
T1 11AM 1 34
T2 11AM 2 35
T3 11AM 3 35
T4 12PM 1 35
T5 12PM 2 35
T6 12PM 3 100
T7 1PM 1 35
T8 1PM 2 35
T9 1PM 3 80
SELECT time, AVG(temp)
FROM sensors
GROUP BY time
Sensors
Result-
ID
Time AVG(temp)
ID-1 11AM 34.6
ID-2 12PM 56.6
ID-3 1PM 50
Outlier
Outlier
Why ID-2 and
ID-3 have those
high values?
7. Step 1: Instrument Workflow in Spark
7#Res3SAIS
Combiner
LineageRDD
Reducer
LineageRDD
Hadoop
LineageRDD
counts
pairscodeserrorslines
Stage 1
Stage 2
reports
Stage
LineageRDD
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Input
ID
Output
ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
8. Step 2: Example Backward Tracing
8#Res3SAIS
Worker1
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Worker1
Worker3
Input ID Output ID
[p1, p2] 400
[ p1 ] 4
Input ID Output ID
400 id1
4 id2
Reducer Stage
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Input ID Output ID
p1 400
Input ID Output ID
p1 400
Stage.Input IDReducer.Output ID
Worker2
9. Step 2: Example Backward Tracing
9#Res3SAIS
Worker1
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Worker1
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Input ID Output ID
p1 400
Input ID Output ID
p1 400
Worker2
Combiner.Output ID Reducer.Output ID
Combiner.Output ID Reducer.Output ID
10. Step 2: Example Backward Tracing
10#Res3SAIS
Worker1
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner
Worker1
Input ID Output ID
offset1 id1
offset2 id2
offset3 id3
Input ID Output ID
{ id1, id 3} 400
{ id2 } 4
Hadoop Combiner
Input ID Output ID
offset1 id1
… …
Input ID Output ID
{ id1, …} 400
Hadoop Combiner Worker2
Combiner.Input IDHadoop.Output ID
Combiner.Input IDHadoop.Output ID
11. BigSift: Automated Debugging of Big
Data Analytics in DISC Applications
SoCC 2017
Muhammad Ali Gulzar, Matteo Interlandi, Xueyuan Han, Tyson
Condie, Miryung Kim
11#Res3SAIS
12. Motivating Example for Automated
Debugging with BigSift
• Alice writes a Spark program that identifies, for each state in the US,
the delta between the minimum and the maximum snowfall reading
for each day of any year and for any particular year.
• An input data record that measures 1 foot of snowfall on January 1st of
Year 1992, in the 99504 zip code (Anchorage, AK) area, appears as
99504 , 01/01/1992 , 1ft
12#Res3SAIS
13. Debugging Incorrect Output
13
99504 , 01/01/1992 , 1ft
99504 , 03/01/1992 , 0.1ft
99504 , 01/01/1993 , 70in
99504 , 03/01/1993 , 145mm
99504 , 01/01/1994 , 245mm
99504 , 01/01/1993 , 85mm
90031 , 02/01/1991 , 0mm
AK , 01/01 , [304.8, 21336, 245, 85]
AK , 03/01 , [30.5 , 145]
AK , 1992 , [304.8 , 30.5]
AK , 1993 , [21336, 145, 85]
AK , 1994 , [245]
CA , 02/01 , [0]
CA , 1991 , [0]
TextFile FlatMap GroupByKey Map Output
AK , 01/01 , 304.8
AK , 1992 , 304.8
AK , 03/01 , 30.5
AK , 1992 , 30.5
AK , 01/01 , 21336
AK , 1993 , 21336
AK , 03/01 , 145
AK , 1993 , 145
AK , 01/01 , 245
AK , 1994 , 245
…. ….
AK , 01/01 , 21251
AK , 03/01 , 114.5
AK , 1992 , 274.3
AK , 1993 , 21251
AK , 1994 , 0
CA , 02/01 , 0
CA , 1991 , 0
It is challenging because the input records is very large and the program
involves computing min and max and a unit conversion
#Res3SAIS
14. Data Provenance with Titian [VLDB ‘15]
99504 , 01/01/1992 , 1ft
99504 , 03/01/1992 , 0.1ft
99504 , 01/01/1993 , 70in
99504 , 03/01/1993 , 145mm
99504 , 01/01/1994 , 245mm
99504 , 01/01/1993 , 85mm
90031 , 02/01/1991 , 0mm
AK , 01/01 , [304.8, 21336, 245, 85]
AK , 03/01 , [30.5 , 145]
AK , 1992 , [304.8 , 30.5]
AK , 1993 , [21336, 145, 85]
AK , 1994 , [245]
CA , 02/01 , [0]
CA , 1991 , [0]
TextFile FlatMap GroupByKey Map Output
AK , 01/01 , 304.8
AK , 1992 , 304.8
AK , 03/01 , 30.5
AK , 1992 , 30.5
AK , 01/01 , 21336
AK , 1993 , 21336
AK , 03/01 , 145
AK , 1993 , 145
AK , 01/01 , 245
AK , 1994 , 245
…. ….
AK , 01/01 , 21251
AK , 03/01 , 114.5
AK , 1992 , 274.3
AK , 1993 , 21251
AK , 1994 , 0
CA , 02/01 , 0
CA , 1991 , 0
#Res3SAIS 14
It over-approximates the scope of failure-inducing inputs i.e. records in
the faulty key-group are all marked as faulty
15. Delta Debugging
• Alice tests the output from different input configurations using a test function
15
Run 1 fails the test
99504 , 01/01/1992 , 1ft
99504 , 03/01/1992 , 0.1ft
99504 , 01/01/1993 , 70in
99504 , 03/01/1993 , 145mm
99504 , 01/01/1994 , 245mm
99504 , 01/01/1993 , 85mm
90031 , 02/01/1991 , 0mm
AK , 01/01 , 304.8
AK , 1992 , 304.8
AK , 03/01 , 30.5
AK , 1992 , 30.5
AK , 01/01 , 21336
AK , 1993 , 21336
AK , 03/01 , 145
AK , 1993 , 145
AK , 01/01 , 245
AK , 1994 , 245
…. ….
AK , 01/01 , [304.8, 21336, 245, 85]
AK , 03/01 , [30.5 , 145]
AK , 1992 , [304.8 , 30.5]
AK , 1993 , [21336, 145, 85]
AK , 1994 , [245]
CA , 02/01 , [0]
CA , 1991 , [0]
AK , 01/01 , 21251
AK , 03/01 , 114.5
AK , 1992 , 274.3
AK , 1993 , 21251
AK , 1994 , 0
CA , 02/01 , 0
CA , 1991 , 0
TextFile FlatMap GroupByKey Map Output
1
2
def test(key:String, delta: Float) : Boolean = {
delta < 6000
}
#Res3SAIS
16. Delta Debugging
• DD is an systematic debugging procedure guided by a test function.
DD is inefficient because it does not eliminate irrelevant input records
upfront.
16
TextFile FlatMap GroupByKey Map Output
#Res3SAIS
99504 , 01/01/1992 , 1ft
99504 , 03/01/1992 , 0.1ft
99504 , 01/01/1993 , 70in
AK , 01/01 , 304.8
AK , 1992 , 304.8
AK , 03/01 , 30.5
AK , 1992 , 30.5
AK , 01/01 , 21336
AK , 1993 , 21336
AK , 01/01 , [304.8, 21336]
AK , 03/01 , [30.5]
AK , 1992 , [304.8 , 30.5]
AK , 1993 , [21336]
AK , 01/01 , 21031.2
AK , 03/01 , 0
AK , 1992 , 274.3
AK , 1993 , 0
Run 2 fails the test
1
2
17. Delta Debugging
• DD is an systematic debugging procedure guided by a test function.
DD is inefficient because it does not eliminate irrelevant input records
upfront.
17
TextFile FlatMap GroupByKey Map Output
#Res3SAIS
99504 , 01/01/1992 , 1ft
99504 , 03/01/1992 , 0.1ft
99504 , 01/01/1993 , 70in
AK , 01/01 , 304.8
AK , 1992 , 304.8
AK , 03/01 , 30.5
AK , 1992 , 30.5
AK , 01/01 , [304.8]
AK , 03/01 , [30.5]
AK , 1992 , [304.8 , 30.5]
AK , 01/01 , 0
AK , 03/01 , 0
AK , 1992 , 274.3
Run 3 passes the test
18. Delta Debugging
• DD is an systematic debugging procedure guided by a test function.
DD is inefficient because it does not eliminate irrelevant input records
upfront.
18
TextFile FlatMap GroupByKey Map Output
#Res3SAIS
99504 , 01/01/1992 , 1ft
99504 , 03/01/1992 , 0.1ft
99504 , 01/01/1993 , 70in
AK , 01/01 , 21336
AK , 1993 , 21336
AK , 01/01 , [21336]
AK , 1993 , [21336]
AK , 01/01 , 0
AK , 1993 , 0
Run 4 passes the test
19. Delta Debugging
• DD is an systematic debugging procedure guided by a test function.
DD is inefficient because it does not eliminate irrelevant input records
upfront.
19
TextFile FlatMap GroupByKey Map Output
#Res3SAIS
99504 , 01/01/1992 , 1ft
99504 , 03/01/1992 , 0.1ft
99504 , 01/01/1993 , 70in
AK , 01/01 , 304.8
AK , 1992 , 304.8
AK , 01/01 , [304.8]
AK , 1992 , [304.8]
AK , 01/01 , 0
AK , 1992 , 0
Run 5 passes the test
20. Delta Debugging
• DD is an systematic debugging procedure guided by a test function.
DD is inefficient because it does not eliminate irrelevant input records
upfront.
20
TextFile FlatMap GroupByKey Map Output
#Res3SAIS
99504 , 01/01/1992 , 1ft
99504 , 03/01/1992 , 0.1ft
99504 , 01/01/1993 , 70in
AK , 03/01 , 30.5
AK , 1992 , 30.5
AK , 03/01 , [30.5]
AK , 1992 , [30.5]
AK , 03/01 , 0
AK , 1992 , 0
Run 6 passes the test
21. Delta Debugging
• DD is an systematic debugging procedure guided by a test function.
DD is inefficient because it does not eliminate irrelevant input records
upfront.
21
TextFile FlatMap GroupByKey Map Output
#Res3SAIS
99504 , 01/01/1992 , 1ft
99504 , 03/01/1992 , 0.1ft
99504 , 01/01/1993 , 70in
AK , 01/01 , 21336
AK , 1993 , 21336
AK , 01/01 , [21336]
AK , 1993 , [21336]
AK , 01/01 , 0
AK , 1993 , 0
Run 7 passes the test
22. Delta Debugging
• DD is an systematic debugging procedure guided by a test function.
DD is inefficient because it does not eliminate irrelevant input records
upfront.
22
TextFile FlatMap GroupByKey Map Output
#Res3SAIS
99504 , 01/01/1992 , 1ft
99504 , 03/01/1992 , 0.1ft
99504 , 01/01/1993 , 70in
AK , 03/01 , 30.5
AK , 1992 , 30.5
AK , 01/01 , 21336
AK , 1993 , 21336
AK , 03/01 , [30.5]
AK , 1992 , [30.5]
AK , 01/01 , [21336]
AK , 1993 , [21336]
AK , 03/01 , 0
AK , 1992 , 0
AK , 01/01 , 0
AK , 1993 , 0
Run 8 passes the test
23. Delta Debugging
• DD is an systematic debugging procedure guided by a test function.
DD is inefficient because it does not eliminate irrelevant input records
upfront.
23
TextFile FlatMap GroupByKey Map Output
#Res3SAIS
99504 , 01/01/1992 , 1ft
99504 , 03/01/1992 , 0.1ft
99504 , 01/01/1993 , 70in
AK , 01/01 , 304.8
AK , 1992 , 304.8
AK , 01/01 , 21336
AK , 1993 , 21336
AK , 01/01 , [304.8 , 21336]
AK , 1992 , [304.8]
AK , 1993 , [21336]
AK , 01/01 , 21031.2
AK , 1992 , 0
AK , 1993 , 0
Run 9 fails the test
24. Automated Debugging in DISC with BigSift
24#Res3SAIS
Test
Predicate
Pushdown
Prioritizing
Backward
Traces
Bitmap based
Test
Memoization
Input: A Spark
Program, A Test
Function
Output: Minimum Fault-
Inducing
Input Records
Data Provenance + Delta Debugging
25. Optimization 1: Test Predicate Pushdown
• Observation: During backward tracing, data provenance traces
through all the partitions even though only a few partitions are faulty
25#Res3SAIS
If applicable, BigSift pushes down the test function to test the
output of combiners in order to isolate the faulty partitions.
Test
Test
Test
Test
Test
Test
Test
26. Optimization 2 :Prioritizing Backward Traces
• Observation: The same faulty input record may contribute to multiple
output records failing the test.
26#Res3SAIS
99504 , 01/01/1992 , 1ft
99504 , 03/01/1992 , 0.1ft
99504 , 01/01/1993 , 70in
99504 , 03/01/1993 , 145mm
99504 , 01/01/1994 , 245mm
99504 , 01/01/1993 , 85mm
90031 , 02/01/1991 , 0mm
AK , 01/01 , [304.8, 21336, 245, 85]
AK , 03/01 , [30.5 , 145]
AK , 1992 , [304.8 , 30.5]
AK , 1993 , [21336, 145, 85]
AK , 1994 , [245]
CA , 02/01 , [0]
CA , 1991 , [0]
TextFile FlatMap GroupByKey Map Output
AK , 01/01 , 304.8
AK , 1992 , 304.8
AK , 03/01 , 30.5
AK , 1992 , 30.5
AK , 01/01 , 21336
AK , 1993 , 21336
AK , 03/01 , 145
AK , 1993 , 145
AK , 01/01 , 245
AK , 1994 , 245
…. ….
AK , 01/01 , 21251
AK , 03/01 , 114.5
AK , 1992 , 274.3
AK , 1993 , 21251
AK , 1994 , 0
CA , 02/01 , 0
CA , 1991 , 0
In case of multiple faulty outputs, BigSift overlaps two backward
traces to minimize the scope of fault-inducing input records
27. Optimization 3: Bitmap based Test Memoization
• Observation: Delta debugging may
try running a program on the same
subset of input redundantly.
• BigSift leverages bitmap to compactly
encode the offsets of original input to
refer to an input subset
#Res3SAIS
0
1
0
1
0
0
0
0
1
1
Input Data Bitmap
!
Test Outcome
We use a bitmap based test memoization technique to avoid
redundant testing of the same input dataset.
29. Debugging Time
29#Res3SAIS
Subject Program Running Time (sec) Debugging Time (sec)
Subject Program Fault Original Job DD BigSift Improvement
Movie Histogram Code 56.2 232.8 17.3 13.5X
Inverted Index Code 107.7 584.2 13.4 43.6X
Rating Histogram Code 40.3 263.4 16.6 15.9X
Sequence Count Code 356.0 13772.1 208.8 66.0X
Rating Frequency Code 77.5 437.9 14.9 29.5X
College Student Data 53.1 235.3 31.8 7.4X
Weather Analysis Data 238.5 999.1 89.9 11.1X
Transit Analysis Code 45.5 375.8 20.2 18.6X
BigSift provides up to a 66X speed up in isolating the precise
fault-inducing input records, in comparison to the baseline DD
30. Debugging Time
30#Res3SAIS
Subject Program Running Time (sec) Debugging Time (sec)
Subject Program Fault Original Job DD BigSift Improvement
Movie Histogram Code 56.2 232.8 17.3 13.5X
Inverted Index Code 107.7 584.2 13.4 43.6X
Rating Histogram Code 40.3 263.4 16.6 15.9X
Sequence Count Code 356.0 13772.1 208.8 66.0X
Rating Frequency Code 77.5 437.9 14.9 29.5X
College Student Data 53.1 235.3 31.8 7.4X
Weather Analysis Data 238.5 999.1 89.9 11.1X
Transit Analysis Code 45.5 375.8 20.2 18.6X
On average, BigSift takes 62% less time to debug a single faulty
output than the time taken for a single run on the entire data.
31. Debugging Time
31#Res3SAIS
1
10
100
1000
10000
100000
1000000
10000000
100000000
1E+09
0 2000 4000 6000 8000 10000 12000 14000
#offault-inducinginput
records
Fault Localization Time (s)
Sequence Count
Delta Debugging
BigSift
Test Driven Data
Provenance
Data Provenance
On average, BigSift takes 62% less time to debug a single faulty
output than the time taken for a single run on the entire data.
32. Fault Localizability over Data Provenance
32#Res3SAIS
143796
6487290
520904
23411
5800
15003060
2554788
350
2
1350
15 13
1 1 1 1 1 1 2
1
10
100
1000
10000
100000
1000000
10000000
100000000
Movie
Historgram
Inverted Index Rating
Histogram
Sequence
Count
Rating
Frequency
College
Students
Weather
Analysis
#offault-inducinginputrecords
Data Provenance Test Driven Data Provenance BigSift & DD
BigSift leverages DD after DP to continue fault isolation, achieving
several orders of magnitude 103 to 107 better precision.
33. Summary
• BigSift is the first piece of work in automated debugging of big data
analytics in DISC.
• BigSift provides 103X – 107X more precision than data provenance
in terms of fault localizability.
• It provides up to 66X speed up in debugging time over baseline
Delta Debugging.
• In our evaluation we have observed that, on average, BigSift finds
the faulty input in 62% less than the original job execution time.
33#Res3SAIS