Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
ICSE2013
1. 1
Assis$ng
Developers
of
Big
Data
Analy$cs
Applica$ons
When
Deploying
on
Hadoop
Clouds
Hadi
Hemma$
Bram
Adams
Weiyi
Shang
Zhen
Ming
Jiang
Ahmed
E.
Hassan
Patrick
Mar$n
2. What
are
Big
Data
Analy$cs
Applica$on
(BDA
App)?
BDA!
Apps!
2
3. Many
fields
today
rely
on
BDA
Apps
to
make
decisions
So&ware
engineering
research,
especially
Mining
So&ware
Repositories.
And…
3
4. Under
the
hood
of
BDA
Apps
4
Hardware
Infrastructure
So&ware
PlaCorm
BDA
Apps
5.
5
Discrepancy
between
scale
of
development
and
deployment
Small sample data and
pseudo cloud!
Big data and
real-life cloud!
6. ACM
InteracIons
2012
“Analysts
moved
back
and
forth
from
local
machines
to
cloud-‐
based
systems.”
6
7. Many
things
can
go
wrong
when
scaling
7
BDA
App
Step
1
Step
2
Step
n
…
Large-‐scale
intermediate
data
generated
by
each
step
can
fill
up
the
disk
space!!!
8. How
to
verify
the
deployment
of
BDA
Apps?
8
Small sample data and
pseudo cloud!
Big data and
real-life cloud!
How
to
verify
10. 10
Many
false
posi$ves!!
Large
results,
too
much
effort
to
manually
examine
Limita$ons
of
tradi$onal
approach
11. Not
all
kills
are
bad:
“specula$ve
execu$on”
11
Slow
task
idenIfied
The
results
of
the
first
finished
task
are
saved,
others
tasks
are
killed!!
Duplicate
the
task
to
other
machines
13. Execu$on
sequences
provide
context
informa$on
of
log
lines
13
Kill
task
t
on
node
A.
Assign
task
t
on
node
A.
Assign
task
t
on
node
B.
Task
t
finished
on
node
B.
14. Log
abstrac$on
reduces
the
amount
of
data
to
examine
14
Kill
task
t1
on
node
A.
Kill
task
t2
on
node
B.
Kill
task
t3
on
node
C.
Kill
task
t4
on
node
A.
Kill
task
t5
on
node
D.
Kill
task
t6
on
node
B.
Kill
task
t7
on
node
A.
Kill
task
t8
on
node
C.
Large
results,
too
much
effort
to
manually
examine
Kill
task
$t
on
node
$n.
15. Overview
of
our
approach
15
Small sample data and
pseudo cloud!
Big data and
real-life cloud!
Underlying
plaborm
Underlying
plaborm
ExecuIon
sequences
ExecuIon
sequences
ExecuIon
sequence
delta
Log
abstracIon
Log
linking
Sequences
simplificaIon
16. Step
1:
Log
Abstrac$on
reduces
the
size
of
logs
16
Log
abstracIon
Log
Linking
Simplifying
sequences
e of our approach.
Table 1: Example of log lines
# Log line
1 time=1, Task=Trying to launch, TaskID=01A
2 time=2, Task=Trying to launch, TaskID=077
3 time=3, Task=JVM, TaskID=01A
4 time=4, Task=Reduce, TaskID=01A
5 time=5, Task=JVM, TaskID=077
6 time=6, Task=Reduce, TaskID=01A
7 time=7, Task=Reduce, TaskID=01A
8 time=8, Task=Progress, TaskID=077
9 time=9, Task=Done, TaskID=077
10 time=10, Task=Commit Pending, TaskID=01A
11 time=11, Task=Done, TaskID=01A
After eliminating looping, the final log sequences are shown
in Figure 2-d.
Table 3: Execution sequence
TaskID Event sequence
01A E1, E2, E3, E3, E3, E5, E6
077 E1, E2, E4, E6
Table 2: Execution events
Event Event template #
E1 time=$t, Task=Trying to launch, TaskID=$id 1,2
E2 time=$t, Task=JVM, TaskID=$id 3,5
E3 time=$t, Task=Reduce, TaskID=$id 4,6,7
E4 time=$t, Task=Progress, TaskID=$id 8
E5 time=$t, Task=Commit Pending, TaskID=$id 10
E6 time=$t, Task=Done, TaskID=$id 9,11
Table 4: Execution sequence after eliminating loop-
the p-value, the higher probabi
Example of log lines
Execution events
Jiang
et
al.
JSME
2008
17. Step
2:
Log
linking
provides
context
for
logs
17
Table 2: Execution events
Event Event template #
E1 time=$t, Task=Trying to launch, TaskID=$id 1,2
E2 time=$t, Task=JVM, TaskID=$id 3,5
E3 time=$t, Task=Reduce, TaskID=$id 4,6,7
E4 time=$t, Task=Progress, TaskID=$id 8
E5 time=$t, Task=Commit Pending, TaskID=$id 10
E6 time=$t, Task=Done, TaskID=$id 9,11
uence after eliminating loop-
the p-value, the higher probability that the new run has
failure. Therefore, every new run will be tested with the
previous failure-free run to calculate the p-value. A p-value
y contain the same TaskID.
gure 2-c shows the result sequence after abstracting the
and linking them into sequences using the TaskID val-
In the event linking result in Figure 2-c, Events E1, E2,
E5 and E6 are linked together (note that event E3 has
n executed twice) and Event E1, E2, E4, E6 are linked
ther since the same TaskID values are shared.
.2 Eliminating repetitions
here can be event repetitions in the existing sequences
ed by loops. For example, for sequences about reading
a from a remote node, there would be repeated events
ut keeping fetching the data. Similar log sequences that
ude di erent times of the same events are considered
rent sequences, although they indicate the same sys-
behaviour in essence. These repeated events need to be
pressed to ease the analysis. We use regular expression
niques to detect and suppress the repetitions. For the
mple shown in Figure 2, the sequence “E1 E2 E3 E3 E5
our technique would detect the repetition of E3 and
press this sequence into “E1 E2 E3 E5 E6”.
5 time=5, Task=JVM, TaskID=077
6 time=6, Task=Reduce, TaskID=01A
7 time=7, Task=Reduce, TaskID=01A
8 time=8, Task=Progress, TaskID=077
9 time=9, Task=Done, TaskID=077
10 time=10, Task=Commit Pending, TaskID=0
11 time=11, Task=Done, TaskID=01A
After eliminating looping, the final log sequences are
in Figure 2-d.
Table 3: Execution sequence
TaskID Event sequence
01A E1, E2, E3, E3, E3, E5, E6
077 E1, E2, E4, E6
3.4 Failure detection
Intuitively, if any failure exists, the cloud computin
Log
abstracIon
Log
Linking
Simplifying
sequences
e of our approach.
Table 1: Example of log lines
# Log line
1 time=1, Task=Trying to launch, TaskID=01A
2 time=2, Task=Trying to launch, TaskID=077
3 time=3, Task=JVM, TaskID=01A
4 time=4, Task=Reduce, TaskID=01A
5 time=5, Task=JVM, TaskID=077
6 time=6, Task=Reduce, TaskID=01A
7 time=7, Task=Reduce, TaskID=01A
8 time=8, Task=Progress, TaskID=077
9 time=9, Task=Done, TaskID=077
10 time=10, Task=Commit Pending, TaskID=01A
11 time=11, Task=Done, TaskID=01A
After eliminating looping, the final log sequences are shown
in Figure 2-d.
Table 3: Execution sequence
TaskID Event sequence
01A E1, E2, E3, E3, E3, E5, E6
077 E1, E2, E4, E6
3.4 Failure detection
Example of log lines
Execution events
18. Step
3:
Sequence
simplifica$on
deals
with
repeated
logs
18
10 time=10, Task=Commit Pending, TaskID=01A
11 time=11, Task=Done, TaskID=01A
After eliminating looping, the final log sequences are shown
n Figure 2-d.
Table 3: Execution sequence
TaskID Event sequence
01A E1, E2, E3, E3, E3, E5, E6
077 E1, E2, E4, E6
3.4 Failure detection
Intuitively, if any failure exists, the cloud computing plat-
Table 2
Event Event template
E1 time=$t, Task=
E2 time=$t, Task=
E3 time=$t, Task=
E4 time=$t, Task=
E5 time=$t, Task=
E6 time=$t, Task=
Table 4: Execution sequence after eliminating lo
ing
TaskID Event sequence
01A E1, E2, E3, E5, E6
077 E1, E2, E4, E6
form would generate extra logs. The extra logs con
event sequences indicating the process of error message
fault recovery. Therefore, di erent event sequences, w
reflect di erent system behaviours, should be recovered
tween di erent runs of an application with and without
ures. Several approaches that identify the di erent e
Log
abstracIon
Log
Linking
Simplifying
sequences
Repeated
logs:
task
t1
read
file
A.
task
t1
read
file
A.
task
t1
read
file
A.
Remove
repe$$on
and
order
of
events
19. Comparing
small
and
large
runs
19
Logs
from
tesIng
run
with
small
data
Logs
from
run
with
large
data
Event
sequence
E1,
E2,
E3,
E5,
E6
Event
sequence
E1,
E2,
E3,
E5,
E6
E1,
E2,
E3,
E7,
E5,
E6
Event
sequence
delta
E1,
E2,
E3,
E7,
E5,
E6
20. Case
study:
subject
systems
20
Source Domain
WordCount
official
example
File
processing
Page
Rank
developed
from
scratch
Social
network
JACK
migrated
from
Perl
Log
analysis
21. How
precise
is
our
approach?
Precision
21
Effort
Reduc$on
How
much
effort
reduc$on
does
our
approach
provide?
22. 0
500
1000
1500
2000
WordCount
JACK
PageRank
#
log
sequences
#
unique
log
events
#
log
line
Our
approach
reduces
the
logs
for
manual
inspec$on
by
over
86%
86%
reducIon
91%
reducIon
Our
approach
Keyword
search
95%
reducIon
22
23. How
precise
is
our
approach?
Precision
23
Effort
Reduc$on
How
much
effort
reduc$on
does
our
approach
provide?
Reduce
logs
for
manual
inspecIon
by
over
86%
24. We
manually
inject
3
common
failures
Machine
Failure!
Missing
supporting
library!
Lack of
disk space!
We
measure
the
number
of
log
lines
and
log
sequences
caused
by
injected
failures.
WordCount Page
Rank JACK
24
Cola
et
al.
Euro-‐Par
2005
25. Our
approach
generates
less
false
posi$ves
than
tradi$onal
approach
25
0
5
10
15
20
25
30
35
40
WordCount
JACK
PageRank
False
posi$ve
ra$o
between
keyword
search
and
our
approach
1:29
1:8
1:36
26. How
precise
is
our
approach?
Precision
26
Effort
Reduc$on
How
much
effort
reduc$on
does
our
approach
provide?
Reduce
logs
for
manual
inspecIon
by
over
86%
Less
false
posiIve
and
addi$onal
context
informaIon
to
assist
in
manual
inspecIon
28. Under
the
hood
of
BDA
Apps
28
Physical
Infrastructure
Underlying
PlaCorm
BDA
Apps
29. Our
approach
can
be
used
in
migra$on
of
BDA
Apps
Hadoop
generates
more
job
sequences
and
task
sequences.
PIG!
PIG
automaIcally
opImize
the
applicaIon
by
grouping
jobs
and
reducing
tasks.
Manually
browsing
logs
to
find
the
differences
can
be
$me-‐consuming.
One
of
the
common
migraIons
29
We
use
our
approach
to
compare
the
execu$on
sequences
of
PageRank
on
both
plaborms