The document discusses Mehmet Balman's work on network-aware data management for large-scale distributed applications. It provides background on Balman, including his employment at VMware and affiliations. The presentation outline discusses VSAN and VVOL storage performance in virtualized environments, data streaming in high-bandwidth networks, the Climate100 100Gbps networking demo, and other topics related to network-aware data management.
Network-Aware Data Management for Large-Scale Distributed Applications
1. Network-‐aware
Data
Management
for
Large-‐scale
Distributed
Applications
Sept
28,
2015
Mehmet
Balman
h3p://balman.info
Senior
Performance
Engineer
at
VMware
Inc.
Guest/Affiliate
at
Berkeley
Lab
1
2. About
me:
Ø 2013:
Performance,
OCTO,
VMware,
Palo
Alto,
CA
Ø 2009:
ComputaNonal
Research
Division
(CRD)
at
Lawrence
Berkeley
NaNonal
Laboratory
(LBNL)
Ø 2005:
Center
for
ComputaNon
&
Technology
(CCT),
Baton
Rouge,
LA
v Computer
Science,
Louisiana
State
University
(2010,2008)
v Bogazici
University,
Istanbul,
Turkey
(2006,2000)
Data
Transfer
Scheduling
with
Advance
ReservaNon
and
Provisioning,
Ph.D.
Failure-‐Awareness
and
Dynamic
AdaptaNon
in
Data
Scheduling,
M.S.
Parallel
Tetrahedral
Mesh
Refinement,
M.S.
2
3. Why
Network-‐aware?
Networking
is
one
of
the
major
components
in
many
of
the
soluNons
today
• Distributed
data
and
compute
resources
• CollaboraNon:
data
to
be
shared
between
remote
sites
• Data
centers
are
complex
network
infrastructures
ü What
further
steps
are
necessary
to
take
full
advantage
of
future
networking
infrastructure?
ü How
are
we
going
to
deal
with
performance
problems?
ü How
can
we
enhance
data
management
services
and
make
them
network-‐aware?
New
collabora>ons
between
data
management
and
networking
communi>es.
3
4. Two
major
players:
• AbstracNon
and
Programmability
• Rapid
Development,
Intelligent
services
• OrchestraNng
compute,
storage,
and
network
resources
together
• IntegraNon
and
deployment
of
complex
sytems
• Performance
Gap:
• LimitaNon
in
current
system
so3ware
vs
foreseen
speed:
• Hardware
is
fast,
Soaware
is
slow
• Latency
vs
throughput
mismatch
will
lead
to
new
innovaBons
4
5. Outline
• VSAN
+
VVOL
Storage
Performance
in
Virtualized
Environments
• Data
Streaming
in
High-‐bandwidth
Networks
• Climate100:
Advance
Network
IniNaNve
and
100Gbps
Demo
• MemzNet:
Memory-‐Mapped
Network
Zero-‐copy
Channels
• Core
Affinity
and
End
System
Tuning
in
High-‐Throughput
Flows
• Network
Reserva>on
and
Online
Scheduling
(QoS)
• FlexRes:
A
Flexible
Network
ReservaNon
Algorithm
• SchedSim:
Online
Scheduling
with
Advance
Provisioning
5
7. VSAN
performance
work
in
a
nutshell
7
Observer
image:
blog.vmware.com
• Every
write
operaNon
needs
to
go
over
network
(and
network
is
not
free)
• Each
layer
(cache,
disk,
object
management,
etc.)
needs
resources
(CPU,
memory)
• Resource
limitaNons
vs
Latency
effect
• Needs
to
support
thousands
of
VMs
Placement
of
Objects:
• Which
Host?
• Which
Disk/SSD
in
the
Host?
What
if
there
are
failures,
migraNons,
and
if
we
need
to
rebalance
8. 8
VVOL:
virtual
volumes
VVOL
image:
blog.vmware.com
Offloading
control
operaNons
to
the
storage
array
• powerOn
• powerOff
• Delete
• clone
9. VVOL
performance
work
• Effect
of
the
latency
in
control
path
•
linked
clone
vs
VVOL
clones
9
Vsphere
Storage
Host
VASA
VP
Data
path
Control
path
• Op>mize
service
latencies
• Batching
(disklib)
• Use
concurrent
opera>ons
10. Internet
Modeling
• My
first
real
paper
was
on
Internet
Topology
• CollecNng
data
from
Traceroute
Gateways
• Analyzing:
• Outdegree
• Indegree
• Diameter
• Reachable
Set
10
11. Outline
• VSAN
+
VVOL
Storage
Performance
in
Virtualized
Environments
• Data
Streaming
in
High-‐bandwidth
Networks
• Climate100:
Advance
Network
IniNaNve
and
100Gbps
Demo
• MemzNet:
Memory-‐Mapped
Network
Zero-‐copy
Channels
• Core
Affinity
and
End
System
Tuning
in
High-‐Throughput
Flows
• Network
Reserva>on
and
Online
Scheduling
(QoS)
• FlexRes:
A
Flexible
Network
ReservaNon
Algorithm
• SchedSim:
Online
Scheduling
with
Advance
Provisioning
11
12. 100Gbps
networking
has
Jinally
arrived!
Applica>ons’
Perspec>ve
Increasing
the
bandwidth
is
not
sufficient
by
itself;
we
need
careful
evaluaNon
of
high-‐bandwidth
networks
from
the
applicaNons’
perspecNve.
1Gbps
to
10Gbps
transiNon
(10
years
ago)
ApplicaNon
did
not
run
10
Nmes
faster
because
there
was
more
bandwidth
available
12
13. ANI
100Gbps
Demo
• 100Gbps
demo
by
ESnet
and
Internet2
• ApplicaNon
design
issues
and
host
tuning
strategies
to
scale
to
100Gbps
rates
• VisualizaNon
of
remotely
located
data
(Cosmology)
• Data
movement
of
large
datasets
with
many
files
(Climate
analysis)
13
14. Earth
System
Grid
Federation
(ESGF)
14
• Over
2,700
sites
• 25,000
users
• IPCC
Fiah
Assessment
Report
(AR5)
2PB
• IPCC
Forth
Assessment
Report
(AR4)
35TB
• Remote
Data
Analysis
• Bulk
Data
Movement
16.
lots-‐of-‐small-‐*iles
problem!
*ile-‐centric
tools?
FTP
RPC
request a file
request a file
send file
send file
request
data
send data
• Keep
the
network
pipe
full
• We
want
out-‐of-‐order
and
asynchronous
send
receive
16
17. Many
Concurrent
Streams
(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single
NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps
pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a
different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).
17
18. ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min
intervals], TCP buffer size is 50M
Effects
of
many
concurrent
streams
18
19. Analysis
of
Core
AfJinities
(NUMA
Effect)
19
Nathan
Hanford
et
al.
NDM’13
Sandy
Bridge
Architecture
Receive
process
20. 20
Analysis
of
Core
AfJinities
(NUMA
Effect)
Nathan
Hanford
et
al.
NDM’14
24. Advantages
• Decoupling
I/O
and
network
operaNons
• front-‐end
(I/O
processing)
• back-‐end
(networking
layer)
• Not
limited
by
the
characterisNcs
of
the
file
sizes
• On
the
fly
tar
approach,
bundling
and
sending
many
files
together
• Dynamic
data
channel
management
Can
increase/decrease
the
parallelism
level
both
in
the
network
communicaNon
and
I/O
read/write
operaNons,
without
closing
and
reopening
the
data
channel
connecNon
(as
is
done
in
regular
FTP
variants).
MemzNet
is
is
not
file-‐centric.
Bookkeeping
informaNon
is
embedded
inside
each
block.
24
26. 100Gbps
Demo
• CMIP3
data
(35TB)
from
the
GPFS
filesystem
at
NERSC
• Block
size
4MB
• Each
block’s
data
secNon
was
aligned
according
to
the
system
pagesize.
• 1GB
cache
both
at
the
client
and
the
server
• At
NERSC,
8
front-‐end
threads
on
each
host
for
reading
data
files
in
parallel.
•
At
ANL/ORNL,
4
front-‐end
threads
for
processing
received
data
blocks.
•
4
parallel
TCP
streams
(four
back-‐end
threads)
were
used
for
each
host-‐to-‐host
connecNon.
26
27. MemzNet’s
Performance
TCP
buffer
size
is
set
to
50MB
MemzNetGridFTP
100Gbps demo
ANI Testbed
27
28. Challenge?
• High-‐bandwidth
brings
new
challenges!
• We
need
substanNal
amount
of
processing
power
and
involvement
of
mulNple
cores
to
fill
a
40Gbps
or
100Gbps
network
• Fine-‐tuning,
both
in
network
and
applicaNon
layers,
to
take
advantage
of
the
higher
network
capacity.
• Incremental
improvement
in
current
tools?
• We
cannot
expect
every
applicaNon
to
tune
and
improve
every
Nme
we
change
the
link
technology
or
speed.
28
29. MemzNet
• MemzNet:
Memory-‐mapped
Network
Channel
• High-‐performance
data
movement
MemzNet
is
an
iniNal
effort
to
put
a
new
layer
between
the
applicaNon
and
the
transport
layer.
• Main
goal
is
to
define
a
network
channel
so
applicaNons
can
directly
use
it
without
the
burden
of
managing/tuning
the
network
communicaNon.
29
Tech
report:
LBNL-‐6177E
30. MemzNet
=
New
Execution
Model
• Luigi
Rizzo
’s
netmap
• proposes
a
new
API
to
send/receive
data
over
the
network
• RDMA
programming
model
• MemzNet
as
a
memory-‐management
component
• IX:
Data
Plane
OS
(Adam
Baley
et
al.
@standford
–
similar
to
MemzNet’s
model)
• mTCP
(even
based
/
replaces
send/receive
in
user
level)
• Tanenbaum
et
al.
Minimizing
context
switches:
proposing
to
use
MONITOR/MWAIT
for
synchronizaNon
30
31. Outline
• VSAN
+
VVOL
Storage
Performance
in
Virtualized
Environments
• Data
Streaming
in
High-‐bandwidth
Networks
• Climate100:
Advance
Network
IniNaNve
and
100Gbps
Demo
• MemzNet:
Memory-‐Mapped
Network
Zero-‐copy
Channels
• Core
Affinity
and
End
System
Tuning
in
High-‐Throughput
Flows
• Network
Reserva>on
and
Online
Scheduling
(QoS)
• FlexRes:
A
Flexible
Network
ReservaNon
Algorithm
• SchedSim:
Online
Scheduling
with
Advance
Provisioning
31
32. Problem
Domain:
Esnet’s
OSCARS
32
ASIA-PACIFIC
(ASGC/Kreonet2/
TWAREN)
ASIA-PACIFIC
(KAREN/KREONET2/
NUS-GP/ODN/
REANNZ/SINET/
TRANSPAC/TWAREN)
AUSTRALIA
(AARnet)
LATIN AMERICA
CLARA/CUDI
CANADA
(CANARIE)
RUSSIA
AND CHINA
(GLORIAD)
US R&E
(DREN/Internet2/NLR)
US R&E
(DREN/Internet2/
NASA)
US R&E
(NASA/NISN/
USDOI)
ASIA-PACIFIC
(BNP/HEPNET)
ASIA-PACIFIC
(ASCC/KAREN/
KREONET2/NUS-GP/
ODN/REANNZ/
SINET/TRANSPAC)
AUSTRALIA
(AARnet)
US R&E
(DREN/Internet2/
NISN/NLR)
US R&E
(Internet2/
NLR)
CERN
US R&E
(DREN/Internet2/
NISN)
CANADA
(CANARIE) LHCONE
CANADA
(CANARIE)
FRANCE
(OpenTransit)
RUSSIA
AND CHINA
(GLORIAD)
CERN
(USLHCNet)
ASIA-PACIFIC
(SINET)
EUROPE
(GÉANT/
NORDUNET)
EUROPE
(GÉANT)
LATIN AMERICA
(AMPATH/CLARA)
LATIN AMERICA
(CLARA/CUDI)
HOUSTON
ALBUQUERQUE
El PASO
SUNNYVALE
BOISE
SEATTLE
KANSAS CITY
NASHVILLE
WASHINGTON DC
NEW YORK
BOSTON
CHICAGO
DENVER
SACRAMENTO
ATLANTA
PNNL
SLAC
AMES PPPL
BNL
ORNL
JLAB
FNAL
ANL
LBNL
• ConnecNng
experimental
faciliNes
and
supercompuNng
centers
• On-‐Demand
Secure
Circuits
and
Advance
ReservaNon
System
• Guaranteed
between
collaboraNng
insNtuNons
by
delivering
network-‐as-‐a-‐service
• Co-‐allocaNon
of
storage
and
network
resources
(SRM:
Storage
Resource
Manager)
OSCARS
provides
yes/no
answers
to
a
reservaNon
request
for
(bandwidth,
start_Bme,
end_Bme)
End-‐to-‐end
ReservaNon:
Storage+Network
33. Reservation
Request
• Between
edge
routers
Need
to
ensure
availability
of
the
requested
bandwidth
from
source
to
desBnaBon
for
the
requested
Bme
interval
v
R={
nsource,
ndesBnaBon,
Mbandwidth,
tstart,
tend}.
v source/desNnaNon
end-‐points
v Requested
bandwidth
v start/end
Nmes
Commi3ed
reservaNons
between
tstart
and
tend
are
examined
The
shortest
path
from
source
to
desNnaNon
is
calculated
based
on
the
engineering
metric
on
each
link,
and
a
bandwidth
guaranteed
path
is
set
up
to
commit
and
eventually
complete
the
reservaNon
request
for
the
given
Nme
period
33
34. Reservation
34
v Components (Graph):
v node (router), port, link (connecting two ports)
v engineering metric (~latency)
v maximum bandwidth (capacity)
v Reservation:
v source, destination, path, time
v (time t1, t3) A -> B -> D (900Mbps)
v (time t2, t3) A -> C -> D (400Mbps)
v (time t4, t5) A -> B -> D (800Mpbs)
A
C
B
D
800Mbps
900Mbps
500Mbps
1000Mbps
300Mbps
ReservaNon
1
ReservaNon
2
ReservaNon
3
t1
t2
t3
t4
t5
35. Example
(Nme
t1,
t2)
:
A
to
D
(600Mbps)
NO
A
to
D
(500Mbps)
YES
A
C
B
D
0
Mbps
/
900Mbps
(900Mbps)
100
Mbps
/
900Mbps
(1000Mbps)
800
Mbps
/
0Mbps
(800Mbps)
500
Mbps
/
0Mbps
(500Mbps)
300
Mbps
/
0
Mbps
(300Mbps)
AcNve
reservaNon
reservaNon
1:
(Nme
t1,
t3)
A
-‐>
B
-‐>
D
(900Mbps)
reservaNon
2:
(Nme
t1,
t3)
A
-‐>
C
-‐>
D
(400Mbps)
reservaNon
3:
(Nme
t4,
t5)
A
-‐>
B
-‐>
D
(800Mpbs)
available/
reserved
(capacity)
35
36. Example
A
C
B
D
0
Mbps
/
900Mbps
(900Mbps)
100
Mbps
/
900Mbps
(1000Mbps)
400
Mbps
/
400Mbps
(800Mbps)
100
Mbps
/
400Mbps
(500Mbps)
300
Mbps
/
0
Mbps
(300Mbps)
(Nme
t1,
t3)
:
A
to
D
(500Mbps)
NO
A
to
C
(500Mbps)
No
(not
max-‐FLOW!)
AcNve
reservaNon
reservaNon
1:
(Nme
t1,
t3)
A
-‐>
B
-‐>
D
(900Mbps)
reservaNon
2:
(Nme
t1,
t3)
A
-‐>
C
-‐>
D
(400Mbps)
reservaNon
3:
(Nme
t4,
t5)
A
-‐>
B
-‐>
D
(800Mpbs)
available/
reserved
(capacity)
36
37. Alternative
Approach:
Flexible
Reservations
• IF
the
requested
bandwidth
can
not
be
guaranteed:
• Try-‐and-‐error
unNl
get
an
available
reservaNon
• Client
is
not
given
other
possible
opNons
• How
can
we
enhance
the
OSCARS
reservaNon
system?
• Be
Flexible:
• Submit
constraints
and
the
system
suggests
possible
reservaNon
opNons
saNsfying
given
requirements
37
Rs
'={
nsource
,
ndesBnaBon,
MMAXbandwidth,
DdataSize,
tEarliestStart,
tLatestEnd}
ReservaNon
engine
finds
out
the
reservaNon
R={
nsource,
ndesBnaBon,
Mbandwidth,
tstart,
tend}
for
the
earliest
compleNon
or
for
the
shortest
duraNon
where
Mbandwidth≤
MMAXbandwidth
and
tEarliestStart
≤
tstart
<
tend≤
tLatestEnd
.
38. Bandwidth
Allocation
(time-‐dependent)
Modified
Dijstra's
algorithms
(max
available
bandwidth):
• BoUleneck
constraint
(not
addiNve)
• QoS
constraint
is
addiNve
in
shortest
path,
etc)
38
The
maximum
bandwidth
available
for
allocaNon
from
a
source
node
to
a
desNnaNon
node
t1
t2
t3
t4
t5
t6
39. Analogous Example
n A vehicle travelling from city A to city B
n There are multiple cities between A and B connected with separate
highways.
n Each highway has a specific speed limit
– (maximum bandwidth)
n But we need to reduce our speed if there is high traffic load on the
road
n We know the load on each highway for every time period
– (active reservations)
n The first question is which path the vehicle should follow in order to
reach city B from city A as early as possible (earliest completion)
• Or, we can delay our journey and start later if the total travel time
would be reduced. Second question is to find the route along with the
starting time for shortest travel duration (shortest duration)
39
Advance bandwidth reservation: we have to set the speed limit before starting and
cannot change during the journey
40. Time steps
n Time steps between t1 and t13
Nme
t4
t2
t3
t1
t5
t6
t7
t8
t9
t10
t11
t12
t13
ReservaNon
1
ReservaNon
2
ReservaNon
3
Res
1
Res
1,2
Res
2
Res
3
t4
t1
t6
t7
t9
t12
t13
Nme
Nme
steps
Max (2r+1) time steps,
where r is the number of
reservations
ts1
ts2
ts3
ts4
40
41. Static Graphs
Res
1
Res
1,2
Res
2
t4
t1
t6
t7
t9
A
C
B
D
0
Mbps
100
Mbps
800
Mbps
500
Mbps
300
Mbps)
A
C
B
D
0
Mbps
100
Mbps
400
Mbps
100
Mbps
300
Mbps)
A
C
B
D
900
Mbps
1000
Mbps
400
Mbps
100
Mbps
300
Mbps)
A
C
B
D
900
Mbps
1000
Mbps
800
Mbps
500
Mbps
300
Mbps)
t4
t6
t7
G(ts3)
G(ts4)
G(ts2)
G(ts1)
41
42. Time Windows
Res
1,2
Res
2
t1
t6
t9
A
C
B
D
0
Mbps
100
Mbps
400
Mbps
100
Mbps
300
Mbps
A
C
B
D
900
Mbps
1000
Mbps
400
Mbps
100
Mbps
300
Mbps
t6
Max (s × (s + 1))/2 time windows, where s is the
number of time steps
G(tw)=G(ts3)
x
G(ts4)
tw=ts1+ts2
Bo3leneck
constraint
G(tw)=G(ts1)
x
G(ts2)
tw=ts3+ts4
42
43. Time
Window
List
(special
data
structures)
now
infinite
Time
windows
list
new
reservaNon:
reservaNon
1,
start
t1,
end
t10
now
t1
t10
infinite
Res
1
new
reservaNon:
reservaNon
2,
start
t12,
end
t20
now
t1
t10
t12
Res
1
t20
infinite
Res
2
43
Careful
soaware
design
makes
implementaNon
fast
and
efficient
44. Performance
max-bandwidth path ~ O(n^2 )
n is the number of nodes in the topology graph
In the worst-case, we may require to search all time
windows, (s × (s + 1))/2, where s is the number of
time steps.
If there are r committed reservations in the search
period, there can be a maximum of 2r + 1 different
time steps in the worst-case.
Overall, the worst-case complexity is bounded
by O(r^2 n^2 )
Note: r is relatively very small compared to the
number of nodes n 44
45. Example
Reservation 1: (time t1, t6) A -> B -> D
(900Mbps)
Reservation 2: (time t4, t7) A -> C -> D
(400Mbps)
Reservation 3: (time t9, t12) A -> B -> D
(700Mpbs)
A
C
B
D
800Mbps
900Mbps
500Mbps
1000Mbps
300Mbps
t4
t2
t3
t1
t5
t6
t7
t8
t9
t10
t11
t12
t13
ReservaNon
1
ReservaNon
2
ReservaNon
3
from A to D (earliest completion)
max bandwidth = 200Mbps, volume = 200Mbps x 4 time slots
earliest start = t1, latest finish t13
45
46. Search Order - Time Windows
Res
1
Res
1,2
Res
2
Res
3
t4
t1
t6
t7
t9
t12
t13
Nme
windows
Res
1
Res
1,
2
Res
1,
2
2
Res
1,2
Res
1,
2
Res
2
Res
1,
2
Res
1,
2
t1-‐-‐t6
t4—t6
t1-‐-‐t4
t6—t7
t4—t7
t1—t7
t7—t9
t6—t9
t4—t9
t1—t9
Max
bandwidth
from
A
to
D
1. 900Mbps
(3)
2. 100Mbps
(2)
3. 100Mbps
(5)
4. 900Mbps
(1)
5. 100Mbps
(3)
6. 100Mbps
(6)
7. 900Mpbs
(2)
8. 900Mbps
(3)
9. 100Mbps
(5)
10. 100Mbps
(8)
ReservaNon:
(
A
to
D
)
(100Mbps)
start=t1
end=t9
46
47. Search Order - Time Windows
Shortest
dura>on?
Res
1
Res
1,2
Res
2
Res
3
t4
t1
t6
t7
t9
t12
t13
Nme
windows
Res
3
Res
3
t9—t13
t12—t12
t9—t12
Max
bandwidth
from
A
to
D
1. 200Mbps
(3)
2. 900Mbps
(1)
3. 200Mbps
(4)
ReservaNon:
(A
to
D
)
(200Mbps)
start=t9
end=t13
Ø from
A
to
D,
max
bandwidth
=
200Mbps
volume
=
175Mbps
x
4
Nme
slots
earliest
start
=
t1,
latest
finish
t13
earliest
compleNon:
(
A
to
D
)
(100Mbps)
start=t1
end=t8
shortest
duraNon:
(
A
to
D
)
(200Mbps)
start=t9
end=t12.5
47
48. Source
>
Network
>
Destination
A
CB
D
800Mbps
900Mbps
500Mbps
1000Mbps
300Mbps
n2
n1
Now
we
have
mulNple
requests
48
49. With
start/end
times
•
Each
transfer
request
has
start
and
end
Nmes
• n
transfer
requests
are
given
(each
request
has
a
specific
amount
of
profit)
• ObjecNve
is
to
maximize
the
profit
• If
profit
is
same
for
each
request,
then
objecNve
is
to
maximize
the
number
of
jobs
in
a
give
Nme
period
• Unspli3able
Flow
Problem:
• An
undirected
graph,
• route
demand
from
source(s)
to
desNnaNons(s)
and
maximize/minimize
the
total
profit/cost
49
The
online
scheduling
method
here
is
inspired
from
Gale-‐Shapley
algorithm
(also
known
as
stable
marriage
problem)
50. Methodology
• Displace
other
jobs
to
open
space
for
the
new
request
•
we
can
shia
max
n
jobs?
• Never
accept
a
job
if
it
causes
other
commi3ed
jobs
to
break
their
criteria
• Planning
ahead
(gives
opportunity
for
co-‐allocaNon)
• Gives
a
polynomial
approximaNon
algorithm
• The
preference
converts
the
UFP
problem
into
Dijkstra
path
search
• UNlizes
Nme
windows/Nme
steps
for
ranking
(be3er
than
earliest
deadline
first)
• Earliest
compleNon
+
shortest
duraNon
• Minimize
concurrency
• Even
random
ranking
would
work
(relaxaNon
in
an
NP-‐hard
problem
50
51.
51
52. Recall
Time
Windows
Res
1
Res
1,2
Res
2
Res
3
t4
t1
t6
t7
t9
t12
t13
Nme
windows
Res
1
Res
1,
2
Res
1,
2
2
Res
1,2
Res
1,
2
Res
2
Res
1,
2
Res
1,
2
t1-‐-‐t6
t4—t6
t1-‐-‐t4
t6—t7
t4—t7
t1—t7
t7—t9
t6—t9
t4—t9
t1—t9
Max
bandwidth
from
A
to
D
1. 900Mbps
(3)
2. 100Mbps
(2)
3. 100Mbps
(5)
4. 900Mbps
(1)
5. 100Mbps
(3)
6. 100Mbps
(6)
7. 900Mpbs
(2)
8. 900Mbps
(3)
9. 100Mbps
(5)
10. 100Mbps
(8)
ReservaNon:
(
A
to
D
)
(100Mbps)
start=t1
end=t9
52
53. Test
53
In
real
life,
number
of
nodes
and
number
of
reservaNon
in
a
given
search
interval
are
limited
See
AINA’13
paper
for
results
+
comparison
with
different
preference
metrics
54. Autonomic
Provisioning
System
• Generate
constraints
automaNcally
(without
user
input)
• Volume
(elephant
flow?)
• True
deadline
if
applicable
• End-‐host
resource
availability
• Burst
rate
(fixed
bandwidth,
variable
bandwidth)
• Update
constraints
according
to
feedback
and
monitoring
• Minimize
operaNonal
cost
• AlternaNve
to
manual
traffic
engineering
What
is
the
incenNve
to
make
correct
reservaNons?
54
55. Data
Center
1
Data
Center
2
Data
node
B
(web
access)
Experimental
facility
A
*
(1)
Experimental
facility
A
generates
30T
of
data
every
day,
and
it
needs
to
be
stored
in
data
center
2,
before
the
next
run,
since
local
disk
space
is
limited
*
(2)
There
is
a
reservaNon
made
between
data
center
1
and
2.
It
is
used
to
replicate
data
files,
1P
total
size,
when
new
data
is
available
in
data
center
2
*
(3)
New
results
are
published
at
data
node
B,
we
expect
high
traffic
to
download
new
simulaNon
files
for
the
next
couple
of
months
Wide-‐area
SDN
55
56. Example
• Experimental
facility
periodically
transfers
data
(i.e.
every
night)
• Data
replicaNon
happens
occasionally,
and
it
will
take
a
week
to
move
1P
of
data.
If
could
get
delayed
couple
of
hours
with
no
harm
• Wide-‐area
download
traffic
will
increase
gradually,
most
of
the
traffic
will
be
during
the
day.
• We
can
dynamically
increase
preference
for
download
traffic
in
the
mornings,
give
high
priority
for
transferring
data
from
the
facility
at
night,
and
use
rest
of
the
bandwidth
for
data
replicaNon
(and
allocate
some
bandwidth
to
confirm
that
it
would
finish
within
a
week
as
usual)
56
57. Virtual
Circuit
ReservaNon
Engine
Autonomic
provisioning
system
monitoring
Reserva>on
Engine
– Select
opNmal
path/Nme/bandwidth
– maximize
the
number
of
admi3ed
requests
–
increase
overall
system
uNlizaNon
and
network
efficiency
– Dynamically
update
the
selected
rouNng
path
for
network
efficiency
– Modify
exisNng
reservaNons
dynamically
to
open
space/Nme
for
new
requests
57
58. THANK
YOU
Any
QuesNon/Comment?
Mehmet
Balman
mehmet@balman.info
h3p://balman.info
58
59. PetaShare
+
Stork
Data
Scheduler
59
AggregaNon
in
Data
Path:
Advance
Buffer
Cache
in
Petafs
and
Petashell
clients
by
aggregaNng
I/O
requests
to
minimize
the
number
of
network
messages
60. Adaptive
Tuning
+
Advanced
Buffer
60
AdapNve
Tuning
for
Bulk
Transfer
Buffer
Cache
for
Remote
I/O