Network-Aware Data Management for Large-Scale Distributed Applications

Network-‐aware
Data

Management
for
Large-‐scale

Distributed
Applications

Sept
28,
2015

Mehmet
Balman

h3p://balman.info

Senior
Performance
Engineer
at
VMware
Inc.

Guest/Aﬃliate
at
Berkeley
Lab

1

About
me:

Ø 2013:
Performance,
OCTO,
VMware,
Palo
Alto,
CA

Ø 2009:
ComputaNonal
Research
Division
(CRD)
at
Lawrence
Berkeley

NaNonal
Laboratory
(LBNL)

Ø 2005:
Center
for
ComputaNon
&
Technology
(CCT),
Baton
Rouge,
LA

v Computer
Science,
Louisiana
State
University
(2010,2008)

v Bogazici
University,
Istanbul,
Turkey
(2006,2000)

Data
Transfer
Scheduling
with
Advance
ReservaNon
and
Provisioning,
Ph.D.

Failure-‐Awareness
and
Dynamic
AdaptaNon
in
Data
Scheduling,
M.S.

Parallel
Tetrahedral
Mesh
Reﬁnement,
M.S.

2

Why
Network-‐aware?

Networking
is
one
of
the
major
components
in
many
of
the

soluNons
today

•  Distributed
data
and
compute
resources

•  CollaboraNon:
data
to
be
shared
between
remote
sites

•  Data
centers
are
complex
network
infrastructures

ü What
further
steps
are
necessary
to
take
full
advantage
of
future

networking
infrastructure?

ü How
are
we
going
to
deal
with
performance
problems?

ü How
can
we
enhance
data
management
services
and
make
them

network-‐aware?

New
collabora>ons
between
data
management
and

networking
communi>es.

3

Two
major
players:

• AbstracNon
and
Programmability

•  Rapid
Development,
Intelligent
services

•  OrchestraNng
compute,
storage,
and
network
resources

together

•  IntegraNon
and
deployment
of
complex
sytems

•  Performance
Gap:

•  LimitaNon
in
current
system
so3ware
vs
foreseen

speed:

•  Hardware
is
fast,
Soaware
is
slow

•  Latency
vs
throughput
mismatch
will
lead
to
new

innovaBons

4

Outline

•  VSAN
+
VVOL
Storage
Performance

in
Virtualized

Environments

•  Data
Streaming
in
High-‐bandwidth
Networks

•  Climate100:
Advance
Network
IniNaNve
and
100Gbps
Demo

•  MemzNet:
Memory-‐Mapped
Network
Zero-‐copy
Channels

•  Core
Aﬃnity
and
End
System
Tuning
in
High-‐Throughput

Flows

•  Network
Reserva>on
and
Online
Scheduling
(QoS)

•  FlexRes:
A
Flexible
Network
ReservaNon
Algorithm

•  SchedSim:
Online
Scheduling
with
Advance
Provisioning

5

VSAN:
virtual
SAN

6

VSAN
image:
blog.vmware.com

Distributed
Object

Storage

Hybrid
(SSD+HDD)

VSAN
performance
work
in
a
nutshell

7

Observer
image:
blog.vmware.com

•  Every
write
operaNon
needs
to
go
over
network
(and

network
is
not
free)

•  Each
layer
(cache,
disk,
object
management,
etc.)
needs

resources
(CPU,
memory)

•  Resource
limitaNons
vs
Latency
eﬀect

•  Needs
to
support
thousands
of
VMs
Placement
of
Objects:

•  Which
Host?

•  Which
Disk/SSD
in
the

Host?

What
if
there
are

failures,
migraNons,

and
if
we
need
to

rebalance

8

VVOL:

virtual
volumes

VVOL
image:
blog.vmware.com

Oﬄoading

control
operaNons
to

the
storage
array

•  powerOn

•  powerOﬀ

•  Delete

•  clone

VVOL
performance
work

• Eﬀect
of
the
latency

in
control
path

• 

linked
clone
vs
VVOL
clones

9

Vsphere

Storage

Host

VASA
VP

Data
path

Control
path

•  Op>mize
service
latencies

•  Batching
(disklib)

•  Use
concurrent
opera>ons

Internet
Modeling

•  My
ﬁrst
real
paper

was
on
Internet
Topology

•  CollecNng
data
from
Traceroute
Gateways

•  Analyzing:

•  Outdegree

•  Indegree

•  Diameter

•  Reachable
Set

10

Outline

•  VSAN
+
VVOL
Storage
Performance

in
Virtualized

Environments

•  Data
Streaming
in
High-‐bandwidth
Networks

•  Climate100:
Advance
Network
IniNaNve
and
100Gbps
Demo

•  MemzNet:
Memory-‐Mapped
Network
Zero-‐copy
Channels

•  Core
Aﬃnity
and
End
System
Tuning
in
High-‐Throughput

Flows

•  Network
Reserva>on
and
Online
Scheduling
(QoS)

•  FlexRes:
A
Flexible
Network
ReservaNon
Algorithm

•  SchedSim:
Online
Scheduling
with
Advance
Provisioning

11

100Gbps
networking
has
Jinally
arrived!

Applica>ons’
Perspec>ve

Increasing
the
bandwidth
is
not
suﬃcient
by
itself;
we
need

careful
evaluaNon
of
high-‐bandwidth
networks
from
the

applicaNons’
perspecNve.

1Gbps
to
10Gbps
transiNon

(10
years
ago)

ApplicaNon
did
not
run
10
Nmes

faster
because
there
was
more

bandwidth
available

12

ANI

100Gbps

Demo

•  100Gbps
demo
by
ESnet
and

Internet2

•  ApplicaNon
design
issues
and
host

tuning
strategies
to
scale
to
100Gbps

rates

•  VisualizaNon
of
remotely
located
data

(Cosmology)

•  Data
movement
of
large

datasets
with

many
ﬁles
(Climate
analysis)

13

Earth
System
Grid
Federation
(ESGF)

14

•  Over
2,700
sites

•  25,000
users

•  IPCC
Fiah
Assessment
Report
(AR5)
2PB

•  IPCC
Forth
Assessment
Report
(AR4)
35TB

•  Remote

Data
Analysis

•  Bulk
Data
Movement

Application’s

Perspective:

Climate
Data
Analysis

15

lots-‐of-‐small-‐*iles
problem!

*ile-‐centric
tools?

FTP
RPC
request a file
request a file
send file
send file
request
data
send data
•  Keep
the
network
pipe
full

•  We
want
out-‐of-‐order
and
asynchronous
send
receive

16

Many
Concurrent
Streams

(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface trafﬁc, packages per second (blue) and bytes per second, over a single
NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps
pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a
different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).

17

ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min
intervals], TCP buffer size is 50M

Effects
of
many
concurrent
streams

18

Analysis
of

Core
AfJinities

(NUMA
Effect)

19
Nathan
Hanford
et
al.

NDM’13

Sandy
Bridge
Architecture

Receive
process

20

Analysis
of

Core
AfJinities

(NUMA
Effect)

Nathan
Hanford
et
al.

NDM’14

100Gbps
demo
environment

RRT:

Sea3le
–
NERSC

16ms

NERSC
–
ANL

50ms

NERSC
–
ORNL

64ms

21

Framework
for
the
Memory-‐mapped

Network
Channel

+
SynchronizaNon
mechanism
for
RoCE

-‐
Keep
the
pipe
full
for
remote
analysis
22

Moving
climate
*iles
ef*iciently

23

Advantages

•  Decoupling
I/O
and
network
operaNons

•  front-‐end
(I/O

processing)

•  back-‐end
(networking
layer)

•  Not
limited
by
the
characterisNcs
of
the
file
sizes

•  On
the
fly
tar
approach,

bundling
and
sending

many
files

together

•  Dynamic
data
channel
management

Can
increase/decrease
the
parallelism
level
both

in
the
network

communicaNon
and
I/O
read/write
operaNons,
without
closing
and

reopening
the
data
channel
connecNon
(as
is
done
in
regular
FTP

variants).

MemzNet
is

is
not
file-‐centric.
Bookkeeping
informaNon
is
embedded

inside
each
block.

24

MemzNet’s
Architecture
for
data

streaming

25

100Gbps
Demo

•  CMIP3
data
(35TB)
from
the
GPFS
ﬁlesystem
at
NERSC

•  Block
size
4MB

•  Each
block’s
data
secNon
was
aligned
according
to
the

system
pagesize.

•  1GB
cache
both
at
the
client
and
the
server

•  At
NERSC,
8
front-‐end
threads
on
each
host
for
reading
data
ﬁles

in
parallel.

• 
At
ANL/ORNL,
4
front-‐end
threads
for
processing
received
data

blocks.

• 
4
parallel
TCP
streams
(four
back-‐end
threads)
were
used
for

each
host-‐to-‐host
connecNon.

26

MemzNet’s
Performance

TCP
buﬀer
size
is
set
to
50MB

MemzNetGridFTP
100Gbps demo
ANI Testbed
27

Challenge?

•  High-‐bandwidth
brings
new
challenges!

•  We
need
substanNal
amount
of
processing
power
and
involvement
of

mulNple
cores
to
ﬁll
a
40Gbps
or
100Gbps
network

•  Fine-‐tuning,
both
in
network
and
applicaNon
layers,
to
take

advantage
of
the
higher
network
capacity.

•  Incremental
improvement
in
current
tools?

•  We
cannot
expect
every
applicaNon
to
tune
and
improve
every
Nme
we

change
the
link
technology
or
speed.

28

MemzNet

•  MemzNet:
Memory-‐mapped
Network
Channel

•  High-‐performance
data
movement

MemzNet
is
an
iniNal
eﬀort
to
put
a
new
layer

between
the
applicaNon
and
the
transport
layer.

•  Main
goal
is
to
deﬁne
a
network
channel
so
applicaNons
can

directly
use
it
without
the
burden
of
managing/tuning
the
network

communicaNon.

29

Tech
report:
LBNL-‐6177E

MemzNet
=
New
Execution
Model

•  Luigi
Rizzo
’s
netmap

•  proposes
a
new
API
to
send/receive
data
over
the

network

• RDMA
programming
model

•  MemzNet
as
a
memory-‐management
component

• IX:
Data
Plane
OS
(Adam
Baley
et
al.
@standford
–

similar
to
MemzNet’s
model)

•  mTCP
(even
based
/
replaces
send/receive
in
user
level)

•  Tanenbaum
et
al.

Minimizing
context
switches:

proposing
to
use
MONITOR/MWAIT
for

synchronizaNon

30

Outline

•  VSAN
+
VVOL
Storage
Performance

in
Virtualized

Environments

•  Data
Streaming
in
High-‐bandwidth
Networks

•  Climate100:
Advance
Network
IniNaNve
and
100Gbps
Demo

•  MemzNet:
Memory-‐Mapped
Network
Zero-‐copy
Channels

•  Core
Aﬃnity
and
End
System
Tuning
in
High-‐Throughput

Flows

•  Network
Reserva>on
and
Online
Scheduling
(QoS)

•  FlexRes:
A
Flexible
Network
ReservaNon
Algorithm

•  SchedSim:
Online
Scheduling
with
Advance
Provisioning

31

Problem
Domain:
Esnet’s
OSCARS

32

ASIA-PACIFIC
(ASGC/Kreonet2/
TWAREN)
ASIA-PACIFIC
(KAREN/KREONET2/
NUS-GP/ODN/
REANNZ/SINET/
TRANSPAC/TWAREN)
AUSTRALIA
(AARnet)
LATIN AMERICA
CLARA/CUDI
CANADA
(CANARIE)
RUSSIA
AND CHINA
(GLORIAD)
US R&E
(DREN/Internet2/NLR)
US R&E
(DREN/Internet2/
NASA)
US R&E
(NASA/NISN/
USDOI)
ASIA-PACIFIC
(BNP/HEPNET)
ASIA-PACIFIC
(ASCC/KAREN/
KREONET2/NUS-GP/
ODN/REANNZ/
SINET/TRANSPAC)
AUSTRALIA
(AARnet)
US R&E
(DREN/Internet2/
NISN/NLR)
US R&E
(Internet2/
NLR)
CERN
US R&E
(DREN/Internet2/
NISN)
CANADA
(CANARIE) LHCONE
CANADA
(CANARIE)
FRANCE
(OpenTransit)
RUSSIA
AND CHINA
(GLORIAD)
CERN
(USLHCNet)
ASIA-PACIFIC
(SINET)
EUROPE
(GÉANT/
NORDUNET)
EUROPE
(GÉANT)
LATIN AMERICA
(AMPATH/CLARA)
LATIN AMERICA
(CLARA/CUDI)
HOUSTON
ALBUQUERQUE
El PASO
SUNNYVALE
BOISE
SEATTLE
KANSAS CITY
NASHVILLE
WASHINGTON DC
NEW YORK
BOSTON
CHICAGO
DENVER
SACRAMENTO
ATLANTA
PNNL
SLAC
AMES PPPL
BNL
ORNL
JLAB
FNAL
ANL
LBNL
•  ConnecNng
experimental
faciliNes
and
supercompuNng
centers

•  On-‐Demand
Secure
Circuits
and
Advance
ReservaNon
System

•  Guaranteed
between
collaboraNng
insNtuNons
by
delivering

network-‐as-‐a-‐service

•  Co-‐allocaNon
of
storage
and
network
resources

(SRM:
Storage
Resource
Manager)

OSCARS
provides
yes/no

answers
to
a
reservaNon

request
for
(bandwidth,

start_Bme,
end_Bme)

End-‐to-‐end
ReservaNon:

Storage+Network

Reservation
Request

•  Between
edge
routers

Need
to
ensure
availability
of
the
requested
bandwidth
from
source
to

desBnaBon
for
the
requested
Bme
interval

v 
R={
nsource,
ndesBnaBon,
Mbandwidth,
tstart,
tend}.

v  source/desNnaNon
end-‐points

v  Requested
bandwidth

v  start/end
Nmes

Commi3ed
reservaNons
between
tstart
and
tend
are
examined

The
shortest
path
from
source
to
desNnaNon
is
calculated
based
on
the

engineering
metric
on
each
link,
and
a
bandwidth
guaranteed
path
is
set

up
to
commit
and
eventually
complete
the
reservaNon
request
for
the

given
Nme
period

33

Reservation

34

v  Components (Graph):
v node (router), port, link (connecting two ports)
v engineering metric (~latency)
v maximum bandwidth (capacity)
v  Reservation:
v source, destination, path, time
v (time t1, t3) A -> B -> D (900Mbps)
v (time t2, t3) A -> C -> D (400Mbps)
v (time t4, t5) A -> B -> D (800Mpbs)
A

C
B

D

800Mbps

900Mbps
500Mbps

1000Mbps

300Mbps

ReservaNon
1

ReservaNon
2

ReservaNon
3

t1

t2
t3

t4
t5

Example

(Nme
t1,
t2)
:

A
to
D
(600Mbps)
NO

A
to
D
(500Mbps)
YES

A

C
B

D

0
Mbps
/
900Mbps
(900Mbps)

100
Mbps
/
900Mbps
(1000Mbps)

800
Mbps
/
0Mbps
(800Mbps)

500
Mbps
/
0Mbps
(500Mbps)

300
Mbps
/

0
Mbps
(300Mbps)

AcNve
reservaNon

reservaNon
1:
(Nme
t1,
t3)

A
-‐>
B
-‐>
D

(900Mbps)

reservaNon
2:
(Nme
t1,
t3)

A
-‐>
C
-‐>
D

(400Mbps)

reservaNon
3:
(Nme
t4,
t5)

A
-‐>
B
-‐>
D

(800Mpbs)

available/
reserved

(capacity)

35

Example

A

C
B

D

0
Mbps
/
900Mbps
(900Mbps)

100
Mbps
/
900Mbps
(1000Mbps)

400
Mbps
/
400Mbps
(800Mbps)

100
Mbps
/
400Mbps
(500Mbps)

300
Mbps
/

0
Mbps
(300Mbps)

(Nme
t1,
t3)
:

A
to
D
(500Mbps)
NO

A
to
C
(500Mbps)
No

(not
max-‐FLOW!)

AcNve
reservaNon

reservaNon
1:
(Nme
t1,
t3)

A
-‐>
B
-‐>
D

(900Mbps)

reservaNon
2:
(Nme
t1,
t3)

A
-‐>
C
-‐>
D

(400Mbps)

reservaNon
3:
(Nme
t4,
t5)

A
-‐>
B
-‐>
D

(800Mpbs)

available/
reserved

(capacity)

36

Alternative
Approach:
Flexible
Reservations

•  IF
the
requested
bandwidth
can
not
be
guaranteed:

•  Try-‐and-‐error
unNl
get
an
available
reservaNon

•  Client
is
not
given
other
possible
opNons

•  How
can
we
enhance
the
OSCARS
reservaNon
system?

•  Be
Flexible:

•  Submit
constraints
and
the
system
suggests
possible
reservaNon
opNons

saNsfying
given
requirements

37

Rs
'={
nsource
,
ndesBnaBon,
MMAXbandwidth,
DdataSize,
tEarliestStart,
tLatestEnd}

ReservaNon
engine
ﬁnds
out
the
reservaNon

R={
nsource,
ndesBnaBon,
Mbandwidth,
tstart,
tend}

for
the
earliest
compleNon
or
for
the
shortest
duraNon

where
Mbandwidth≤
MMAXbandwidth
and
tEarliestStart
≤
tstart
<
tend≤
tLatestEnd
.

Bandwidth
Allocation
(time-‐dependent)

Modiﬁed
Dijstra's

algorithms
(max
available

bandwidth):

•  BoUleneck
constraint

(not
addiNve)

•  QoS
constraint
is
addiNve

in
shortest
path,
etc)

38
The
maximum
bandwidth
available
for
allocaNon
from
a
source
node
to
a
desNnaNon

node

t1
t2
t3
t4
t5
t6

Analogous Example
n  A vehicle travelling from city A to city B
n  There are multiple cities between A and B connected with separate
highways.
n  Each highway has a specific speed limit
–  (maximum bandwidth)
n  But we need to reduce our speed if there is high traffic load on the
road
n  We know the load on each highway for every time period
–  (active reservations)
n  The first question is which path the vehicle should follow in order to
reach city B from city A as early as possible (earliest completion)
•  Or, we can delay our journey and start later if the total travel time
would be reduced. Second question is to find the route along with the
starting time for shortest travel duration (shortest duration)
39

Advance bandwidth reservation: we have to set the speed limit before starting and
cannot change during the journey

Time steps
n  Time steps between t1 and t13
Nme

t4
t2
t3
t1
t5
t6
t7
t8
t9
t10
t11
t12
t13

ReservaNon
1

ReservaNon
2

ReservaNon
3

Res
1
Res
1,2

Res

2

Res
3

t4
t1
t6
t7
t9
t12
t13

Nme

Nme
steps

Max (2r+1) time steps,
where r is the number of
reservations
ts1
ts2
ts3
ts4

40

Static Graphs
Res
1
Res
1,2
Res
2

t4
t1

t6
t7
t9

A

C
B

D

0
Mbps

100
Mbps

800
Mbps

500
Mbps

300
Mbps)

A

C
B

D

0
Mbps

100
Mbps

400
Mbps

100
Mbps

300
Mbps)

A

C
B

D

900
Mbps

1000
Mbps

400
Mbps

100
Mbps

300
Mbps)

A

C
B

D

900
Mbps

1000
Mbps

800
Mbps

500
Mbps

300
Mbps)

t4
t6

t7

G(ts3)
G(ts4)
G(ts2)
G(ts1)

41

Time Windows
Res
1,2
Res
2

t1

t6
t9

A

C
B

D

0
Mbps

100
Mbps

400
Mbps

100
Mbps

300
Mbps

A

C
B

D

900
Mbps

1000
Mbps

400
Mbps

100
Mbps

300
Mbps

t6

Max (s × (s + 1))/2 time windows, where s is the
number of time steps
G(tw)=G(ts3)
x
G(ts4)

tw=ts1+ts2

Bo3leneck
constraint

G(tw)=G(ts1)
x
G(ts2)

tw=ts3+ts4

42

Time
Window
List

(special
data
structures)

now
infinite

Time
windows
list

new
reservaNon:

reservaNon
1,
start
t1,
end
t10

now
t1
t10
infinite

Res
1

new
reservaNon:

reservaNon
2,
start
t12,
end
t20

now
t1
t10
t12

Res
1

t20
infinite

Res
2

43

Careful
soaware
design
makes
implementaNon
fast
and
efficient

Performance
max-bandwidth path ~ O(n^2 )
n is the number of nodes in the topology graph
In the worst-case, we may require to search all time
windows, (s × (s + 1))/2, where s is the number of
time steps.
If there are r committed reservations in the search
period, there can be a maximum of 2r + 1 different
time steps in the worst-case.
Overall, the worst-case complexity is bounded
by O(r^2 n^2 )
Note: r is relatively very small compared to the
number of nodes n 44

Example
Reservation 1: (time t1, t6) A -> B -> D
(900Mbps)
Reservation 2: (time t4, t7) A -> C -> D
(400Mbps)
Reservation 3: (time t9, t12) A -> B -> D
(700Mpbs)
A

C
B

D

800Mbps

900Mbps
500Mbps

1000Mbps

300Mbps

t4
t2
t3
t1
t5
t6
t7
t8
t9
t10
t11
t12
t13

ReservaNon
1

ReservaNon
2

ReservaNon
3

from A to D (earliest completion)
max bandwidth = 200Mbps, volume = 200Mbps x 4 time slots
earliest start = t1, latest finish t13
45

Search Order - Time Windows
Res
1
Res
1,2

Res

2

Res
3

t4
t1
t6
t7
t9
t12
t13

Nme

windows

Res
1

Res
1,
2

Res
1,
2

2

Res
1,2

Res
1,
2

Res
2

Res
1,
2

Res
1,
2

t1-‐-‐t6

t4—t6

t1-‐-‐t4

t6—t7

t4—t7

t1—t7

t7—t9

t6—t9

t4—t9

t1—t9

Max
bandwidth
from
A
to
D

1.  900Mbps

(3)

2.  100Mbps

(2)

3.  100Mbps

(5)

4.  900Mbps

(1)

5.  100Mbps

(3)

6.  100Mbps

(6)

7.  900Mpbs

(2)

8.  900Mbps

(3)

9.  100Mbps

(5)

10.  100Mbps

(8)

ReservaNon:
(
A
to
D
)
(100Mbps)
start=t1

end=t9
46

Search Order - Time Windows
Shortest
dura>on?

Res
1
Res
1,2

Res

2

Res
3

t4
t1
t6
t7
t9
t12
t13

Nme

windows

Res
3

Res
3
t9—t13

t12—t12

t9—t12

Max
bandwidth
from
A
to
D

1.  200Mbps

(3)

2.  900Mbps

(1)

3.  200Mbps

(4)

ReservaNon:
(A
to
D
)
(200Mbps)
start=t9
end=t13

Ø from
A
to
D,
max
bandwidth
=
200Mbps

volume
=
175Mbps
x
4
Nme
slots

earliest
start
=
t1,
latest
ﬁnish
t13

earliest
compleNon:

(
A
to
D
)
(100Mbps)
start=t1

end=t8

shortest
duraNon:

(
A
to
D
)
(200Mbps)
start=t9

end=t12.5

47

Source
>
Network
>
Destination

A
CB
D
800Mbps

900Mbps
500Mbps

1000Mbps

300Mbps

n2

n1

Now
we
have

mulNple
requests

48

With
start/end
times

• 
Each
transfer
request
has
start
and
end
Nmes

•  n
transfer
requests
are
given
(each
request
has
a
specific
amount
of

profit)

•  ObjecNve
is
to
maximize
the
profit

•  If
profit
is
same
for
each
request,
then
objecNve
is
to

maximize
the
number
of
jobs
in
a
give
Nme
period

•  Unspli3able
Flow
Problem:

•  An
undirected
graph,

•  route
demand
from
source(s)
to
desNnaNons(s)
and
maximize/minimize

the
total
profit/cost

49

The
online
scheduling
method
here
is
inspired
from
Gale-‐Shapley
algorithm
(also

known
as
stable
marriage
problem)

Methodology

•  Displace
other
jobs
to
open
space
for
the
new
request

• 
we
can
shia
max
n
jobs?

•  Never
accept
a
job
if
it
causes
other
commi3ed
jobs
to
break
their

criteria

•  Planning
ahead
(gives
opportunity
for
co-‐allocaNon)

•  Gives
a
polynomial
approximaNon
algorithm

•  The
preference
converts
the
UFP
problem
into
Dijkstra
path

search

•  UNlizes
Nme
windows/Nme
steps
for
ranking
(be3er
than
earliest

deadline
ﬁrst)

•  Earliest
compleNon
+
shortest
duraNon

•  Minimize
concurrency

•  Even
random
ranking
would
work
(relaxaNon
in
an
NP-‐hard
problem

50

Recall
Time
Windows

Res
1
Res
1,2

Res

2

Res
3

t4
t1
t6
t7
t9
t12
t13

Nme

windows

Res
1

Res
1,
2

Res
1,
2

2

Res
1,2

Res
1,
2

Res
2

Res
1,
2

Res
1,
2

t1-‐-‐t6

t4—t6

t1-‐-‐t4

t6—t7

t4—t7

t1—t7

t7—t9

t6—t9

t4—t9

t1—t9

Max
bandwidth
from
A
to
D

1.  900Mbps

(3)

2.  100Mbps

(2)

3.  100Mbps

(5)

4.  900Mbps

(1)

5.  100Mbps

(3)

6.  100Mbps

(6)

7.  900Mpbs

(2)

8.  900Mbps

(3)

9.  100Mbps

(5)

10.  100Mbps

(8)

ReservaNon:
(
A
to
D
)
(100Mbps)
start=t1

end=t9
52

Test

53

In
real
life,
number
of

nodes
and
number
of

reservaNon
in
a
given

search
interval
are

limited
See
AINA’13
paper
for
results

+
comparison
with
diﬀerent
preference
metrics

Autonomic
Provisioning
System

•  Generate
constraints
automaNcally
(without
user
input)

•  Volume
(elephant
flow?)

•  True
deadline
if
applicable

•  End-‐host
resource
availability

•  Burst
rate
(fixed
bandwidth,
variable
bandwidth)

•  Update
constraints
according
to
feedback
and
monitoring

•  Minimize
operaNonal
cost

•  AlternaNve
to
manual
traffic
engineering

What
is
the
incenNve
to
make
correct
reservaNons?

54

Data
Center
1

Data
Center
2

Data
node
B

(web
access)

Experimental

facility
A

*
(1)
Experimental
facility
A
generates
30T
of
data
every
day,
and
it
needs
to
be
stored
in

data
center
2,
before
the
next
run,
since
local
disk
space
is
limited

*
(2)
There
is
a
reservaNon
made
between
data
center
1
and
2.
It
is
used
to
replicate

data
files,
1P
total
size,
when
new
data
is
available
in
data
center
2

*
(3)
New
results
are
published
at
data
node
B,
we
expect
high
traffic
to
download

new
simulaNon
files
for
the
next
couple
of
months

Wide-‐area

SDN

55

Example

•  Experimental
facility
periodically
transfers
data
(i.e.
every
night)

•  Data
replicaNon
happens
occasionally,
and
it
will
take
a
week
to

move
1P
of
data.
If
could
get
delayed
couple
of
hours
with
no
harm

•  Wide-‐area
download
traffic
will
increase
gradually,
most
of
the

traffic
will
be
during
the
day.

•  We
can
dynamically
increase
preference
for
download
traffic
in
the

mornings,
give
high
priority
for
transferring
data
from
the
facility
at
night,

and
use
rest
of
the
bandwidth
for
data
replicaNon
(and
allocate
some

bandwidth
to
confirm
that
it
would
finish
within
a
week
as
usual)

56

Virtual
Circuit

ReservaNon
Engine

Autonomic
provisioning

system

monitoring

Reserva>on
Engine

–  Select
opNmal
path/Nme/bandwidth

–  maximize
the
number
of
admi3ed
requests

– 
increase
overall
system
uNlizaNon
and
network
eﬃciency

–  Dynamically
update
the
selected
rouNng
path
for
network
eﬃciency

–  Modify
exisNng
reservaNons
dynamically
to
open
space/Nme
for
new

requests

57

THANK
YOU

Any
QuesNon/Comment?

Mehmet
Balman

mehmet@balman.info

h3p://balman.info

58

PetaShare
+
Stork
Data
Scheduler

59

AggregaNon
in
Data
Path:

Advance
Buﬀer
Cache
in
Petafs
and
Petashell
clients
by
aggregaNng

I/O
requests
to
minimize
the
number
of
network
messages

Adaptive
Tuning
+
Advanced
Buffer

60

AdapNve
Tuning
for

Bulk
Transfer

Buﬀer
Cache
for

Remote
I/O

Network-Aware Data Management for Large-Scale Distributed Applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Network-Aware Data Management for Large-Scale Distributed Applications

Similar to Network-Aware Data Management for Large-Scale Distributed Applications (20)

More from balmanme

More from balmanme (20)

Recently uploaded

Recently uploaded (20)

Network-Aware Data Management for Large-Scale Distributed Applications