Cloud Architecture Tutorial: Platform Component Architecture

Cloud
Architecture
Tutorial

Pla$orm
Component
Architecture

Part
2
of
3

Qcon
London
March
5th,
2012

Adrian
Cockcro?

@adrianco
#ne$lixcloud

hCp://www.linkedin.com/in/adriancockcro?

Don’t
Do
That!

A
Discussion
of
AnM-‐Architecture

(wriCen
as
an
Ignite
talk)

Architecture

PaCerns
to
guide
detailed

design
and
construcMon

AnM-‐Architecture

Constraints
that
limit
detailed

design
and
construcMon

How
could
that
happen?

Anatomy
of
a
Failure

What
I
Wanted

•  Moving
to
Cassandra
as
primary
data
store

•  We
need
backups!

•  We
are
running
on
AWS…

I
want
Cassandra
backups
to
S3

Start
with
full
backup,
incremental
later

Restore
to
a
diﬀerent
Cassandra
cluster

AddiMonal
Goals

I
would
like
it
next
week
-‐
Keep
it
simple

No
single
point
of
failure

Get
once
a
day
full
backup
working
ﬁrst

Prototype

•  Created
S3
bucket

•  Carefully
ﬁgured
out
a
good
S3
path
hierarchy

•  Wrote
a
simple
backup
script

•  Added
it
to
cron

•  ….

•  Proﬁt!

(total
Mme
half
a
day)

Now
comes
the
hard
part!

Restore
is
trickier,
Cassandra
is
wriCen
in
Java,

programmer
from
another
team
takes
over…

Here’s
the
S3
bucket,
backups
are
being

collected
already,
please
ﬁgure
out
how
to

restore
it.
Done
by
next
week
perhaps?

Days
Pass…

•  Programmer
is
re-‐wriMng
backup
in
python

•  Installs
Python
2.7
on
CentOS,
breaks
yum

•  Backup
remotely
invoked
from
a
central
point

•  Cassandra
patched
to
do
incremental
backups

Weeks
Pass…

•  Python
based
full
backup
&
restore
works!

•  But
only
to
the
Cassandra
cluster
it
came
from

•  Incremental
backup
works!

•  Restore
not
done
yet…

Cassandra
in
ProducMon

We
do
have
backups
running
now,
right?

We’ll
get
right
on
it…

I
want
the
producKon
backup
restored
in
test.

Oh,
didn’t
implement
that
feature
yet…

Whoops!

ProducMon
data
trashed
while
sefng
up
backup

Luckily
–
it
was
recoverable
from
elsewhere

Months
Pass

•  Python
prototype
re-‐wriCen
in
Java
(Priam)

•  Integrated
with
other
management
funcMons

•  Decentralized
backups
again
(yay!)

•  Reliable
backups

•  Restore
to
test

•  Not
simple

•  Took
too
long…

AnM-‐Architecture

•  Deﬁne
the
things
you
don’t
want

•  Constrain
the
outcome

•  Check
that
the
constraints
are
being
met

•  …

•  Proﬁt!

AnM-‐Architecture
Success

hCp://techblog.ne$lix.com/2011/04/lessons-‐ne$lix-‐learned-‐from-‐aws-‐outage.html

AnM-‐Architecture

Deﬁne
the
space
the
thing
will
inhabit

(All
pictures
in
this
secMon
were
found

on
google
images)

Cloud
Architecture
PaCerns

Where
do
we
start?

Goals

•  Faster

–  Lower
latency
than
the
equivalent
datacenter
web
pages
and
API
calls

–  Measured
as
mean
and
99th
percenMle

–  For
both
ﬁrst
hit
(e.g.
home
page)
and
in-‐session
hits
for
the
same
user

•  Scalable

–  Avoid
needing
any
more
datacenter
capacity
as
subscriber
count
increases

–  No
central
verMcally
scaled
databases

–  Leverage
AWS
elasMc
capacity
eﬀecMvely

•  Available

–  SubstanMally
higher
robustness
and
availability
than
datacenter
services

–  Leverage
mulMple
AWS
availability
zones

–  No
scheduled
down
Mme,
no
central
database
schema
to
change

•  ProducMve

–  OpMmize
agility
of
a
large
development
team
with
automaMon
and
tools

–  Leave
behind
complex
tangled
datacenter
code
base
(~8
year
old
architecture)

–  Enforce
clean
layered
interfaces
and
re-‐usable
components

Datacenter
AnM-‐PaCerns

What
do
we
currently
do
in
the

datacenter
that
prevents
us
from

meeMng
our
goals?

Architecture

•  So?ware
Architecture

–  The
abstracMons
and
interfaces
that
developers
build

against

•  Systems
Architecture

–  The
service
instances
that
deﬁne
availability,

scalability

•  Compose-‐ability

–  so?ware
architecture
that
is
independent
of
the

systems
architecture

–  decoupled
ﬂexible
building
block
components

Rewrite
from
Scratch

Not
everything
is
cloud
speciﬁc

Pay
down
technical
debt

Robust
paCerns

Ne$lix
Datacenter
vs.
Cloud
Arch

Central
SQL
Database
Distributed
Key/Value
NoSQL

SMcky
In-‐Memory
Session
Shared
Memcached
Session

ChaCy
Protocols
Latency
Tolerant
Protocols

Tangled
Service
Interfaces
Layered
Service
Interfaces

Instrumented
Code
Instrumented
Service
PaCerns

Fat
Complex
Objects
Lightweight
Serializable
Objects

Components
as
Jar
Files
Components
as
Services

The
Central
SQL
Database

•  Datacenter
has
a
central
database

–  Everything
in
one
place
is
convenient
unMl
it
fails

–  Customers,
movies,
history,
conﬁguraMon

•  Schema
changes
require
downMme

AnK-‐paMern
impacts
scalability,
availability

The
Distributed
Key-‐Value
Store

•  Cloud
has
many
key-‐value
data
stores

–  More
complex
to
keep
track
of,
do
backups
etc.

–  Each
store
is
much
simpler
to
administer

DBA

–  Joins
take
place
in
java
code

–  No
schema
to
change,
no
scheduled
downMme

•  Mean
Latency
for
Simple
Key
Lookup
Queries

–  Memcached
is
dominated
by
network
latency
<1ms

–  Cassandra
around
one
millisecond

–  Oracle
for
simple
queries
is
a
few
milliseconds

–  DynamoDB
around
5ms

–  SimpleDB
replicaMon
and
REST
overheads
>10ms

The
SMcky
Session

•  Datacenter
SMcky
Load
Balancing

–  Eﬃcient
caching
for
low
latency

–  Tricky
session
handling
code

•  Encourages
concentrated
funcMonality

–  one
service
that
does
everything

–  Middle
Mer
load
balancer
had
issues
in
pracMce

AnK-‐paMern
impacts
producKvity,
availability

Shared
Session
State

•  ElasMc
Load
Balancer

–  We
don’t
use
the
cookie
based
rouMng
opMon

–  External
“session
caching”
with
memcached

•  More
ﬂexible
ﬁne
grain
services

–  Any
instance
can
serve
any
request

–  Works
beCer
with
auto-‐scaled
instance
counts

ChaCy
Opaque
and
BriCle
Protocols

•  Datacenter
service
protocols

–  Assumed
low
latency
for
many
simple
requests

•  Based
on
serializing
exisMng
java
objects

–  Ineﬃcient
formats

–  IncompaMble
when
deﬁniMons
change

AnK-‐paMern
causes
producKvity,
latency
and

availability
issues

Robust
and
Flexible
Protocols

•  Cloud
service
protocols

–  JSR311/Jersey
is
used
for
REST/HTTP
service
calls

–  Custom
client
code
includes
service
discovery

–  Support
complex
data
types
in
a
single
request

•  Apache
Avro

–  Evolved
from
Protocol
Buﬀers
and
Thri?

–  Includes
JSON
header
deﬁning
key/value
protocol

–  Avro
serializaMon
is
half
the
size
and
several
Mmes

faster
than
Java
serializaMon,
more
work
to
code

Persisted
Protocols

•  Persist
Avro
in
Memcached

–  Save
space/latency
(zigzag
encoding,
half
the
size)

–  New
keys
are
ignored

–  Missing
keys
are
handled
cleanly

•  Avro
protocol
deﬁniMons

–  Less
briCle
across
versions

–  Can
be
wriCen
in
JSON
or
generated
from
POJOs

–  It’s
hard,
needs
beCer
tooling

Tangled
Service
Interfaces

•  Datacenter
implementaMon
is
exposed

–  Oracle
SQL
queries
mixed
into
business
logic

•  Tangled
code

–  Deep
dependencies,
false
sharing

•  Data
providers
with
sideways
dependencies

–  Everything
depends
on
everything
else

AnK-‐paMern
aﬀects
producKvity,
availability

Untangled
Service
Interfaces

•  New
Cloud
Code
With
Strict
Layering

–  Compile
against
interface
jar

–  Can
use
spring
runMme
binding
to
enforce

–  Fine
grain
services
as
components

•  Service
interface
is
the
service

–  ImplementaMon
is
completely
hidden

–  Can
be
implemented
locally
or
remotely

–  ImplementaMon
can
evolve
independently

Untangled
Service
Interfaces

Two
layers:

•  SAL
-‐
Service
Access
Library

–  Basic
serializaMon
and
error
handling

–  REST
or
POJO’s
deﬁned
by
data
provider

•  ESL
-‐
Extended
Service
Library

–  Caching,
conveniences,
can
combine
several
SALs

–  Exposes
faceted
type
system
(described
later)

–  Interface
deﬁned
by
data
consumer
in
many
cases

Service
InteracMon
PaCern

Sample
Swimlane
Diagram

Service
Architecture
PaCerns

•  Internal
Interfaces
Between
Services

–  Common
paCerns
as
templates

–  Highly
instrumented,
observable,
analyMcs

–  Service
Level
Agreements
–
SLAs

•  Library
templates
for
generic
features

–  Instrumented
Ne$lix
Base
Servlet
template

–  Instrumented
generic
client
interface
template

–  Instrumented
S3,
SimpleDB,
Memcached
clients

CLIENT

Request
Start

Timestamp,
Client

Inbound
Request
End
outbound

deserialize
end
Timestamp
serialize
start

Mmestamp

Mmestamp

Inbound
Client

deserialize
outbound

start
serialize
end

Mmestamp
Mmestamp

Client
network

receive

Mmestamp

Service
Request
Client
Network

send

Mmestamp

Instruments
Every

Service

network
send

Mmestamp

Step
in
the
call
Service

Network

receive

Mmestamp

Service
Service

outbound
inbound

serialize
end
serialize
start

Mmestamp
Mmestamp

Service
Service

outbound
inbound

serialize
start
SERVICE
execute
serialize
end

request
start

Mmestamp
Mmestamp

Mmestamp,

execute
request

end
Mmestamp

Boundary
Interfaces

•  Isolate
teams
from
external
dependencies

–  Fake
SAL
built
by
cloud
team

–  Real
SAL
provided
by
data
provider
team
later

–  ESL
built
by
cloud
team
using
faceted
objects

•  Fake
data
sources
allow
development
to
start

–  e.g.
Fake
IdenMty
SAL
for
a
test
set
of
customers

–  Development
solidiﬁes
dependencies
early

–  Helps
external
team
provide
the
right
interface

One
Object
That
Does
Everything

•  Datacenter
uses
a
few
big
complex
objects

–  Movie
and
Customer
objects
are
the
foundaMon

–  Good
choice
for
a
small
team
and
one
instance

–  ProblemaMc
for
large
teams
and
many
instances

•  False
sharing
causes
tangled
dependencies

–  UnproducMve
re-‐integraMon
work

AnK-‐paMern
impacKng
producKvity
and

availability

An
Interface
For
Each
Component

•  Cloud
uses
faceted
Video
and
Visitor

–  Basic
types
hold
only
the
idenMﬁer

–  Facets
scope
the
interface
you
actually
need

–  Each
component
can
deﬁne
its
own
facets

•  No
false-‐sharing
and
dependency
chains

–  Type
manager
converts
between
facets
as
needed

–  video.asA(PresentaMonVideo)
for
www

–  video.asA(MerchableVideo)
for
middle
Mer

Basic
Types

Epistemology
and
Design

By
Stan
Lanning

Avoiding
“Level
Confusion”
[Catataxis]

•  Business
Level
Objects
(BLO?)

–  Customers,
Movies,
etc

–  Conceptual:
Exist
only
between
the
ears

•  Abstract
Types

–  AbstracMons
that
try
to
model
aspects
of
the
business

level
objects

–  O?en
captured
by
Java
interfaces

•  ImplementaMons

–  Speciﬁc
coded
implementaMons
of
the
abstract
types

–  Java
class,
or
a
collecMon
of
rows
in
a
database…

Facets

•  No
single
Abstract
Type
captures
everything

about
a
BLO

–  Diﬀerent
teams
see
diﬀerent
“facets”

•  Customer:
Account
status;

Billing
history;
Viewing

history;
A/B
test
assignments

•  Movie:
Availability;
Popularity;
Synopsis;
Cast

–  Loosely
coupled,
Mghtly
aligned(!)

•  All
facets
for
a
BLO
should
inherit
from
one

“basic”
type
that
has
minimal
behavior

Basic
Types

•  Module
external
interfaces
deal
in
basic
types;

internal
calls
are
free
to
use
more
complex

facets

•  Generic
machinery
to
switch
between
facets

Business
Level
Object
Java
Basic
Type

Movie
(TV
show…)
Video

Customer
Visitor

Category
VTag

Country
ISOCountry

Type
Manager

•  Holds
the
“factory”
objects
that
manage

instances
of
facets

–  Typically
one
factory
per
facet

–  Factories
free
to
implement
any
instance

management
policy
they
want

•  Factories
register
with
the
Type
Manager

–  callers
never
interact
directly
with
the
factories

–  Mock
managers?

Switching
Facets

•  Each
Basic
Type
B
implements
a
method
that

uses
the
Type
Manager
to
ﬁnd
facet

implementaMons
of
the
same
BLO

<T extends B> T asA(Class<T> c)!
•  Example:

Visitor visitor = xxx; 
ABClient abClient = visitor.asA(ABClient.class); 
assert(visitor.equals(abClient));!

•  Look
Ma,
no
cast!

–  Facets
are
equal,
but
not
necessarily
==.

IDs!
(huh)
What
are
they
good
for?

•  IDs
exist
because
implementaKons
need
to

externalize
objects
and
maintain
their
idenKty

–  Persist
in
a
DB,
or
talk
to
a
remote
service

–  Diﬀerent
implementaMons
of
a
type
of
BLO
model

the
same
object
iﬀ
they
have
the
same
ID

–  Basic
Types
use
IDs
to
manage
facets,
determine

equality,
etc

ConverMng
IDs
ßàObjects

Long id = xx;!
MyVisitor visitor =!
TypeManager.findObject(Visitor.class, id)!
.asA(MyVisitor.class);!
assert(id.equals(visitor.getId());!
// Or more efficiently…!
MyVisitor visitor2 =!
TypeManager.findObject(Visitor.class, id,!
MyVisitor.class);!
// There are also efficient bulk conversion methods!
Collection<Long> ids = xxx;!
List<MyVisitor> visitors =!
TypeManager.findObjects(Visitor.class, ids,!
MyVisitor.class);!
!

Stan’s
Soap
Box

•  Don’t
pass
around
IDs
when
you
mean
to
refer

to
the
BLO;
that
is
Level
Confusion

•  Using
Basic
Types
helps
the
compiler
help
you;

compile
Mme
problems
are
beCer
than
run

Mme
problems

•  More
readable
by
people,
but
beware
that

asA
operaMons
may
be
a
lot
of
work

•  (Is
this
a
way
to
approximate
mulMple-‐
inheritance
in
Java?)

So?ware
Architecture
PaCerns

•  Object
Models

–  Basic
and
derived
types,
facets,
serializable

–  Pass
by
reference
within
a
service

–  Pass
by
value
between
services

•  ComputaMon
and
I/O
Models

–  Service
ExecuMon
using
Best
Eﬀort
/
Futures

–  Common
thread
pool
management

–  Circuit
breakers
to
manage
and
contain
failures

Model
Driven
Architecture

•  TradiMonal
Datacenter
PracMces

–  Lots
of
unique
hand-‐tweaked
systems

–  Hard
to
enforce
paCerns

–  Some
use
of
Puppet
to
automate
changes

•  Model
Driven
Cloud
Architecture

–  Perforce/Ivy/Jenkins
based
builds
for
everything

–  Every
producMon
instance
is
a
pre-‐baked
AMI

–  Every
applicaMon
is
managed
by
an
Autoscaler

Every
change
is
a
new
AMI

Ne$lix
Cloud
Pla$orm

Ne$lix
ApplicaMons

Ne$lix
Cloud
Pla$orm
/
PaaS

AWS
Speciﬁc
Partner

Ne$lix
Legacy

Code
Interfaces

Datacenter

AWS
Services
Partner
Services

Services

Ne$lix
PaaS
Principles

•  Maximum
FuncMonality

–  Developer
producMvity
and
agility

•  Leverage
as
much
of
AWS
as
possible

–  AWS
is
making
huge
investments
in
features/scale

•  Interfaces
that
isolate
Apps
from
AWS

–  Avoid
lock-‐in
to
speciﬁc
AWS
API
details

•  Portability
is
a
long
term
goal

–  Gets
easier
as
other
vendors
catch
up
with
AWS

Ne$lix
Global
PaaS

•  Architecture
Features
and
Overview

•  Portals
and
Explorers

•  Pla$orm
Services

•  Pla$orm
APIs

•  Pla$orm
Frameworks

•  Persistence

•  Scalability
Benchmark

Global
PaaS?

Toys
are
nice,
but
this
is
the
real
thing…

•  Supports
all
AWS
Availability
Zones
and
Regions

•  Supports
mulMple
AWS
accounts
{test,
prod,
etc.}

•  Cross
Region/Acct
Data
ReplicaMon
and
Archiving

•  InternaMonalized,
Localized
and
GeoIP
rouMng

•  Security
is
ﬁne
grain,
dynamic
AWS
keys

•  Autoscaling
to
thousands
of
instances

•  Monitoring
for
millions
of
metrics

•  ProducMve
for
100s
of
developers
on
one
product

•  23M+
users
USA,
Canada,
LaMn
America,
UK,
Eire

Basic
PaaS
EnMMes

•  AWS
Based
EnMMes

–  Instances
and
Machine
Images,
ElasMc
IP
Addresses

–  Security
Groups,
Load
Balancers,
Autoscale
Groups

–  Availability
Zones
and
Geographic
Regions

•  Ne$lix
PaaS
EnMMes

–  ApplicaMons
(registered
services)

–  Clusters
(versioned
Autoscale
Groups
for
an
App)

–  ProperMes
(dynamic
hierarchical
conﬁguraMon)

Core
PaaS
Services

•  AWS
Based
Services

–  S3
storage,
to
5TB
ﬁles,
parallel
mulMpart
writes

–  SQS
–
Simple
Queue
Service.
Messaging
layer.

•  Ne$lix
Based
Services

–  EVCache
–
memcached
based
ephemeral
cache

–  Cassandra
–
distributed
data
store

•  External
Services

–  GeoIP
Lookup
interfaced
to
a
vendor

–  Keystore
HSM
in
Ne$lix
Datacenter

Instance
Architecture

Linux
Base
AMI
(CentOS
or
Ubuntu)

OpMonal

Apache

frontend,

Java
(JDK
6
or
7)

memcached,

non-‐java
apps

Tomcat

AppDynamics

appagent

Monitoring

Log
rotaMon
ApplicaMon
servlet,
base
Healthcheck,
status

to
S3
GC
and
thread
server,
pla$orm,
interface
servlets,
JMX
interface,

AppDynamics
dump
logging
jars
for
dependent
services
Servo
autoscale

machineagent

Epic

Security
Architecture

•  Instance
Level
Security
baked
into
base
AMI

–  Login:
ssh
only
allowed
via
portal
(not
between
instances)

–  Each
app
type
runs
as
its
own
userid
app{test|prod}

•  AWS
Security,
IdenMty
and
Access
Management

–  Each
app
has
its
own
security
group
(ﬁrewall
ports)

–  Fine
grain
user
roles
and
resource
ACLs

•  Key
Management

–  AWS
Keys
dynamically
provisioned,
easy
updates

–  High
grade
app
speciﬁc
key
management
support

Core
Pla$orm
Frameworks
and
APIs

Portals
and
Explorers

•  Ne$lix
ApplicaMon
Console
(NAC)

–  Primary
AWS
provisioning/conﬁg
interface

•  AWS
Usage
Analyzer

–  Breaks
down
costs
by
applicaMon
and
resource

•  Cassandra
Explorer

–  Browse
clusters,
keyspaces,
column
families

•  Base
Server
Explorer

–  Browse
service
endpoints
conﬁguraMon,
perf

AWS
Usage

for
test,
carefully
omifng
any
$
numbers…

Pla$orm
Services

•  Discovery
–
service
registry
for
“ApplicaMons”

•  IntrospecMon
–
Entrypoints

•  Cryptex
–
Dynamic
security
key
management

•  Geo
–
Geographic
IP
lookup

•  Pla$ormservice
–
Dynamic
property
conﬁguraMon

•  LocalizaMon
–
manage
and
lookup
local
translaMons

•  Evcache
–
ephemeral
volaMle
cache

•  Cassandra
–
Cross
zone/region
distributed
data
store

•  Zookeeper
–
Distributed
CoordinaMon
(Curator)

•  Various
proxies
–
access
to
old
datacenter
stuﬀ

IntrospecMon
-‐
Entrypoints

•  REST
API
for
tools,
apps,
explorers,
monkeys…

–  E.g.
GET
/REST/v1/instance/$INSTANCE_ID

•  AWS
Resources

–  Autoscaling
Groups,
EIP
Groups,
Instances

•  Ne$lix
PaaS
Resources

–  Discovery
ApplicaMons,
Clusters
of
ASGs,
History

Entrypoints
Queries

MongoDB
used
for
low
traffic
complex
queries
against
complex
objects

DescripAon
Range
expression

Find
all
acMve
instances.

all()

Find
all
instances
associated
with
a
group
%(cloudmonkey)

name.

Find
all
instances
associated
with
a
/^cloudmonkey$/discovery()

discovery
group.

Find
all
auto
scale
groups
with
no
instances.
asg(),-‐has(INSTANCES;asg())

How
many
instances
are
not
in
an
auto
count(all(),-‐info(eval(INSTANCES;asg())))

scale
group?

What
groups
include
an
instance?
*(i-‐4e108521)

What
auto
scale
groups
and
elasMc
load
filter(TYPE;asg,elb;*(i-‐4e108521))

balancers
include
an
instance?

What
instance
has
a
given
public
ip?
filter(PUBLIC_IP;174.129.188.{0..255};all())

Metrics
Framework

•  System
and
ApplicaMon

–  CollecMon,
AggregaMon,
Querying
and
ReporMng

–  Non-‐blocking
logging,
avoids
log4j
lock
contenMon

–  Honu-‐Streaming
-‐>
S3
-‐>
EMR
-‐>
Hive

•  Performance,
Robustness,
Monitoring,
Analysis

–  Tracers,
Counters
–
explicit
code
instrumentaMon
log

–  Real
Time
Tracers/Counters

–  SLA
–
service
level
response
Mme
percenMles

–  Servo
annotated
JMX
extract
to
Cloudwatch

•  Latency
Monkey
Infrastructure

–  Inject
random
delays
into
service
responses

ConﬁguraAon
Management

•  Ne$lixConﬁguraMon

–  ValidaMon
Framework

–  Sitewide
ProperMes
Explorer

•  Pla$ormService

•  Mapping
Service

•  ZooKeeper
(Curator)

•  InstanceIdenMty

Interprocess
CommunicaAon

•  Discovery
Service
registry
for
“applicaMons”

–  “here
I
am”
call
every
30s,
drop
a?er
3
missed

–  “where
is
everyone”
call

–  Redundant,
distributed,
moving
to
Zookeeper

•  NIWS
–
Ne$lix
Internal
Web
Service
client

–  So?ware
Middle
Tier
Load
Balancer

–  Failure
retry
moves
to
next
instance

–  Many
opMons
for
encoding,
etc.

Security
Key
Management

•  AKMS

–  Dynamic
Key
Management
interface

–  Update
AWS
keys
at
runMme,
no
restart

–  All
keys
stored
securely,
none
on
disk
or
in
AMI

•  Cryptex
-‐
Flexible
key
store

–  Low
grade
keys
processed
in
client

–  Medium
grade
keys
processed
by
Cryptex
service

–  High
grade
keys
processed
by
hardware
(Ingrian)

AWS
Persistence
Services

•  SimpleDB

–  Got
us
started,
migrated
to
Cassandra
now

–  NFSDB
-‐
Instrumented
wrapper
library

–  Domain
and
Item
sharding
(workarounds)

•  S3

–  Upgraded/Instrumented
JetS3t
based
interface

–  Supports
mulMpart
upload
and
5TB
ﬁles

–  Global
S3
endpoint
management

Aside:
Adrian’s
Rant
on
CAP
Theorem

Choose
Consistency
or
Availability
when
ParAAoned

•  Instances
and
Networks
will
fail

•  Network
failure
=
ParMMon
“P”
is
a
given

•  Distributed
Systems:
two
choices
–
CP
or
AP

•  “Vendor
claims
CA”

–  Usually
they
mean
available
when
instances
fail

•  Master-‐Slave
=
Consistent
when
ParMMoned

–  You
can’t
write
unless
you
can
see
the
master

•  No-‐Master
=
Available
when
ParMMoned

–  Writes
proceed,
conﬂicts
will
be
patched
up
later

What
Ne$lix
Needed
from
NoSQL

Basic
Requirements

•  Supports
running
on
Amazon
EC2

•  Supports
Amazon
Availability
Zones

•  Low
latency,
low
latency
variance

•  High
and
scalable
read
and
write
throughput

•  Large
and
scalable
capacity,
no
external
sharding

•  “AP”
Eventually
Consistent

•  Data
integrity
checks
and
repairs

•  Online
Snapshot
Backup,
Restore/Rollback

Scenario
–
Immediate
Read
a?er
Write

Q1:
Is
rouMng
and
replicaMon
zone
aware?

TV
Device

New
New

Favorite
Round
Robin
Favorites

Load
Balancer

List

API
API

(zone
A)
(Zone
B)

Append
New

New
Favorites

Favorite
List

Favorites
Favorites

(zone
A)
(Zone
B)

ReplicaMon

Network
ParMMon

Q2:
What
happens
next?

TV
Device

New
New

Favorite
Round
Robin
Favorites

Load
Balancer

List

API
API

(zone
A)
(Zone
B)

Append
New

New
Favorites

Favorite
List

Favorites
Favorites

(zone
A)
(Zone
B)

No
ReplicaMon

Network
ParMMon

Q3:
Supports
Append
vs.
Read/Modify/Write?

TV
Device

New
New

Favorite
Round
Robin
Favorites

Load
Balancer

List

RMW

API
API

(zone
A)
(Zone
B)

Old
New
New

Favorites
Favorites
Favorites

List
List
List

Favorites
Favorites

(zone
A)
(Zone
B)

ReplicaMon

Silent
Data
CorrupMon

Q4:
How
is
it
detected
and
corrected?

TV
Device

New
New

Favorite
Round
Robin
Favorites

Load
Balancer

List

API
API

(zone
A)
(Zone
B)

Append
New

New
Favorites

Favorite
List

Favorites
Favorites

(zone
A)
(Zone
B)

ReplicaMon
corrupted
on
disk
or
via
network

NePlix
PlaPorm
Persistence

•  Ephemeral
VolaMle
Cache
–
evcache

–  Discovery-‐aware
memcached
based
backend

–  Client
abstracMons
for
zone
aware
replicaMon

–  OpMon
to
write
to
all
zones,
fast
read
from
local

•  Cassandra

–  Highly
available
and
scalable
(more
later…)

•  MongoDB

–  Complex
object/query
model
for
small
scale
use

•  MySQL

–  Hard
to
scale,
legacy
and
small
relaMonal
models

Why
Cassandra?

•  We
value
Availability
over
Consistency
–
AP

–  Cassandra
is
a
Java
distributed
systems
toolkit

•  We
have
a
building
full
of
Java
engineers

–  Riak
is
in
Erlang
–
a
blessing
and
a
curse…

•  We
want
FOSS
+
Support

–  Voldemort
doesn’t
have
a
support
model

•  Writes
are
intrinsically
harder
than
reads

–  Hbase
is
CP
opMmized
for
reads
&
single
namenode
issues

•  Cassandra
works,
running
~55
clusters

–  Step
by
step
into
full
producMon
over
the
last
year

Priam
–
Cassandra
AutomaMon

Available
at
hCp://github.com/ne$lix

•  Ne$lix
Pla$orm
Tomcat
Code

•  Zero
touch
auto-‐conﬁguraMon

•  State
management
for
Cassandra
JVM

•  Token
allocaMon
and
assignment

•  Broken
node
auto-‐replacement

•  Full
and
incremental
backup
to
S3

•  Restore
sequencing
from
S3

•  Grow/Shrink
Cassandra
“ring”

Astyanax

Available
at
hCp://github.com/ne$lix

•  Cassandra
java
client

•  API
abstracMon
on
top
of
Thri?
protocol

•  “Fixed”
ConnecMon
Pool
abstracMon
(vs.
Hector)

–  Round
robin
with
Failover

–  Retry-‐able
operaMons
not
Med
to
a
connecMon

–  Ne$lix
PaaS
Discovery
service
integraMon

–  Host
reconnect
(fixed
interval
or
exponenMal
backoff)

–  Token
aware
to
save
a
network
hop
–
lower
latency

–  Latency
aware
to
avoid
compacMng/repairing
nodes
–
lower
variance

•  Batch
mutaMon:
set,
put,
delete,
increment

•  Simplified
use
of
serializers
via
method
overloading
(vs.
Hector)

•  ConnecMonPoolMonitor
interface
for
counters
and
tracers

•  Composite
Column
Names
replacing
deprecated
SuperColumns

IniMalizing
Astyanax

// Configuration either set in code or nfastyanax.properties
platform.ListOfComponentsToInit=LOGGING,APPINFO,DISCOVERY
netflix.environment=test
default.astyanax.readConsistency=CL_QUORUM
default.astyanax.writeConsistency=CL_QUORUM
MyCluster.MyKeyspace.astyanax.servers=127.0.0.1

// Must initialize platform for discovery to work
NFLibraryManager.initLibrary(PlatformManager.class, props, false, true);
NFLibraryManager.initLibrary(NFAstyanaxManager.class, props, true, false);

// Open a keyspace instance
Keyspace keyspace = KeyspaceFactory.openKeyspace(”MyCluster”,”MyKeyspace");

Astyanax
Query
Example

Paginate
through
all
columns
in
a
row

ColumnList<String>
columns;

int
pageize
=
10;

try
{

RowQuery<String,
String>
query
=
keyspace

.prepareQuery(CF_STANDARD1)

.getKey("A")

.setIsPaginaMng()

.withColumnRange(new
RangeBuilder().setMaxSize(pageize).build());

while
(!(columns
=
query.execute().getResult()).isEmpty())
{

for
(Column<String>
c
:
columns)
{

}

}

}
catch
(ConnecMonExcepMon
e)
{

}

Data
MigraMon
to
Cassandra

Distributed
Key-‐Value
Stores

•  Cloud
has
many
key-‐value
data
stores

–  More
complex
to
keep
track
of,
do
backups
etc.

–  Each
store
is
much
simpler
to
administer
DBA

–  Joins
take
place
in
java
code

•  No
schema
to
change,
no
scheduled
downMme

•  Latency
for
typical
queries

–  Memcached
is
dominated
by
network
latency
<1ms

–  Cassandra
takes
a
few
milliseconds

–  SimpleDB
replicaMon
and
REST
auth
overheads
>10ms

MulA-‐Regional
Data
ReplicaAon

•  IR
Framework
–
Datacenter
Item
Replicator

–  Built
in
2009,
ﬁrst
step
to
the
cloud

–  Oracle
to
SimpleDB
or
Cassandra
via
poll
and
push

–  Return
updates
to
Oracle
via
SQS
message
queue

•  SimpleDB
or
S3
to
Cassandra

–  Data
migraMon
tool
for
global
Ne$lix

•  Global
SimpleDB
and
S3
ReplicaMon

–  Cross
region
async
updates
USA
to
Europe

TransiAonal
Steps

•  BidirecMonal
ReplicaMon

–  Oracle
to
SimpleDB

–  Queued
reverse
path
using
SQS

–  Backups
remain
in
Datacenter
via
Oracle

•  New
Cloud-‐Only
Data
Sources

–  Cassandra
based

–  No
replicaMon
to
Datacenter

–  Backups
performed
in
the
cloud

API

AWS
EC2

Front
End
Load
Balancer

Discovery

Service
API
Proxy
API
etc.

Load
Balancer

Component
API
SQS

Services
Oracl
e

Oracle

Oracle

Cassandra
memcached
ReplicaMon

memcached

EC2

Internal

Disks

NePlix

S3
Data
Center

SimpleDB

Cufng
the
Umbilical

•  TransiMon
Oracle
Data
Sources
to
Cassandra

–  Oﬄoad
Datacenter
Oracle
hardware

–  Free
up
capacity
for
growth
of
remaining
services

•  TransiMon
SimpleDB+Memcached
to
Cassandra

–  Primary
data
sources
that
need
backup

–  Keep
simplest
small
use
cases
for
now

•  New
challenges

–  Backup,
restore,
archive,
business
conMnuity

–  Business
Intelligence
integraMon

API

AWS
EC2

Front
End
Load
Balancer

Discovery

Service
API
Proxy

Load
Balancer

Component
API

Services

memcached
Cassandra

EC2

Internal

Disks

Backup

S3

SimpleDB

High
Availability

•  Cassandra
stores
3
local
copies,
1
per
zone

–  Synchronous
access,
durable,
highly
available

–  Read/Write
One
fastest,
least
consistent
-‐
~1ms

–  Read/Write
Quorum
2
of
3,
consistent
-‐
~3ms

•  AWS
Availability
Zones

–  Separate
buildings

–  Separate
power
etc.

–  Fairly
close
together

“TradiMonal”
Cassandra
Write
Data
Flows

Single
Region,
MulMple
Availability
Zone,
Not
Token
Aware

Cassandra

• Disks

• Zone
A

2
2

4
2

1.  Client
Writes
to
any
Cassandra
3
3

Cassandra
If
a
node
goes
oﬄine,

Cassandra
Node
• Disks
5 • Disks
5
hinted
handoﬀ

2.  Coordinator
Node
• Zone
C
1 • Zone
A
completes
the
write

replicates
to
nodes
when
the
node
comes

and
Zones

Non
Token
back
up.

3.  Nodes
return
ack
to

Aware

coordinator
Clients
Requests
can
choose
to

4.  Coordinator
returns
3
wait
for
one
node,
a

Cassandra
Cassandra

ack
to
client
• Disks
• Disks
5
quorum,
or
all
nodes
to

5.  Data
wriCen
to
• Zone
C
• Zone
B
ack
the
write

internal
commit
log

disk
(no
more
than
Cassandra
SSTable
disk
writes
and

• Disks

10
seconds
later)
• Zone
B

compacMons
occur

asynchronously

Astyanax
-‐
Cassandra
Write
Data
Flows

Single
Region,
MulMple
Availability
Zone,
Token
Aware

Cassandra

• Disks

• Zone
A

1.  Client
Writes
to
Cassandra
2
2

Cassandra
If
a
node
goes
oﬄine,

nodes
and
Zones
• Disks
3 • Disks
3
hinted
handoﬀ

2.  Nodes
return
ack
to
• Zone
C
1 • Zone
A
completes
the
write

client

3.  Data
wriCen
to

Token
when
the
node
comes

back
up.

internal
commit
log
Aware

disks
(no
more
than
Clients
2

Requests
can
choose
to

10
seconds
later)
Cassandra
Cassandra
wait
for
one
node,
a

• Disks
• Disks
3
quorum,
or
all
nodes
to

• Zone
C
• Zone
B
ack
the
write

Cassandra
SSTable
disk
writes
and

• Disks

• Zone
B

compacMons
occur

asynchronously

Data
Flows
for
MulM-‐Region
Writes

Token
Aware,
Consistency
Level
=
Local
Quorum

1.  Client
writes
to
local
replicas
If
a
node
or
region
goes
offline,
hinted
handoff

2.  Local
write
acks
returned
to
completes
the
write
when
the
node
comes
back
up.

Client
which
conMnues
when
Nightly
global
compare
and
repair
jobs
ensure

2
of
3
local
nodes
are
everything
stays
consistent.

commiCed

3.  Local
coordinator
writes
to

remote
coordinator.

Cassandra
100+ms
latency

4.  When
data
arrives,
remote

Cassandra

•  Disks
•  Disks

•  Zone
A
•  Zone
A

coordinator
node
acks
and
Cassandra
2
2

Cassandra
Cassandra
4

Cassandra

6
6
3
5
Disks
6

copies
to
other
remote
zones
6

•  Disks
•  Disks

•  Zone
C
•  Zone
A

• 
•  Zone
C
4
Disks
A

• 
•  Zone

1

4

5.  Remote
nodes
ack
to
local
US
EU

coordinator
Clients
Clients

Cassandra
2

Cassandra
Cassandra
5

Cassandra

6.  Data
flushed
to
internal
•  Disks

•  Zone
C

•  Disks

6

•  Zone
B

•  Disks

•  Zone
C

•  Disks
6

•  Zone
B

commit
log
disks
(no
more
Cassandra
Cassandra

than
10
seconds
later)

•  Disks
•  Disks

•  Zone
B
•  Zone
B

Remote
Copies

•  Cassandra
duplicates
across
AWS
regions

–  Asynchronous
write,
replicates
at
desMnaMon

–  Doesn’t
directly
aﬀect
local
read/write
latency

•  Global
Coverage

–  Business
agility

–  Follow
AWS…
?
•  Local
Access
?
?
–  BeCer
latency
3
A 3
–  Fault
IsolaMon

Cassandra
Backup

•  Full
Backup
Cassandra

Cassandra
Cassandra

–  Time
based
snapshot

–  SSTable
compress
-‐>
S3
Cassandra
Cassandra

•  Incremental
S3

Backup

Cassandra
Cassandra

–  SSTable
write
triggers

compressed
copy
to
S3
Cassandra
Cassandra

•  Archive
Cassandra
Cassandra

–  Copy
cross
region

A

Cassandra
Restore

•  Full
Restore
Cassandra

Cassandra
Cassandra

–  Replace
previous
data

•  New
Ring
from
Backup
Cassandra
Cassandra

–  New
name
old
data
S3

Backup

Cassandra
Cassandra

•  Scripted

–  Create
new
instances
Cassandra
Cassandra

–  Parallel
load
-‐
fast
Cassandra
Cassandra

Cassandra
Online
AnalyMcs

•  Brisk
=
Hadoop
+
Cass
Cassandra

–  “Cassandra
Enterprise”

Brisk
Cassandra

–  Use
split
Brisk
ring
Brisk
Cassandra

–  Size
each
separately

S3

•  Direct
Access
Cassandra

Backup

Cassandra

–  Keyspaces

–  Hive/Pig/Map-‐Reduce
Cassandra
Cassandra

–  Hdfs
as
a
keyspace
Cassandra
Cassandra

–  Distributed
namenode

ETL
for
Cassandra

•  Data
is
de-‐normalized
over
many
clusters!

•  Too
many
to
restore
from
backups
for
ETL

•  SoluMon
–
read
backup
ﬁles
using
Hadoop

•  Aegisthus

–  hCp://techblog.ne$lix.com/2012/02/aegisthus-‐bulk-‐data-‐pipeline-‐out-‐of.html

–  High
throughput
raw
SSTable
processing

–  Re-‐normalizes
many
clusters
to
a
consistent
view

–  Extract,
Transform,
then
Load
into
Teradata

Cassandra
Archive
A

Appropriate
level
of
paranoia
needed…

•  Archive
could
be
un-‐readable

–  Restore
S3
backups
weekly
from
prod
to
test,
and
daily
ETL

•  Archive
could
be
stolen

–  PGP
Encrypt
archive

•  AWS
East
Region
could
have
a
problem

–  Copy
data
to
AWS
West

•  ProducMon
AWS
Account
could
have
an
issue

–  Separate
Archive
account
with
no-‐delete
S3
ACL

•  AWS
S3
could
have
a
global
problem

–  Create
an
extra
copy
on
a
diﬀerent
cloud
vendor….

Extending
to
MulM-‐Region

In
producMon
for
UK/Eire
support

1.  Create
cluster
in
EU
Take
a
Boeing
737
on
a
domesMc
ﬂight,
upgrade
it
to

a
747
by
adding
more
engines,
fuel
and
bigger
wings

2.  Backup
US
cluster
to
S3

and
ﬂy
it
to
Europe
without
landing
it
on
the
way…

3.  Restore
backup
in
EU

4.  Local
repair
EU
cluster

5.  Global
repair/join

Cassandra
100+ms
latency
Cassandra
1

•  Disks
•  Disks

•  Zone
A
•  Zone
A

Cassandra
Cassandra
Cassandra
Cassandra

•  Disks
•  Disks
•  Disks
•  Disks

•  Zone
C
•  Zone
A
•  Zone
C
•  Zone
A

US
5
EU

Clients
Clients

Cassandra
Cassandra
Cassandra
Cassandra

•  Disks
•  Disks
•  Disks
•  Disks

•  Zone
C
•  Zone
B
•  Zone
C
•  Zone
B

Cassandra
Cassandra

•  Disks
•  Disks

•  Zone
B

3
•  Zone
B

4

2

S3

Tools
and
AutomaMon

•  Developer
and
Build
Tools

–  Jira,
Perforce,
Eclipse,
Jenkins,
Ivy,
ArMfactory

–  Builds,
creates
.war
ﬁle,
.rpm,
bakes
AMI
and
launches

•  Custom
Ne$lix
ApplicaMon
Console

–  AWS
Features
at
Enterprise
Scale
(hide
the
AWS
security
keys!)

–  Auto
Scaler
Group
is
unit
of
deployment
to
producMon

•  Open
Source
+
Support

–  Apache,
Tomcat,
Cassandra,
Hadoop

–  Datastax
support
for
Cassandra,
AWS
support
for
Hadoop
via
EMR

•  Monitoring
Tools

–  Alert
processing
gateway
into
Pagerduty

–  AppDynamics
–
Developer
focus
for
cloud
hCp://appdynamics.com

NoSQL
Developer
MigraMon

•  Jason
Brown
@jasobrown

–  Cassandra
from
the
Trenches

–  slideshare.net/ne$lix

•  Mark
Atwood,
"Guide
to
NoSQL,
redux”

–  YouTube
hCp://youtu.be/zAbFRiyT3LU

Open
Sourcing
the
Ne$lix
PaaS

Open
Source
Strategy

•  Release
PaaS
Components
git-‐by-‐git

–  Source
at
github.com/ne$lix

–  Intros
and
techniques
at
techblog.ne$lix.com

–  Blog
post
or
new
code
every
week
or
so

•  MoMvaMons

–  Give
back
to
Apache
licensed
OSS
community

–  MoMvate,
retain,
hire
top
engineers

–  Create
a
community
that
adds
features
and
ﬁxes

Current
OSS
Projects
and
Posts

Github
/
Techblog

Priam
Exhibitor
Servo

Apache
Project

Techblog
Post
Astyanax
Curator
Autoscaling
scripts

CassJMeter
Zookeeper
Honu

Cassandra
EVCache
Circuit
Breaker

Aegisthus

Takeaway

NePlix
has
built
and
deployed
a
scalable
global
PlaPorm
as
a
Service.

Key
components
of
the
NePlix
PaaS
are
being
released
as
Open
Source

projects
so
you
can
build
your
own
custom
PaaS.

hCp://github.com/Ne$lix

hCp://techblog.ne$lix.com

hCp://slideshare.net/Ne$lix

hCp://www.linkedin.com/in/adriancockcro?

@adrianco
#ne$lixcloud

End
of
Part
2
of
3

Cloud Architecture Tutorial: Platform Component Architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Cloud Architecture Tutorial: Platform Component Architecture

Similar to Cloud Architecture Tutorial: Platform Component Architecture (20)

More from Adrian Cockcroft

More from Adrian Cockcroft (14)

Recently uploaded

Recently uploaded (20)

Cloud Architecture Tutorial: Platform Component Architecture