This document provides an overview of a presentation on cloud architecture and anti-architecture patterns. The presentation discusses moving a company's primary data store from a centralized SQL database to a distributed Cassandra database in the cloud. An initial prototype backup solution was overengineered, becoming complex and taking too long to implement fully. This highlighted the importance of defining anti-architecture constraints upfront to guide development in a simpler direction. The presentation concludes with a discussion of differences between the company's existing datacenter architecture and goals for a cloud architecture, focusing on replacing centralized components with distributed and decoupled alternatives.
1. Cloud
Architecture
Tutorial
Pla$orm
Component
Architecture
Part
2
of
3
Qcon
London
March
5th,
2012
Adrian
Cockcro?
@adrianco
#ne$lixcloud
hCp://www.linkedin.com/in/adriancockcro?
2. Don’t
Do
That!
A
Discussion
of
AnM-‐Architecture
(wriCen
as
an
Ignite
talk)
8. What
I
Wanted
• Moving
to
Cassandra
as
primary
data
store
• We
need
backups!
• We
are
running
on
AWS…
I
want
Cassandra
backups
to
S3
Start
with
full
backup,
incremental
later
Restore
to
a
different
Cassandra
cluster
9. AddiMonal
Goals
I
would
like
it
next
week
-‐
Keep
it
simple
No
single
point
of
failure
Get
once
a
day
full
backup
working
first
10. Prototype
• Created
S3
bucket
• Carefully
figured
out
a
good
S3
path
hierarchy
• Wrote
a
simple
backup
script
• Added
it
to
cron
• ….
• Profit!
(total
Mme
half
a
day)
11. Now
comes
the
hard
part!
Restore
is
trickier,
Cassandra
is
wriCen
in
Java,
programmer
from
another
team
takes
over…
Here’s
the
S3
bucket,
backups
are
being
collected
already,
please
figure
out
how
to
restore
it.
Done
by
next
week
perhaps?
12. Days
Pass…
• Programmer
is
re-‐wriMng
backup
in
python
• Installs
Python
2.7
on
CentOS,
breaks
yum
• Backup
remotely
invoked
from
a
central
point
• Cassandra
patched
to
do
incremental
backups
13. Weeks
Pass…
• Python
based
full
backup
&
restore
works!
• But
only
to
the
Cassandra
cluster
it
came
from
• Incremental
backup
works!
• Restore
not
done
yet…
14. Cassandra
in
ProducMon
We
do
have
backups
running
now,
right?
We’ll
get
right
on
it…
I
want
the
producKon
backup
restored
in
test.
Oh,
didn’t
implement
that
feature
yet…
15. Whoops!
ProducMon
data
trashed
while
sefng
up
backup
Luckily
–
it
was
recoverable
from
elsewhere
16. Months
Pass
• Python
prototype
re-‐wriCen
in
Java
(Priam)
• Integrated
with
other
management
funcMons
• Decentralized
backups
again
(yay!)
• Reliable
backups
• Restore
to
test
• Not
simple
• Took
too
long…
17. AnM-‐Architecture
• Define
the
things
you
don’t
want
• Constrain
the
outcome
• Check
that
the
constraints
are
being
met
• …
• Profit!
21. Goals
• Faster
– Lower
latency
than
the
equivalent
datacenter
web
pages
and
API
calls
– Measured
as
mean
and
99th
percenMle
– For
both
first
hit
(e.g.
home
page)
and
in-‐session
hits
for
the
same
user
• Scalable
– Avoid
needing
any
more
datacenter
capacity
as
subscriber
count
increases
– No
central
verMcally
scaled
databases
– Leverage
AWS
elasMc
capacity
effecMvely
• Available
– SubstanMally
higher
robustness
and
availability
than
datacenter
services
– Leverage
mulMple
AWS
availability
zones
– No
scheduled
down
Mme,
no
central
database
schema
to
change
• ProducMve
– OpMmize
agility
of
a
large
development
team
with
automaMon
and
tools
– Leave
behind
complex
tangled
datacenter
code
base
(~8
year
old
architecture)
– Enforce
clean
layered
interfaces
and
re-‐usable
components
22. Datacenter
AnM-‐PaCerns
What
do
we
currently
do
in
the
datacenter
that
prevents
us
from
meeMng
our
goals?
23. Architecture
• So?ware
Architecture
– The
abstracMons
and
interfaces
that
developers
build
against
• Systems
Architecture
– The
service
instances
that
define
availability,
scalability
• Compose-‐ability
– so?ware
architecture
that
is
independent
of
the
systems
architecture
– decoupled
flexible
building
block
components
24. Rewrite
from
Scratch
Not
everything
is
cloud
specific
Pay
down
technical
debt
Robust
paCerns
25. Ne$lix
Datacenter
vs.
Cloud
Arch
Central
SQL
Database
Distributed
Key/Value
NoSQL
SMcky
In-‐Memory
Session
Shared
Memcached
Session
ChaCy
Protocols
Latency
Tolerant
Protocols
Tangled
Service
Interfaces
Layered
Service
Interfaces
Instrumented
Code
Instrumented
Service
PaCerns
Fat
Complex
Objects
Lightweight
Serializable
Objects
Components
as
Jar
Files
Components
as
Services
26. The
Central
SQL
Database
• Datacenter
has
a
central
database
– Everything
in
one
place
is
convenient
unMl
it
fails
– Customers,
movies,
history,
configuraMon
• Schema
changes
require
downMme
AnK-‐paMern
impacts
scalability,
availability
27. The
Distributed
Key-‐Value
Store
• Cloud
has
many
key-‐value
data
stores
– More
complex
to
keep
track
of,
do
backups
etc.
– Each
store
is
much
simpler
to
administer
DBA
– Joins
take
place
in
java
code
– No
schema
to
change,
no
scheduled
downMme
• Mean
Latency
for
Simple
Key
Lookup
Queries
– Memcached
is
dominated
by
network
latency
<1ms
– Cassandra
around
one
millisecond
– Oracle
for
simple
queries
is
a
few
milliseconds
– DynamoDB
around
5ms
– SimpleDB
replicaMon
and
REST
overheads
>10ms
28. The
SMcky
Session
• Datacenter
SMcky
Load
Balancing
– Efficient
caching
for
low
latency
– Tricky
session
handling
code
• Encourages
concentrated
funcMonality
– one
service
that
does
everything
– Middle
Mer
load
balancer
had
issues
in
pracMce
AnK-‐paMern
impacts
producKvity,
availability
29. Shared
Session
State
• ElasMc
Load
Balancer
– We
don’t
use
the
cookie
based
rouMng
opMon
– External
“session
caching”
with
memcached
• More
flexible
fine
grain
services
– Any
instance
can
serve
any
request
– Works
beCer
with
auto-‐scaled
instance
counts
30. ChaCy
Opaque
and
BriCle
Protocols
• Datacenter
service
protocols
– Assumed
low
latency
for
many
simple
requests
• Based
on
serializing
exisMng
java
objects
– Inefficient
formats
– IncompaMble
when
definiMons
change
AnK-‐paMern
causes
producKvity,
latency
and
availability
issues
31. Robust
and
Flexible
Protocols
• Cloud
service
protocols
– JSR311/Jersey
is
used
for
REST/HTTP
service
calls
– Custom
client
code
includes
service
discovery
– Support
complex
data
types
in
a
single
request
• Apache
Avro
– Evolved
from
Protocol
Buffers
and
Thri?
– Includes
JSON
header
defining
key/value
protocol
– Avro
serializaMon
is
half
the
size
and
several
Mmes
faster
than
Java
serializaMon,
more
work
to
code
32. Persisted
Protocols
• Persist
Avro
in
Memcached
– Save
space/latency
(zigzag
encoding,
half
the
size)
– New
keys
are
ignored
– Missing
keys
are
handled
cleanly
• Avro
protocol
definiMons
– Less
briCle
across
versions
– Can
be
wriCen
in
JSON
or
generated
from
POJOs
– It’s
hard,
needs
beCer
tooling
33. Tangled
Service
Interfaces
• Datacenter
implementaMon
is
exposed
– Oracle
SQL
queries
mixed
into
business
logic
• Tangled
code
– Deep
dependencies,
false
sharing
• Data
providers
with
sideways
dependencies
– Everything
depends
on
everything
else
AnK-‐paMern
affects
producKvity,
availability
34. Untangled
Service
Interfaces
• New
Cloud
Code
With
Strict
Layering
– Compile
against
interface
jar
– Can
use
spring
runMme
binding
to
enforce
– Fine
grain
services
as
components
• Service
interface
is
the
service
– ImplementaMon
is
completely
hidden
– Can
be
implemented
locally
or
remotely
– ImplementaMon
can
evolve
independently
35. Untangled
Service
Interfaces
Two
layers:
• SAL
-‐
Service
Access
Library
– Basic
serializaMon
and
error
handling
– REST
or
POJO’s
defined
by
data
provider
• ESL
-‐
Extended
Service
Library
– Caching,
conveniences,
can
combine
several
SALs
– Exposes
faceted
type
system
(described
later)
– Interface
defined
by
data
consumer
in
many
cases
37. Service
Architecture
PaCerns
• Internal
Interfaces
Between
Services
– Common
paCerns
as
templates
– Highly
instrumented,
observable,
analyMcs
– Service
Level
Agreements
–
SLAs
• Library
templates
for
generic
features
– Instrumented
Ne$lix
Base
Servlet
template
– Instrumented
generic
client
interface
template
– Instrumented
S3,
SimpleDB,
Memcached
clients
38. CLIENT
Request
Start
Timestamp,
Client
Inbound
Request
End
outbound
deserialize
end
Timestamp
serialize
start
Mmestamp
Mmestamp
Inbound
Client
deserialize
outbound
start
serialize
end
Mmestamp
Mmestamp
Client
network
receive
Mmestamp
Service
Request
Client
Network
send
Mmestamp
Instruments
Every
Service
network
send
Mmestamp
Step
in
the
call
Service
Network
receive
Mmestamp
Service
Service
outbound
inbound
serialize
end
serialize
start
Mmestamp
Mmestamp
Service
Service
outbound
inbound
serialize
start
SERVICE
execute
serialize
end
request
start
Mmestamp
Mmestamp
Mmestamp,
execute
request
end
Mmestamp
39. Boundary
Interfaces
• Isolate
teams
from
external
dependencies
– Fake
SAL
built
by
cloud
team
– Real
SAL
provided
by
data
provider
team
later
– ESL
built
by
cloud
team
using
faceted
objects
• Fake
data
sources
allow
development
to
start
– e.g.
Fake
IdenMty
SAL
for
a
test
set
of
customers
– Development
solidifies
dependencies
early
– Helps
external
team
provide
the
right
interface
40. One
Object
That
Does
Everything
• Datacenter
uses
a
few
big
complex
objects
– Movie
and
Customer
objects
are
the
foundaMon
– Good
choice
for
a
small
team
and
one
instance
– ProblemaMc
for
large
teams
and
many
instances
• False
sharing
causes
tangled
dependencies
– UnproducMve
re-‐integraMon
work
AnK-‐paMern
impacKng
producKvity
and
availability
41. An
Interface
For
Each
Component
• Cloud
uses
faceted
Video
and
Visitor
– Basic
types
hold
only
the
idenMfier
– Facets
scope
the
interface
you
actually
need
– Each
component
can
define
its
own
facets
• No
false-‐sharing
and
dependency
chains
– Type
manager
converts
between
facets
as
needed
– video.asA(PresentaMonVideo)
for
www
– video.asA(MerchableVideo)
for
middle
Mer
43. Avoiding
“Level
Confusion”
[Catataxis]
• Business
Level
Objects
(BLO?)
– Customers,
Movies,
etc
– Conceptual:
Exist
only
between
the
ears
• Abstract
Types
– AbstracMons
that
try
to
model
aspects
of
the
business
level
objects
– O?en
captured
by
Java
interfaces
• ImplementaMons
– Specific
coded
implementaMons
of
the
abstract
types
– Java
class,
or
a
collecMon
of
rows
in
a
database…
44. Facets
• No
single
Abstract
Type
captures
everything
about
a
BLO
– Different
teams
see
different
“facets”
• Customer:
Account
status;
Billing
history;
Viewing
history;
A/B
test
assignments
• Movie:
Availability;
Popularity;
Synopsis;
Cast
– Loosely
coupled,
Mghtly
aligned(!)
• All
facets
for
a
BLO
should
inherit
from
one
“basic”
type
that
has
minimal
behavior
45. Basic
Types
• Module
external
interfaces
deal
in
basic
types;
internal
calls
are
free
to
use
more
complex
facets
• Generic
machinery
to
switch
between
facets
Business
Level
Object
Java
Basic
Type
Movie
(TV
show…)
Video
Customer
Visitor
Category
VTag
Country
ISOCountry
46. Type
Manager
• Holds
the
“factory”
objects
that
manage
instances
of
facets
– Typically
one
factory
per
facet
– Factories
free
to
implement
any
instance
management
policy
they
want
• Factories
register
with
the
Type
Manager
– callers
never
interact
directly
with
the
factories
– Mock
managers?
47. Switching
Facets
• Each
Basic
Type
B
implements
a
method
that
uses
the
Type
Manager
to
find
facet
implementaMons
of
the
same
BLO
<T extends B> T asA(Class<T> c)!
• Example:
Visitor visitor = xxx;
ABClient abClient = visitor.asA(ABClient.class);
assert(visitor.equals(abClient));!
• Look
Ma,
no
cast!
– Facets
are
equal,
but
not
necessarily
==.
48. IDs!
(huh)
What
are
they
good
for?
• IDs
exist
because
implementaKons
need
to
externalize
objects
and
maintain
their
idenKty
– Persist
in
a
DB,
or
talk
to
a
remote
service
– Different
implementaMons
of
a
type
of
BLO
model
the
same
object
iff
they
have
the
same
ID
– Basic
Types
use
IDs
to
manage
facets,
determine
equality,
etc
49. ConverMng
IDs
ßàObjects
Long id = xx;!
MyVisitor visitor =!
TypeManager.findObject(Visitor.class, id)!
.asA(MyVisitor.class);!
assert(id.equals(visitor.getId());!
// Or more efficiently…!
MyVisitor visitor2 =!
TypeManager.findObject(Visitor.class, id,!
MyVisitor.class);!
// There are also efficient bulk conversion methods!
Collection<Long> ids = xxx;!
List<MyVisitor> visitors =!
TypeManager.findObjects(Visitor.class, ids,!
MyVisitor.class);!
!
50. Stan’s
Soap
Box
• Don’t
pass
around
IDs
when
you
mean
to
refer
to
the
BLO;
that
is
Level
Confusion
• Using
Basic
Types
helps
the
compiler
help
you;
compile
Mme
problems
are
beCer
than
run
Mme
problems
• More
readable
by
people,
but
beware
that
asA
operaMons
may
be
a
lot
of
work
• (Is
this
a
way
to
approximate
mulMple-‐
inheritance
in
Java?)
51. So?ware
Architecture
PaCerns
• Object
Models
– Basic
and
derived
types,
facets,
serializable
– Pass
by
reference
within
a
service
– Pass
by
value
between
services
• ComputaMon
and
I/O
Models
– Service
ExecuMon
using
Best
Effort
/
Futures
– Common
thread
pool
management
– Circuit
breakers
to
manage
and
contain
failures
52. Model
Driven
Architecture
• TradiMonal
Datacenter
PracMces
– Lots
of
unique
hand-‐tweaked
systems
– Hard
to
enforce
paCerns
– Some
use
of
Puppet
to
automate
changes
• Model
Driven
Cloud
Architecture
– Perforce/Ivy/Jenkins
based
builds
for
everything
– Every
producMon
instance
is
a
pre-‐baked
AMI
– Every
applicaMon
is
managed
by
an
Autoscaler
Every
change
is
a
new
AMI
54. Ne$lix
PaaS
Principles
• Maximum
FuncMonality
– Developer
producMvity
and
agility
• Leverage
as
much
of
AWS
as
possible
– AWS
is
making
huge
investments
in
features/scale
• Interfaces
that
isolate
Apps
from
AWS
– Avoid
lock-‐in
to
specific
AWS
API
details
• Portability
is
a
long
term
goal
– Gets
easier
as
other
vendors
catch
up
with
AWS
55. Ne$lix
Global
PaaS
• Architecture
Features
and
Overview
• Portals
and
Explorers
• Pla$orm
Services
• Pla$orm
APIs
• Pla$orm
Frameworks
• Persistence
• Scalability
Benchmark
56. Global
PaaS?
Toys
are
nice,
but
this
is
the
real
thing…
• Supports
all
AWS
Availability
Zones
and
Regions
• Supports
mulMple
AWS
accounts
{test,
prod,
etc.}
• Cross
Region/Acct
Data
ReplicaMon
and
Archiving
• InternaMonalized,
Localized
and
GeoIP
rouMng
• Security
is
fine
grain,
dynamic
AWS
keys
• Autoscaling
to
thousands
of
instances
• Monitoring
for
millions
of
metrics
• ProducMve
for
100s
of
developers
on
one
product
• 23M+
users
USA,
Canada,
LaMn
America,
UK,
Eire
57. Basic
PaaS
EnMMes
• AWS
Based
EnMMes
– Instances
and
Machine
Images,
ElasMc
IP
Addresses
– Security
Groups,
Load
Balancers,
Autoscale
Groups
– Availability
Zones
and
Geographic
Regions
• Ne$lix
PaaS
EnMMes
– ApplicaMons
(registered
services)
– Clusters
(versioned
Autoscale
Groups
for
an
App)
– ProperMes
(dynamic
hierarchical
configuraMon)
58. Core
PaaS
Services
• AWS
Based
Services
– S3
storage,
to
5TB
files,
parallel
mulMpart
writes
– SQS
–
Simple
Queue
Service.
Messaging
layer.
• Ne$lix
Based
Services
– EVCache
–
memcached
based
ephemeral
cache
– Cassandra
–
distributed
data
store
• External
Services
– GeoIP
Lookup
interfaced
to
a
vendor
– Keystore
HSM
in
Ne$lix
Datacenter
59. Instance
Architecture
Linux
Base
AMI
(CentOS
or
Ubuntu)
OpMonal
Apache
frontend,
Java
(JDK
6
or
7)
memcached,
non-‐java
apps
Tomcat
AppDynamics
appagent
Monitoring
Log
rotaMon
ApplicaMon
servlet,
base
Healthcheck,
status
to
S3
GC
and
thread
server,
pla$orm,
interface
servlets,
JMX
interface,
AppDynamics
dump
logging
jars
for
dependent
services
Servo
autoscale
machineagent
Epic
60. Security
Architecture
• Instance
Level
Security
baked
into
base
AMI
– Login:
ssh
only
allowed
via
portal
(not
between
instances)
– Each
app
type
runs
as
its
own
userid
app{test|prod}
• AWS
Security,
IdenMty
and
Access
Management
– Each
app
has
its
own
security
group
(firewall
ports)
– Fine
grain
user
roles
and
resource
ACLs
• Key
Management
– AWS
Keys
dynamically
provisioned,
easy
updates
– High
grade
app
specific
key
management
support
62. Portals
and
Explorers
• Ne$lix
ApplicaMon
Console
(NAC)
– Primary
AWS
provisioning/config
interface
• AWS
Usage
Analyzer
– Breaks
down
costs
by
applicaMon
and
resource
• Cassandra
Explorer
– Browse
clusters,
keyspaces,
column
families
• Base
Server
Explorer
– Browse
service
endpoints
configuraMon,
perf
68. Pla$orm
Services
• Discovery
–
service
registry
for
“ApplicaMons”
• IntrospecMon
–
Entrypoints
• Cryptex
–
Dynamic
security
key
management
• Geo
–
Geographic
IP
lookup
• Pla$ormservice
–
Dynamic
property
configuraMon
• LocalizaMon
–
manage
and
lookup
local
translaMons
• Evcache
–
ephemeral
volaMle
cache
• Cassandra
–
Cross
zone/region
distributed
data
store
• Zookeeper
–
Distributed
CoordinaMon
(Curator)
• Various
proxies
–
access
to
old
datacenter
stuff
69. IntrospecMon
-‐
Entrypoints
• REST
API
for
tools,
apps,
explorers,
monkeys…
– E.g.
GET
/REST/v1/instance/$INSTANCE_ID
• AWS
Resources
– Autoscaling
Groups,
EIP
Groups,
Instances
• Ne$lix
PaaS
Resources
– Discovery
ApplicaMons,
Clusters
of
ASGs,
History
70. Entrypoints
Queries
MongoDB
used
for
low
traffic
complex
queries
against
complex
objects
DescripAon
Range
expression
Find
all
acMve
instances.
all()
Find
all
instances
associated
with
a
group
%(cloudmonkey)
name.
Find
all
instances
associated
with
a
/^cloudmonkey$/discovery()
discovery
group.
Find
all
auto
scale
groups
with
no
instances.
asg(),-‐has(INSTANCES;asg())
How
many
instances
are
not
in
an
auto
count(all(),-‐info(eval(INSTANCES;asg())))
scale
group?
What
groups
include
an
instance?
*(i-‐4e108521)
What
auto
scale
groups
and
elasMc
load
filter(TYPE;asg,elb;*(i-‐4e108521))
balancers
include
an
instance?
What
instance
has
a
given
public
ip?
filter(PUBLIC_IP;174.129.188.{0..255};all())
71. Metrics
Framework
• System
and
ApplicaMon
– CollecMon,
AggregaMon,
Querying
and
ReporMng
– Non-‐blocking
logging,
avoids
log4j
lock
contenMon
– Honu-‐Streaming
-‐>
S3
-‐>
EMR
-‐>
Hive
• Performance,
Robustness,
Monitoring,
Analysis
– Tracers,
Counters
–
explicit
code
instrumentaMon
log
– Real
Time
Tracers/Counters
– SLA
–
service
level
response
Mme
percenMles
– Servo
annotated
JMX
extract
to
Cloudwatch
• Latency
Monkey
Infrastructure
– Inject
random
delays
into
service
responses
73. Interprocess
CommunicaAon
• Discovery
Service
registry
for
“applicaMons”
– “here
I
am”
call
every
30s,
drop
a?er
3
missed
– “where
is
everyone”
call
– Redundant,
distributed,
moving
to
Zookeeper
• NIWS
–
Ne$lix
Internal
Web
Service
client
– So?ware
Middle
Tier
Load
Balancer
– Failure
retry
moves
to
next
instance
– Many
opMons
for
encoding,
etc.
74. Security
Key
Management
• AKMS
– Dynamic
Key
Management
interface
– Update
AWS
keys
at
runMme,
no
restart
– All
keys
stored
securely,
none
on
disk
or
in
AMI
• Cryptex
-‐
Flexible
key
store
– Low
grade
keys
processed
in
client
– Medium
grade
keys
processed
by
Cryptex
service
– High
grade
keys
processed
by
hardware
(Ingrian)
75. AWS
Persistence
Services
• SimpleDB
– Got
us
started,
migrated
to
Cassandra
now
– NFSDB
-‐
Instrumented
wrapper
library
– Domain
and
Item
sharding
(workarounds)
• S3
– Upgraded/Instrumented
JetS3t
based
interface
– Supports
mulMpart
upload
and
5TB
files
– Global
S3
endpoint
management
76. Aside:
Adrian’s
Rant
on
CAP
Theorem
Choose
Consistency
or
Availability
when
ParAAoned
• Instances
and
Networks
will
fail
• Network
failure
=
ParMMon
“P”
is
a
given
• Distributed
Systems:
two
choices
–
CP
or
AP
• “Vendor
claims
CA”
– Usually
they
mean
available
when
instances
fail
• Master-‐Slave
=
Consistent
when
ParMMoned
– You
can’t
write
unless
you
can
see
the
master
• No-‐Master
=
Available
when
ParMMoned
– Writes
proceed,
conflicts
will
be
patched
up
later
78. Basic
Requirements
• Supports
running
on
Amazon
EC2
• Supports
Amazon
Availability
Zones
• Low
latency,
low
latency
variance
• High
and
scalable
read
and
write
throughput
• Large
and
scalable
capacity,
no
external
sharding
• “AP”
Eventually
Consistent
• Data
integrity
checks
and
repairs
• Online
Snapshot
Backup,
Restore/Rollback
79. Scenario
–
Immediate
Read
a?er
Write
Q1:
Is
rouMng
and
replicaMon
zone
aware?
TV
Device
New
New
Favorite
Round
Robin
Favorites
Load
Balancer
List
API
API
(zone
A)
(Zone
B)
Append
New
New
Favorites
Favorite
List
Favorites
Favorites
(zone
A)
(Zone
B)
ReplicaMon
80. Network
ParMMon
Q2:
What
happens
next?
TV
Device
New
New
Favorite
Round
Robin
Favorites
Load
Balancer
List
API
API
(zone
A)
(Zone
B)
Append
New
New
Favorites
Favorite
List
Favorites
Favorites
(zone
A)
(Zone
B)
No
ReplicaMon
81. Network
ParMMon
Q3:
Supports
Append
vs.
Read/Modify/Write?
TV
Device
New
New
Favorite
Round
Robin
Favorites
Load
Balancer
List
RMW
API
API
(zone
A)
(Zone
B)
Old
New
New
Favorites
Favorites
Favorites
List
List
List
Favorites
Favorites
(zone
A)
(Zone
B)
ReplicaMon
82. Silent
Data
CorrupMon
Q4:
How
is
it
detected
and
corrected?
TV
Device
New
New
Favorite
Round
Robin
Favorites
Load
Balancer
List
API
API
(zone
A)
(Zone
B)
Append
New
New
Favorites
Favorite
List
Favorites
Favorites
(zone
A)
(Zone
B)
ReplicaMon
corrupted
on
disk
or
via
network
83. NePlix
PlaPorm
Persistence
• Ephemeral
VolaMle
Cache
–
evcache
– Discovery-‐aware
memcached
based
backend
– Client
abstracMons
for
zone
aware
replicaMon
– OpMon
to
write
to
all
zones,
fast
read
from
local
• Cassandra
– Highly
available
and
scalable
(more
later…)
• MongoDB
– Complex
object/query
model
for
small
scale
use
• MySQL
– Hard
to
scale,
legacy
and
small
relaMonal
models
84. Why
Cassandra?
• We
value
Availability
over
Consistency
–
AP
– Cassandra
is
a
Java
distributed
systems
toolkit
• We
have
a
building
full
of
Java
engineers
– Riak
is
in
Erlang
–
a
blessing
and
a
curse…
• We
want
FOSS
+
Support
– Voldemort
doesn’t
have
a
support
model
• Writes
are
intrinsically
harder
than
reads
– Hbase
is
CP
opMmized
for
reads
&
single
namenode
issues
• Cassandra
works,
running
~55
clusters
– Step
by
step
into
full
producMon
over
the
last
year
85. Priam
–
Cassandra
AutomaMon
Available
at
hCp://github.com/ne$lix
• Ne$lix
Pla$orm
Tomcat
Code
• Zero
touch
auto-‐configuraMon
• State
management
for
Cassandra
JVM
• Token
allocaMon
and
assignment
• Broken
node
auto-‐replacement
• Full
and
incremental
backup
to
S3
• Restore
sequencing
from
S3
• Grow/Shrink
Cassandra
“ring”
86. Astyanax
Available
at
hCp://github.com/ne$lix
• Cassandra
java
client
• API
abstracMon
on
top
of
Thri?
protocol
• “Fixed”
ConnecMon
Pool
abstracMon
(vs.
Hector)
– Round
robin
with
Failover
– Retry-‐able
operaMons
not
Med
to
a
connecMon
– Ne$lix
PaaS
Discovery
service
integraMon
– Host
reconnect
(fixed
interval
or
exponenMal
backoff)
– Token
aware
to
save
a
network
hop
–
lower
latency
– Latency
aware
to
avoid
compacMng/repairing
nodes
–
lower
variance
• Batch
mutaMon:
set,
put,
delete,
increment
• Simplified
use
of
serializers
via
method
overloading
(vs.
Hector)
• ConnecMonPoolMonitor
interface
for
counters
and
tracers
• Composite
Column
Names
replacing
deprecated
SuperColumns
87. IniMalizing
Astyanax
// Configuration either set in code or nfastyanax.properties
platform.ListOfComponentsToInit=LOGGING,APPINFO,DISCOVERY
netflix.environment=test
default.astyanax.readConsistency=CL_QUORUM
default.astyanax.writeConsistency=CL_QUORUM
MyCluster.MyKeyspace.astyanax.servers=127.0.0.1
// Must initialize platform for discovery to work
NFLibraryManager.initLibrary(PlatformManager.class, props, false, true);
NFLibraryManager.initLibrary(NFAstyanaxManager.class, props, true, false);
// Open a keyspace instance
Keyspace keyspace = KeyspaceFactory.openKeyspace(”MyCluster”,”MyKeyspace");
88. Astyanax
Query
Example
Paginate
through
all
columns
in
a
row
ColumnList<String>
columns;
int
pageize
=
10;
try
{
RowQuery<String,
String>
query
=
keyspace
.prepareQuery(CF_STANDARD1)
.getKey("A")
.setIsPaginaMng()
.withColumnRange(new
RangeBuilder().setMaxSize(pageize).build());
while
(!(columns
=
query.execute().getResult()).isEmpty())
{
for
(Column<String>
c
:
columns)
{
}
}
}
catch
(ConnecMonExcepMon
e)
{
}
90. Distributed
Key-‐Value
Stores
• Cloud
has
many
key-‐value
data
stores
– More
complex
to
keep
track
of,
do
backups
etc.
– Each
store
is
much
simpler
to
administer
DBA
– Joins
take
place
in
java
code
• No
schema
to
change,
no
scheduled
downMme
• Latency
for
typical
queries
– Memcached
is
dominated
by
network
latency
<1ms
– Cassandra
takes
a
few
milliseconds
– SimpleDB
replicaMon
and
REST
auth
overheads
>10ms
91. MulA-‐Regional
Data
ReplicaAon
• IR
Framework
–
Datacenter
Item
Replicator
– Built
in
2009,
first
step
to
the
cloud
– Oracle
to
SimpleDB
or
Cassandra
via
poll
and
push
– Return
updates
to
Oracle
via
SQS
message
queue
• SimpleDB
or
S3
to
Cassandra
– Data
migraMon
tool
for
global
Ne$lix
• Global
SimpleDB
and
S3
ReplicaMon
– Cross
region
async
updates
USA
to
Europe
92. TransiAonal
Steps
• BidirecMonal
ReplicaMon
– Oracle
to
SimpleDB
– Queued
reverse
path
using
SQS
– Backups
remain
in
Datacenter
via
Oracle
• New
Cloud-‐Only
Data
Sources
– Cassandra
based
– No
replicaMon
to
Datacenter
– Backups
performed
in
the
cloud
93. API
AWS
EC2
Front
End
Load
Balancer
Discovery
Service
API
Proxy
API
etc.
Load
Balancer
Component
API
SQS
Services
Oracl
e
Oracle
Oracle
Cassandra
memcached
ReplicaMon
memcached
EC2
Internal
Disks
NePlix
S3
Data
Center
SimpleDB
94. Cufng
the
Umbilical
• TransiMon
Oracle
Data
Sources
to
Cassandra
– Offload
Datacenter
Oracle
hardware
– Free
up
capacity
for
growth
of
remaining
services
• TransiMon
SimpleDB+Memcached
to
Cassandra
– Primary
data
sources
that
need
backup
– Keep
simplest
small
use
cases
for
now
• New
challenges
– Backup,
restore,
archive,
business
conMnuity
– Business
Intelligence
integraMon
95. API
AWS
EC2
Front
End
Load
Balancer
Discovery
Service
API
Proxy
Load
Balancer
Component
API
Services
memcached
Cassandra
EC2
Internal
Disks
Backup
S3
SimpleDB
96. High
Availability
• Cassandra
stores
3
local
copies,
1
per
zone
– Synchronous
access,
durable,
highly
available
– Read/Write
One
fastest,
least
consistent
-‐
~1ms
– Read/Write
Quorum
2
of
3,
consistent
-‐
~3ms
• AWS
Availability
Zones
– Separate
buildings
– Separate
power
etc.
– Fairly
close
together
97. “TradiMonal”
Cassandra
Write
Data
Flows
Single
Region,
MulMple
Availability
Zone,
Not
Token
Aware
Cassandra
• Disks
• Zone
A
2
2
4
2
1. Client
Writes
to
any
Cassandra
3
3
Cassandra
If
a
node
goes
offline,
Cassandra
Node
• Disks
5 • Disks
5
hinted
handoff
2. Coordinator
Node
• Zone
C
1 • Zone
A
completes
the
write
replicates
to
nodes
when
the
node
comes
and
Zones
Non
Token
back
up.
3. Nodes
return
ack
to
Aware
coordinator
Clients
Requests
can
choose
to
4. Coordinator
returns
3
wait
for
one
node,
a
Cassandra
Cassandra
ack
to
client
• Disks
• Disks
5
quorum,
or
all
nodes
to
5. Data
wriCen
to
• Zone
C
• Zone
B
ack
the
write
internal
commit
log
disk
(no
more
than
Cassandra
SSTable
disk
writes
and
• Disks
10
seconds
later)
• Zone
B
compacMons
occur
asynchronously
98. Astyanax
-‐
Cassandra
Write
Data
Flows
Single
Region,
MulMple
Availability
Zone,
Token
Aware
Cassandra
• Disks
• Zone
A
1. Client
Writes
to
Cassandra
2
2
Cassandra
If
a
node
goes
offline,
nodes
and
Zones
• Disks
3 • Disks
3
hinted
handoff
2. Nodes
return
ack
to
• Zone
C
1 • Zone
A
completes
the
write
client
3. Data
wriCen
to
Token
when
the
node
comes
back
up.
internal
commit
log
Aware
disks
(no
more
than
Clients
2
Requests
can
choose
to
10
seconds
later)
Cassandra
Cassandra
wait
for
one
node,
a
• Disks
• Disks
3
quorum,
or
all
nodes
to
• Zone
C
• Zone
B
ack
the
write
Cassandra
SSTable
disk
writes
and
• Disks
• Zone
B
compacMons
occur
asynchronously
99. Data
Flows
for
MulM-‐Region
Writes
Token
Aware,
Consistency
Level
=
Local
Quorum
1. Client
writes
to
local
replicas
If
a
node
or
region
goes
offline,
hinted
handoff
2. Local
write
acks
returned
to
completes
the
write
when
the
node
comes
back
up.
Client
which
conMnues
when
Nightly
global
compare
and
repair
jobs
ensure
2
of
3
local
nodes
are
everything
stays
consistent.
commiCed
3. Local
coordinator
writes
to
remote
coordinator.
Cassandra
100+ms
latency
4. When
data
arrives,
remote
Cassandra
• Disks
• Disks
• Zone
A
• Zone
A
coordinator
node
acks
and
Cassandra
2
2
Cassandra
Cassandra
4
Cassandra
6
6
3
5
Disks
6
copies
to
other
remote
zones
6
• Disks
• Disks
• Zone
C
• Zone
A
•
• Zone
C
4
Disks
A
•
• Zone
1
4
5. Remote
nodes
ack
to
local
US
EU
coordinator
Clients
Clients
Cassandra
2
Cassandra
Cassandra
5
Cassandra
6. Data
flushed
to
internal
• Disks
• Zone
C
• Disks
6
• Zone
B
• Disks
• Zone
C
• Disks
6
• Zone
B
commit
log
disks
(no
more
Cassandra
Cassandra
than
10
seconds
later)
• Disks
• Disks
• Zone
B
• Zone
B
100. Remote
Copies
• Cassandra
duplicates
across
AWS
regions
– Asynchronous
write,
replicates
at
desMnaMon
– Doesn’t
directly
affect
local
read/write
latency
• Global
Coverage
– Business
agility
– Follow
AWS…
?
• Local
Access
?
?
– BeCer
latency
3
A 3
– Fault
IsolaMon
101. Cassandra
Backup
• Full
Backup
Cassandra
Cassandra
Cassandra
– Time
based
snapshot
– SSTable
compress
-‐>
S3
Cassandra
Cassandra
• Incremental
S3
Backup
Cassandra
Cassandra
– SSTable
write
triggers
compressed
copy
to
S3
Cassandra
Cassandra
• Archive
Cassandra
Cassandra
– Copy
cross
region
A
102. Cassandra
Restore
• Full
Restore
Cassandra
Cassandra
Cassandra
– Replace
previous
data
• New
Ring
from
Backup
Cassandra
Cassandra
– New
name
old
data
S3
Backup
Cassandra
Cassandra
• Scripted
– Create
new
instances
Cassandra
Cassandra
– Parallel
load
-‐
fast
Cassandra
Cassandra
103. Cassandra
Online
AnalyMcs
• Brisk
=
Hadoop
+
Cass
Cassandra
– “Cassandra
Enterprise”
Brisk
Cassandra
– Use
split
Brisk
ring
Brisk
Cassandra
– Size
each
separately
S3
• Direct
Access
Cassandra
Backup
Cassandra
– Keyspaces
– Hive/Pig/Map-‐Reduce
Cassandra
Cassandra
– Hdfs
as
a
keyspace
Cassandra
Cassandra
– Distributed
namenode
104. ETL
for
Cassandra
• Data
is
de-‐normalized
over
many
clusters!
• Too
many
to
restore
from
backups
for
ETL
• SoluMon
–
read
backup
files
using
Hadoop
• Aegisthus
– hCp://techblog.ne$lix.com/2012/02/aegisthus-‐bulk-‐data-‐pipeline-‐out-‐of.html
– High
throughput
raw
SSTable
processing
– Re-‐normalizes
many
clusters
to
a
consistent
view
– Extract,
Transform,
then
Load
into
Teradata
105. Cassandra
Archive
A
Appropriate
level
of
paranoia
needed…
• Archive
could
be
un-‐readable
– Restore
S3
backups
weekly
from
prod
to
test,
and
daily
ETL
• Archive
could
be
stolen
– PGP
Encrypt
archive
• AWS
East
Region
could
have
a
problem
– Copy
data
to
AWS
West
• ProducMon
AWS
Account
could
have
an
issue
– Separate
Archive
account
with
no-‐delete
S3
ACL
• AWS
S3
could
have
a
global
problem
– Create
an
extra
copy
on
a
different
cloud
vendor….
106. Extending
to
MulM-‐Region
In
producMon
for
UK/Eire
support
1. Create
cluster
in
EU
Take
a
Boeing
737
on
a
domesMc
flight,
upgrade
it
to
a
747
by
adding
more
engines,
fuel
and
bigger
wings
2. Backup
US
cluster
to
S3
and
fly
it
to
Europe
without
landing
it
on
the
way…
3. Restore
backup
in
EU
4. Local
repair
EU
cluster
5. Global
repair/join
Cassandra
100+ms
latency
Cassandra
1
• Disks
• Disks
• Zone
A
• Zone
A
Cassandra
Cassandra
Cassandra
Cassandra
• Disks
• Disks
• Disks
• Disks
• Zone
C
• Zone
A
• Zone
C
• Zone
A
US
5
EU
Clients
Clients
Cassandra
Cassandra
Cassandra
Cassandra
• Disks
• Disks
• Disks
• Disks
• Zone
C
• Zone
B
• Zone
C
• Zone
B
Cassandra
Cassandra
• Disks
• Disks
• Zone
B
3
• Zone
B
4
2
S3
107. Tools
and
AutomaMon
• Developer
and
Build
Tools
– Jira,
Perforce,
Eclipse,
Jenkins,
Ivy,
ArMfactory
– Builds,
creates
.war
file,
.rpm,
bakes
AMI
and
launches
• Custom
Ne$lix
ApplicaMon
Console
– AWS
Features
at
Enterprise
Scale
(hide
the
AWS
security
keys!)
– Auto
Scaler
Group
is
unit
of
deployment
to
producMon
• Open
Source
+
Support
– Apache,
Tomcat,
Cassandra,
Hadoop
– Datastax
support
for
Cassandra,
AWS
support
for
Hadoop
via
EMR
• Monitoring
Tools
– Alert
processing
gateway
into
Pagerduty
– AppDynamics
–
Developer
focus
for
cloud
hCp://appdynamics.com
108. NoSQL
Developer
MigraMon
• Jason
Brown
@jasobrown
– Cassandra
from
the
Trenches
– slideshare.net/ne$lix
• Mark
Atwood,
"Guide
to
NoSQL,
redux”
– YouTube
hCp://youtu.be/zAbFRiyT3LU
110. Open
Source
Strategy
• Release
PaaS
Components
git-‐by-‐git
– Source
at
github.com/ne$lix
– Intros
and
techniques
at
techblog.ne$lix.com
– Blog
post
or
new
code
every
week
or
so
• MoMvaMons
– Give
back
to
Apache
licensed
OSS
community
– MoMvate,
retain,
hire
top
engineers
– Create
a
community
that
adds
features
and
fixes
111. Current
OSS
Projects
and
Posts
Github
/
Techblog
Priam
Exhibitor
Servo
Apache
Project
Techblog
Post
Astyanax
Curator
Autoscaling
scripts
CassJMeter
Zookeeper
Honu
Cassandra
EVCache
Circuit
Breaker
Aegisthus
112. Takeaway
NePlix
has
built
and
deployed
a
scalable
global
PlaPorm
as
a
Service.
Key
components
of
the
NePlix
PaaS
are
being
released
as
Open
Source
projects
so
you
can
build
your
own
custom
PaaS.
hCp://github.com/Ne$lix
hCp://techblog.ne$lix.com
hCp://slideshare.net/Ne$lix
hCp://www.linkedin.com/in/adriancockcro?
@adrianco
#ne$lixcloud
End
of
Part
2
of
3