Accumulo design

APACHE ACCUMULO
From a design perspective

SCALABLE KEY-VALUE STORE
BASED ON GOOGLE'S
BIGTABLE

BIGTABLE FEATURES
• Distributes data across many commodity servers

• Sorts data by key for fast lookup of values by key

• Scan across multiple key value pairs

• Highly consistent writes to single row

• Support for MapReduce jobs

DATA MODEL
Key
Value
Row ID
Column
Timestamp
Family Qualiﬁer

Row ID Col Fam Col Qual Timestamp Value
Bob Email id0023 20120301
Hey joe, can
you send ...
Bob Email id0024 20120302
Re: next
Thursday ...
Bob UserPrefs Background 20130101 Grey
Fred Email id0001 20080302
Welcome to
gmail ...
Sarah Email id0004 20130201 Hi again ...
Sara Videos ytid009 20100303
nsu736:)jdudjd
k$:)378;'$$)

Tablet servers HDFS DataNodes
Commit Layer Replication Layer

SINCE 2006
• Several BigTable implementations

• Apache Hbase

• Apache Cassandra

• Apache Accumulo

• others …

HBASE
• Open source Apache project started by developers at
Powerset, bought by Microsoft

• Now used at Facebook, StumbleUpon, other big web sites

• Fast reads

• Row-oriented API

• Each column family has it's own set of ﬁles

CASSANDRA
• Apache project started at Facebook

• Combines elements of BigTable and Amazon's Dynamo
into one system

• Used at Netﬂix, other web sites

• Fast writes

• Tunable consistency

Tablet servers
Commit and Replication Layer

CONSISTENCY
• Highly consistent means: writes in one place

• Eventually consistent: writes in > one place

• Writes in > one place: network partition tolerance

• Partition tolerance: geographically distributed servers

• *Google uses Spanner to synchronize multiple dbs

Tablet servers
Data Center A Data Center B

Data Center A Data Center B
Tablet servers

OVERVIEW
• Both highly scalable

• Used to build web applications that can serve millions of
users at once

• Serves as a low-latency persistence layer for real time
service of requests

• Available in single data center or cross data center options

USE CASE
• Most data comes from users

• Schema deﬁned by the application

• Data builds up over time

ACCUMULO
• Can support the web application use-case

• But what are those other extra features for?

ACCUMULO ‘EXTRAS’
• Dynamic Column Families

• ColumnVisibility

• Key-value oriented API

• Iterators

• Batch Scanners

BIG ORGANIZATIONS
• Missions other than internet services

• Various disparate operational systems that
generate data

• Desire to look across and analyze that data

• Desire to deliver results to their own population

USE CASE IS DISCOVERING
AND ANALYZING ALL DATA

ISSUES
• Scale

• Unknown / multiple schema

• Support for analysis without data movement

• Varying levels of sensitivity in the same system

• Support a high number of low-latency user requests

Many Users
Analyze
Db
Data sets

NO CONTROL OVER OR
MANY DIFFERENT SCHEMA?

MAP EXISTING FIELDSTO
COLUMNS DYNAMICALLY

VARYING LEVELS OF DATA
SENSITIVITY?

DATA MODEL
Key
Value
Row ID
Column
Time
stamp
Family Qualiﬁer Visibility

Row ID Col Fam Col Qual Col Vis Timestamp Value
Bob Email id0023
personal
comms
20120301
Hey joe, can
you send ...
Bob Email id0024
personal
comms
20120302
Re: next
Thursday ...
Bob UserPrefs Background prefs 20130101 Grey
Fred Email id0001
personal
comms
20080302
Welcome to
gmail ...
Sarah Email id0004
personal
comms
20130201 Hi again ...
Sara Videos ytid009 public post 20100303
nsu736:)jdu
djdk
$:)378;'$$)

DATA OFVARYING
SENSITIVITY LEVELS CAN BE
PHYSICALLY CO-LOCATED

FRAMEWORKS LIKE HADOOP
MAP REDUCE LOVE IT WHEN
DATA IS ALLTOGETHER

SECONDARY INDICES
• Application-created data: known

• Pre-existing data? unknown

SECONDARY INDICES
RowID

Col Qual Value
RID00001 age 54
RID00001 name bob
RID00002 name fred
RID00003 age 43
RID00003 height 5’9”
RID00003 name harry
RID00004 name carl
RID00005 name evan
RowID

Col Fam Col Qual
43 age RID00003
54 age RID00001
5’9” height RID00003
bob name RID00001
carl name RID00004
evan name RID00005
fred name RID00002
harry name RID00003

RowID

Col Qual Value
RID00001 age 54
RID00001 name bob
RID00002 name fred
RID00003 age 43
RID00003 height 5’9”
RID00003 name harry
RID00004 name carl
RID00005 name evan
Batch Scanner

COLUMNVISIBILITY APPLIES
TO INDEXESTOO

SHUFFLE-SORTED?
• Between Map and Reduce phases is shufﬂe-sort

• Sorting by key is necessary so all the values for a
given key end up next to each other …

• BigTable also sorts keys …

Value combine(Iterator<Value> values)

Accumulo design

Recommended

Recommended

More Related Content

Similar to Accumulo design

Similar to Accumulo design (20)

Recently uploaded

Recently uploaded (20)

Accumulo design