3. BIGTABLE FEATURES
• Distributes data across many commodity servers
• Sorts data by key for fast lookup of values by key
• Scan across multiple key value pairs
• Highly consistent writes to single row
• Support for MapReduce jobs
5. Row ID Col Fam Col Qual Timestamp Value
Bob Email id0023 20120301
Hey joe, can
you send ...
Bob Email id0024 20120302
Re: next
Thursday ...
Bob UserPrefs Background 20130101 Grey
Fred Email id0001 20080302
Welcome to
gmail ...
Sarah Email id0004 20130201 Hi again ...
Sara Videos ytid009 20100303
nsu736:)jdudjd
k$:)378;'$$)
10. HBASE
• Open source Apache project started by developers at
Powerset, bought by Microsoft
• Now used at Facebook, StumbleUpon, other big web sites
• Fast reads
• Row-oriented API
• Each column family has it's own set of files
12. CASSANDRA
• Apache project started at Facebook
• Combines elements of BigTable and Amazon's Dynamo
into one system
• Used at Netflix, other web sites
• Fast writes
• Tunable consistency
14. CONSISTENCY
• Highly consistent means: writes in one place
• Eventually consistent: writes in > one place
• Writes in > one place: network partition tolerance
• Partition tolerance: geographically distributed servers
• *Google uses Spanner to synchronize multiple dbs
17. OVERVIEW
• Both highly scalable
• Used to build web applications that can serve millions of
users at once
• Serves as a low-latency persistence layer for real time
service of requests
• Available in single data center or cross data center options
18. USE CASE
• Most data comes from users
• Schema defined by the application
• Data builds up over time
23. BIG ORGANIZATIONS
• Missions other than internet services
• Various disparate operational systems that
generate data
• Desire to look across and analyze that data
• Desire to deliver results to their own population
24. USE CASE IS DISCOVERING
AND ANALYZING ALL DATA
25. ISSUES
• Scale
• Unknown / multiple schema
• Support for analysis without data movement
• Varying levels of sensitivity in the same system
• Support a high number of low-latency user requests
35. Row ID Col Fam Col Qual Col Vis Timestamp Value
Bob Email id0023
personal
comms
20120301
Hey joe, can
you send ...
Bob Email id0024
personal
comms
20120302
Re: next
Thursday ...
Bob UserPrefs Background prefs 20130101 Grey
Fred Email id0001
personal
comms
20080302
Welcome to
gmail ...
Sarah Email id0004
personal
comms
20130201 Hi again ...
Sara Videos ytid009 public post 20100303
nsu736:)jdu
djdk
$:)378;'$$)
42. SECONDARY INDICES
RowID
Col Qual Value
RID00001 age 54
RID00001 name bob
RID00002 name fred
RID00003 age 43
RID00003 height 5’9”
RID00003 name harry
RID00004 name carl
RID00005 name evan
RowID
Col Fam Col Qual
43 age RID00003
54 age RID00001
5’9” height RID00003
bob name RID00001
carl name RID00004
evan name RID00005
fred name RID00002
harry name RID00003
45. RowID
Col Qual Value
RID00001 age 54
RID00001 name bob
RID00002 name fred
RID00003 age 43
RID00003 height 5’9”
RID00003 name harry
RID00004 name carl
RID00005 name evan
Batch Scanner
49. SHUFFLE-SORTED?
• Between Map and Reduce phases is shuffle-sort
• Sorting by key is necessary so all the values for a
given key end up next to each other …
• BigTable also sorts keys …