Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/the-promise-and-peril-of-abundance-making-big-data-small/brendan-mcadams
Unblocking The Main Thread Solving ANRs and Frozen Frames
The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012
1. A Modest Proposal
for Taming and Clarifying the Promises of Big Data
and the Software Driven Future
Brendan McAdams
10gen, Inc.
brendan@10gen.com
@rit
Friday, November 16, 12
2. "In short, software is eating the world."
- Marc Andreesen
Wall Street Journal, Aug. 2011
http://on.wsj.com/XLwnmo
Friday, November 16, 12
3. Software is Eating the World
• Amazon.com (and .uk, .es, etc) started as a bookstore
• Today, they sell just about everything - bicycles,
appliances, computers, TVs, etc.
• In some cities in America, they even do home grocery
delivery
• No longer as much of a physical goods company -
becoming fixated and surrounded by software
• Pioneering the eBook revolution with Kindle
• EC2 is running a huge percentage of the public
internet
Friday, November 16, 12
4. Software is Eating the World
• Netflix started as a company to deliver DVDs to the home...
Friday, November 16, 12
5. Software is Eating the World
• Netflix started as a company to deliver DVDs to the home...
• But as they’ve grown, business has shifted to an
online streaming service
• They are now rolling out rapidly in many countries
including Ireland, the UK, Canada and the Nordics
• No need for physical inventory or postal distribution ...
just servers and digital copies
Friday, November 16, 12
6. Disney Found Itself Forced To Transform...
From This...
Friday, November 16, 12
7. Disney Found Itself Forced To Transform...
... To This
Friday, November 16, 12
8. But What Does All This Software Do?
• Software always eats data – be it text files, user form input,
emails, etc
• All things that eat, must eventually excrete...
Friday, November 16, 12
9. Ingestion = Excretion
+ =
Yeast Ingests Sugars,
and Excretes Ethanol
Friday, November 16, 12
10. Ingestion = Excretion
=
Cows, er...
well, you get the point.
Friday, November 16, 12
11. So What Does Software Eat?
• Software always eats data – be it text files, user form input,
emails, etc
• But what does software excrete?
• More Data, of course...
• This data gets bigger and bigger
• The solutions become narrower for storing &
processing this data
• Data Fertilizes Software, in an endless cycle...
Friday, November 16, 12
12. There’s a Big Market Here...
• Lots of Solutions for Big Data
• Data Warehouse Software
• Operational Databases
• Old style systems being upgraded to scale storage +
processing
• NoSQL - Cassandra, MongoDB, etc
• Platforms
• Hadoop
Friday, November 16, 12
14. Don’t Tilt At Windmills...
• It is easy to get distracted by all of these solutions
• Keep it simple
• Use tools you (and your team) can understand
• Use tools and techniques that can scale
• Try not to reinvent the wheel
Friday, November 16, 12
15. ... And Don’t Bite Off More Than You Can Chew
• Break it into smaller pieces
• You can’t fit a whole pig into your mouth...
• ... slice it into small parts that you can consume.
Friday, November 16, 12
16. Big Data at a Glance
Large Dataset
Primary Key as “username”
• Big Data can be gigabytes, terabytes, petabytes or exabytes
• An ideal big data system scales up and down around various
data sizes – while providing a uniform view
• Major concerns
• Can I read & write this data efficiently at different scale?
• Can I run calculations on large portions of this data?
Friday, November 16, 12
17. Big Data at a Glance
...
Large Dataset
Primary Key as “username”
• Systems like Google File System (which inspired Hadoop’s
HDFS) and MongoDB’s Sharding handle the scale problem by
chunking
• Break up pieces of data into smaller chunks, spread across
many data nodes
• Each data node contains many chunks
• If a chunk gets too large or a node overloaded, data can be
rebalanced
Friday, November 16, 12
18. Chunks Represent Ranges of Values
Initially, an empty
collection has a single
-∞ +∞ chunk, running the range
of minimum (-∞) to ...
INSERT {USERNAME: “Bill”} maximum (+∞)
As we add data, more
chunks are created of -∞ “B” “C” +∞
new ranges
INSERT {USERNAME: “Becky”}
INSERT {USERNAME: “Brendan”}
Individual or partial letter
-∞ “Ba” “Be” “Br” ranges are one possible
chunk value... but they
can get smaller!
INSERT {USERNAME: “Brad”}
The smallest possible
chunk value is not a “Brad” “Brendan”
range, but a single
possible value
Friday, November 16, 12
19. Big Data at a Glance
a b c d e f g h
...
Large Dataset
Primary Key as “username”
s t u v w x y z
• To simplify things, let’s look at our dataset split into chunks by
letter
• Each chunk is represented by a single letter marking its
contents
• You could think of “B” as really being “Ba” →”Bz”
Friday, November 16, 12
20. Big Data at a Glance
a b c d e f g h
Large Dataset
Primary Key as “username”
s t u v w x y z
Friday, November 16, 12
21. Big Data at a Glance
Large Dataset
Primary Key as “username”
x b v t d f z s
h e u c w a y g
MongoDB Sharding ( as well as HDFS ) breaks data into chunks (~64 mb)
Friday, November 16, 12
22. Big Data at a Glance
Data Node 1 Data Node 2
Large Dataset Node 3
Data Data Node 4
Primary Key as “username”
25% of chunks 25% of chunks 25% of chunks 25% of chunks
x b v t d f z s
h e u c w a y g
Representing data as chunks allows many levels of scale across n data nodes
Friday, November 16, 12
23. Scaling
Data Node 1 Data Node 2 Data Node 3 Data Node 4 5
Data Node
x b v t d f z s
h e u c w a y g
The set of chunks can be evenly distributed across n data nodes
Friday, November 16, 12
24. Add Nodes: Chunk Rebalancing
Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5
x c b z t f v y
a s u g e w h d
The goal is equilibrium - an equal distribution.
As nodes are added (or even removed)
chunks can be redistributed for balance.
Friday, November 16, 12
25. Don’t Bite Off More Than You Can Chew...
• The answer to calculating big data is much the same as
storing it
• We need to break our data into bite sized pieces
• Build functions which can be composed together
repeatedly on partitions of our data
• Process portions of the data across multiple calculation
nodes
• Aggregate the results into a final set of results
Friday, November 16, 12
26. Bite Sized Pieces Are Easier to Swallow
• These pieces are not chunks – rather, the individual data
points that make up each chunk
• Chunks make up a useful data transfer units for processing
as well
• Transfer Chunks as “Input Splits” to calculation nodes,
allowing for scalable parallel processing
Friday, November 16, 12
27. MapReduce the Pieces
• The most common application of these techniques is
MapReduce
• Based on a Google Whitepaper, works with two primary
functions – map and reduce – to calculate against large
datasets
Friday, November 16, 12
28. MapReduce to Calculate Big Data
• MapReduce is designed to effectively process data at varying
scales
• Composable function units can be reused repeatedly for scaled
results
Friday, November 16, 12
29. MapReduce to Calculate Big Data
• In addition to the HDFS storage component, Hadoop is built
around MapReduce for calculation
• MongoDB can be integrated to MapReduce data on Hadoop
• No HDFS storage needed - data moves directly between
MongoDB and Hadoop’s MapReduce engine
Friday, November 16, 12
30. What is MapReduce?
• MapReduce made up of a series of phases, the primary of
which are
• Map
• Shuffle
• Reduce
• Let’s look at a typical MapReduce job
• Email records
• Count # of times a particular user has received email
Friday, November 16, 12
31. MapReducing Email
to: tyler
from: brendan
subject: Ruby Support
to: brendan
from: tyler
subject: Re: Ruby Support
to: mike
from: brendan
subject: Node Support
to: brendan
from: mike
subject: Re: Node Support
to: mike
from: tyler
subject: COBOL Support
to: tyler
from: mike
subject: Re: COBOL Support
(WTF?)
Friday, November 16, 12
32. Map Step
map function breaks each document
to: tyler
into a key (grouping) & value
key: tyler
from: brendan value: {count: 1}
subject: Ruby Support
to: brendan
from: tyler key: brendan
subject: Re: Ruby Support value: {count: 1}
to: mike
from: brendan
subject: Node Support key: tyler
value: {count: 1}
map function
to: brendan emit(k, v)
from: mike
subject: Re: Node Support key: mike
value: {count: 1}
to: mike
from: tyler key: brendan
subject: COBOL Support value: {count: 1}
to: tyler
from: mike
subject: Re: COBOL Support key: mike
(WTF?) value: {count: 1}
Friday, November 16, 12
33. Group/Shuffle Step
key: tyler
value: {count: 1}
key: brendan
Group like keys together, value: {count: 1}
creating an array of their key: tyler
value: {count: 1}
distinct values
(Automatically done by M/R frameworks)
key: mike
value: {count: 1}
key: brendan
value: {count: 1}
key: mike
value: {count: 1}
Friday, November 16, 12
34. Group/Shuffle Step
Group like keys together,
key: tyler
creating an array of their
values: [{count: 1},
{count: 1}]
distinct values key: mike
values: [{count: 1},
{count: 1}]
(Automatically done by M/R frameworks)
key: brendan
values: [{count: 1},
{count: 1}]
Friday, November 16, 12
35. Reduce Step
For each key reduce function
flattens the list of values to a single
result
key: tyler key: mike
values: [{count: 1}, value: {count: 2}
{count: 1}]
key: mike key: brendan
reduce function
values: [{count: 1}, value: {count: 2}
aggregate values
{count: 1}]
return (result)
key: brendan
key: tyler
values: [{count: 1},
value: {count: 2}
{count: 1}]
Friday, November 16, 12
36. Processing Scalable Big Data
• MapReduce provides an effective system for calculating
and processing our large datasets (from gigabytes through
exabytes and beyond)
• MapReduce is supported in many places including
MongoDB & Hadoop
• We have effective answers for both of our concerns.
• Can I read & write this data efficiently at different scale?
• Can I run calculations on large portions of this data?
Friday, November 16, 12
37. Batch Isn’t a Sustainable Answer
• There are downsides here - fundamentally, MapReduce is a
batch process
• Batch systems like Hadoop give us a “Catch 22”
• You can get answers to questions from Petabytes of
Data
• But you can’t guarantee you’ll get them quickly
• In some ways, this is a step backwards in our industry
• Business Stakeholders tend to want answers now
• We must evolve
Friday, November 16, 12
38. Moving Away from Batch
• The Big Data world is moving rapidly away from slow, batch
based processing solutions
• Google moved forward from Batch into more Realtime over last
few years
• Hadoop is replacing “MapReduce as Assembly Language” with
more flexible resource management in YARN
• Now MapReduce is just a feature implemented on top of
YARN
• Build anything we want
• Newer systems like Spark & Storm provide platforms for
realtime processes
Friday, November 16, 12
39. In Closing
• The World IS Being Eaten By Software
• All that software is leaving behind an awful lot of data
• We must be careful not to “step in it”
• More Data Means More Software Means More Data
Means...
• Practical Solutions for Processing & Storing Data will save
us
• We as Data Scientists & Technologists must always evolve
our strategies, thinking and tools
Friday, November 16, 12
40. [Download the Hadoop Connector]
http://github.com/mongodb/mongo-hadoop
[Docs]
http://api.mongodb.org/hadoop/
¿QUESTIONS?
*Contact Me*
brendan@10gen.com
(twitter: @rit)
Friday, November 16, 12