The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

A Modest Proposal
for Taming and Clarifying the Promises of Big Data

and the Software Driven Future

Brendan McAdams
10gen, Inc.
brendan@10gen.com
@rit

Friday, November 16, 12

"In short, software is eating the world."
- Marc Andreesen
Wall Street Journal, Aug. 2011
http://on.wsj.com/XLwnmo


Software is Eating the World

• Amazon.com (and .uk, .es, etc) started as a bookstore
• Today, they sell just about everything - bicycles,
appliances, computers, TVs, etc.
• In some cities in America, they even do home grocery
delivery
• No longer as much of a physical goods company -
becoming fixated and surrounded by software
• Pioneering the eBook revolution with Kindle
• EC2 is running a huge percentage of the public
internet



• Netflix started as a company to deliver DVDs to the home...



• Netflix started as a company to deliver DVDs to the home...
• But as they’ve grown, business has shifted to an
online streaming service
• They are now rolling out rapidly in many countries
including Ireland, the UK, Canada and the Nordics
• No need for physical inventory or postal distribution ...
just servers and digital copies


Disney Found Itself Forced To Transform...

From This...


Disney Found Itself Forced To Transform...

... To This


But What Does All This Software Do?

• Software always eats data – be it text files, user form input,
emails, etc

• All things that eat, must eventually excrete...


Ingestion = Excretion

+ =

Yeast Ingests Sugars,

and Excretes Ethanol


Ingestion = Excretion

=

Cows, er...

well, you get the point.

So What Does Software Eat?

• Software always eats data – be it text files, user form input,
emails, etc

• But what does software excrete?
• More Data, of course...
• This data gets bigger and bigger
• The solutions become narrower for storing &
processing this data
• Data Fertilizes Software, in an endless cycle...


There’s a Big Market Here...

• Lots of Solutions for Big Data
• Data Warehouse Software
• Operational Databases
• Old style systems being upgraded to scale storage +
processing
• NoSQL - Cassandra, MongoDB, etc
• Platforms
• Hadoop


Don’t Tilt At Windmills...


Don’t Tilt At Windmills...

• It is easy to get distracted by all of these solutions
• Keep it simple
• Use tools you (and your team) can understand
• Use tools and techniques that can scale
• Try not to reinvent the wheel


... And Don’t Bite Off More Than You Can Chew

• Break it into smaller pieces
• You can’t fit a whole pig into your mouth...
• ... slice it into small parts that you can consume.


Big Data at a Glance

Large Dataset
Primary Key as “username”

• Big Data can be gigabytes, terabytes, petabytes or exabytes
• An ideal big data system scales up and down around various
data sizes – while providing a uniform view

• Major concerns
• Can I read & write this data efficiently at different scale?
• Can I run calculations on large portions of this data?


...
Large Dataset

• Systems like Google File System (which inspired Hadoop’s
HDFS) and MongoDB’s Sharding handle the scale problem by
chunking

• Break up pieces of data into smaller chunks, spread across
many data nodes
• Each data node contains many chunks
• If a chunk gets too large or a node overloaded, data can be
rebalanced


Chunks Represent Ranges of Values
Initially, an empty
collection has a single
-∞ +∞ chunk, running the range
of minimum (-∞) to ...
INSERT {USERNAME: “Bill”} maximum (+∞)

As we add data, more
chunks are created of -∞ “B” “C” +∞
new ranges

INSERT {USERNAME: “Becky”}
INSERT {USERNAME: “Brendan”}

Individual or partial letter
-∞ “Ba” “Be” “Br” ranges are one possible
chunk value... but they
can get smaller!

INSERT {USERNAME: “Brad”}

The smallest possible
chunk value is not a “Brad” “Brendan”
range, but a single
possible value


a b c d e f g h
...
Large Dataset
s t u v w x y z

• To simplify things, let’s look at our dataset split into chunks by
letter

• Each chunk is represented by a single letter marking its
contents
• You could think of “B” as really being “Ba” →”Bz”


a b c d e f g h
Large Dataset
s t u v w x y z



Large Dataset

x b v t d f z s

h e u c w a y g

MongoDB Sharding ( as well as HDFS ) breaks data into chunks (~64 mb)


Data Node 1 Data Node 2
Large Dataset Node 3
Data Data Node 4
25% of chunks 25% of chunks 25% of chunks 25% of chunks

x b v t d f z s

h e u c w a y g

Representing data as chunks allows many levels of scale across n data nodes


Scaling
Data Node 1 Data Node 2 Data Node 3 Data Node 4 5
Data Node

x b v t d f z s

h e u c w a y g

The set of chunks can be evenly distributed across n data nodes


Add Nodes: Chunk Rebalancing
Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5

x c b z t f v y

a s u g e w h d

The goal is equilibrium - an equal distribution.

As nodes are added (or even removed)

chunks can be redistributed for balance.


Don’t Bite Off More Than You Can Chew...

• The answer to calculating big data is much the same as
storing it

• We need to break our data into bite sized pieces
• Build functions which can be composed together
repeatedly on partitions of our data
• Process portions of the data across multiple calculation
nodes
• Aggregate the results into a final set of results


Bite Sized Pieces Are Easier to Swallow

• These pieces are not chunks – rather, the individual data
points that make up each chunk

• Chunks make up a useful data transfer units for processing
as well
• Transfer Chunks as “Input Splits” to calculation nodes,
allowing for scalable parallel processing


MapReduce the Pieces

• The most common application of these techniques is
MapReduce
• Based on a Google Whitepaper, works with two primary
functions – map and reduce – to calculate against large
datasets


MapReduce to Calculate Big Data

• MapReduce is designed to effectively process data at varying
scales

• Composable function units can be reused repeatedly for scaled
results


MapReduce to Calculate Big Data

• In addition to the HDFS storage component, Hadoop is built
around MapReduce for calculation

• MongoDB can be integrated to MapReduce data on Hadoop
• No HDFS storage needed - data moves directly between
MongoDB and Hadoop’s MapReduce engine


What is MapReduce?

• MapReduce made up of a series of phases, the primary of
which are
• Map
• Shuffle
• Reduce
• Let’s look at a typical MapReduce job
• Email records
• Count # of times a particular user has received email


MapReducing Email
to: tyler
from: brendan
subject: Ruby Support

to: brendan
from: tyler
subject: Re: Ruby Support

to: mike
from: brendan
subject: Node Support

to: brendan
from: mike
subject: Re: Node Support

to: mike
from: tyler
subject: COBOL Support

to: tyler
from: mike
subject: Re: COBOL Support
(WTF?)


Map Step
map function breaks each document
to: tyler
into a key (grouping) & value
key: tyler
from: brendan value: {count: 1}
subject: Ruby Support

to: brendan
from: tyler key: brendan
subject: Re: Ruby Support value: {count: 1}

to: mike
from: brendan
subject: Node Support key: tyler
value: {count: 1}
map function
to: brendan emit(k, v)
from: mike
subject: Re: Node Support key: mike
value: {count: 1}

to: mike
from: tyler key: brendan
subject: COBOL Support value: {count: 1}

to: tyler
from: mike
subject: Re: COBOL Support key: mike
(WTF?) value: {count: 1}


Group/Shuffle Step
key: tyler
value: {count: 1}

key: brendan
Group like keys together, value: {count: 1}

creating an array of their key: tyler
value: {count: 1}

distinct values
(Automatically done by M/R frameworks)
key: mike
value: {count: 1}

key: brendan
value: {count: 1}

key: mike
value: {count: 1}


Group/Shuffle Step

Group like keys together,
key: tyler

creating an array of their
values: [{count: 1},
{count: 1}]

distinct values key: mike
{count: 1}]
(Automatically done by M/R frameworks)
key: brendan
{count: 1}]


Reduce Step
For each key reduce function

flattens the list of values to a single

result
key: tyler key: mike
values: [{count: 1}, value: {count: 2}
{count: 1}]

key: mike key: brendan
reduce function
values: [{count: 1}, value: {count: 2}
aggregate values
{count: 1}]
return (result)

key: brendan
key: tyler
value: {count: 2}
{count: 1}]


Processing Scalable Big Data

• MapReduce provides an effective system for calculating
and processing our large datasets (from gigabytes through
exabytes and beyond)

• MapReduce is supported in many places including
MongoDB & Hadoop

• We have effective answers for both of our concerns.
• Can I read & write this data efficiently at different scale?
• Can I run calculations on large portions of this data?


Batch Isn’t a Sustainable Answer
• There are downsides here - fundamentally, MapReduce is a
batch process

• Batch systems like Hadoop give us a “Catch 22”
• You can get answers to questions from Petabytes of
Data
• But you can’t guarantee you’ll get them quickly
• In some ways, this is a step backwards in our industry
• Business Stakeholders tend to want answers now
• We must evolve


Moving Away from Batch
• The Big Data world is moving rapidly away from slow, batch
based processing solutions

• Google moved forward from Batch into more Realtime over last
few years

• Hadoop is replacing “MapReduce as Assembly Language” with
more flexible resource management in YARN
• Now MapReduce is just a feature implemented on top of
YARN
• Build anything we want
• Newer systems like Spark & Storm provide platforms for
realtime processes


In Closing
• The World IS Being Eaten By Software

• All that software is leaving behind an awful lot of data
• We must be careful not to “step in it”
• More Data Means More Software Means More Data
Means...

• Practical Solutions for Processing & Storing Data will save
us

• We as Data Scientists & Technologists must always evolve
our strategies, thinking and tools


[Download the Hadoop Connector]
http://github.com/mongodb/mongo-hadoop
[Docs]
http://api.mongodb.org/hadoop/

¿QUESTIONS?

*Contact Me*
brendan@10gen.com
(twitter: @rit)


The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012

Recommended

Recommended

More Related Content

More from Big Data Spain

More from Big Data Spain (20)

Recently uploaded

Recently uploaded (20)

The promise and peril of abundance: Making Big Data small. BRENDAN MCADAMS at Big Data Spain 2012