Big data, map reduce and beyond

Big Data, MapReduce and
beyond
Iván de Prado Alonso // @ivanprado
Pere Ferrera Bertran // @ferrerabertran
@datasalt

Outline

Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Big Data. What and why
2. MapReduce & Hadoop
3. MapReduce Design Patterns
4. Real-life MapReduce
1. Data analytics
2. Crawling
3. Full-text indexing
4. Reputation systems
5. Data Mining
5. Tuple MapReduce & Pangool

2 / 112

In the past...

● Data and computation fits on one monolithic machine
● Monolithic databases: RDBMS
● Scalability:
– Vertical: buy better hardware
● Distributed systems
– No very common
– Logic centric: Data move where the logic is
● Distributed storage: SAN

4 / 112

Distributed systems are hard

● Building distributed systems is hard
– If you can scale vertically at a reasonable cost, why to deal with
distributed systems complexity?
● But circumstances are changing:
– Big Data
● Big data refers to the massive amounts of data that are difficult
to analyze and handle using common database management
tools

5 / 112

BIG

DATA
“MAC”

6 / 112


Big Data

● Data is the new bottleneck
– Web data
● Web pages
● Interaction Logs
– Social networks data
– Mobile devices
– Data generated by Sensors
● Old systems/techniques are not appropriated
● A new approach is needed

7 / 112

Big Data project parts

Serving
Acquiring

Processing

8 / 112


Acquiring

● Gathering/receiving/storing data from sources
● Many kind of sources
– Internet
– Sensors
– User behavior
– Mobile devices
– Health care data
– Banking data
– Social Networks
– …..

9 / 112

Processing

● Data is present in the system (acquired)
● This step is responsible of extracting value from data
– Eliminate duplicates
– Infer relations
– Calculate statistics
– Correlate information
– Ensure quality
– Generate recommendations
– ….

10 / 112

Serving

● Most of the cases, some interface has to be provided to access
the processed information
● Possibilities
– Big Data / No Big Data
– Real time access to results / non real time access
● Some examples:
– Search engine → inverted index
– Banking data → relational database
– Social Network → NoSQL database

11 / 112

Big Data system types

● Offline
– Latency is not a problem
● Online
– Response immediacy is important
● Mixed
– Online behavior, but internally is a mixture of two systems
● One online
● One offline

Offline Online
MapReduce NoSQL
Hadoop Search engines
Distributed RDBMS

12 / 112

A

Mixed
Online
Offline

A
P
AS
P

P

P
S

A S
S
Big Data Systems types II

13 / 112


“Swiss army knife of the
21st century”
Media Guardian Innovation Awards

http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop 15 / 112

History

● 2004-2006
– GFS and MapReduce papers published by Google
– Doug Cutting implements an open source version for Nutch
● 2006-2008
– Hadoop project becomes independent from Nutch
– Web scale reached in 2008
● 2008-now
– Hadoop becomes popular and is commercially exploited

Source: Hadoop: a brief history. Doug Cutting

16 / 112

Hadoop

“The Apache Hadoop
software library is a
framework that allows for
the distributed
processing of large data
sets across clusters of
computers using a
simple programming
model”
From Apache Hadoop page

17 / 112

Main ideas

● Distributed
– Distributed storage
– Distributed computation platform
● Built to be fault tolerant
● Shared nothing architecture
● Programmer isolation from distributed system difficulties
– By providing an simply primitives for programming

18 / 112

Hadoop Distributed File System (HDFS)

● Distributed
– Aggregates the individual storage of each node
● Files formed by blocks
– Typically 64 or 128 Mb (configurable)
– Stored in the OS filesystem: Xfs, Ext3, etc.
● Fault tolerant
– Blocks replicated more than once

19 / 112

How files are stored

DataNode 1 (DN1)

NameNode
1 DataNode 2 (DN2)
Data.txt: 2
Blocks:
1
DN1
1
DN2
2 3
DN1
DN4
3 DataNode 4 (DN4)
DN2
DN3 2
4
DN4 4
DN3
DataNode 3 (DN3)

3

4
20 / 112

MapReduce

● Map Reduce is the abstraction behind Hadoop
● The unit of execution is the Job
● Job has
– An input
– An output
– A map function
– A reduce function
● Input and output are sequences of key/value pairs
● The map and reduce functions are provided by the developer
– The execution is distributed and parallelized by Hadoop

21 / 112

Job phases

● Two different phases: mapping and reducing
● Mapping phase
– Map function is applied to Input data
● Intermediate data is generated
● Reducing phase
– Reduce function is applied to intermediate data
● Final output is generated

22 / 112

MapReduce

● Two functions (Map & Reduce)
– Map(u, v) : [w,x]*
– Reduce(w, x*) : [y, z]*
● Example: word count
– Map([document, null]) -> [word, 1]*
– Reduce(word, 1*) -> [word, total]
● MapReduce & SQL
– SELECT word, count(*) GROUP BY word
● Distributed execution in a cluster
– Horizontal scalability

23 / 112

Word Count

This is a line
Also this

Map Reduce
reduce(a, {1}) =
map(“This is a line”) =
a, 1
this, 1
reduce(also, {1}) =
is, 1
also, 1
a, 1
reduce(is, {1}) =
line, 1
is, 1
map(“Also this”) =
reduce(line, {1}) =
also, 1
line, 1
this, 1
reduce(this, {1, 1}) =
this, 2

a, 1
also, 1
Result: is, 1
line, 1
this, 2

24 / 112

Map examples

● Swap Mapper
– Swaps key and value
map(key, value):
emit (value, key)

● Split Key Mapper
– Splits key in words and emit a pair per each word
map(key, value):
words = key.split(“ “)
for each word in words:
emit (word, value)

25 / 112

Map examples (II)

● Filter Mapper
– Filter out some records
map(key, value):
if (key <> “the”):
emit (key, value)

● Key/value concatenation mapper
– Concatenates the key and the value in the key
map(key, value):
emit (key + “ “ + value, null)

26 / 112

Reduce examples

● Count reducer
– Counts the number of elements per each key
reduce(key, values):
count = 0
for each value in values:
count++
emit(key, count)

● Average reducer
– Computes the average value for each key
count = 0
total = 0
count++
total += value
emit(key, total / count)

27 / 112

Reduce examples (II)

● Keep first reducer
– Keeps the first key/value input pair
emit(key, first(values))

● Value concatenation reducer
– Concatenates the values in one string

result = “”
result += “ “ + value
emit(key, result)

28 / 112

Identity map and reduce

● The identity functions are those that keeps the input
unchanged
– Map identity
map(key, value):
emit (key, value)

– Reduce identity

emit (key, value)

29 / 112

Putting all together

map(k, v) → [w, x]*
reduce(w, [x]+) → [y, z]*

● Job flow:
– The mapper generates key/value pairs
– This pairs are grouped by the key
– Hadoop calls the reduce function for each group
– The output of the reduce function is the final Job output
● Hadoop will distribute the work
– Different nodes in the cluster will process the data in parallel

30 / 112

data

Tasks
Output
Reduce
Map Tasks

Intermediate
Job Execution

Node 1 Node 1
Input Splits (blocks)

Node 2 Node 2
31 / 112


Job Execution (II)

● Key/value pair are sorted by key in the shuffle & sorting phase
– That is needed in order to group registers by the key when calling the
reducer
– It also means that calls to the reduce function are done in key-order
● Reduce function with key “A” is always called before than reduce
function with key “B” whiting the same reduce task
● Reducers starts downloading data from the mappers as soon
as possible
– In order to reduce the shuffle & sorting phase time
– Number of reducers can be configured by the programmer

32 / 112

Partial Sort Job

● A job formed with the identity map and the identity reducer
– It just sort data by the key per each reducer

Input file D B A B C D E A

Map 1 Map 2

Intermediate D A B B D A C E
data

Reduce 1 Reduce 2

A A D D B B C E

Output files 33 / 112

Input Splits

● Each map task process one input split
– Map task starts processing at the first complete record, and finishes
processing the record crossed by the rightmost boundary

Input Input Input Input
Split Split Split Split
1 2 3 4

File

Records

Map Map Map Map
Task Task Task Task
1 2 3 4
34 / 112

Combiner

● Intermediate data goes from the map tasks to the reduce tasks
through the network
– Network can be saturated
● Combiners can be used to reduce the amount of data sent to
the reducers
– When the operation is commutative and associative
● A combiner is a function similar to the reducer
– But it is executed in the map task, just after all the mapping has been
done
● Combiners can't have side effects
– Because Hadoop can decide to execute them or not

35 / 112

Design patterns

1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation

37 / 112

Filtering

● Filtering:
Input data
● We process the
input file in parallel
with Hadoop and
if(condition) { emit(); }
emit a smaller dataset
in the end

Output data

38 / 112

Design patterns

1. Filtering
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation

39 / 112

Secondary sorting

● Receive reducer values in a specific order
● Moving averages:
– Secondary sort by timestamp
– Fill an in-memory window and perform average.
● Top N items in a group:
– Secondary sort by <X>
– Emit the first N elements in a group
● Useful, yet quite difficult to implement in Hadoop.
Sort Comparator

Key

Group Comparator
Partitioner
40 / 112

Design patterns

1. Filtering
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation

41 / 112

Distributed execution without Hadoop

● Distributed queue
– It is needed a common queue used to coordinate and assign work
● Distributed workers
– Consumers working on each node, getting work from the queue
● Problems:
– Difficult to coordinate
● Failover
● Loosing messages
● Load balance
– Queue must scale

42 / 112

Distributed execution with Hadoop

● Map-only Jobs.
● Use Hadoop just for the sake of “parallelizing something”.
● Anything that doesn't involve a “group by” (no shuffle/reducer)
● Examples:
– Text categorization
– Filtering
– Crawling
Map 1 Map 2 … Map n
– Updating a DB
– Distributed grep
● NlineInputFormat can be handy for that.

43 / 112

Disadvantages

● Work is done in batches
– And distribution is not probably even
● Some resources are wasted
● There are some tricks to alleviate the problem
– Task timeout + saving remaining work to next execution

44 / 112

Design patterns

1. Filtering
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation

45 / 112

Computing statistics (I)

● Count, Sum, Average, Std. Dev...
● Aggregated by something
● Recall SQL: select user, count(clicks) … group by user

user, click user, click Map: emit (user, click)

Reduce by user: count values

user, count(click)

46 / 112

Computing statistics (II)

● When sum(), avg(), etc, Combiners are often needed

● Imagine a user performed 3 million clicks
– Then, a reducer will receive 3 million registers
– This reducer will be the bottleneck of the Job. Everyone needs to wait
for it to count 3 million things.

● Solution: Perform partial counts in a Combiner

● Combiner is executed before shuffling, after Mapper.

47 / 112

Computing statistics (III)

● Using a Combiner:

user, click user, click Map

user, count(click) user, count(click) Combine

user, sum(count(click)) Reduce

● For each group, reducer aggregates N counts in the worst case! (N =
#mappers)
48 / 112

Design patterns

1. Filtering
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation

49 / 112

Distinct

● How to calculate distinct count(something) group by X ?
● It is somewhat easy (2 M/Rs):

M/R 1 (eliminates duplicates):
– emit ({X, something}, null)
– so rows are grouped by ({X, something})
– In the reducer, just emit the first (ignore duplicates)
M/R 2 (groups by X and count):
– For each input X, something → emit (X, 1)
– group by (X)
– The reducer counts incoming values

50 / 112

Distinct: example

M/R 1 M/R 2

(X1, s1)
(X1, s1)
(X1, s2) (X1, s1) (X1, s1)
(X1, s1) (X1, s2) (X1, s2) X1 → 2
(X1, s2) (X2, s1) (X2, s1) X2 → 2
(X2, s1) (X2, s3) (X2, s3)
(X2, s1)
(X2, s3)
(X2, s1)

51 / 112

Distinct: Secondary sort

● We can calculate distinct in only one Job
● Using Secondary Sorting

M/R:
– emit ({X, something}, null)
– group by (X), secondary sort by (something)
– The reducer: iterate, count & emit “something, count” when
“something” changes. Reset the counter each “something” change.

● Need to use a Combiner to eliminate duplicates (otherwise
reducer would receive too many records).
● disctinct count() is more parallelizable with 2 Jobs than with 1!

52 / 112

Design patterns

1. Filtering
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation

53 / 112

Sorting

● We have seen how sorting is (partially) inherent in Hadoop.
● But if we want “pure” sorting:
– Use one Reducer (not scalable)
– Use an advanced partitioning strategy

● Yahoo! TeraSort (http://sortbenchmark.org/Yahoo2009.pdf)
● Use sampling to calculate data distribution
● Implement custom Partitioning according to distribution

54 / 112

Sorting (II)

● Hash partitioning:
0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 ...

● Distribution-aware partitioning:

0 1 2

55 / 112

Design patterns

1. Filtering
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation

56 / 112

Joins

● Joining two (or more) datasets is a quite common need.
● The difficulty is that both datasets may be “too big”
– Otherwise, an in-memory join can be do quite easily just by reading
one of the datasets in RAM.

● “Big joins” are commonly done “reduce-side”:

Map dataset 1: (K1, d21) Map dataset 2: (K1, d11)
(K2, d22) (K2, d12)

Reduce by common key (K1, K2,...)

K1 → d11, d21
Reduce: Join
K2 → d12, d22

● The so-called “map-side joins” are more complex and tricky.

57 / 112

Joins: 1-N relation

● Use secondary sorting to get the “one-side” of the relation the
first
– Otherwise you need to use memory to perform the join
● Does not scale
● Employee (E) – Sales join (S)

SSESSS You need to use memory

ESSSSS Memory not needed

58 / 112

Left – Right – Inner joins

● Join between Employee and Sales

reducer(key, values):
employee = null
first = first(values)
rest = rest(values)
If isEmployee(first)
employee = first

If employee = null
// rigth join SSSSS
else if size(rest) = 0
// left join E
else
// inner join ESSSSS

59 / 112

Design patterns

1. Filtering
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation

60 / 112

Reconciliation

● Hadoop can be used to “simulate a database”.
● For that, we need to:
– Merge data state s(t) with past state s(t-1) using a Join.
– Update rows with the same ID (performing whatever logic).
– Store the result as the next full data state.
– Rotate states:
● s(t-1) = s(t)
● s(t) = s(t + 1) s(t-1) s(t)

s(t+1)
M/R

61 / 112

Real-life Hadoop projects

● 95% Real-life Hadoop projects are a mixture of the patterns we
just saw.
● Example: A vertical search-engine.
– Distributed execution: Feed crawl / parse
– Data reconciliation: Merge data by listing ID
– Join: Augment listings with geographical info
– …
● Example: Payments data stats.
– Secondary sort: weekly, daily & monthly stats
– Distributed execution: Random-updates to a DB
● ...

62 / 112

Real-life MapReduce

1. Data analytics
2. Crawling
5. Data Mining

64 / 112

Data analytics

● Obvious use case for MapReduce.
● Examples: calculate unique visits per page.
– Top products per month.
– Unique clicks per banner.
– Etc.
● Offline analytics (Batch-oriented).
– Online analytics not a good fit for MapReduce.

65 / 112

Data analytics: How it works

● A batch process that uses all historical data.
– Recompute everything always.
– Easier to manage and maintain than incremental computation.
● A chain of MapReduce steps produce the final output.
● There are tools that ease building / maintaining the MapReduce
chain:
– Hive, Pig, Cascading, Pangool for programming a MapReduce flow
easily.
– Oozie, Azkaban for connecting existing MapReduce jobs easily.
● Scheduling flows and such.

66 / 112

Data analytics: Difficulties

● Some things are harder to calculate than others.
● Calculating unique visits per page.
– A simple solution in two MapReduce steps or a more sophisticated one
in a single MapReduce step.
– Approximated methods can be used as well.
● Calculating the median.
– Need to sort all the dataset and iterate twice if we don’t know the
number of elements.

67 / 112

Data analytics: Examples

● Gather clicks on pages.
– Save (click, page, timestamp) in the HDFS.
● A MapReduce job groups by page and counts the number of
clicks:
● map: emit(page, click).
● reduce: (page, list<click>) emits (page, totalClicks).
● We now have the total number of clicks per page.

68 / 112

Data analytics: Examples (II)

● Another MapReduce job groups by day and page and counts
the number of clicks:
● map: emit((page, day), click).
● reduce: Same as before.
● We now have the total number of clicks per page and day.
● These are simple examples, but data analytics can get as
sophisticated as you want.
– Example: calculate a 10 bar histogram of the distribution of clicks over
the hours of the day for each page.

69 / 112

Real-life MapReduce

1. Data analytics
2. Crawling
5. Data Mining

70 / 112

Crawling: Web Crawling

● Web Crawling:
– “A Web crawler is a computer program that browses the World Wide
Web in a methodical, automated manner or in an orderly fashion.”
● Applications:
– Search engines.
– NLP (Sentiment analysis).
● Examples:

71 / 112

Real-life MapReduce

1. Data analytics
2. Crawling
5. Data Mining

72 / 112

Crawling: Web Crawling (at scale)

● How to parallelize storage and bandwidth?
● How to deduplicate stored URLs?
● Other complexities: politeness, infinite loops, robots.txt,
canonical URLs, pagination, parsing, …
● Relevancy: Pagerank.

73 / 112

Crawling: Nutch

● What is Nutch?
– Open source web-search software project.
– Apache project.
– Hadoop, Tike, Lucene, SOLR.

● (Brief) history
– Started in 2002/2003
– 2005: MapReduce
– 2006: Hadoop
– 2006/2007: Tika
– 2010 TLP Apache project

74 / 112

Crawling: Nutch: How it works

● “Select, Crawl, Parse, Dedup by URL” loop.
● Lucene, SOLR for indexing.
– We will see them later.
● CrawlDB: Pages are saved in HDFS.
● MapReduce makes storage and computing scalable.
– Helps in deduplicating pages by URL.
– Helps in identifying new pages to crawl.

75 / 112

Crawling: Not-Only Web Crawling

● Custom crawlers:
– Tweets.
– XML feeds.
● Simpler, as we usually don’t need to traverse a tree.
– Sometimes only crawling a fixed seed of resources is enough.
● Applications
– Vertical search engines.
– Reputation systems.

76 / 112

Crawling: Example: Crawling tweets at scale

● Use a scalable computing engine for fetching tweets.
– Storm is a good fit.
– Hadoop can be used as well.
● Tricky usage of MapReduce: Create as many groups as crawlers
and embed a Crawler in them.
● Save raw feed data (JSON) in HDFS.
● MapReduce: Parse JSON tweets.
● MapReduce: Deduplicate tweets.
● MapReduce: Analyze tweets and perform data analysis.

77 / 112

M/R
Parse
HDFS

M/R
Dedup
M/R
Analysis
Results
Crawling: Example: Crawling tweets at scale

78 / 112


Real-life MapReduce

1. Data analytics
2. Crawling
5. Data Mining

79 / 112

Full-text indexing: Definitions

● Search engine:
– An information retrieval system designed to help find information
stored on a computer system.
● Inverted index:
– Index data structure storing a mapping from content, such as words or
numbers, to its locations in a database file, or in a document or a set of
documents.
– B-Trees are not inverted indexes!
● Stemming.
● Relevancy in results.

80 / 112

Full-text indexing: Applications

● Web search engines
– Finding relevant pages for a topic
● Vertical search engines
– Finding jobs by description
● Social networks
– Finding messages by text
● e-Commerce
– Finding articles by description
● In general, any service or application needing efficient text
information retrieval

81 / 112

Full-text indexing (at scale)

● Real-time indexing versus batch-indexing:
– The first is cool: it is real-time, but it is difficult. We will not address it
now.
– The second is not real-time, but it is simpler.
● How to batch-index a big corpus dataset?
– Need a scalable storage, (HDFS).
● How to deduplicate documents?
– MapReduce to the rescue (like we saw before).
● How to generate multiple indexes?
– MapReduce can help (we will see how).

82 / 112

Full-text indexing: MapReduce

● MapReduce can be used to generate an inverted index.
– Vertical partitioning v.s. Horizontal partitioning.
● Example:
– Map: emit(word, docId)
– Reduce: emit(word, list<docIds>)
● Quite simple. But what about stop words, stemming, etc?
● How to store the index?
● Better not to reinvent the wheel.

83 / 112

Full-text indexing: Lucene / SOLR

● Lucene: Doug Cutting’s
– From Nutch.
– Mainstream open-source implementation of an inverted index.
– Efficient disk allocation, highly performant.
● SOLR: Mainstream open-source search server.
– Provides stemming, analyzers, HTTP servlets, etc.
– Lacks some other desirable properties:
● Elasticity, real-time indexing, horizontal partitioning (although work
in progress).
● Still the reference technology for creating and serving inverted
indexes.

84 / 112

Full-text indexing: MapReduce meets SOLR

● We can use MapReduce for scaling the indexing process.
● At the same time, we can use SOLR for creating the resulting
index.
– SOLR is used as-a-library.
● Generated indexes are later deployed to the search servers.

85 / 112

Full-text indexing: Example

● A vertical job search engine.
● Jobs are parsed from crawled feeds and saved in the HDFS.
● MapReduce for deduplicating job offers.
– map: emit(jobId, job)
– reduce (jobId, list<job>) -> emit (jobId, job)
● Retention policy: keep latest job.

86 / 112

Full-text indexing: Example (II)

● MapReduce for augmenting job information (adding
geographical information).
– map1: emit(job.city, job)
– map2: emit(city, geoInfo)
– reduce (job.city, list<job>)(city, geoInfo) -> for all jobs emit(job, geoInfo)
● MapReduce for distributing the index process:
– map: emit(job.country, job)
– reduce: (job.country, list<job>) -> Create index for country “job.country”
using SOLR.
● Deploy per-country indexes to search cluster.

87 / 112

Full-text indexing: Example

XML
feeds Search Cluster

Deploy

Geo Indexes
HDFS info

M/R M/R M/R M/R
Parse Dedup Geo info Index
88 / 112

Real-life MapReduce

1. Data analytics
2. Crawling
5. Data Mining

89 / 112

Reputation: Definitions

● What is reputation?
● Reputation in social communities.
– eBay, StackOverflow...
● Reputation in social media.
– Twitter, Facebook...
● Why is it important?

90 / 112

Reputation: Relationships

● Modelling relationships is needed for calculating reputation
● Graph-like models arise
● Usually stored as vertices
– A interacts with B
– or, A trusts B
– or, A → B
● Internet-scale graphs can be stored in HDFS
– Each vertex in a row
– Add needed metadata to vertices: date, etc.

91 / 112

Reputation: MapReduce analysis on vertices

● All people whom “A” interacted with.
– map: (a, b)
– reduce: (a, list<b>).
● Essentially things like PageRank can be very easily
implemented.
● PageRank as a measure of page relevancy from page inlinks.
– But it can be extrapolated to any kind of authority and trustiness metric.
– E.g. People relevancy from social networks.

92 / 112

Reputation: Going deeper on graphs

● Friends of friends of friends.
– 1 MapReduce step: My friends.
– 2 MapReduce steps: Friends of my friends.
– 3 MapReduce steps: Friends of friends of my friends.
● Iterative MapReduce solves it.
● But there are better foundational models such as Google’s
Pregel.
– Exploiting data locality in graphs.
– Apache Giraph.
– Apache Hama.

93 / 112

Reputation: Difficulties

● Sometimes multiple MapReduce steps are needed for
calculating a final metric.
– Because data doesn’t fit in memory.
● Intermediate relations need to be calculated.
– And later filtered out.
● “Polynomial effect”: Calculate all pairwise relations in a set:
N*(N-1)/2
– Possible bottleneck.

94 / 112

Reputation: Difficulties: Data imbalance

● When grouping by something, some groups may be much
bigger than others.
– Causing “data imbalance”.
● Data imbalance in MapReduce is a big problem.
– Some machines will finish quickly while one will be busy for hours.
● Inefficient usage of resources.
● Data processing doesn’t scale linearly anymore.
– Next MapReduce step can’t start as long as current one hasn’t yet
finished.

95 / 112

Reputation: Example

● Input: Tweets.
● Distributed crawling for fetching the tweets.
– Save them in the HDFS.
● Parse the tweets. Define the graph of trustiness.
– A trusts B if A follows B.
● Execute PageRank over the graph.
– Spreads trustiness to all vertices.

96 / 112

M/R
Parse
HDFS
Reputation: Example

C
A
B

D
Graph of trustiness

M/R
Results

PageRank
97 / 112


Real-life MapReduce

1. Data analytics
2. Crawling
5. Data Mining

98 / 112

Data mining: Text classification

● Document classification
– Documents are texts in this case.
● Assigns a text one or more categories.
– Binary classifiers versus multi-category classifiers
– Multi-category classifiers can be built from multiple binary classifiers
● Two steps: generating the model and classifying.

99 / 112

Data mining: Text classification: Steps

● Generating the model
– The resultant model may or may not fit in memory.
● Let’s assume the final model fits in memory.
– Use a large dataset for generating the model.
● MapReduce helps scaling the model generation process.
● Example: Build multiple binary classifiers -> parallelize by classifier.
● Example: Calculate conditional probabilities of a Bayesian model.
Paralellize by word (like in WordCount example).
● Classifying
– MapReduce also helps in classifying a large dataset.
● If model fits in memory, parallelize documents to classify and load the
model in memory.
– Batch-classifying: parallelize documents to classify. Output is the set of
documents with the assigned categories.

100 / 112

Data mining: Others

● Mahout library for other data mining problems.
– Clustering, logistic regression, etc.
● Recommendation algorithms.
– Many recommendation algorithms are based on calculating
correlations.
– Calculating correlations in parallel with MapReduce is easy.
● Remember: always in the “batch” or “offline” domain.
– Recommendations are reloaded after batch process finishes.

101 / 112

Tuple Mapreduce
Pere Ferrera, Ivan de Prado, Eric Palacios, Jose Luis Fernandez-Marquez, Giovanna Di Marzo
Serugendo: Tuple MapReduce: Beyond classic MapReduce. In ICDM 2012: Proceedings of the
IEEE International Conference on Data Mining (To appear) (10.7% Acceptance rate)

Common MapReduce problems

● Lack of compound records
– By default, key & value are considered monolithic entities.
● In real life, this case is rare.
– Alleviated by some serialization libraries (Thrift, Protocol Buffers)
● Sorting within a group
– MapReduce foundation does nos support it
● Although MapReduce implementations overcome this problem with
“tricks”
● Joins
– Needs of compound records and sorting within a group to be implemented
– Not directly supported by MapReduce

103 / 112

Tuple Map Reduce: rationale

● Compound records, sorting within a group and join are design
patterns that arises in most MapReduce applications...
● … but MapReduce does not make the implementation easy
● An evolution of MapReduce paradigm is needed to cover these
design patterns:

Tuple MapReduce

104 / 112

Tuple MapReduce

● We extended the classic (key, value) MapReduce model
● Use n-sized Tuples instead of (key, value)
● Define a Tuple-based M/R
– Covering most common use cases

105 / 112

Group by / Sort by

● You can think of a M/R as a SELECT … GROUP BY …
● With Tuple MapReduce, you simply “group by” a subset of
Tuple fields
– Easier, more intuitive than having objects for each kind of Key you
want to group by.
● Alternatively, you may “sort by” a wider subset
– Hiding all complex logic behind secondary sort

106 / 112

Tuple-Join MapReduce

● Extend the whole idea to allow for easier joins

Tuple1: (a,b,c,d) Tuple2: (a,b,f,g,h)

Join by (a,b)

● Formally speaking:

107 / 112

Pangool

http://pangool.net

108 / 112


Pangool: What?

● Better, simpler, powerful API replacement for Hadoop's API
● What do we mean by API replacement?
– APIs on top of Hadoop: Pig, Hive, Cascading.
– Using them always comes with a tradeoff.
– Paradigms other than MapReduce, not always the best choice.
– Performance restrictions.

● Pangool is still MapReduce, low-level and high performing
– Yet a lot simpler!

109 / 112

Pangool: Why?

● Hadoop has a steep learning curve
● Default API is too low-level
● Making things efficient is harsh (binary comparisons, ser/de...)
● There are some common patterns (joins, secondary sorting...)

Common pattern

How can we make them simpler
Common pattern without loosing flexibility and power?

Common pattern

110 / 112

Pangool API

● Schema, Tuple, …
● Reducers, Mappers, etc are instances instead of static classes
– Easier to configure them: new MyReducer(5, 2.0);
● Still tied to Hadoop's particularities in some ways
– NullWritable, etc

● Let's see an example

111 / 112

Thanks!!

Iván de Prado Alonso
Pere Ferrera Bertran

Big data, map reduce and beyond

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big data, map reduce and beyond

Similar to Big data, map reduce and beyond (20)

More from datasalt

More from datasalt (6)

Recently uploaded

Recently uploaded (20)

Big data, map reduce and beyond

Editor's Notes