More Related Content Similar to Big data, map reduce and beyond Similar to Big data, map reduce and beyond (20) Big data, map reduce and beyond1. Big Data, MapReduce and
beyond
Iván de Prado Alonso // @ivanprado
Pere Ferrera Bertran // @ferrerabertran
@datasalt
2. Outline
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Big Data. What and why
2. MapReduce & Hadoop
3. MapReduce Design Patterns
4. Real-life MapReduce
1. Data analytics
2. Crawling
3. Full-text indexing
4. Reputation systems
5. Data Mining
5. Tuple MapReduce & Pangool
2 / 112
4. In the past...
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Data and computation fits on one monolithic machine
● Monolithic databases: RDBMS
● Scalability:
– Vertical: buy better hardware
● Distributed systems
– No very common
– Logic centric: Data move where the logic is
● Distributed storage: SAN
4 / 112
5. Distributed systems are hard
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Building distributed systems is hard
– If you can scale vertically at a reasonable cost, why to deal with
distributed systems complexity?
● But circumstances are changing:
– Big Data
● Big data refers to the massive amounts of data that are difficult
to analyze and handle using common database management
tools
5 / 112
6. BIG
DATA
“MAC”
6 / 112
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
7. Big Data
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Data is the new bottleneck
– Web data
● Web pages
● Interaction Logs
– Social networks data
– Mobile devices
– Data generated by Sensors
● Old systems/techniques are not appropriated
● A new approach is needed
7 / 112
8. Big Data project parts
Serving
Acquiring
Processing
8 / 112
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
9. Acquiring
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Gathering/receiving/storing data from sources
● Many kind of sources
– Internet
– Sensors
– User behavior
– Mobile devices
– Health care data
– Banking data
– Social Networks
– …..
9 / 112
10. Processing
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Data is present in the system (acquired)
● This step is responsible of extracting value from data
– Eliminate duplicates
– Infer relations
– Calculate statistics
– Correlate information
– Ensure quality
– Generate recommendations
– ….
10 / 112
11. Serving
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Most of the cases, some interface has to be provided to access
the processed information
● Possibilities
– Big Data / No Big Data
– Real time access to results / non real time access
● Some examples:
– Search engine → inverted index
– Banking data → relational database
– Social Network → NoSQL database
11 / 112
12. Big Data system types
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Offline
– Latency is not a problem
● Online
– Response immediacy is important
● Mixed
– Online behavior, but internally is a mixture of two systems
● One online
● One offline
Offline Online
MapReduce NoSQL
Hadoop Search engines
Distributed RDBMS
12 / 112
13. A
Mixed
Online
Offline
A
P
AS
P
P
P
S
A S
S
Big Data Systems types II
13 / 112
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
15. “Swiss army knife of the
21st century”
Media Guardian Innovation Awards
http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop 15 / 112
16. History
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● 2004-2006
– GFS and MapReduce papers published by Google
– Doug Cutting implements an open source version for Nutch
● 2006-2008
– Hadoop project becomes independent from Nutch
– Web scale reached in 2008
● 2008-now
– Hadoop becomes popular and is commercially exploited
Source: Hadoop: a brief history. Doug Cutting
16 / 112
17. Hadoop
“The Apache Hadoop
software library is a
framework that allows for
the distributed
processing of large data
sets across clusters of
computers using a
simple programming
model”
From Apache Hadoop page
17 / 112
18. Main ideas
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Distributed
– Distributed storage
– Distributed computation platform
● Built to be fault tolerant
● Shared nothing architecture
● Programmer isolation from distributed system difficulties
– By providing an simply primitives for programming
18 / 112
19. Hadoop Distributed File System (HDFS)
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Distributed
– Aggregates the individual storage of each node
● Files formed by blocks
– Typically 64 or 128 Mb (configurable)
– Stored in the OS filesystem: Xfs, Ext3, etc.
● Fault tolerant
– Blocks replicated more than once
19 / 112
20. How files are stored
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
DataNode 1 (DN1)
NameNode
1 DataNode 2 (DN2)
Data.txt: 2
Blocks:
1
DN1
1
DN2
2 3
DN1
DN4
3 DataNode 4 (DN4)
DN2
DN3 2
4
DN4 4
DN3
DataNode 3 (DN3)
3
4
20 / 112
21. MapReduce
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Map Reduce is the abstraction behind Hadoop
● The unit of execution is the Job
● Job has
– An input
– An output
– A map function
– A reduce function
● Input and output are sequences of key/value pairs
● The map and reduce functions are provided by the developer
– The execution is distributed and parallelized by Hadoop
21 / 112
22. Job phases
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Two different phases: mapping and reducing
● Mapping phase
– Map function is applied to Input data
● Intermediate data is generated
● Reducing phase
– Reduce function is applied to intermediate data
● Final output is generated
22 / 112
23. MapReduce
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Two functions (Map & Reduce)
– Map(u, v) : [w,x]*
– Reduce(w, x*) : [y, z]*
● Example: word count
– Map([document, null]) -> [word, 1]*
– Reduce(word, 1*) -> [word, total]
● MapReduce & SQL
– SELECT word, count(*) GROUP BY word
● Distributed execution in a cluster
– Horizontal scalability
23 / 112
24. Word Count
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
This is a line
Also this
Map Reduce
reduce(a, {1}) =
map(“This is a line”) =
a, 1
this, 1
reduce(also, {1}) =
is, 1
also, 1
a, 1
reduce(is, {1}) =
line, 1
is, 1
map(“Also this”) =
reduce(line, {1}) =
also, 1
line, 1
this, 1
reduce(this, {1, 1}) =
this, 2
a, 1
also, 1
Result: is, 1
line, 1
this, 2
24 / 112
25. Map examples
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Swap Mapper
– Swaps key and value
map(key, value):
emit (value, key)
● Split Key Mapper
– Splits key in words and emit a pair per each word
map(key, value):
words = key.split(“ “)
for each word in words:
emit (word, value)
25 / 112
26. Map examples (II)
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Filter Mapper
– Filter out some records
map(key, value):
if (key <> “the”):
emit (key, value)
● Key/value concatenation mapper
– Concatenates the key and the value in the key
map(key, value):
emit (key + “ “ + value, null)
26 / 112
27. Reduce examples
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Count reducer
– Counts the number of elements per each key
reduce(key, values):
count = 0
for each value in values:
count++
emit(key, count)
● Average reducer
– Computes the average value for each key
reduce(key, values):
count = 0
total = 0
for each value in values:
count++
total += value
emit(key, total / count)
27 / 112
28. Reduce examples (II)
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Keep first reducer
– Keeps the first key/value input pair
reduce(key, values):
emit(key, first(values))
● Value concatenation reducer
– Concatenates the values in one string
reduce(key, values):
result = “”
for each value in values:
result += “ “ + value
emit(key, result)
28 / 112
29. Identity map and reduce
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● The identity functions are those that keeps the input
unchanged
– Map identity
map(key, value):
emit (key, value)
– Reduce identity
reduce(key, values):
for each value in values:
emit (key, value)
29 / 112
30. Putting all together
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
map(k, v) → [w, x]*
reduce(w, [x]+) → [y, z]*
● Job flow:
– The mapper generates key/value pairs
– This pairs are grouped by the key
– Hadoop calls the reduce function for each group
– The output of the reduce function is the final Job output
● Hadoop will distribute the work
– Different nodes in the cluster will process the data in parallel
30 / 112
31. data
Tasks
Output
Reduce
Map Tasks
Intermediate
Job Execution
Node 1 Node 1
Input Splits (blocks)
Node 2 Node 2
31 / 112
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
32. Job Execution (II)
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Key/value pair are sorted by key in the shuffle & sorting phase
– That is needed in order to group registers by the key when calling the
reducer
– It also means that calls to the reduce function are done in key-order
● Reduce function with key “A” is always called before than reduce
function with key “B” whiting the same reduce task
● Reducers starts downloading data from the mappers as soon
as possible
– In order to reduce the shuffle & sorting phase time
– Number of reducers can be configured by the programmer
32 / 112
33. Partial Sort Job
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● A job formed with the identity map and the identity reducer
– It just sort data by the key per each reducer
Input file D B A B C D E A
Map 1 Map 2
Intermediate D A B B D A C E
data
Reduce 1 Reduce 2
A A D D B B C E
Output files 33 / 112
34. Input Splits
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Each map task process one input split
– Map task starts processing at the first complete record, and finishes
processing the record crossed by the rightmost boundary
Input Input Input Input
Split Split Split Split
1 2 3 4
File
Records
Map Map Map Map
Task Task Task Task
1 2 3 4
34 / 112
35. Combiner
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Intermediate data goes from the map tasks to the reduce tasks
through the network
– Network can be saturated
● Combiners can be used to reduce the amount of data sent to
the reducers
– When the operation is commutative and associative
● A combiner is a function similar to the reducer
– But it is executed in the map task, just after all the mapping has been
done
● Combiners can't have side effects
– Because Hadoop can decide to execute them or not
35 / 112
37. Design patterns
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation
37 / 112
38. Filtering
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Filtering:
Input data
● We process the
input file in parallel
with Hadoop and
if(condition) { emit(); }
emit a smaller dataset
in the end
Output data
38 / 112
39. Design patterns
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation
39 / 112
40. Secondary sorting
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Receive reducer values in a specific order
● Moving averages:
– Secondary sort by timestamp
– Fill an in-memory window and perform average.
● Top N items in a group:
– Secondary sort by <X>
– Emit the first N elements in a group
● Useful, yet quite difficult to implement in Hadoop.
Sort Comparator
Key
Group Comparator
Partitioner
40 / 112
41. Design patterns
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation
41 / 112
42. Distributed execution without Hadoop
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Distributed queue
– It is needed a common queue used to coordinate and assign work
● Distributed workers
– Consumers working on each node, getting work from the queue
● Problems:
– Difficult to coordinate
● Failover
● Loosing messages
● Load balance
– Queue must scale
42 / 112
43. Distributed execution with Hadoop
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Map-only Jobs.
● Use Hadoop just for the sake of “parallelizing something”.
● Anything that doesn't involve a “group by” (no shuffle/reducer)
● Examples:
– Text categorization
– Filtering
– Crawling
Map 1 Map 2 … Map n
– Updating a DB
– Distributed grep
● NlineInputFormat can be handy for that.
43 / 112
44. Disadvantages
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Work is done in batches
– And distribution is not probably even
● Some resources are wasted
● There are some tricks to alleviate the problem
– Task timeout + saving remaining work to next execution
44 / 112
45. Design patterns
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation
45 / 112
46. Computing statistics (I)
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Count, Sum, Average, Std. Dev...
● Aggregated by something
● Recall SQL: select user, count(clicks) … group by user
user, click user, click Map: emit (user, click)
Reduce by user: count values
user, count(click)
46 / 112
47. Computing statistics (II)
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● When sum(), avg(), etc, Combiners are often needed
● Imagine a user performed 3 million clicks
– Then, a reducer will receive 3 million registers
– This reducer will be the bottleneck of the Job. Everyone needs to wait
for it to count 3 million things.
● Solution: Perform partial counts in a Combiner
● Combiner is executed before shuffling, after Mapper.
47 / 112
48. Computing statistics (III)
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Using a Combiner:
user, click user, click Map
user, count(click) user, count(click) Combine
user, sum(count(click)) Reduce
● For each group, reducer aggregates N counts in the worst case! (N =
#mappers)
48 / 112
49. Design patterns
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation
49 / 112
50. Distinct
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● How to calculate distinct count(something) group by X ?
● It is somewhat easy (2 M/Rs):
M/R 1 (eliminates duplicates):
– emit ({X, something}, null)
– so rows are grouped by ({X, something})
– In the reducer, just emit the first (ignore duplicates)
M/R 2 (groups by X and count):
– For each input X, something → emit (X, 1)
– group by (X)
– The reducer counts incoming values
50 / 112
51. Distinct: example
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
M/R 1 M/R 2
(X1, s1)
(X1, s1)
(X1, s2) (X1, s1) (X1, s1)
(X1, s1) (X1, s2) (X1, s2) X1 → 2
(X1, s2) (X2, s1) (X2, s1) X2 → 2
(X2, s1) (X2, s3) (X2, s3)
(X2, s1)
(X2, s3)
(X2, s1)
51 / 112
52. Distinct: Secondary sort
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● We can calculate distinct in only one Job
● Using Secondary Sorting
M/R:
– emit ({X, something}, null)
– group by (X), secondary sort by (something)
– The reducer: iterate, count & emit “something, count” when
“something” changes. Reset the counter each “something” change.
● Need to use a Combiner to eliminate duplicates (otherwise
reducer would receive too many records).
● disctinct count() is more parallelizable with 2 Jobs than with 1!
52 / 112
53. Design patterns
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation
53 / 112
54. Sorting
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● We have seen how sorting is (partially) inherent in Hadoop.
● But if we want “pure” sorting:
– Use one Reducer (not scalable)
– Use an advanced partitioning strategy
● Yahoo! TeraSort (http://sortbenchmark.org/Yahoo2009.pdf)
● Use sampling to calculate data distribution
● Implement custom Partitioning according to distribution
54 / 112
55. Sorting (II)
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Hash partitioning:
0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 ...
● Distribution-aware partitioning:
0 1 2
55 / 112
56. Design patterns
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation
56 / 112
57. Joins
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Joining two (or more) datasets is a quite common need.
● The difficulty is that both datasets may be “too big”
– Otherwise, an in-memory join can be do quite easily just by reading
one of the datasets in RAM.
● “Big joins” are commonly done “reduce-side”:
Map dataset 1: (K1, d21) Map dataset 2: (K1, d11)
(K2, d22) (K2, d12)
Reduce by common key (K1, K2,...)
K1 → d11, d21
Reduce: Join
K2 → d12, d22
● The so-called “map-side joins” are more complex and tricky.
57 / 112
58. Joins: 1-N relation
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Use secondary sorting to get the “one-side” of the relation the
first
– Otherwise you need to use memory to perform the join
● Does not scale
● Employee (E) – Sales join (S)
SSESSS You need to use memory
ESSSSS Memory not needed
58 / 112
59. Left – Right – Inner joins
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Join between Employee and Sales
reducer(key, values):
employee = null
first = first(values)
rest = rest(values)
If isEmployee(first)
employee = first
If employee = null
// rigth join SSSSS
else if size(rest) = 0
// left join E
else
// inner join ESSSSS
59 / 112
60. Design patterns
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation
60 / 112
61. Reconciliation
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Hadoop can be used to “simulate a database”.
● For that, we need to:
– Merge data state s(t) with past state s(t-1) using a Join.
– Update rows with the same ID (performing whatever logic).
– Store the result as the next full data state.
– Rotate states:
● s(t-1) = s(t)
● s(t) = s(t + 1) s(t-1) s(t)
s(t+1)
M/R
61 / 112
62. Real-life Hadoop projects
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● 95% Real-life Hadoop projects are a mixture of the patterns we
just saw.
● Example: A vertical search-engine.
– Distributed execution: Feed crawl / parse
– Data reconciliation: Merge data by listing ID
– Join: Augment listings with geographical info
– …
● Example: Payments data stats.
– Secondary sort: weekly, daily & monthly stats
– Distributed execution: Random-updates to a DB
● ...
62 / 112
64. Real-life MapReduce
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Data analytics
2. Crawling
3. Full-text indexing
4. Reputation systems
5. Data Mining
64 / 112
65. Data analytics
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Obvious use case for MapReduce.
● Examples: calculate unique visits per page.
– Top products per month.
– Unique clicks per banner.
– Etc.
● Offline analytics (Batch-oriented).
– Online analytics not a good fit for MapReduce.
65 / 112
66. Data analytics: How it works
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● A batch process that uses all historical data.
– Recompute everything always.
– Easier to manage and maintain than incremental computation.
● A chain of MapReduce steps produce the final output.
● There are tools that ease building / maintaining the MapReduce
chain:
– Hive, Pig, Cascading, Pangool for programming a MapReduce flow
easily.
– Oozie, Azkaban for connecting existing MapReduce jobs easily.
● Scheduling flows and such.
66 / 112
67. Data analytics: Difficulties
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Some things are harder to calculate than others.
● Calculating unique visits per page.
– A simple solution in two MapReduce steps or a more sophisticated one
in a single MapReduce step.
– Approximated methods can be used as well.
● Calculating the median.
– Need to sort all the dataset and iterate twice if we don’t know the
number of elements.
67 / 112
68. Data analytics: Examples
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Gather clicks on pages.
– Save (click, page, timestamp) in the HDFS.
● A MapReduce job groups by page and counts the number of
clicks:
● map: emit(page, click).
● reduce: (page, list<click>) emits (page, totalClicks).
● We now have the total number of clicks per page.
68 / 112
69. Data analytics: Examples (II)
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Another MapReduce job groups by day and page and counts
the number of clicks:
● map: emit((page, day), click).
● reduce: Same as before.
● We now have the total number of clicks per page and day.
● These are simple examples, but data analytics can get as
sophisticated as you want.
– Example: calculate a 10 bar histogram of the distribution of clicks over
the hours of the day for each page.
69 / 112
70. Real-life MapReduce
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Data analytics
2. Crawling
3. Full-text indexing
4. Reputation systems
5. Data Mining
70 / 112
71. Crawling: Web Crawling
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Web Crawling:
– “A Web crawler is a computer program that browses the World Wide
Web in a methodical, automated manner or in an orderly fashion.”
● Applications:
– Search engines.
– NLP (Sentiment analysis).
● Examples:
71 / 112
72. Real-life MapReduce
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Data analytics
2. Crawling
3. Full-text indexing
4. Reputation systems
5. Data Mining
72 / 112
73. Crawling: Web Crawling (at scale)
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● How to parallelize storage and bandwidth?
● How to deduplicate stored URLs?
● Other complexities: politeness, infinite loops, robots.txt,
canonical URLs, pagination, parsing, …
● Relevancy: Pagerank.
73 / 112
74. Crawling: Nutch
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● What is Nutch?
– Open source web-search software project.
– Apache project.
– Hadoop, Tike, Lucene, SOLR.
● (Brief) history
– Started in 2002/2003
– 2005: MapReduce
– 2006: Hadoop
– 2006/2007: Tika
– 2010 TLP Apache project
74 / 112
75. Crawling: Nutch: How it works
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● “Select, Crawl, Parse, Dedup by URL” loop.
● Lucene, SOLR for indexing.
– We will see them later.
● CrawlDB: Pages are saved in HDFS.
● MapReduce makes storage and computing scalable.
– Helps in deduplicating pages by URL.
– Helps in identifying new pages to crawl.
75 / 112
76. Crawling: Not-Only Web Crawling
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Custom crawlers:
– Tweets.
– XML feeds.
● Simpler, as we usually don’t need to traverse a tree.
– Sometimes only crawling a fixed seed of resources is enough.
● Applications
– Vertical search engines.
– Reputation systems.
76 / 112
77. Crawling: Example: Crawling tweets at scale
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Use a scalable computing engine for fetching tweets.
– Storm is a good fit.
– Hadoop can be used as well.
● Tricky usage of MapReduce: Create as many groups as crawlers
and embed a Crawler in them.
● Save raw feed data (JSON) in HDFS.
● MapReduce: Parse JSON tweets.
● MapReduce: Deduplicate tweets.
● MapReduce: Analyze tweets and perform data analysis.
77 / 112
78. M/R
Parse
HDFS
M/R
Dedup
M/R
Analysis
Results
Crawling: Example: Crawling tweets at scale
78 / 112
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
79. Real-life MapReduce
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Data analytics
2. Crawling
3. Full-text indexing
4. Reputation systems
5. Data Mining
79 / 112
80. Full-text indexing: Definitions
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Search engine:
– An information retrieval system designed to help find information
stored on a computer system.
● Inverted index:
– Index data structure storing a mapping from content, such as words or
numbers, to its locations in a database file, or in a document or a set of
documents.
– B-Trees are not inverted indexes!
● Stemming.
● Relevancy in results.
80 / 112
81. Full-text indexing: Applications
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Web search engines
– Finding relevant pages for a topic
● Vertical search engines
– Finding jobs by description
● Social networks
– Finding messages by text
● e-Commerce
– Finding articles by description
● In general, any service or application needing efficient text
information retrieval
81 / 112
82. Full-text indexing (at scale)
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Real-time indexing versus batch-indexing:
– The first is cool: it is real-time, but it is difficult. We will not address it
now.
– The second is not real-time, but it is simpler.
● How to batch-index a big corpus dataset?
– Need a scalable storage, (HDFS).
● How to deduplicate documents?
– MapReduce to the rescue (like we saw before).
● How to generate multiple indexes?
– MapReduce can help (we will see how).
82 / 112
83. Full-text indexing: MapReduce
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● MapReduce can be used to generate an inverted index.
– Vertical partitioning v.s. Horizontal partitioning.
● Example:
– Map: emit(word, docId)
– Reduce: emit(word, list<docIds>)
● Quite simple. But what about stop words, stemming, etc?
● How to store the index?
● Better not to reinvent the wheel.
83 / 112
84. Full-text indexing: Lucene / SOLR
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Lucene: Doug Cutting’s
– From Nutch.
– Mainstream open-source implementation of an inverted index.
– Efficient disk allocation, highly performant.
● SOLR: Mainstream open-source search server.
– Provides stemming, analyzers, HTTP servlets, etc.
– Lacks some other desirable properties:
● Elasticity, real-time indexing, horizontal partitioning (although work
in progress).
● Still the reference technology for creating and serving inverted
indexes.
84 / 112
85. Full-text indexing: MapReduce meets SOLR
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● We can use MapReduce for scaling the indexing process.
● At the same time, we can use SOLR for creating the resulting
index.
– SOLR is used as-a-library.
● Generated indexes are later deployed to the search servers.
85 / 112
86. Full-text indexing: Example
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● A vertical job search engine.
● Jobs are parsed from crawled feeds and saved in the HDFS.
● MapReduce for deduplicating job offers.
– map: emit(jobId, job)
– reduce (jobId, list<job>) -> emit (jobId, job)
● Retention policy: keep latest job.
86 / 112
87. Full-text indexing: Example (II)
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● MapReduce for augmenting job information (adding
geographical information).
– map1: emit(job.city, job)
– map2: emit(city, geoInfo)
– reduce (job.city, list<job>)(city, geoInfo) -> for all jobs emit(job, geoInfo)
● MapReduce for distributing the index process:
– map: emit(job.country, job)
– reduce: (job.country, list<job>) -> Create index for country “job.country”
using SOLR.
● Deploy per-country indexes to search cluster.
87 / 112
88. Full-text indexing: Example
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
XML
feeds Search Cluster
Deploy
Geo Indexes
HDFS info
M/R M/R M/R M/R
Parse Dedup Geo info Index
88 / 112
89. Real-life MapReduce
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Data analytics
2. Crawling
3. Full-text indexing
4. Reputation systems
5. Data Mining
89 / 112
90. Reputation: Definitions
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● What is reputation?
● Reputation in social communities.
– eBay, StackOverflow...
● Reputation in social media.
– Twitter, Facebook...
● Why is it important?
90 / 112
91. Reputation: Relationships
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Modelling relationships is needed for calculating reputation
● Graph-like models arise
● Usually stored as vertices
– A interacts with B
– or, A trusts B
– or, A → B
● Internet-scale graphs can be stored in HDFS
– Each vertex in a row
– Add needed metadata to vertices: date, etc.
91 / 112
92. Reputation: MapReduce analysis on vertices
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● All people whom “A” interacted with.
– map: (a, b)
– reduce: (a, list<b>).
● Essentially things like PageRank can be very easily
implemented.
● PageRank as a measure of page relevancy from page inlinks.
– But it can be extrapolated to any kind of authority and trustiness metric.
– E.g. People relevancy from social networks.
92 / 112
93. Reputation: Going deeper on graphs
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Friends of friends of friends.
– 1 MapReduce step: My friends.
– 2 MapReduce steps: Friends of my friends.
– 3 MapReduce steps: Friends of friends of my friends.
● Iterative MapReduce solves it.
● But there are better foundational models such as Google’s
Pregel.
– Exploiting data locality in graphs.
– Apache Giraph.
– Apache Hama.
93 / 112
94. Reputation: Difficulties
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Sometimes multiple MapReduce steps are needed for
calculating a final metric.
– Because data doesn’t fit in memory.
● Intermediate relations need to be calculated.
– And later filtered out.
● “Polynomial effect”: Calculate all pairwise relations in a set:
N*(N-1)/2
– Possible bottleneck.
94 / 112
95. Reputation: Difficulties: Data imbalance
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● When grouping by something, some groups may be much
bigger than others.
– Causing “data imbalance”.
● Data imbalance in MapReduce is a big problem.
– Some machines will finish quickly while one will be busy for hours.
● Inefficient usage of resources.
● Data processing doesn’t scale linearly anymore.
– Next MapReduce step can’t start as long as current one hasn’t yet
finished.
95 / 112
96. Reputation: Example
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Input: Tweets.
● Distributed crawling for fetching the tweets.
– Save them in the HDFS.
● Parse the tweets. Define the graph of trustiness.
– A trusts B if A follows B.
● Execute PageRank over the graph.
– Spreads trustiness to all vertices.
96 / 112
97. M/R
Parse
HDFS
Reputation: Example
C
A
B
D
Graph of trustiness
M/R
Results
PageRank
97 / 112
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
98. Real-life MapReduce
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Data analytics
2. Crawling
3. Full-text indexing
4. Reputation systems
5. Data Mining
98 / 112
99. Data mining: Text classification
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Document classification
– Documents are texts in this case.
● Assigns a text one or more categories.
– Binary classifiers versus multi-category classifiers
– Multi-category classifiers can be built from multiple binary classifiers
● Two steps: generating the model and classifying.
99 / 112
100. Data mining: Text classification: Steps
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Generating the model
– The resultant model may or may not fit in memory.
● Let’s assume the final model fits in memory.
– Use a large dataset for generating the model.
● MapReduce helps scaling the model generation process.
● Example: Build multiple binary classifiers -> parallelize by classifier.
● Example: Calculate conditional probabilities of a Bayesian model.
Paralellize by word (like in WordCount example).
● Classifying
– MapReduce also helps in classifying a large dataset.
● If model fits in memory, parallelize documents to classify and load the
model in memory.
– Batch-classifying: parallelize documents to classify. Output is the set of
documents with the assigned categories.
100 / 112
101. Data mining: Others
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Mahout library for other data mining problems.
– Clustering, logistic regression, etc.
● Recommendation algorithms.
– Many recommendation algorithms are based on calculating
correlations.
– Calculating correlations in parallel with MapReduce is easy.
● Remember: always in the “batch” or “offline” domain.
– Recommendations are reloaded after batch process finishes.
101 / 112
102. Tuple Mapreduce
Pere Ferrera, Ivan de Prado, Eric Palacios, Jose Luis Fernandez-Marquez, Giovanna Di Marzo
Serugendo: Tuple MapReduce: Beyond classic MapReduce. In ICDM 2012: Proceedings of the
IEEE International Conference on Data Mining (To appear) (10.7% Acceptance rate)
103. Common MapReduce problems
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Lack of compound records
– By default, key & value are considered monolithic entities.
● In real life, this case is rare.
– Alleviated by some serialization libraries (Thrift, Protocol Buffers)
● Sorting within a group
– MapReduce foundation does nos support it
● Although MapReduce implementations overcome this problem with
“tricks”
● Joins
– Needs of compound records and sorting within a group to be implemented
– Not directly supported by MapReduce
103 / 112
104. Tuple Map Reduce: rationale
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Compound records, sorting within a group and join are design
patterns that arises in most MapReduce applications...
● … but MapReduce does not make the implementation easy
● An evolution of MapReduce paradigm is needed to cover these
design patterns:
Tuple MapReduce
104 / 112
105. Tuple MapReduce
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● We extended the classic (key, value) MapReduce model
● Use n-sized Tuples instead of (key, value)
● Define a Tuple-based M/R
– Covering most common use cases
105 / 112
106. Group by / Sort by
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● You can think of a M/R as a SELECT … GROUP BY …
● With Tuple MapReduce, you simply “group by” a subset of
Tuple fields
– Easier, more intuitive than having objects for each kind of Key you
want to group by.
● Alternatively, you may “sort by” a wider subset
– Hiding all complex logic behind secondary sort
106 / 112
107. Tuple-Join MapReduce
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Extend the whole idea to allow for easier joins
Tuple1: (a,b,c,d) Tuple2: (a,b,f,g,h)
Join by (a,b)
● Formally speaking:
107 / 112
108. Pangool
http://pangool.net
108 / 112
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
109. Pangool: What?
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Better, simpler, powerful API replacement for Hadoop's API
● What do we mean by API replacement?
– APIs on top of Hadoop: Pig, Hive, Cascading.
– Using them always comes with a tradeoff.
– Paradigms other than MapReduce, not always the best choice.
– Performance restrictions.
● Pangool is still MapReduce, low-level and high performing
– Yet a lot simpler!
109 / 112
110. Pangool: Why?
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Hadoop has a steep learning curve
● Default API is too low-level
● Making things efficient is harsh (binary comparisons, ser/de...)
● There are some common patterns (joins, secondary sorting...)
Common pattern
How can we make them simpler
Common pattern without loosing flexibility and power?
Common pattern
110 / 112
111. Pangool API
Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
● Schema, Tuple, …
● Reducers, Mappers, etc are instances instead of static classes
– Easier to configure them: new MyReducer(5, 2.0);
● Still tied to Hadoop's particularities in some ways
– NullWritable, etc
● Let's see an example
111 / 112
Editor's Notes - Premios a la Innovación de The Guardian Hay que reconocer que las navajas suizas son útiles … Quién no ha necesitado una lupa en un momento de emergencia! A Hadoop le pasa como las navajas suizas. Son muy útiles, sudas la gota gorda consigues sacar el accesorio que quieres Distribuida: aprovecha la potencia de varias máquinas en un cluster Grandes conjuntos de datos: Hadoop no es apropiado para conjuntos de datos pequeños Simple Programming Model: Hadoop no es sólo un framework, es un nuevo paradigma de programación distribuida Hadoop se asienta principalmente en dos modulos: Un sistema de ficheros distribuido Para almacenar grandes volumenes de datos Un nuevo paradigma de programación: MapReduce Veamos uno por uno.