Tuple map reduce: beyond classic mapreduce

Tuple MapReduce: Beyond
classic MapReduce
Pedro Ferrera, Ivan de Prado, Eric Palacios Jose Luis FernandezMarquez
DataSalt Giovanna Di Marzo Serugendo
Barcelona, SPAIN University of Geneva, CUI
pere,ivan,epalacios@datasalt.com Geneva, SWITZERLAND
joseluis.fernandez@unige.ch

Outline

● Introduction
● Related Work
● Classic MapReduce
– The problems of MapReduce
● Tuple MapReduce
– The basic Tuple MapReduce
– Joins
– Generalization of MapReduce
● Pangool
● Conclusions and Future work

2 / 18

Introduction
● A huge amount of information → needs for new processing
technologies.
● MapReduce → major contribution ...
– … but involves a sharp learning curve.
● Most of design patterns found in real world problems are not
well covered.
● We propose Tuple MapReduce as a better foundation model.
● TupleMapReduce on Hadoop → Pangool
– No key architectural changes needed.

3 / 18

Related work

● MapReduce: Google paper on 2004
● Hadoop
● Higher level tools
– Sawzall, FlumeJava, Pig, Hive, Jaql, Cascading
● Higher level abstractions very popular
– Supports the idea of MapReduce as a too low-level paradigm
● Merge MapReduce
– Targets the problem of relational operations (joins)
– Implies changes in the architecture and a new step merge

4 / 18

Classic MapReduce

● Jobs
– input file, ouput file
– Developer provides two functions: map and reduce

● Distributed execution of work
– Firstly the map function in the mapper phase
– Then the reduce function in the reducing phase

5 / 18

The problems of MapReduce

● Compound records
– Real world problems include multi-field records. They don’t fit well on
the key/value schema
● Sorting
– No inherent sorting within the reduce records.
– “secondary sorting trick” on implementations (Hadoop)
● Join
– A quite common operation
– Not directly possible in MapReduce without using “tricks”:
● secondary sorting
● compound records

6 / 18

Tuple MapReduce

● Idea: replace key/value by tuples
● group-by and sort-by clauses

7 / 18

Tuple MapReduce (II)
● group-by and sort-by constraint
– group-by as a prefix of sort-by
– Needed if you want to be able to implement Tuple MapReduce over a
MapReduce architecture

● Contrary to MapReduce, Tuple MapReduce:
– provides compound records → tuple
– provides intra-reduce sorting

8 / 18

Example: cumulative visits

● Cumulative # of visits up to each single date
Input → URL, date, visits

<<<

Expected output →
URL, date, cumulative visits

9 / 18

Join-Tuple MapReduce

● Joins among heterogeneous datasets
– Tuples associated with a source-id.
● Tuples reach the reducer sorted by source-id
–enabling memoryless reduce joins
– and grouped by some common fields

10 / 18

Example: join between clients and payments
name client_id payment_id amount

clients
Inner join payments

11 / 18

Generalization of MapReduce

● MapReduce is a TupleMapReduce with...
– tuples of two values and
– group-by and sort-by set to first value
● The opposite is also possible → implementing Tuple MapReduce
into existing MapReduce implementations.
– Architectural changes are not needed.
– Pangool is a proof of that.

12 / 18

Pangool pangool.net

● Tuple MapReduce implementation on top of Hadoop.
– On top of existing MapReduce implementation.
● It is just a library. No architecture change was needed.
● Used on real world applications
– Banking
– Searching
– Social networks

13 / 18

Pangool benchmark – secondary sort

14 / 18

Pangool benchmark – join

15 / 18

Pangool performance

● Just between 5% and 8% worst than Hadoop
– Pretty good considering that Pangool is built on top of Hadoop API
● The difference would probably disappear with a native
implementation
● Much better than higher level API's
– Probably because Pangool is a low level API

16 / 18

Conclusions and Future work

● MapReduce key/value has been shown too strict.
● Tuple MapReduce keep MapReduce features
– Enhancing it with
● compound records,
● joins and
● intra-reduce sorting.
● Pangool is a proof of its viability,
– including in existing implementations like Hadoop without changing the
architecture
● Future work would involve abstractions for flow creations
– Simplifying job chaining and data flow.

17 / 18

Thanks!

Pedro Ferrera, Ivan de Prado, Eric Palacios Jose Luis FernandezMarquez
DataSalt Giovanna Di Marzo Serugendo
Barcelona, SPAIN University of Geneva, CUI
pere,ivan,epalacios@datasalt.com Geneva, SWITZERLAND
joseluis.fernandez@unige.ch

● Any questions, or doubts?

– ivan@datasalt.com
– @ivanprado

18 / 18

Tuple map reduce: beyond classic mapreduce

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to Tuple map reduce: beyond classic mapreduce

Similar to Tuple map reduce: beyond classic mapreduce (20)

Recently uploaded

Recently uploaded (20)

Tuple map reduce: beyond classic mapreduce