Tuple MapReduce, a new
foundational model extending MapReduce with the notion of
tuples. Tuple MapReduce allows to bridge the gap between the
low-level constructs provided by MapReduce and higher-level
needs required by programmers, such as compound records,
sorting or joins. This paper presents as well Pangool, an open-
source framework implementing Tuple MapReduce. Pangool
eases the design and implementation of applications based
on MapReduce and increases their flexibility, still maintaining
Hadoop’s performance.
Time Series Foundation Models - current state and future directions
Tuple map reduce: beyond classic mapreduce
1. Tuple MapReduce: Beyond
classic MapReduce
Pedro Ferrera, Ivan de Prado, Eric Palacios Jose Luis FernandezMarquez
DataSalt Giovanna Di Marzo Serugendo
Barcelona, SPAIN University of Geneva, CUI
pere,ivan,epalacios@datasalt.com Geneva, SWITZERLAND
joseluis.fernandez@unige.ch
2. Outline
● Introduction
● Related Work
● Classic MapReduce
– The problems of MapReduce
● Tuple MapReduce
– The basic Tuple MapReduce
– Joins
– Generalization of MapReduce
● Pangool
● Conclusions and Future work
2 / 18
3. Introduction
● A huge amount of information → needs for new processing
technologies.
● MapReduce → major contribution ...
– … but involves a sharp learning curve.
● Most of design patterns found in real world problems are not
well covered.
● We propose Tuple MapReduce as a better foundation model.
● TupleMapReduce on Hadoop → Pangool
– No key architectural changes needed.
3 / 18
4. Related work
● MapReduce: Google paper on 2004
● Hadoop
● Higher level tools
– Sawzall, FlumeJava, Pig, Hive, Jaql, Cascading
● Higher level abstractions very popular
– Supports the idea of MapReduce as a too low-level paradigm
● Merge MapReduce
– Targets the problem of relational operations (joins)
– Implies changes in the architecture and a new step merge
4 / 18
5. Classic MapReduce
● Jobs
– input file, ouput file
– Developer provides two functions: map and reduce
● Distributed execution of work
– Firstly the map function in the mapper phase
– Then the reduce function in the reducing phase
5 / 18
6. The problems of MapReduce
● Compound records
– Real world problems include multi-field records. They don’t fit well on
the key/value schema
● Sorting
– No inherent sorting within the reduce records.
– “secondary sorting trick” on implementations (Hadoop)
● Join
– A quite common operation
– Not directly possible in MapReduce without using “tricks”:
● secondary sorting
● compound records
6 / 18
7. Tuple MapReduce
● Idea: replace key/value by tuples
● group-by and sort-by clauses
7 / 18
8. Tuple MapReduce (II)
● group-by and sort-by constraint
– group-by as a prefix of sort-by
– Needed if you want to be able to implement Tuple MapReduce over a
MapReduce architecture
● Contrary to MapReduce, Tuple MapReduce:
– provides compound records → tuple
– provides intra-reduce sorting
8 / 18
9. Example: cumulative visits
● Cumulative # of visits up to each single date
Input → URL, date, visits
<<<
Expected output →
URL, date, cumulative visits
9 / 18
10. Join-Tuple MapReduce
● Joins among heterogeneous datasets
– Tuples associated with a source-id.
● Tuples reach the reducer sorted by source-id
–enabling memoryless reduce joins
– and grouped by some common fields
10 / 18
11. Example: join between clients and payments
name client_id payment_id amount
clients
Inner join payments
11 / 18
12. Generalization of MapReduce
● MapReduce is a TupleMapReduce with...
– tuples of two values and
– group-by and sort-by set to first value
● The opposite is also possible → implementing Tuple MapReduce
into existing MapReduce implementations.
– Architectural changes are not needed.
– Pangool is a proof of that.
12 / 18
13. Pangool pangool.net
● Tuple MapReduce implementation on top of Hadoop.
– On top of existing MapReduce implementation.
● It is just a library. No architecture change was needed.
● Used on real world applications
– Banking
– Searching
– Social networks
13 / 18
16. Pangool performance
● Just between 5% and 8% worst than Hadoop
– Pretty good considering that Pangool is built on top of Hadoop API
● The difference would probably disappear with a native
implementation
● Much better than higher level API's
– Probably because Pangool is a low level API
16 / 18
17. Conclusions and Future work
● MapReduce key/value has been shown too strict.
● Tuple MapReduce keep MapReduce features
– Enhancing it with
● compound records,
● joins and
● intra-reduce sorting.
● Pangool is a proof of its viability,
– including in existing implementations like Hadoop without changing the
architecture
● Future work would involve abstractions for flow creations
– Simplifying job chaining and data flow.
17 / 18
18. Thanks!
Pedro Ferrera, Ivan de Prado, Eric Palacios Jose Luis FernandezMarquez
DataSalt Giovanna Di Marzo Serugendo
Barcelona, SPAIN University of Geneva, CUI
pere,ivan,epalacios@datasalt.com Geneva, SWITZERLAND
joseluis.fernandez@unige.ch
● Any questions, or doubts?
– ivan@datasalt.com
– @ivanprado
18 / 18