SlideShare a Scribd company logo
1 of 18
Download to read offline
Tuple MapReduce: Beyond
   classic MapReduce
Pedro Ferrera, Ivan de Prado, Eric Palacios   Jose Luis Fernandez­Marquez
                 DataSalt                     Giovanna Di Marzo Serugendo
            Barcelona, SPAIN                     University of Geneva, CUI
    pere,ivan,epalacios@datasalt.com              Geneva, SWITZERLAND
                                               joseluis.fernandez@unige.ch
Outline

●   Introduction
●   Related Work
●   Classic MapReduce
    –   The problems of MapReduce
●   Tuple MapReduce
    –   The basic Tuple MapReduce
    –   Joins
    –   Generalization of MapReduce
●   Pangool
●   Conclusions and Future work



                                      2 / 18
Introduction
●   A huge amount of information → needs for new processing
    technologies.
●   MapReduce → major contribution ...
    –   … but involves a sharp learning curve.
●   Most of design patterns found in real world problems are not
    well covered.
●   We propose Tuple MapReduce as a better foundation model.
●   TupleMapReduce on Hadoop → Pangool
    –   No key architectural changes needed.




                                                               3 / 18
Related work

●   MapReduce: Google paper on 2004
●   Hadoop
●   Higher level tools
    –   Sawzall, FlumeJava, Pig, Hive, Jaql, Cascading
●   Higher level abstractions very popular
    –   Supports the idea of MapReduce as a too low-level paradigm
●   Merge MapReduce
    –   Targets the problem of relational operations (joins)
    –   Implies changes in the architecture and a new step merge




                                                                     4 / 18
Classic MapReduce

●   Jobs
    –   input file, ouput file
    –   Developer provides two functions: map and reduce




●   Distributed execution of work
    –   Firstly the map function in the mapper phase
    –   Then the reduce function in the reducing phase




                                                           5 / 18
The problems of MapReduce

●   Compound records
    –   Real world problems include multi-field records. They don’t fit well on
        the key/value schema
●   Sorting
    –   No inherent sorting within the reduce records.
    –   “secondary sorting trick” on implementations (Hadoop)
●   Join
    –   A quite common operation
    –   Not directly possible in MapReduce without using “tricks”:
           ●   secondary sorting
           ●   compound records



                                                                             6 / 18
Tuple MapReduce

●   Idea: replace key/value by tuples
●   group-by and sort-by clauses




                                        7 / 18
Tuple MapReduce (II)
●   group-by and sort-by constraint
    –   group-by as a prefix of sort-by
    –   Needed if you want to be able to implement Tuple MapReduce over a
        MapReduce architecture




●   Contrary to MapReduce, Tuple MapReduce:
    –   provides compound records → tuple
    –   provides intra-reduce sorting




                                                                            8 / 18
Example: cumulative visits

●   Cumulative # of visits up to each single date
      Input → URL, date, visits




                                                    <<<

       Expected output →
    URL, date, cumulative visits




                                                          9 / 18
Join-Tuple MapReduce

●   Joins among heterogeneous datasets
    –   Tuples associated with a source-id.
         ●   Tuples reach the reducer sorted by source-id
              –enabling memoryless reduce joins
    –   and grouped by some common fields




                                                            10 / 18
Example: join between clients and payments
   name client_id payment_id amount




clients
           Inner join         payments




                                         11 / 18
Generalization of MapReduce

●   MapReduce is a TupleMapReduce with...
    –   tuples of two values and
    –   group-by and sort-by set to first value
●   The opposite is also possible → implementing Tuple MapReduce
    into existing MapReduce implementations.
    –   Architectural changes are not needed.
    –   Pangool is a proof of that.




                                                            12 / 18
Pangool                                          pangool.net

●   Tuple MapReduce implementation on top of Hadoop.
    –   On top of existing MapReduce implementation.
         ●   It is just a library. No architecture change was needed.
●   Used on real world applications
    –   Banking
    –   Searching
    –   Social networks




                                                                        13 / 18
Pangool benchmark – secondary sort




                                     14 / 18
Pangool benchmark – join




                           15 / 18
Pangool performance

●   Just between 5% and 8% worst than Hadoop
    –   Pretty good considering that Pangool is built on top of Hadoop API
         ●   The difference would probably disappear with a native
             implementation
●   Much better than higher level API's
    –   Probably because Pangool is a low level API




                                                                        16 / 18
Conclusions and Future work

●   MapReduce key/value has been shown too strict.
●   Tuple MapReduce keep MapReduce features
    –   Enhancing it with
         ●   compound records,
         ●   joins and
         ●   intra-reduce sorting.
●   Pangool is a proof of its viability,
    –   including in existing implementations like Hadoop without changing the
        architecture
●   Future work would involve abstractions for flow creations
    –   Simplifying job chaining and data flow.



                                                                         17 / 18
Thanks!


 Pedro Ferrera, Ivan de Prado, Eric Palacios       Jose Luis Fernandez­Marquez
                  DataSalt                         Giovanna Di Marzo Serugendo
             Barcelona, SPAIN                         University of Geneva, CUI
     pere,ivan,epalacios@datasalt.com                  Geneva, SWITZERLAND
                                                    joseluis.fernandez@unige.ch



                       ●   Any questions, or doubts?

                           –   ivan@datasalt.com
                           –   @ivanprado




                                                                             18 / 18

More Related Content

Viewers also liked

Nosql Introduction
Nosql IntroductionNosql Introduction
Nosql IntroductionAnju Singh
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsVineet Gupta
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins Edureka!
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainKamal A
 
Introduction to Tokenization
Introduction to TokenizationIntroduction to Tokenization
Introduction to TokenizationNabeel Yoosuf
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReducePietro Michiardi
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsPradeeban Kathiravelu, Ph.D.
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesDataWorks Summit
 
What is Payment Tokenization?
What is Payment Tokenization?What is Payment Tokenization?
What is Payment Tokenization?Rambus Inc
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitSlim Baltagi
 

Viewers also liked (16)

Nosql Introduction
Nosql IntroductionNosql Introduction
Nosql Introduction
 
BIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECTBIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECT
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web Systems
 
BIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECTBIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECT
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins
 
BIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECTBIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECT
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
 
Introduction to Tokenization
Introduction to TokenizationIntroduction to Tokenization
Introduction to Tokenization
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
Denormalization
DenormalizationDenormalization
Denormalization
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data Sets
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-series
 
What is Payment Tokenization?
What is Payment Tokenization?What is Payment Tokenization?
What is Payment Tokenization?
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 

Similar to Tuple map reduce: beyond classic mapreduce

An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Introduction of MapReduce
Introduction of MapReduceIntroduction of MapReduce
Introduction of MapReduceHC Lin
 
Mr hadoop seedrocket
Mr hadoop seedrocketMr hadoop seedrocket
Mr hadoop seedrocketSeedRocket
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisDavid Gleich
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine LearningSudarsun Santhiappan
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesVasia Kalavri
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfWasyihunSema2
 
Map reduce advantages over parallel databases
Map reduce advantages over parallel databases Map reduce advantages over parallel databases
Map reduce advantages over parallel databases Ahmad El Tawil
 
Large scale social networks analysis joclad 2013
Large scale social networks analysis   joclad 2013Large scale social networks analysis   joclad 2013
Large scale social networks analysis joclad 2013Rui Sarmento
 
MapReduce
MapReduceMapReduce
MapReducerobjk
 
MapReduce
MapReduceMapReduce
MapReducerobjk
 
Strata Beijing 2017: Jumpy, a python interface for nd4j
Strata Beijing 2017: Jumpy, a python interface for nd4jStrata Beijing 2017: Jumpy, a python interface for nd4j
Strata Beijing 2017: Jumpy, a python interface for nd4jAdam Gibson
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceEvert Lammerts
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
Software size distribution - Why we always underestimate software cost
Software size distribution - Why we always underestimate software costSoftware size distribution - Why we always underestimate software cost
Software size distribution - Why we always underestimate software costIsrael Herraiz
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 

Similar to Tuple map reduce: beyond classic mapreduce (20)

An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Introduction of MapReduce
Introduction of MapReduceIntroduction of MapReduce
Introduction of MapReduce
 
Mr hadoop seedrocket
Mr hadoop seedrocketMr hadoop seedrocket
Mr hadoop seedrocket
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open Issues
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Map reduce advantages over parallel databases
Map reduce advantages over parallel databases Map reduce advantages over parallel databases
Map reduce advantages over parallel databases
 
Large scale social networks analysis joclad 2013
Large scale social networks analysis   joclad 2013Large scale social networks analysis   joclad 2013
Large scale social networks analysis joclad 2013
 
MapReduce
MapReduceMapReduce
MapReduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Strata Beijing 2017: Jumpy, a python interface for nd4j
Strata Beijing 2017: Jumpy, a python interface for nd4jStrata Beijing 2017: Jumpy, a python interface for nd4j
Strata Beijing 2017: Jumpy, a python interface for nd4j
 
Hadoop.mapreduce
Hadoop.mapreduceHadoop.mapreduce
Hadoop.mapreduce
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
Big Data Technology
Big Data TechnologyBig Data Technology
Big Data Technology
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Software size distribution - Why we always underestimate software cost
Software size distribution - Why we always underestimate software costSoftware size distribution - Why we always underestimate software cost
Software size distribution - Why we always underestimate software cost
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 

Recently uploaded

QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 

Recently uploaded (20)

QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 

Tuple map reduce: beyond classic mapreduce

  • 1. Tuple MapReduce: Beyond classic MapReduce Pedro Ferrera, Ivan de Prado, Eric Palacios Jose Luis Fernandez­Marquez DataSalt Giovanna Di Marzo Serugendo Barcelona, SPAIN University of Geneva, CUI pere,ivan,epalacios@datasalt.com Geneva, SWITZERLAND joseluis.fernandez@unige.ch
  • 2. Outline ● Introduction ● Related Work ● Classic MapReduce – The problems of MapReduce ● Tuple MapReduce – The basic Tuple MapReduce – Joins – Generalization of MapReduce ● Pangool ● Conclusions and Future work 2 / 18
  • 3. Introduction ● A huge amount of information → needs for new processing technologies. ● MapReduce → major contribution ... – … but involves a sharp learning curve. ● Most of design patterns found in real world problems are not well covered. ● We propose Tuple MapReduce as a better foundation model. ● TupleMapReduce on Hadoop → Pangool – No key architectural changes needed. 3 / 18
  • 4. Related work ● MapReduce: Google paper on 2004 ● Hadoop ● Higher level tools – Sawzall, FlumeJava, Pig, Hive, Jaql, Cascading ● Higher level abstractions very popular – Supports the idea of MapReduce as a too low-level paradigm ● Merge MapReduce – Targets the problem of relational operations (joins) – Implies changes in the architecture and a new step merge 4 / 18
  • 5. Classic MapReduce ● Jobs – input file, ouput file – Developer provides two functions: map and reduce ● Distributed execution of work – Firstly the map function in the mapper phase – Then the reduce function in the reducing phase 5 / 18
  • 6. The problems of MapReduce ● Compound records – Real world problems include multi-field records. They don’t fit well on the key/value schema ● Sorting – No inherent sorting within the reduce records. – “secondary sorting trick” on implementations (Hadoop) ● Join – A quite common operation – Not directly possible in MapReduce without using “tricks”: ● secondary sorting ● compound records 6 / 18
  • 7. Tuple MapReduce ● Idea: replace key/value by tuples ● group-by and sort-by clauses 7 / 18
  • 8. Tuple MapReduce (II) ● group-by and sort-by constraint – group-by as a prefix of sort-by – Needed if you want to be able to implement Tuple MapReduce over a MapReduce architecture ● Contrary to MapReduce, Tuple MapReduce: – provides compound records → tuple – provides intra-reduce sorting 8 / 18
  • 9. Example: cumulative visits ● Cumulative # of visits up to each single date Input → URL, date, visits <<< Expected output → URL, date, cumulative visits 9 / 18
  • 10. Join-Tuple MapReduce ● Joins among heterogeneous datasets – Tuples associated with a source-id. ● Tuples reach the reducer sorted by source-id –enabling memoryless reduce joins – and grouped by some common fields 10 / 18
  • 11. Example: join between clients and payments name client_id payment_id amount clients Inner join payments 11 / 18
  • 12. Generalization of MapReduce ● MapReduce is a TupleMapReduce with... – tuples of two values and – group-by and sort-by set to first value ● The opposite is also possible → implementing Tuple MapReduce into existing MapReduce implementations. – Architectural changes are not needed. – Pangool is a proof of that. 12 / 18
  • 13. Pangool pangool.net ● Tuple MapReduce implementation on top of Hadoop. – On top of existing MapReduce implementation. ● It is just a library. No architecture change was needed. ● Used on real world applications – Banking – Searching – Social networks 13 / 18
  • 14. Pangool benchmark – secondary sort 14 / 18
  • 15. Pangool benchmark – join 15 / 18
  • 16. Pangool performance ● Just between 5% and 8% worst than Hadoop – Pretty good considering that Pangool is built on top of Hadoop API ● The difference would probably disappear with a native implementation ● Much better than higher level API's – Probably because Pangool is a low level API 16 / 18
  • 17. Conclusions and Future work ● MapReduce key/value has been shown too strict. ● Tuple MapReduce keep MapReduce features – Enhancing it with ● compound records, ● joins and ● intra-reduce sorting. ● Pangool is a proof of its viability, – including in existing implementations like Hadoop without changing the architecture ● Future work would involve abstractions for flow creations – Simplifying job chaining and data flow. 17 / 18
  • 18. Thanks! Pedro Ferrera, Ivan de Prado, Eric Palacios Jose Luis Fernandez­Marquez DataSalt Giovanna Di Marzo Serugendo Barcelona, SPAIN University of Geneva, CUI pere,ivan,epalacios@datasalt.com Geneva, SWITZERLAND joseluis.fernandez@unige.ch ● Any questions, or doubts? – ivan@datasalt.com – @ivanprado 18 / 18