SlideShare a Scribd company logo
1 of 21
Optimization of Collective Communication
          Operations in MPICH

         Possamai Lino, 800509




   Parallel Computing Lecture – February 2006
Introduction

   To resolve many scientific problems, high calculation power is
   needed.
   Parallel architectures were created to increase the speed of
   calculation.
   Increasing the computational speed of calculation is achieved
   also optimizing the operations used in the message passing
   interface.
   More than 40% of the time spent in MPI function was spent
   in the function ‘Reduce’ and ‘AllReduce’.
   And 25% of time using a non-power of two number of
   processors




Possamai Lino           Parallel Computing Lecture              2
Cost model
   The time taken to send a message from node i to j is
   modeled as α+nβ for bi-directional communications.
   α is the latency, β is the bandwidth and n (bytes) is the
   amount of data sent during the communication.
   γ is the byte cost for the reducing operation computed
   locally.
   For uni-diretional communication the cost is modeled as
   αuni+nβuni.

   Ratio that indicate the type of network is defined as
   fα=αuni/α. Same for bandwidth parameter.




Possamai Lino          Parallel Computing Lecture              3
*-to-all operations

   AllGather (all-to-all)
       Consist of gathering data from all nodes and distribute
       it to all
   Broadcast (one-to-all)
       Is a operation of broadcasting a data from a root node
       to every other node.
   All-to-all
       Each node send his unique data to every other
       processes. Different from allgather because the data
       owned by each node are not part of a unique vector


Possamai Lino          Parallel Computing Lecture                4
Allgather
   The old algorithm uses a ring method.
   At each step (p-1 in total), node i send its data to node i+1 and
   receives data from i-1 (with wrap around).
   Actually used for large/medium messages and for non-power of
   two number of processes.
   A first optimization consist of using a recursive vector doubling
   with distance doubling technique as in figure.
   The amount of data sent by each process is 2kn/p, where k is
   the current step, ranging from
   0 to log2 p - 1.
   So, the total cost is:
       α log2 p + nβ(p-1)/p


Possamai Lino           Parallel Computing Lecture               5
Broadcast

   Binomial tree algorithm is the
   old algorithm used in MPICH.
   Good for short messages
   because of the latency term.
   Van Der Geijn has proposed an
   algorithm for long messages
   that takes a message, divide
   and scatter it between nodes
   and finally, collect them back
   to every node (allgather).
   The total cost is:
       [α log2 p + nβ(p-1)/p ] +
       [(p-1)α + nβ(p-1)/p]=
       α(log2 p + p - 1) + 2nβ(p-
       1)/p.

Possamai Lino            Parallel Computing Lecture   6
Reduce operations
   Reduce
     A root node computes a reduction function using the
     data gathered from all processes

   Reduce-scatter (all-to-all reduction)
     Is a reduction in which, at the end, the result vector
     is scattered between the processes

   AllReduce
      Is a reduction followed by a allgather of the resulting
      vector
Possamai Lino          Parallel Computing Lecture               7
Terminology

   Recursive vector halving: the vector to be reduced is
   recursively halved in each step.
   Recursive vector doubling: small pieces of the vector
   scattered across processes are recursively gathered or
   combined to form the large vector
   Recursive distance halving: the distance over which
   processes communicate is recursively halved at each step (p
   / 2, p / 4, ... , 1).
   Recursive distance doubling: the distance over which
   processes communicate is recursively doubled at each step
   (1, 2, 4, ... , p / 2).




Possamai Lino           Parallel Computing Lecture               8
Reduce-scatter operation                              1/2

   Old algorithm implement this operation as a binomial
   tree reduction to rank 0, followed by a linear scatterv.
   The total cost is the sum of the binomial tree reduction
   plus the cost of the linear scatterv, so
   (log2 p + p -1) α + (log2 p + (p-1)/p) nβ + (log2 p) nγ
   The choice for the best algorithm depends on the type of
   reduce operation: commutative or not-commutative
   For commutative reduce operation, and for short
   messages, recursive-halving algorithm is used.
   For not-commutative, recursive-doubling is used.


Possamai Lino        Parallel Computing Lecture           9
Recursive-halving (commutative)
   Different implementations whether p is
   power of two or not.
   In the first case, log2 p steps are taken and
   on each of them bi-directional
   communication is performed.
   The data sent is halved at each step.
   Each process sends the data needed by all
   processes in the other half and receives
   the data needed by all processes in its own
   half
   In the second case, we reduce the number
   of processes to the nearest lower power of
   two, before applying the r-h algorithm to
   the rest of nodes.
   Finally, we distribute the data result vector
   to r=p-p’ processes excluded

Possamai Lino                Parallel Computing Lecture   10
Recursive-doubling (non commutative)

   Similar to allgather optimized algorithm
   At each step k, from 0 to log2 p -1, processes communicate (n-
   n/p 2k) data




Possamai Lino            Parallel Computing Lecture                 11
Reduce-scatter for long messages
 Previous algorithms works well
 if the messages are short
 In other cases, pairwise
 exchange algorithm is used.
 Needed p-1 steps where in
 each step i, each process
 sends data to process (rank +
 i) and receives data from
 process (rank – i) and finally
 perform local reduction.
 Amount of data sent at each
 step n/p
 Same bandwidth requirement
 as the recursive halving
 algorithm
Possamai Lino            Parallel Computing Lecture   12
Switching between algorithms
                  as optimization




Possamai Lino       Parallel Computing Lecture   13
Reduce

   Old algorithm use a binomial
   tree that takes log2 p steps
   Good for short messages but
   not best for long messages.
   They propose an optimized
   algorithm named Rabenseifner
   that utilizes less bandwidth.
   It is a reduce-scatter
   (recursive-halving) followed by
   a binomial tree gather to the
   root node.
   The cost is the sum of reduce-
   scatter and gather.
   Good for power of two number
   of processes.
Possamai Lino             Parallel Computing Lecture   14
Reduce (non-power of two nodes)

   In this case, before using the above algorithm, we
   must arrange the number of processes.
   Reducing to the nearest lower power of two nodes is
   necessary, so p’=2‫ ﺎ‬log p ‫ﻟ‬
   And the number of nodes removed is r=p-p’.
   The reduction is obtained combining half part of data
   of first 2r nodes to the same even ranked nodes.
   Finally, the first r nodes plus the last p-2r nodes are
   power of two and now we are able to apply po2
   algorithms
   Reduction cost: (1-fα)α + (1+fβ)βn/2 + γn/2

Possamai Lino          Parallel Computing Lecture            15
Reduce/AllReduce npo2 schema




Possamai Lino    Parallel Computing Lecture   16
AllReduce (power of two nodes)

   Used a recursive doubling
   algorithm for short and long
   messages and for user-defined
   reduction operations.
   For long and predefined reduction
   operation messages, Rebenseifnesr
   algorithm is used.
   Similar to reduce implementation,
   starts with a reduce-scatter and is
   followed by an allgather.
   Total cost: 2α log2 p + 2nβ(p-1)/p
   + nγ(p-1)/p


Possamai Lino             Parallel Computing Lecture   17
AllReduce (non power of two nodes)

   Similar implementation of reduce, but after the reduction
   on power of two nodes, and after the recursive
   algorithm, follow an allgather operation.
   Allgather implemented using a recursive vector doubling
   and distance halving for the first r+(2r-p) nodes.
   For the other r nodes (p-r+2r-p), we need an additional
   overhead for sending the result data.
   This step takes αuni+nβuni




Possamai Lino         Parallel Computing Lecture          18
Results comparison

   Vendor MPI/ newer MPI implementation Older/newer
   implementation of operations




Possamai Lino          Parallel Computing Lecture     19
Index

   Introduction
   Cost Model
   *-to-all operations
      Allgather
      Broadcast
   Reduce operations
      Terminology
      Reduce-scatter
      Reduce
      AllReduce
   Results comparison


Possamai Lino            Parallel Computing Lecture   20
References

   Thakur, Rabenseifner, Gropp

   Optimization of Collective Communication Operations in
   MPICH

   The Int. J. of High Performance Computing Applications,
   Volume 19, No. 1, Spring 2005, pp. 49–66.




Possamai Lino         Parallel Computing Lecture         21

More Related Content

What's hot

All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations Syed Zaid Irshad
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3Md. Mahedi Mahfuj
 
Cost optimal algorithm
Cost optimal algorithmCost optimal algorithm
Cost optimal algorithmHeman Pathak
 
Chapter - 04 Basic Communication Operation
Chapter - 04 Basic Communication OperationChapter - 04 Basic Communication Operation
Chapter - 04 Basic Communication OperationNifras Ismail
 
WVKULAK13_submission_14
WVKULAK13_submission_14WVKULAK13_submission_14
WVKULAK13_submission_14Max De Koninck
 
Design Of Area Delay Efficient Fixed-Point Lms Adaptive Filter For EEG Applic...
Design Of Area Delay Efficient Fixed-Point Lms Adaptive Filter For EEG Applic...Design Of Area Delay Efficient Fixed-Point Lms Adaptive Filter For EEG Applic...
Design Of Area Delay Efficient Fixed-Point Lms Adaptive Filter For EEG Applic...IJTET Journal
 
A CRITICAL IMPROVEMENT ON OPEN SHOP SCHEDULING ALGORITHM FOR ROUTING IN INTER...
A CRITICAL IMPROVEMENT ON OPEN SHOP SCHEDULING ALGORITHM FOR ROUTING IN INTER...A CRITICAL IMPROVEMENT ON OPEN SHOP SCHEDULING ALGORITHM FOR ROUTING IN INTER...
A CRITICAL IMPROVEMENT ON OPEN SHOP SCHEDULING ALGORITHM FOR ROUTING IN INTER...IJCNCJournal
 

What's hot (18)

All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations
 
D0341015020
D0341015020D0341015020
D0341015020
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3
 
Cost optimal algorithm
Cost optimal algorithmCost optimal algorithm
Cost optimal algorithm
 
Chap3 slides
Chap3 slidesChap3 slides
Chap3 slides
 
Chapter - 04 Basic Communication Operation
Chapter - 04 Basic Communication OperationChapter - 04 Basic Communication Operation
Chapter - 04 Basic Communication Operation
 
Chap6 slides
Chap6 slidesChap6 slides
Chap6 slides
 
Solution(1)
Solution(1)Solution(1)
Solution(1)
 
Basic Communication
Basic CommunicationBasic Communication
Basic Communication
 
Chapter 4 pc
Chapter 4 pcChapter 4 pc
Chapter 4 pc
 
Chap11 slides
Chap11 slidesChap11 slides
Chap11 slides
 
Chap7 slides
Chap7 slidesChap7 slides
Chap7 slides
 
Broadcast in Hypercube
Broadcast in HypercubeBroadcast in Hypercube
Broadcast in Hypercube
 
WVKULAK13_submission_14
WVKULAK13_submission_14WVKULAK13_submission_14
WVKULAK13_submission_14
 
Design Of Area Delay Efficient Fixed-Point Lms Adaptive Filter For EEG Applic...
Design Of Area Delay Efficient Fixed-Point Lms Adaptive Filter For EEG Applic...Design Of Area Delay Efficient Fixed-Point Lms Adaptive Filter For EEG Applic...
Design Of Area Delay Efficient Fixed-Point Lms Adaptive Filter For EEG Applic...
 
A CRITICAL IMPROVEMENT ON OPEN SHOP SCHEDULING ALGORITHM FOR ROUTING IN INTER...
A CRITICAL IMPROVEMENT ON OPEN SHOP SCHEDULING ALGORITHM FOR ROUTING IN INTER...A CRITICAL IMPROVEMENT ON OPEN SHOP SCHEDULING ALGORITHM FOR ROUTING IN INTER...
A CRITICAL IMPROVEMENT ON OPEN SHOP SCHEDULING ALGORITHM FOR ROUTING IN INTER...
 
Matrix multiplication
Matrix multiplicationMatrix multiplication
Matrix multiplication
 
Dg34662666
Dg34662666Dg34662666
Dg34662666
 

Similar to Optimization of Collective Communication in MPICH

Elementary Parallel Algorithms
Elementary Parallel AlgorithmsElementary Parallel Algorithms
Elementary Parallel AlgorithmsHeman Pathak
 
High Speed Signed multiplier for Digital Signal Processing Applications
High Speed Signed multiplier for Digital Signal Processing ApplicationsHigh Speed Signed multiplier for Digital Signal Processing Applications
High Speed Signed multiplier for Digital Signal Processing ApplicationsIOSR Journals
 
chap4_slides.ppt
chap4_slides.pptchap4_slides.ppt
chap4_slides.pptStrangerMe2
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learningpauldix
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterSudhang Shankar
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGenevsachde
 
第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
cis97007
cis97007cis97007
cis97007perfj
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Parallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic MathematicsParallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic MathematicsIOSR Journals
 
El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711RCCSRENKEI
 
Training the neural network using levenberg marquardt’s algorithm to optimize
Training the neural network using levenberg marquardt’s algorithm to optimizeTraining the neural network using levenberg marquardt’s algorithm to optimize
Training the neural network using levenberg marquardt’s algorithm to optimizeIAEME Publication
 
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Florent Renucci
 

Similar to Optimization of Collective Communication in MPICH (20)

Elementary Parallel Algorithms
Elementary Parallel AlgorithmsElementary Parallel Algorithms
Elementary Parallel Algorithms
 
High Speed Signed multiplier for Digital Signal Processing Applications
High Speed Signed multiplier for Digital Signal Processing ApplicationsHigh Speed Signed multiplier for Digital Signal Processing Applications
High Speed Signed multiplier for Digital Signal Processing Applications
 
Chap4 slides
Chap4 slidesChap4 slides
Chap4 slides
 
Chap4 slides
Chap4 slidesChap4 slides
Chap4 slides
 
Chap4 slides
Chap4 slidesChap4 slides
Chap4 slides
 
chap4_slides.ppt
chap4_slides.pptchap4_slides.ppt
chap4_slides.ppt
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
 
parallel
parallelparallel
parallel
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGene
 
第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)
 
IJET-V3I1P14
IJET-V3I1P14IJET-V3I1P14
IJET-V3I1P14
 
Lecturre 07 - Chapter 05 - Basic Communications Operations
Lecturre 07 - Chapter 05 - Basic Communications  OperationsLecturre 07 - Chapter 05 - Basic Communications  Operations
Lecturre 07 - Chapter 05 - Basic Communications Operations
 
Todtree
TodtreeTodtree
Todtree
 
cis97007
cis97007cis97007
cis97007
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Parallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic MathematicsParallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic Mathematics
 
El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711El text.tokuron a(2019).jung190711
El text.tokuron a(2019).jung190711
 
Training the neural network using levenberg marquardt’s algorithm to optimize
Training the neural network using levenberg marquardt’s algorithm to optimizeTraining the neural network using levenberg marquardt’s algorithm to optimize
Training the neural network using levenberg marquardt’s algorithm to optimize
 
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
 

More from Lino Possamai

Music Motive @ H-ack
Music Motive @ H-ack Music Motive @ H-ack
Music Motive @ H-ack Lino Possamai
 
Metodi matematici per l’analisi di sistemi complessi
Metodi matematici per l’analisi di sistemi complessiMetodi matematici per l’analisi di sistemi complessi
Metodi matematici per l’analisi di sistemi complessiLino Possamai
 
Multidimensional Analysis of Complex Networks
Multidimensional Analysis of Complex NetworksMultidimensional Analysis of Complex Networks
Multidimensional Analysis of Complex NetworksLino Possamai
 
A static Analyzer for Finding Dynamic Programming Errors
A static Analyzer for Finding Dynamic Programming ErrorsA static Analyzer for Finding Dynamic Programming Errors
A static Analyzer for Finding Dynamic Programming ErrorsLino Possamai
 
On Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic ProgramsOn Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic ProgramsLino Possamai
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 

More from Lino Possamai (7)

Music Motive @ H-ack
Music Motive @ H-ack Music Motive @ H-ack
Music Motive @ H-ack
 
Metodi matematici per l’analisi di sistemi complessi
Metodi matematici per l’analisi di sistemi complessiMetodi matematici per l’analisi di sistemi complessi
Metodi matematici per l’analisi di sistemi complessi
 
Multidimensional Analysis of Complex Networks
Multidimensional Analysis of Complex NetworksMultidimensional Analysis of Complex Networks
Multidimensional Analysis of Complex Networks
 
Slashdot.Org
Slashdot.OrgSlashdot.Org
Slashdot.Org
 
A static Analyzer for Finding Dynamic Programming Errors
A static Analyzer for Finding Dynamic Programming ErrorsA static Analyzer for Finding Dynamic Programming Errors
A static Analyzer for Finding Dynamic Programming Errors
 
On Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic ProgramsOn Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic Programs
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Optimization of Collective Communication in MPICH

  • 1. Optimization of Collective Communication Operations in MPICH Possamai Lino, 800509 Parallel Computing Lecture – February 2006
  • 2. Introduction To resolve many scientific problems, high calculation power is needed. Parallel architectures were created to increase the speed of calculation. Increasing the computational speed of calculation is achieved also optimizing the operations used in the message passing interface. More than 40% of the time spent in MPI function was spent in the function ‘Reduce’ and ‘AllReduce’. And 25% of time using a non-power of two number of processors Possamai Lino Parallel Computing Lecture 2
  • 3. Cost model The time taken to send a message from node i to j is modeled as α+nβ for bi-directional communications. α is the latency, β is the bandwidth and n (bytes) is the amount of data sent during the communication. γ is the byte cost for the reducing operation computed locally. For uni-diretional communication the cost is modeled as αuni+nβuni. Ratio that indicate the type of network is defined as fα=αuni/α. Same for bandwidth parameter. Possamai Lino Parallel Computing Lecture 3
  • 4. *-to-all operations AllGather (all-to-all) Consist of gathering data from all nodes and distribute it to all Broadcast (one-to-all) Is a operation of broadcasting a data from a root node to every other node. All-to-all Each node send his unique data to every other processes. Different from allgather because the data owned by each node are not part of a unique vector Possamai Lino Parallel Computing Lecture 4
  • 5. Allgather The old algorithm uses a ring method. At each step (p-1 in total), node i send its data to node i+1 and receives data from i-1 (with wrap around). Actually used for large/medium messages and for non-power of two number of processes. A first optimization consist of using a recursive vector doubling with distance doubling technique as in figure. The amount of data sent by each process is 2kn/p, where k is the current step, ranging from 0 to log2 p - 1. So, the total cost is: α log2 p + nβ(p-1)/p Possamai Lino Parallel Computing Lecture 5
  • 6. Broadcast Binomial tree algorithm is the old algorithm used in MPICH. Good for short messages because of the latency term. Van Der Geijn has proposed an algorithm for long messages that takes a message, divide and scatter it between nodes and finally, collect them back to every node (allgather). The total cost is: [α log2 p + nβ(p-1)/p ] + [(p-1)α + nβ(p-1)/p]= α(log2 p + p - 1) + 2nβ(p- 1)/p. Possamai Lino Parallel Computing Lecture 6
  • 7. Reduce operations Reduce A root node computes a reduction function using the data gathered from all processes Reduce-scatter (all-to-all reduction) Is a reduction in which, at the end, the result vector is scattered between the processes AllReduce Is a reduction followed by a allgather of the resulting vector Possamai Lino Parallel Computing Lecture 7
  • 8. Terminology Recursive vector halving: the vector to be reduced is recursively halved in each step. Recursive vector doubling: small pieces of the vector scattered across processes are recursively gathered or combined to form the large vector Recursive distance halving: the distance over which processes communicate is recursively halved at each step (p / 2, p / 4, ... , 1). Recursive distance doubling: the distance over which processes communicate is recursively doubled at each step (1, 2, 4, ... , p / 2). Possamai Lino Parallel Computing Lecture 8
  • 9. Reduce-scatter operation 1/2 Old algorithm implement this operation as a binomial tree reduction to rank 0, followed by a linear scatterv. The total cost is the sum of the binomial tree reduction plus the cost of the linear scatterv, so (log2 p + p -1) α + (log2 p + (p-1)/p) nβ + (log2 p) nγ The choice for the best algorithm depends on the type of reduce operation: commutative or not-commutative For commutative reduce operation, and for short messages, recursive-halving algorithm is used. For not-commutative, recursive-doubling is used. Possamai Lino Parallel Computing Lecture 9
  • 10. Recursive-halving (commutative) Different implementations whether p is power of two or not. In the first case, log2 p steps are taken and on each of them bi-directional communication is performed. The data sent is halved at each step. Each process sends the data needed by all processes in the other half and receives the data needed by all processes in its own half In the second case, we reduce the number of processes to the nearest lower power of two, before applying the r-h algorithm to the rest of nodes. Finally, we distribute the data result vector to r=p-p’ processes excluded Possamai Lino Parallel Computing Lecture 10
  • 11. Recursive-doubling (non commutative) Similar to allgather optimized algorithm At each step k, from 0 to log2 p -1, processes communicate (n- n/p 2k) data Possamai Lino Parallel Computing Lecture 11
  • 12. Reduce-scatter for long messages Previous algorithms works well if the messages are short In other cases, pairwise exchange algorithm is used. Needed p-1 steps where in each step i, each process sends data to process (rank + i) and receives data from process (rank – i) and finally perform local reduction. Amount of data sent at each step n/p Same bandwidth requirement as the recursive halving algorithm Possamai Lino Parallel Computing Lecture 12
  • 13. Switching between algorithms as optimization Possamai Lino Parallel Computing Lecture 13
  • 14. Reduce Old algorithm use a binomial tree that takes log2 p steps Good for short messages but not best for long messages. They propose an optimized algorithm named Rabenseifner that utilizes less bandwidth. It is a reduce-scatter (recursive-halving) followed by a binomial tree gather to the root node. The cost is the sum of reduce- scatter and gather. Good for power of two number of processes. Possamai Lino Parallel Computing Lecture 14
  • 15. Reduce (non-power of two nodes) In this case, before using the above algorithm, we must arrange the number of processes. Reducing to the nearest lower power of two nodes is necessary, so p’=2‫ ﺎ‬log p ‫ﻟ‬ And the number of nodes removed is r=p-p’. The reduction is obtained combining half part of data of first 2r nodes to the same even ranked nodes. Finally, the first r nodes plus the last p-2r nodes are power of two and now we are able to apply po2 algorithms Reduction cost: (1-fα)α + (1+fβ)βn/2 + γn/2 Possamai Lino Parallel Computing Lecture 15
  • 16. Reduce/AllReduce npo2 schema Possamai Lino Parallel Computing Lecture 16
  • 17. AllReduce (power of two nodes) Used a recursive doubling algorithm for short and long messages and for user-defined reduction operations. For long and predefined reduction operation messages, Rebenseifnesr algorithm is used. Similar to reduce implementation, starts with a reduce-scatter and is followed by an allgather. Total cost: 2α log2 p + 2nβ(p-1)/p + nγ(p-1)/p Possamai Lino Parallel Computing Lecture 17
  • 18. AllReduce (non power of two nodes) Similar implementation of reduce, but after the reduction on power of two nodes, and after the recursive algorithm, follow an allgather operation. Allgather implemented using a recursive vector doubling and distance halving for the first r+(2r-p) nodes. For the other r nodes (p-r+2r-p), we need an additional overhead for sending the result data. This step takes αuni+nβuni Possamai Lino Parallel Computing Lecture 18
  • 19. Results comparison Vendor MPI/ newer MPI implementation Older/newer implementation of operations Possamai Lino Parallel Computing Lecture 19
  • 20. Index Introduction Cost Model *-to-all operations Allgather Broadcast Reduce operations Terminology Reduce-scatter Reduce AllReduce Results comparison Possamai Lino Parallel Computing Lecture 20
  • 21. References Thakur, Rabenseifner, Gropp Optimization of Collective Communication Operations in MPICH The Int. J. of High Performance Computing Applications, Volume 19, No. 1, Spring 2005, pp. 49–66. Possamai Lino Parallel Computing Lecture 21