SlideShare a Scribd company logo
1 of 79
Introduction to
Parallel Computing


     Jörn Dinkla
     http://www.dinkla.com

          Version 1.1
Dipl.-Inform. Jörn Dinkla
 Java (J2SE, JEE)
 Programming Languages
   Scala, Groovy, Haskell
 Parallel Computing
   GPU Computing
 Model driven
 Eclipse-Plugins
Overview
 Progress in computing
 Traditional Hard- and Software
 Theoretical Computer Science
   Algorithms
   Machines
   Optimization
 Parallelization
 Parallel Hard- and Software
Progress in Computing
1. New applications
   Not feasible before
   Not needed before
   Not possible before
2. Better applications
    Faster
    More data
    Better quality
      precision, accuracy, exactness
Progress in Computing
 Two ingredients
   Hardware
     Machine(s) to execute program
   Software
     Model / language to formulate program
     Libraries
     Methods
How was progress achieved?
 Hardware
   CPU, memory, disks, networks
   Faster and larger
 Software
   New and better algorithms
   Programming methods and languages
Traditional Hardware
 Von Neumann-Architecture
           CPU    I/O   Memory




                  Bus




 John Backus 1977
   “von Neumann bottleneck“     Cache
Improvements
   Increasing Clock Frequency
   Memory Hierarchy / Cache
   Parallelizing ALU
   Pipelining
   Very-long Instruction Words (VLIW)
   Instruction-Level parallelism (ILP)
   Superscalar processors
   Vector data types
   Multithreaded
   Multicore / Manycore
Moore‘s law
 Guaranteed until 2020
Clock frequency
 No increase since 2005
Physical Limits
 Increase of clock frequency
   >>> Energy-consumption
   >>> Heat-dissipation
 Limit to transistor size

   Faster processors impossible !?!
2005
“The Free Lunch Is Over:
   A Fundamental Turn Toward
   Concurrency in Software”

       Herb Sutter
       Dr. Dobb’s Journal, March 2005
Multicore
 Transistor count
    Doubles every 2-3 years
 Calculation speed
    No increase

  Multicore

 Efficient?
How to use the cores?
 Multi-Tasking OS
   Different tasks
 Speeding up same task
     Assume 2 CPUs
     Problem is divided in half
     Each CPU calculates a half
     Time taken is half of the original time?
Traditional Software
 Computation is expressed as “algorithm
    “a step-by-step procedure for calculations”
    algorithm = logic + control
 Example
   1.   Open file
   2.   For all records in the file
        1.   Add the salary
   3.   Close file
   4.   Print out the sum of the salaries

 Keywords
    Sequential, Serial, Deterministic
Traditional Software
 Improvements
   Better algorithms
   Programming languages (OO)
   Developement methods (agile)
 Limits
   Theoretical Computer Science
   Complexity theory (NP, P, NC)
Architecture
 Simplification: Ignore the bus

    CPU    I/O   Memory
                            I/O         Memory




           Bus
                                  CPU
More than one CPU?
 How should they communicate ?


   I/O         Memory     I/O         Memory




         CPU                    CPU
Message Passing
 Distributed system
 Loose coupling
                                      Messages

                            Network




       I/O         Memory               I/O            Memory




             CPU                                 CPU
Shared Memory
 Shared Memory
 Tight coupling

            I/O         Memory         I/O




                  CPU            CPU
Shared Memory
 Global vs. Local
 Memory hierarchy

     I/O         Memory            I/O         Memory




                          Shared
           CPU                           CPU
                          Memory
Overview: Memory
 Unshared Memory
   Message Passing
   Actors
 Shared Memory
   Threads
 Memory hierarchies / hybrid
   Partitioned Global Adress Space (PGAS)
 Transactional Memory
Sequential Algorithms
 Random Access Machine (RAM)
   Step by step, deterministic
                                  Addr Value
                                   0     3
    PC    int sum = 0              1
                                   2
                                         7
                                         5
          for i=0 to 4             3     1
                                   4     2
            sum += mem[i]          5    18
          mem[5]= sum
Sequential Algorithms
int sum = 0
for i=0 to 4
  sum += mem[i]
Addr Value   Addr Value   Addr Value   Addr Value   Addr Value   Addr Value
 0     3      0     3      0     3      0     3      0     3      0     3
 1     7      1     7      1     7      1     7      1     7      1     7
 2     5      2     5      2     5      2     5      2     5      2     5
 3     1      3     1      3     1      3     1      3     1      3     1
 4     2      4     2      4     2      4     2      4     2      4     2
 5     0      5     3      5    10      5    15      5    16      5    18
More than one CPU
 How many programs should run?
   One
     In lock-step
        All processors do the same
     In any order
   More than one
     Distributed system
Two Processors
PC 1   int sum = 0              int sum = 0
       for i=0 to 2      PC 2   for i=3 to 4
         sum += mem[i]            sum += mem[i]
       mem[5]= sum              mem[5]= sum
                                            Addr Value
                                             0     3
 Lockstep                                   1
                                             2
                                                   7
                                                   5

 Memory Access!                             3
                                             4
                                                   1
                                                   2
                                             5    18
Flynn‘s Taxonomy
 1966

                        Instruction
                     Single    Multiple
           Single     SISD      MISD
   Data
          Multiple   SIMD       MIMD
Flynn‘s Taxonomy
 SISD
   RAM, Von Neumann
 SIMD
   Lockstep, vector processor, GPU
 MISD
   Fault tolerance
 MIMD
   Distributed system
Extension MIMD
 How many programs?

 SPMD
   One program
   Not in lockstep as in SIMD
 MPMD
   Many programs
Processes & Threads
 Process
   Operating System
      Address space
      IPC
   Heavy weight
   Contains 1..* threads
 Thread
   Smallest unit of execution
   Light weight
Overview: Algorithms
   Sequential
   Parallel
   Concurrent    Overlap
   Distributed
   Randomized
   Quantum
Computer Science
 Theoretical Computer Science
     A long time before 2005
     1989: Gibbons, Rytter
     1990: Ben-Ari
     1996: Lynch
Gap: Theory and Practice
 Galactic algorithms
 Written for abstract machines
   PRAM, special networks, etc.
 Simplifying assumptions
   No boundaries
   Exact arithmetic
   Infinite memory, network speed, etc.
Sequential algorithms
 Implementing a sequential algorithm
   Machine architecture
   Programming language
   Performance
     Processor, memory and cache speed
   Boundary cases
   Sometimes hard
Parallel algorithms
 Implementing a parallel algorithm
   Adapt algorithm to architecture
      No PRAM or sorting network!
   Problems with shared memory
   Synchronization
   Harder!
Parallelization
 Transforming
   a sequential
   into a parallel algorithm

 Tasks
   Adapt to architecture
   Rewrite
   Test correctness wrt „golden“ seq. code
Granularity
 “Size” of the threads?
   How much computation?
 Coarse vs. fine grain
 Right choice
   Important for good performance
   Algorithm design
Computational thinking
 “… is the thought processes involved
  in formulating problems and their
  solutions so that the solutions are
  represented in a form that can be
  effectively carried out by an
  information-processing agent.”
              Cuny, Snyder, Wing 2010
Computational thinking
 “… is the new literacy of the 21st
  Century.”
               Cuny, Snyder, Wing 2010



 Expert level needed for parallelization!
Problems: Shared Memory
 Destructive updates
   i += 1
 Parallel, independent processes
   How do the others now that i increased?
   Synchronization needed
      Memory barrier
      Complicated for beginners
Problems: Shared Memory

PC 1   int sum = 0              int sum = 0
       for i=0 to 2      PC 2   for i=3 to 4
         sum += mem[i]            sum += mem[i]
       mem[5]= sum              mem[5]= sum
                                            Addr Value
                                             0     3
 Which one first?                           1
                                             2
                                                   7
                                                   5
                                             3     1
                                             4     2
                                             5    18
Problems: Shared Memory

PC 1   int sum = 0              int sum = 0
       for i=0 to 2      PC 2   for i=3 to 4
         sum += mem[i]            sum += mem[i]
       mem[5]= sum
       sync()                   sync()
                                mem[5] += sum


 Synchronization needed
Problems: Shared Memory
 The memory barrier
    When is a value read or written?
    Optimizing compilers change semantics

 int a = b + 5
    Read b
    Add 5 to b, store temporary in c
    Write c to a

 Solutions (Java)
    volatile
    java.util.concurrent.atomic
Problems: Shared Memory
 Thread safety
 Reentrant code

  class X {
    int x;
    void inc() { x+=1; }
  }
Problems: Threads
 Deadlock
   A wants B, B wants A, both waiting
 Starvation
   A wants B, but never gets it
 Race condition
   A writes to mem, B reads/writes mem
Shared Mem: Solutions
 Shared mutable state
   Synchronize properly


 Isolated mutable state
   Don‘t share state


 Immutable or unshared
   Don‘t mutate state!
Solutions
 Transactional Memory
   Every access within transaction
   See databases
 Actor models
   Message passing
 Immutable state / pure functional
Speedup and Efficiency
 Running time
   T(1) with one processor
   T(n) with two processors
 Speedup
   How much faster?
   S(n) = T(1) / T(n)
Speedup and Efficiency
 Efficiency
   Are all the processors used?
   E(n) = S(n) / n = T(1) / (n * T(n))
Amdahl‘s Law

Amdahl‘s Law
Amdahl‘s Law
 Corrolary
   Maximize the parallel part
   Only parallelize when parallel part is large
    enough
P-Completeness
 Is there an efficient parallel version for
  every algorithm?
   No! Hardly parallelizable problems
   P-Completeness
   Example Circuit-Value-Problem (CVP)
P-Completeness

Optimization
 What can i achieve?
 When do I stop?
 How many threads should i use?
Optimization
 I/O bound
   Thread is waiting for memory, disk, etc.
 Computation bound
   Thread is calculating the whole time

 Watch processor utilization!
Optimization
 I/O bound
   Use asynchronous/non-blocking I/O
   Increase number of threads
 Computation bound
   Number of threads = Number of cores
Processors
 Multicore CPU
 Graphical Processing Unit (GPU)
 Field-Programmable Gate Array
  (FPGA)
GPU Computing
 Finer granularity than CPU
   Specialized processors
   512 cores on a Fermi
 High memory bandwidth 192 GB/sec
CPU vs. GPU




 Source: SGI
FPGA
 Configurable hardware circuits
 Programmed in Verilog, VHDL
 Now: OpenCL
   Much higher level of abstraction
 Under development, promising
 No performance tests results
  (2011/12)
Networks / Cluster
 Combination of             CPU




     CPU                   Memory


     Memory
     Network
                            Network




     GPU                    GPU



     FPGA
                            FPGA

 Vast possibilities
Example
 2 x connected by network
   2 CPU each with local cache
   Global memory
                            Network




  CPU               CPU                CPU               CPU



          Memory                               Memory

 Memory            Memory             Memory            Memory
Example
 1 CPU with local cache
 Connected by shared memory
   2 GPU with local memory („device“)


         CPU      Memory   GPU   Memory




                           GPU   Memory
        Memory
Next Step: Hybrid
 Hybrid / Heterogenous
   Multi-Core / Many-Core
   Plus special purpose hardware
     GPU
     FPGA
Optimal combination?
 Which network gives the best
  performance?
   Complicated
   Technical restrictions
      4x PCI-Express 16x Motherboards
      Power consumption
      Cooling
Example: K-Computer
   SPARC64 VIIIfx 2.0GHz
   705024 Cores
   10.51 Petaflop/s
   No GPUs

 #1 2011
Example: Tianhe-1A
   14336 Xeon X5670
   7168 Tesla M2050
   2048 NUDT FT1000
   2.57 petaflop/s

 #2 2011
Example: HPC at home
 Workstations and blades
   8 x 512 cores = 4096 cores
Frameworks: Shared Mem
 C/C++
     OpenMP
     POSIX Threads (pthreads)
     Intel Thread Building Blocks
     Windows Threads
 Java
   java.util.concurrent
Frameworks: Actors
 C/C++
   Theron
 Java / JVM
   Akka
   Scala
   GPars (Groovy)
GPU Computing
 NVIDIA CUDA
   NVIDIA
 OpenCL
     AMD
     NVIDIA
     Intel
     Altera
     Apple
 WebCL
   Nokia
   Samsung
Advanced courses
 Best practices for concurrency in Java
   Java‘s java.util.concurrent
   Actor models
   Transactional Memory


 See http://www.dinkla.com
Advanced courses
 GPU Computing
     NVIDIA CUDA
     OpenCL
     Using NVIDIA CUDA with Java
     Using OpenCL with Java
 See http://www.dinkla.com
References: Practice
 Mattson, Sanders, Massingill
   Patterns for
    Parallel Programming
 Breshears
   The Art of Concurrency
References: Practice
 Pacheco
   An Introduction to
    Parallel Programming
 Herlihy, Shavit
   The Art of
    Multiprocessor Programming
References: Theory
 Gibbons, Rytter
   Efficient Parallel Algorithms
 Lynch
   Distributed Algorithms
 Ben-Ari
   Principles of Concurrent and
    Distributed Programming
References: GPU Computing
 Scarpino
   OpenCL in Action


 Sanders, Kandrot
   CUDA by Example
References: Background
 Hennessy, Paterson
   Computer Architecture: A Quantitative
    Approach

More Related Content

What's hot

Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computingVajira Thambawita
 
Parallel & Distributed processing
Parallel & Distributed processingParallel & Distributed processing
Parallel & Distributed processingSyed Zaid Irshad
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)self employed
 
Introduction of Memory Management
Introduction of Memory Management Introduction of Memory Management
Introduction of Memory Management Maitree Patel
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computingMehul Patel
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreadingFraboni Ec
 
Multi Processors And Multi Computers
 Multi Processors And Multi Computers Multi Processors And Multi Computers
Multi Processors And Multi ComputersNemwos
 
Memory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer ArchitechtureMemory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer ArchitechtureShweta Ghate
 
Introduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer SystemsIntroduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer SystemsMrMaKKaWi
 
Input output systems ppt - cs2411
Input output systems ppt - cs2411Input output systems ppt - cs2411
Input output systems ppt - cs2411Geerthik Varun
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applicationsBurhan Ahmed
 
Operating systems system structures
Operating systems   system structuresOperating systems   system structures
Operating systems system structuresMukesh Chinta
 
Multiprocessor Systems
Multiprocessor SystemsMultiprocessor Systems
Multiprocessor Systemsvampugani
 

What's hot (20)

Parallel processing
Parallel processingParallel processing
Parallel processing
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computing
 
Parallel & Distributed processing
Parallel & Distributed processingParallel & Distributed processing
Parallel & Distributed processing
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)
 
Introduction of Memory Management
Introduction of Memory Management Introduction of Memory Management
Introduction of Memory Management
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreading
 
Virtual Memory
Virtual MemoryVirtual Memory
Virtual Memory
 
Multi Processors And Multi Computers
 Multi Processors And Multi Computers Multi Processors And Multi Computers
Multi Processors And Multi Computers
 
Sequential consistency model
Sequential consistency modelSequential consistency model
Sequential consistency model
 
Memory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer ArchitechtureMemory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer Architechture
 
Computer performance
Computer performanceComputer performance
Computer performance
 
Introduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer SystemsIntroduction to Parallel Distributed Computer Systems
Introduction to Parallel Distributed Computer Systems
 
Input output systems ppt - cs2411
Input output systems ppt - cs2411Input output systems ppt - cs2411
Input output systems ppt - cs2411
 
Paging and segmentation
Paging and segmentationPaging and segmentation
Paging and segmentation
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
 
Demand paging
Demand pagingDemand paging
Demand paging
 
Operating systems system structures
Operating systems   system structuresOperating systems   system structures
Operating systems system structures
 
Multiprocessor Systems
Multiprocessor SystemsMultiprocessor Systems
Multiprocessor Systems
 

Viewers also liked

Higher nab preparation
Higher nab preparationHigher nab preparation
Higher nab preparationscaddell
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingRoshan Karunarathna
 
Highly Surmountable Challenges in Ruby+OMR JIT Compilation
Highly Surmountable Challenges in Ruby+OMR JIT CompilationHighly Surmountable Challenges in Ruby+OMR JIT Compilation
Highly Surmountable Challenges in Ruby+OMR JIT CompilationMatthew Gaudet
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computingNiranjana Ambadi
 
VLSI Design(Fabrication)
VLSI Design(Fabrication)VLSI Design(Fabrication)
VLSI Design(Fabrication)Trijit Mallick
 
Parallel computing
Parallel computingParallel computing
Parallel computingvirend111
 
0 introduction to computer architecture
0 introduction to computer architecture0 introduction to computer architecture
0 introduction to computer architectureaamc1100
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processingPage Maker
 
Introduction to parallel processing
Introduction to parallel processingIntroduction to parallel processing
Introduction to parallel processingPage Maker
 
Parallel computing
Parallel computingParallel computing
Parallel computingVinay Gupta
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesMurtadha Alsabbagh
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithmsDanish Javed
 
Parallel Algorithm Models
Parallel Algorithm ModelsParallel Algorithm Models
Parallel Algorithm ModelsMartin Coronel
 

Viewers also liked (20)

Higher nab preparation
Higher nab preparationHigher nab preparation
Higher nab preparation
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
Highly Surmountable Challenges in Ruby+OMR JIT Compilation
Highly Surmountable Challenges in Ruby+OMR JIT CompilationHighly Surmountable Challenges in Ruby+OMR JIT Compilation
Highly Surmountable Challenges in Ruby+OMR JIT Compilation
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
 
Parallel computing(1)
Parallel computing(1)Parallel computing(1)
Parallel computing(1)
 
VLSI Design(Fabrication)
VLSI Design(Fabrication)VLSI Design(Fabrication)
VLSI Design(Fabrication)
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Parallel computing(2)
Parallel computing(2)Parallel computing(2)
Parallel computing(2)
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
0 introduction to computer architecture
0 introduction to computer architecture0 introduction to computer architecture
0 introduction to computer architecture
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel Computing
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
 
Parallel processing Concepts
Parallel processing ConceptsParallel processing Concepts
Parallel processing Concepts
 
Introduction to parallel processing
Introduction to parallel processingIntroduction to parallel processing
Introduction to parallel processing
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
Parallel Algorithm Models
Parallel Algorithm ModelsParallel Algorithm Models
Parallel Algorithm Models
 

Similar to Introduction To Parallel Computing

Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013midnite_runr
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
Basic course
Basic courseBasic course
Basic courseSirajRock
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiTakuya ASADA
 
Optimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerOptimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerAllineaSoftware
 
Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2ice799
 
Parallelizing Conqueror's Blade
Parallelizing Conqueror's BladeParallelizing Conqueror's Blade
Parallelizing Conqueror's BladeIntel® Software
 
Multicore processing
Multicore processingMulticore processing
Multicore processingguestc0be34a
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)David Evans
 
The Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore EraThe Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore EraStefan Marr
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-isctembreternitz
 
6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptxSimRelokasi2
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wallugur candan
 

Similar to Introduction To Parallel Computing (20)

Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Performance
PerformancePerformance
Performance
 
Basic course
Basic courseBasic course
Basic course
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgi
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Optimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerOptimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant caller
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
Basic course
Basic courseBasic course
Basic course
 
Parallelizing Conqueror's Blade
Parallelizing Conqueror's BladeParallelizing Conqueror's Blade
Parallelizing Conqueror's Blade
 
Multicore processing
Multicore processingMulticore processing
Multicore processing
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)
 
The Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore EraThe Price of the Free Lunch: Programming in the Multicore Era
The Price of the Free Lunch: Programming in the Multicore Era
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx
 
Basic course
Basic courseBasic course
Basic course
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 

More from Jörn Dinkla

Presentation of the book "Mikado Method"
Presentation of the book "Mikado Method"Presentation of the book "Mikado Method"
Presentation of the book "Mikado Method"Jörn Dinkla
 
Korrekte nebenläufige Anwendungen mit Koroutinen und TDD
Korrekte nebenläufige Anwendungen mit Koroutinen und TDDKorrekte nebenläufige Anwendungen mit Koroutinen und TDD
Korrekte nebenläufige Anwendungen mit Koroutinen und TDDJörn Dinkla
 
Nebenlaeufigkeit mit Koroutinen strukturieren
Nebenlaeufigkeit mit Koroutinen strukturierenNebenlaeufigkeit mit Koroutinen strukturieren
Nebenlaeufigkeit mit Koroutinen strukturierenJörn Dinkla
 
Plain react, hooks and/or Redux ?
Plain react, hooks and/or Redux ?Plain react, hooks and/or Redux ?
Plain react, hooks and/or Redux ?Jörn Dinkla
 
A short introduction to Kotlin
A short introduction to KotlinA short introduction to Kotlin
A short introduction to KotlinJörn Dinkla
 
Concurrency in Kotlin with coroutines
Concurrency in Kotlin with coroutinesConcurrency in Kotlin with coroutines
Concurrency in Kotlin with coroutinesJörn Dinkla
 
Nebenläufigkeit mit Kotlins Koroutinen
Nebenläufigkeit mit Kotlins KoroutinenNebenläufigkeit mit Kotlins Koroutinen
Nebenläufigkeit mit Kotlins KoroutinenJörn Dinkla
 
GPU-Computing mit CUDA und OpenCL
GPU-Computing mit CUDA und OpenCLGPU-Computing mit CUDA und OpenCL
GPU-Computing mit CUDA und OpenCLJörn Dinkla
 
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDASchulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDAJörn Dinkla
 
Die ‚komplexe‘ Perspektive - Einführung in die digitale Wirtschaft
Die ‚komplexe‘ Perspektive - Einführung in die digitale WirtschaftDie ‚komplexe‘ Perspektive - Einführung in die digitale Wirtschaft
Die ‚komplexe‘ Perspektive - Einführung in die digitale WirtschaftJörn Dinkla
 
Geschäftsmodelle - Ein kurzer Überblick
Geschäftsmodelle -Ein kurzer ÜberblickGeschäftsmodelle -Ein kurzer Überblick
Geschäftsmodelle - Ein kurzer ÜberblickJörn Dinkla
 
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard Casey
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard CaseyBuchvorstellung "Libertarian Anarchy: Against the State" von Gerard Casey
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard CaseyJörn Dinkla
 
Multi-GPU-Computing: Eins, zwei, drei, ganz viele
Multi-GPU-Computing: Eins, zwei, drei, ganz vieleMulti-GPU-Computing: Eins, zwei, drei, ganz viele
Multi-GPU-Computing: Eins, zwei, drei, ganz vieleJörn Dinkla
 
Tipps & Tricks für den erfolgreichen Einsatz von GPU-Computing
Tipps & Tricks für den erfolgreichen Einsatz von GPU-ComputingTipps & Tricks für den erfolgreichen Einsatz von GPU-Computing
Tipps & Tricks für den erfolgreichen Einsatz von GPU-ComputingJörn Dinkla
 
GPU-Computing mit CUDA und OpenCL in der Praxis
GPU-Computing mit CUDA und OpenCL in der PraxisGPU-Computing mit CUDA und OpenCL in der Praxis
GPU-Computing mit CUDA und OpenCL in der PraxisJörn Dinkla
 
Subversion Schulung
Subversion SchulungSubversion Schulung
Subversion SchulungJörn Dinkla
 
Test-Driven-Development mit JUnit 4
Test-Driven-Development mit JUnit 4Test-Driven-Development mit JUnit 4
Test-Driven-Development mit JUnit 4Jörn Dinkla
 

More from Jörn Dinkla (18)

Presentation of the book "Mikado Method"
Presentation of the book "Mikado Method"Presentation of the book "Mikado Method"
Presentation of the book "Mikado Method"
 
Korrekte nebenläufige Anwendungen mit Koroutinen und TDD
Korrekte nebenläufige Anwendungen mit Koroutinen und TDDKorrekte nebenläufige Anwendungen mit Koroutinen und TDD
Korrekte nebenläufige Anwendungen mit Koroutinen und TDD
 
Nebenlaeufigkeit mit Koroutinen strukturieren
Nebenlaeufigkeit mit Koroutinen strukturierenNebenlaeufigkeit mit Koroutinen strukturieren
Nebenlaeufigkeit mit Koroutinen strukturieren
 
Plain react, hooks and/or Redux ?
Plain react, hooks and/or Redux ?Plain react, hooks and/or Redux ?
Plain react, hooks and/or Redux ?
 
A short introduction to Kotlin
A short introduction to KotlinA short introduction to Kotlin
A short introduction to Kotlin
 
Concurrency in Kotlin with coroutines
Concurrency in Kotlin with coroutinesConcurrency in Kotlin with coroutines
Concurrency in Kotlin with coroutines
 
Nebenläufigkeit mit Kotlins Koroutinen
Nebenläufigkeit mit Kotlins KoroutinenNebenläufigkeit mit Kotlins Koroutinen
Nebenläufigkeit mit Kotlins Koroutinen
 
GPU-Computing mit CUDA und OpenCL
GPU-Computing mit CUDA und OpenCLGPU-Computing mit CUDA und OpenCL
GPU-Computing mit CUDA und OpenCL
 
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDASchulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
 
Die ‚komplexe‘ Perspektive - Einführung in die digitale Wirtschaft
Die ‚komplexe‘ Perspektive - Einführung in die digitale WirtschaftDie ‚komplexe‘ Perspektive - Einführung in die digitale Wirtschaft
Die ‚komplexe‘ Perspektive - Einführung in die digitale Wirtschaft
 
Geschäftsmodelle - Ein kurzer Überblick
Geschäftsmodelle -Ein kurzer ÜberblickGeschäftsmodelle -Ein kurzer Überblick
Geschäftsmodelle - Ein kurzer Überblick
 
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard Casey
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard CaseyBuchvorstellung "Libertarian Anarchy: Against the State" von Gerard Casey
Buchvorstellung "Libertarian Anarchy: Against the State" von Gerard Casey
 
Multi-GPU-Computing: Eins, zwei, drei, ganz viele
Multi-GPU-Computing: Eins, zwei, drei, ganz vieleMulti-GPU-Computing: Eins, zwei, drei, ganz viele
Multi-GPU-Computing: Eins, zwei, drei, ganz viele
 
Tipps & Tricks für den erfolgreichen Einsatz von GPU-Computing
Tipps & Tricks für den erfolgreichen Einsatz von GPU-ComputingTipps & Tricks für den erfolgreichen Einsatz von GPU-Computing
Tipps & Tricks für den erfolgreichen Einsatz von GPU-Computing
 
GPU-Computing mit CUDA und OpenCL in der Praxis
GPU-Computing mit CUDA und OpenCL in der PraxisGPU-Computing mit CUDA und OpenCL in der Praxis
GPU-Computing mit CUDA und OpenCL in der Praxis
 
Subversion Schulung
Subversion SchulungSubversion Schulung
Subversion Schulung
 
Test-Driven-Development mit JUnit 4
Test-Driven-Development mit JUnit 4Test-Driven-Development mit JUnit 4
Test-Driven-Development mit JUnit 4
 
Ant im Detail
Ant im DetailAnt im Detail
Ant im Detail
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Introduction To Parallel Computing

  • 1. Introduction to Parallel Computing Jörn Dinkla http://www.dinkla.com Version 1.1
  • 2. Dipl.-Inform. Jörn Dinkla  Java (J2SE, JEE)  Programming Languages  Scala, Groovy, Haskell  Parallel Computing  GPU Computing  Model driven  Eclipse-Plugins
  • 3. Overview  Progress in computing  Traditional Hard- and Software  Theoretical Computer Science  Algorithms  Machines  Optimization  Parallelization  Parallel Hard- and Software
  • 4. Progress in Computing 1. New applications  Not feasible before  Not needed before  Not possible before 2. Better applications  Faster  More data  Better quality  precision, accuracy, exactness
  • 5. Progress in Computing  Two ingredients  Hardware  Machine(s) to execute program  Software  Model / language to formulate program  Libraries  Methods
  • 6. How was progress achieved?  Hardware  CPU, memory, disks, networks  Faster and larger  Software  New and better algorithms  Programming methods and languages
  • 7. Traditional Hardware  Von Neumann-Architecture CPU I/O Memory Bus  John Backus 1977  “von Neumann bottleneck“ Cache
  • 8. Improvements  Increasing Clock Frequency  Memory Hierarchy / Cache  Parallelizing ALU  Pipelining  Very-long Instruction Words (VLIW)  Instruction-Level parallelism (ILP)  Superscalar processors  Vector data types  Multithreaded  Multicore / Manycore
  • 10. Clock frequency  No increase since 2005
  • 11. Physical Limits  Increase of clock frequency  >>> Energy-consumption  >>> Heat-dissipation  Limit to transistor size Faster processors impossible !?!
  • 12. 2005 “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software” Herb Sutter Dr. Dobb’s Journal, March 2005
  • 13. Multicore  Transistor count  Doubles every 2-3 years  Calculation speed  No increase Multicore  Efficient?
  • 14. How to use the cores?  Multi-Tasking OS  Different tasks  Speeding up same task  Assume 2 CPUs  Problem is divided in half  Each CPU calculates a half  Time taken is half of the original time?
  • 15. Traditional Software  Computation is expressed as “algorithm  “a step-by-step procedure for calculations”  algorithm = logic + control  Example 1. Open file 2. For all records in the file 1. Add the salary 3. Close file 4. Print out the sum of the salaries  Keywords  Sequential, Serial, Deterministic
  • 16. Traditional Software  Improvements  Better algorithms  Programming languages (OO)  Developement methods (agile)  Limits  Theoretical Computer Science  Complexity theory (NP, P, NC)
  • 17. Architecture  Simplification: Ignore the bus CPU I/O Memory I/O Memory Bus CPU
  • 18. More than one CPU?  How should they communicate ? I/O Memory I/O Memory CPU CPU
  • 19. Message Passing  Distributed system  Loose coupling Messages Network I/O Memory I/O Memory CPU CPU
  • 20. Shared Memory  Shared Memory  Tight coupling I/O Memory I/O CPU CPU
  • 21. Shared Memory  Global vs. Local  Memory hierarchy I/O Memory I/O Memory Shared CPU CPU Memory
  • 22. Overview: Memory  Unshared Memory  Message Passing  Actors  Shared Memory  Threads  Memory hierarchies / hybrid  Partitioned Global Adress Space (PGAS)  Transactional Memory
  • 23. Sequential Algorithms  Random Access Machine (RAM)  Step by step, deterministic Addr Value 0 3 PC int sum = 0 1 2 7 5 for i=0 to 4 3 1 4 2 sum += mem[i] 5 18 mem[5]= sum
  • 24. Sequential Algorithms int sum = 0 for i=0 to 4 sum += mem[i] Addr Value Addr Value Addr Value Addr Value Addr Value Addr Value 0 3 0 3 0 3 0 3 0 3 0 3 1 7 1 7 1 7 1 7 1 7 1 7 2 5 2 5 2 5 2 5 2 5 2 5 3 1 3 1 3 1 3 1 3 1 3 1 4 2 4 2 4 2 4 2 4 2 4 2 5 0 5 3 5 10 5 15 5 16 5 18
  • 25. More than one CPU  How many programs should run?  One  In lock-step  All processors do the same  In any order  More than one  Distributed system
  • 26. Two Processors PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3  Lockstep 1 2 7 5  Memory Access! 3 4 1 2 5 18
  • 27. Flynn‘s Taxonomy  1966 Instruction Single Multiple Single SISD MISD Data Multiple SIMD MIMD
  • 28. Flynn‘s Taxonomy  SISD  RAM, Von Neumann  SIMD  Lockstep, vector processor, GPU  MISD  Fault tolerance  MIMD  Distributed system
  • 29. Extension MIMD  How many programs?  SPMD  One program  Not in lockstep as in SIMD  MPMD  Many programs
  • 30. Processes & Threads  Process  Operating System  Address space  IPC  Heavy weight  Contains 1..* threads  Thread  Smallest unit of execution  Light weight
  • 31. Overview: Algorithms  Sequential  Parallel  Concurrent Overlap  Distributed  Randomized  Quantum
  • 32. Computer Science  Theoretical Computer Science  A long time before 2005  1989: Gibbons, Rytter  1990: Ben-Ari  1996: Lynch
  • 33. Gap: Theory and Practice  Galactic algorithms  Written for abstract machines  PRAM, special networks, etc.  Simplifying assumptions  No boundaries  Exact arithmetic  Infinite memory, network speed, etc.
  • 34. Sequential algorithms  Implementing a sequential algorithm  Machine architecture  Programming language  Performance  Processor, memory and cache speed  Boundary cases  Sometimes hard
  • 35. Parallel algorithms  Implementing a parallel algorithm  Adapt algorithm to architecture  No PRAM or sorting network!  Problems with shared memory  Synchronization  Harder!
  • 36. Parallelization  Transforming  a sequential  into a parallel algorithm  Tasks  Adapt to architecture  Rewrite  Test correctness wrt „golden“ seq. code
  • 37. Granularity  “Size” of the threads?  How much computation?  Coarse vs. fine grain  Right choice  Important for good performance  Algorithm design
  • 38. Computational thinking  “… is the thought processes involved in formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an information-processing agent.” Cuny, Snyder, Wing 2010
  • 39. Computational thinking  “… is the new literacy of the 21st Century.” Cuny, Snyder, Wing 2010  Expert level needed for parallelization!
  • 40. Problems: Shared Memory  Destructive updates  i += 1  Parallel, independent processes  How do the others now that i increased?  Synchronization needed  Memory barrier  Complicated for beginners
  • 41. Problems: Shared Memory PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum mem[5]= sum Addr Value 0 3  Which one first? 1 2 7 5 3 1 4 2 5 18
  • 42. Problems: Shared Memory PC 1 int sum = 0 int sum = 0 for i=0 to 2 PC 2 for i=3 to 4 sum += mem[i] sum += mem[i] mem[5]= sum sync() sync() mem[5] += sum  Synchronization needed
  • 43. Problems: Shared Memory  The memory barrier  When is a value read or written?  Optimizing compilers change semantics  int a = b + 5  Read b  Add 5 to b, store temporary in c  Write c to a  Solutions (Java)  volatile  java.util.concurrent.atomic
  • 44. Problems: Shared Memory  Thread safety  Reentrant code class X { int x; void inc() { x+=1; } }
  • 45. Problems: Threads  Deadlock  A wants B, B wants A, both waiting  Starvation  A wants B, but never gets it  Race condition  A writes to mem, B reads/writes mem
  • 46. Shared Mem: Solutions  Shared mutable state  Synchronize properly  Isolated mutable state  Don‘t share state  Immutable or unshared  Don‘t mutate state!
  • 47. Solutions  Transactional Memory  Every access within transaction  See databases  Actor models  Message passing  Immutable state / pure functional
  • 48. Speedup and Efficiency  Running time  T(1) with one processor  T(n) with two processors  Speedup  How much faster?  S(n) = T(1) / T(n)
  • 49. Speedup and Efficiency  Efficiency  Are all the processors used?  E(n) = S(n) / n = T(1) / (n * T(n))
  • 52. Amdahl‘s Law  Corrolary  Maximize the parallel part  Only parallelize when parallel part is large enough
  • 53. P-Completeness  Is there an efficient parallel version for every algorithm?  No! Hardly parallelizable problems  P-Completeness  Example Circuit-Value-Problem (CVP)
  • 55. Optimization  What can i achieve?  When do I stop?  How many threads should i use?
  • 56. Optimization  I/O bound  Thread is waiting for memory, disk, etc.  Computation bound  Thread is calculating the whole time  Watch processor utilization!
  • 57. Optimization  I/O bound  Use asynchronous/non-blocking I/O  Increase number of threads  Computation bound  Number of threads = Number of cores
  • 58. Processors  Multicore CPU  Graphical Processing Unit (GPU)  Field-Programmable Gate Array (FPGA)
  • 59. GPU Computing  Finer granularity than CPU  Specialized processors  512 cores on a Fermi  High memory bandwidth 192 GB/sec
  • 60. CPU vs. GPU  Source: SGI
  • 61. FPGA  Configurable hardware circuits  Programmed in Verilog, VHDL  Now: OpenCL  Much higher level of abstraction  Under development, promising  No performance tests results (2011/12)
  • 62. Networks / Cluster  Combination of CPU  CPU Memory  Memory  Network Network  GPU GPU  FPGA FPGA  Vast possibilities
  • 63. Example  2 x connected by network  2 CPU each with local cache  Global memory Network CPU CPU CPU CPU Memory Memory Memory Memory Memory Memory
  • 64. Example  1 CPU with local cache  Connected by shared memory  2 GPU with local memory („device“) CPU Memory GPU Memory GPU Memory Memory
  • 65. Next Step: Hybrid  Hybrid / Heterogenous  Multi-Core / Many-Core  Plus special purpose hardware  GPU  FPGA
  • 66. Optimal combination?  Which network gives the best performance?  Complicated  Technical restrictions  4x PCI-Express 16x Motherboards  Power consumption  Cooling
  • 67. Example: K-Computer  SPARC64 VIIIfx 2.0GHz  705024 Cores  10.51 Petaflop/s  No GPUs  #1 2011
  • 68. Example: Tianhe-1A  14336 Xeon X5670  7168 Tesla M2050  2048 NUDT FT1000  2.57 petaflop/s  #2 2011
  • 69. Example: HPC at home  Workstations and blades  8 x 512 cores = 4096 cores
  • 70. Frameworks: Shared Mem  C/C++  OpenMP  POSIX Threads (pthreads)  Intel Thread Building Blocks  Windows Threads  Java  java.util.concurrent
  • 71. Frameworks: Actors  C/C++  Theron  Java / JVM  Akka  Scala  GPars (Groovy)
  • 72. GPU Computing  NVIDIA CUDA  NVIDIA  OpenCL  AMD  NVIDIA  Intel  Altera  Apple  WebCL  Nokia  Samsung
  • 73. Advanced courses  Best practices for concurrency in Java  Java‘s java.util.concurrent  Actor models  Transactional Memory  See http://www.dinkla.com
  • 74. Advanced courses  GPU Computing  NVIDIA CUDA  OpenCL  Using NVIDIA CUDA with Java  Using OpenCL with Java  See http://www.dinkla.com
  • 75. References: Practice  Mattson, Sanders, Massingill  Patterns for Parallel Programming  Breshears  The Art of Concurrency
  • 76. References: Practice  Pacheco  An Introduction to Parallel Programming  Herlihy, Shavit  The Art of Multiprocessor Programming
  • 77. References: Theory  Gibbons, Rytter  Efficient Parallel Algorithms  Lynch  Distributed Algorithms  Ben-Ari  Principles of Concurrent and Distributed Programming
  • 78. References: GPU Computing  Scarpino  OpenCL in Action  Sanders, Kandrot  CUDA by Example
  • 79. References: Background  Hennessy, Paterson  Computer Architecture: A Quantitative Approach