3. Overview
Progress in computing
Traditional Hard- and Software
Theoretical Computer Science
Algorithms
Machines
Optimization
Parallelization
Parallel Hard- and Software
4. Progress in Computing
1. New applications
Not feasible before
Not needed before
Not possible before
2. Better applications
Faster
More data
Better quality
precision, accuracy, exactness
5. Progress in Computing
Two ingredients
Hardware
Machine(s) to execute program
Software
Model / language to formulate program
Libraries
Methods
6. How was progress achieved?
Hardware
CPU, memory, disks, networks
Faster and larger
Software
New and better algorithms
Programming methods and languages
7. Traditional Hardware
Von Neumann-Architecture
CPU I/O Memory
Bus
John Backus 1977
“von Neumann bottleneck“ Cache
8. Improvements
Increasing Clock Frequency
Memory Hierarchy / Cache
Parallelizing ALU
Pipelining
Very-long Instruction Words (VLIW)
Instruction-Level parallelism (ILP)
Superscalar processors
Vector data types
Multithreaded
Multicore / Manycore
14. How to use the cores?
Multi-Tasking OS
Different tasks
Speeding up same task
Assume 2 CPUs
Problem is divided in half
Each CPU calculates a half
Time taken is half of the original time?
15. Traditional Software
Computation is expressed as “algorithm
“a step-by-step procedure for calculations”
algorithm = logic + control
Example
1. Open file
2. For all records in the file
1. Add the salary
3. Close file
4. Print out the sum of the salaries
Keywords
Sequential, Serial, Deterministic
16. Traditional Software
Improvements
Better algorithms
Programming languages (OO)
Developement methods (agile)
Limits
Theoretical Computer Science
Complexity theory (NP, P, NC)
23. Sequential Algorithms
Random Access Machine (RAM)
Step by step, deterministic
Addr Value
0 3
PC int sum = 0 1
2
7
5
for i=0 to 4 3 1
4 2
sum += mem[i] 5 18
mem[5]= sum
24. Sequential Algorithms
int sum = 0
for i=0 to 4
sum += mem[i]
Addr Value Addr Value Addr Value Addr Value Addr Value Addr Value
0 3 0 3 0 3 0 3 0 3 0 3
1 7 1 7 1 7 1 7 1 7 1 7
2 5 2 5 2 5 2 5 2 5 2 5
3 1 3 1 3 1 3 1 3 1 3 1
4 2 4 2 4 2 4 2 4 2 4 2
5 0 5 3 5 10 5 15 5 16 5 18
25. More than one CPU
How many programs should run?
One
In lock-step
All processors do the same
In any order
More than one
Distributed system
26. Two Processors
PC 1 int sum = 0 int sum = 0
for i=0 to 2 PC 2 for i=3 to 4
sum += mem[i] sum += mem[i]
mem[5]= sum mem[5]= sum
Addr Value
0 3
Lockstep 1
2
7
5
Memory Access! 3
4
1
2
5 18
27. Flynn‘s Taxonomy
1966
Instruction
Single Multiple
Single SISD MISD
Data
Multiple SIMD MIMD
32. Computer Science
Theoretical Computer Science
A long time before 2005
1989: Gibbons, Rytter
1990: Ben-Ari
1996: Lynch
33. Gap: Theory and Practice
Galactic algorithms
Written for abstract machines
PRAM, special networks, etc.
Simplifying assumptions
No boundaries
Exact arithmetic
Infinite memory, network speed, etc.
34. Sequential algorithms
Implementing a sequential algorithm
Machine architecture
Programming language
Performance
Processor, memory and cache speed
Boundary cases
Sometimes hard
35. Parallel algorithms
Implementing a parallel algorithm
Adapt algorithm to architecture
No PRAM or sorting network!
Problems with shared memory
Synchronization
Harder!
36. Parallelization
Transforming
a sequential
into a parallel algorithm
Tasks
Adapt to architecture
Rewrite
Test correctness wrt „golden“ seq. code
37. Granularity
“Size” of the threads?
How much computation?
Coarse vs. fine grain
Right choice
Important for good performance
Algorithm design
38. Computational thinking
“… is the thought processes involved
in formulating problems and their
solutions so that the solutions are
represented in a form that can be
effectively carried out by an
information-processing agent.”
Cuny, Snyder, Wing 2010
39. Computational thinking
“… is the new literacy of the 21st
Century.”
Cuny, Snyder, Wing 2010
Expert level needed for parallelization!
40. Problems: Shared Memory
Destructive updates
i += 1
Parallel, independent processes
How do the others now that i increased?
Synchronization needed
Memory barrier
Complicated for beginners
41. Problems: Shared Memory
PC 1 int sum = 0 int sum = 0
for i=0 to 2 PC 2 for i=3 to 4
sum += mem[i] sum += mem[i]
mem[5]= sum mem[5]= sum
Addr Value
0 3
Which one first? 1
2
7
5
3 1
4 2
5 18
42. Problems: Shared Memory
PC 1 int sum = 0 int sum = 0
for i=0 to 2 PC 2 for i=3 to 4
sum += mem[i] sum += mem[i]
mem[5]= sum
sync() sync()
mem[5] += sum
Synchronization needed
43. Problems: Shared Memory
The memory barrier
When is a value read or written?
Optimizing compilers change semantics
int a = b + 5
Read b
Add 5 to b, store temporary in c
Write c to a
Solutions (Java)
volatile
java.util.concurrent.atomic
45. Problems: Threads
Deadlock
A wants B, B wants A, both waiting
Starvation
A wants B, but never gets it
Race condition
A writes to mem, B reads/writes mem
46. Shared Mem: Solutions
Shared mutable state
Synchronize properly
Isolated mutable state
Don‘t share state
Immutable or unshared
Don‘t mutate state!
47. Solutions
Transactional Memory
Every access within transaction
See databases
Actor models
Message passing
Immutable state / pure functional
48. Speedup and Efficiency
Running time
T(1) with one processor
T(n) with two processors
Speedup
How much faster?
S(n) = T(1) / T(n)
49. Speedup and Efficiency
Efficiency
Are all the processors used?
E(n) = S(n) / n = T(1) / (n * T(n))
52. Amdahl‘s Law
Corrolary
Maximize the parallel part
Only parallelize when parallel part is large
enough
53. P-Completeness
Is there an efficient parallel version for
every algorithm?
No! Hardly parallelizable problems
P-Completeness
Example Circuit-Value-Problem (CVP)
56. Optimization
I/O bound
Thread is waiting for memory, disk, etc.
Computation bound
Thread is calculating the whole time
Watch processor utilization!
57. Optimization
I/O bound
Use asynchronous/non-blocking I/O
Increase number of threads
Computation bound
Number of threads = Number of cores
72. GPU Computing
NVIDIA CUDA
NVIDIA
OpenCL
AMD
NVIDIA
Intel
Altera
Apple
WebCL
Nokia
Samsung
73. Advanced courses
Best practices for concurrency in Java
Java‘s java.util.concurrent
Actor models
Transactional Memory
See http://www.dinkla.com
74. Advanced courses
GPU Computing
NVIDIA CUDA
OpenCL
Using NVIDIA CUDA with Java
Using OpenCL with Java
See http://www.dinkla.com
75. References: Practice
Mattson, Sanders, Massingill
Patterns for
Parallel Programming
Breshears
The Art of Concurrency
76. References: Practice
Pacheco
An Introduction to
Parallel Programming
Herlihy, Shavit
The Art of
Multiprocessor Programming
77. References: Theory
Gibbons, Rytter
Efficient Parallel Algorithms
Lynch
Distributed Algorithms
Ben-Ari
Principles of Concurrent and
Distributed Programming