M&t presentation

BENCHMARK
INSTRUMENTATION
Umit Cavus BUYUKSAHIN
Measurements Tools & Techinics, Spring ‘12

4/17/2012

Benchmark Instrumentation 2

OUTLINE
• NAS Benchmark Suite
• Experiments
• Paraver Visualization
• Code View
• Communication
• Disk I/O
• Load Balancing
• LD1 Cache Miss
• Cycles per Instruction (CPI)
• Execution Time
• Benchmarking Time
• Conclusion


NAS Benchmark Suite

• NAS
... is a set of benchmarks.
... evaluates performance of highly parallel supercomputers.
... developed and maintained by NASA Advanced Supercomputing(NAS).


NAS Benchmark Suite
• NAS Kernel Applications
• IS - Integer Sort
• EP - Embarrassingly Parallel
• CG - Conjugate Gradient
• MG - Multi-Grid
• FT - discrete 3D fast Fourier Transform

• Problem Sizes
• S : small size
• W : workstation size
• A, B, C : standart test size; ~4X size in increasing order
• D, E, F : large test size; ~16X size in increasing order


OUTLINE
• Experiments
• Code View
• Communication
• Disk I/O
• Load Balancing
• LD1 Cache Miss
• Execution Time
• Conclusion


Experiments
• NAS Parallel Benchmark version 3.2.1
• IS Kernel Application:
... sorts N keys in parallel.
... tests
• integer computation speed
• communication perfomance

• S Problem Size:
... small for quick test purposes
... has 216 keys


Experiments
• IS Benchmarking Procedure (generally)

1. Generating sequence of N keys
2. Loading N keys into the memory systems
3. Time begins
4. Loop
Sorting & partial verification
5. Time ends
6. Full verification.


Experiments
Machines:

• My Computer
 i686 GNU/Linux
 3Gb Ram
 2 CPUSs with 800Mhz

• Boada
 x86_64 x86_64 x86_64 GNU/Linux
 24Gb Ram
 24 CPUS with 1596Mhz


Experiments
Procedure:

• Not manually instrumented.
• Paraver traces are automatically generated
• LD_PRELOAD is exported.
• Benchmarks are executed with 2,4,8,16,32, and 64 processors.
• Benchmark results are analyzed
• Generated traces are examined in paraver tools.


OUTLINE
• Experiments
• Code View
• Communication
• Disk I/O
• Load Balancing
• LD1 Cache Miss
• Execution Time
• Conclusion


Paraver Visualization – Code View
• My Computer

• Boada


Paraver Visualization – Communication
• My Computer

• Boada


Paraver Visualization – Disk I/O
• My Computer

• Boada


Paraver Visualization – Load Balance
• My Computer

....


Paraver Visualization – Load Balance
• Boada

....


Paraver Visualization – LD1 Cache Miss
• My Computer


Paraver Visualization – LD1 Cache Miss
• Boada


Paraver Visualization – CPI
• My Computer


Paraver Visualization – CPI
• Boada


OUTLINE
• Experiments
• Code View
• Communication
• Disk I/O
• Load Balancing
• LD1 Cache Miss
• Execution Time
• Conclusion


Execution Time

16000

14000

12000

10000

MyComputer
Time (ms)

8000

Boada
6000

4000

2000

0
2 4 8 16 32 64 # of processors


Execution Time
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛𝑇𝑖𝑚𝑒 𝑜𝑓 𝑀𝑦𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟
• Relative Speedup =
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛𝑇𝑖𝑚𝑒 𝑜𝑓 𝐵𝑜𝑎𝑑𝑎

60

50

40

30
SpeedUp

20

10

0
1 2 4 8 16 32 64

# of processors


OUTLINE
• Experiments
• Code View
• Communication
• Disk I/O
• Load Balancing
• LD1 Cache Miss
• Execution Time
• Conclusion


Benchmarking Time - reminder
• IS Benchmarking Procedure (generally)

1. Generating sequence of N keys
2. Loading N keys into the memory systems
3. Time begins
4. Loop
Sorting & partial verification
5. Time ends
6. Full verification.

• Benchmarking Time = execution time of the parallel
algorithm


Benchmarking Time
2,000

1,800

1,600

1,400

1,200
Time (sec)

1,000 MyComputer
Boada
0,800

0,600

0,400

0,200

0,000
1 2 4 8 16 32 64 # of processors


Benchmarking Time
𝐵𝑒𝑛𝑐ℎ𝑚𝑎𝑟𝑘𝑖𝑛𝑔𝑇𝑖𝑚𝑒 𝑜𝑓 𝑀𝑦𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟
• Relative Speedup =
𝐵𝑒𝑛𝑐ℎ𝑚𝑎𝑟𝑘𝑖𝑛𝑔𝑇𝑖𝑚𝑒 𝑜𝑓 𝐵𝑜𝑎𝑑𝑎

70,00

60,00

50,00

40,00
SpeedUp

30,00

20,00

10,00

0,00 # of processors
1 2 4 8 16 32 64


Benchmarking Time
• SpeedUp of My Computer
1,2

1

0,8
SpeedUp

0,6

0,4

0,2

0
# of processors
1 2 4 8 16 32 64


Benchmarking Time
• SpeedUp of Boada

7

6

5

4
SpeedUp

3

2

1

0
1 2 4 8 16 32 64 # of processors


OUTLINE
• Experiments
• Code View
• Communication
• Disk I/O
• Load Balancing
• LD1 Cache Miss
• Execution Time
• Conclusion


Conclusion
• IS application
• ... does not have so much communication.
• ... is based on computation and memory loading.
• ... has low cache miss and high CPI values in computation phase.

• NAS is designed for highly parallel supercomputers.
• MyComputer is inadequate to meet requierments of NAS.
• MyComputer can not speed up in this application.
• Boada can speed up untill number of processors that it has.
• Mycomputer saves less time for disk I/O operations.
• CPI values in Boada’ s computation phase less.

BENCHMARK
INSTRUMENTATION
Umit Cavus BUYUKSAHIN
Measurements & Tools, Spring ‘12

4/17/2012

M&t presentation

Recommended

Recommended

More Related Content

Similar to M&t presentation

Similar to M&t presentation (20)

Recently uploaded

Recently uploaded (20)

M&t presentation