Fine tuned the cache hierarchy of an Alpha microprocessor for three individual benchmarks namely GCC,ANAGRAM and GO by modifying various cache design parameters like cache levels, cache types ( in case of more than one level of cache), sizes, associativity, block sizes and block replacement policy. compared the performance of individual benchmarks for different configurations based on CPI and COST function.
Block diagram reduction techniques in control systems.ppt
Cache Design for an Alpha Microprocessor
1. 1
University of Texas at Dallas
Department of Electrical Engineering
EEDG 6304 – Computer Architecture
Project #1
“CACHE DESIGN”
Team Members
Bharat Biyani (2021152193)
Shree Viswa Shamanthan L D (2021180127)
2. 2
TABLE OF CONTENTS
Sr. No. Description Page No.
1 INTRODUCTION 3
2 SIMULATION APPROACH 4
3 CPI FORMULAE 4
4 PART 2: FIND CPI 5
5 PART 3: OPTIMIZE CPI FOR EACH BENCHMARK 7
6 PART 4: DEFINE COST FUNCTION 10
7
PART 5: OPTIMIZE CACHE FOR PERFORMANCE/
COST
11
8 CONCLUSION 14
9 APPENDIX 15
3. 3
1. INTRODUCTION
Caches are used by the central processing unit (CPU) of a computer to reduce the
average time to access memory. The cache is a smaller, faster memory which stores copies of
the data from frequently used main memory locations. Hence Caches form the integral part of a
microprocessor design. Any cache design is proven to the industry/customers through their
benchmarks. The main aim of this project is to analyze the cache performance of an Alpha
microprocessor for 3 individual benchmarks (GCC, Anagram Alpha and Go) with following
design constraints.
The cache design parameters that can be tuned in our example are
Cache Levels: One or two levels, for data and instruction caches
Unified caches: Selection of separate vs. unified instruction/data caches
Size: Cache size is the most important factor to avoid capacity misses.
Block size: Block size of the cache, usually 64 or 32 bytes.
Block replacement policy: Selection between FIFO, LRU and Random.
Associativity: Selection of cache associativity (e.g. direct mapped (1-way set
associative), 2-way set associative, etc.)
While larger caches generally mean better performance, they also come at a greater cost.
Thus, sensible design choices and trade-offs are required. So we are going to use the cost
function to identify the optimal configuration.
4. 4
2. SIMULATION APPROACH
Three benchmarks (GCC, Anagram and Go) were selected and run on the Simplesim3.0
simulation tool. Multiple techniques are applied to achieve the optimal configuration. The
following steps were carried out in order to determine the optimal configuration:
1. Select a particular benchmark.
2. Select one cache structure combination from these 3 cases for each benchmark
L1 Separate - L2 Separate
L1 Separate - L2 Unified
L1 Unified - L2 Unified
Where L1 and L2 are Levels of Cache.
3. For each cache structural combination; vary the page replacement policy iteratively
(LRU, FIFO and Random)
4. For each cache structural combination and page replacement policy; vary the block size
values iteratively (32Kb, 64Kb)
5. For each cache structural combination, page replacement policy and block size; vary
associativity iteratively (1, 2, 4, 8)
6. Calculate the CPI for each setting (shown below)
7. Repeat steps 1 to 6 for all possible combination of configurations.
8. Calculate the cost for each cache configuration using the cost function (shown on page
no. 10).
9. Compare the results and annotate graphically Optimal cache configuration for each
benchmark, as well as all benchmarks combined is selected with the help of the defined
cost function.
3. CPI FORMULAE
3.1. L1 separate and L2 separate
CPI = CPI ideal + 6 * ( L1InsMissRate * %L1Ins + L1DataMissRate * %L1Data) + 50 * (
L2InsMissRate * %L2Ins + L2DataMissRate * %L2Data )
3.2. L1 separate and L2 unified
CPI = CPI ideal + 6 * ( L1InsMissRate * %L1Ins + L1DataMissRate * %L1Data) + 50 * ( L2MissRate
* %L2Data ) )
3.3. L1 unified and L2 unified
CPI = CPI ideal + 6 * ( L1MissRate * ( %L1Data) ) + 50 * ( L2InsMissRate * ( %L2Data ) ) )
Where,
%LxIns = Instruction Accesses for Lx /Total Memory Accesses
%LxData = Data Accesses for Lx /Total Memory Accesses
5. 5
4. PART 2: FIND CPI
In this part, the CPI for the 3 individual benchmarks was calculated. Our baseline configuration
will be the Alpha 21264 EV6 configuration:
Cache Levels: Two Levels.
Unified caches: Separate L1 data and Instruction cache, unified L2 cache.
Size: 64KB Separate L1 data and instruction caches, 1MB unified L2 cache.
Block size: 64 bytes
Block replacement policy: FIFO
Associativity: 2- way set associative L1 cache, Direct-mapped L2 cache.
-------------------------------------------------------------------------------------------------------------------------------
4.1. GCC Benchmark
Total Memory Accesses = 337327101
Number of L1 Instruction cache accesses = 337327101
Number of L1Data cache access = 124102799
Number of L2 access = 3330118
L1 Ins miss rate = 0.0047, L1 Data miss rate = 0.0106, L2 miss rate = 0.1311
CPI = CPI ideal + 6 * ( L1InsMissRate * %L1Ins + L1DataMissRate * %L1Data) + 50 * ( L2MissRate
* %L2Data ) )
CPI = 1 + 6 * (0.0047 * (337327101/337327101) + 0.0106 * (124102799/337327101)) + 50 *
(0.1311 * (3330118/337327101)))
CPI = 1.11631
----------------------------------------------------------------------------------------------------------------------------
4.2. Anagram Benchmark
Total Memory Accesses = 25724898
Number of L1 Instruction cache accesses = 25724898
Number of L1Data cache access = 11182060
Number of L2 access = 92401
L1 Ins miss rate = 0, L1 Data miss rate = 0.0048, L2 miss rate = 0.3191
6. 6
CPI = 1 + 6 * (0 * (25724898/25724898) + 0.0048 * (11182060/25724898)) + 50 * (0.3191 *
(92401/25724898)))
CPI = CPI ideal + 6 * (L1InsMissRate * %Ins + L1DataMissRate * %Data) + 50 * (L2MissRate *
%Data))
CPI = 1.06983
----------------------------------------------------------------------------------------------------------------------------
4.3. GO Benchmark
Total Memory Accesses = 545823664
Number of L1 Instruction cache accesses = 545823664
Number of L1Data cache access = 213791111
Number of L2 access = 1021478
L1 Ins miss rate = 0.0013, L1 Data miss rate = 0.0010, L2 miss rate = 0.0907
CPI = CPI ideal + 6 * (L1InsMissRate * %Ins + L1DataMissRate * %Data) + 50 * (L2MissRate *
%Data))
CPI = 1 + 6 * (0.0013 * (545823664/545823664) + 0.0010 * (213791111/545823664)) + 50 *
(0.0907 * (1021478/545823664)))
CPI = 1.01864
CONCLUSION FOR PART 2:
Sr. No. Benchmark CPI
1 GCC 1.11631
2 Anagram 1.06983
3 GO 1.01864
7. 7
5. PART 3: OPTIMIZE CPI FOR EACH BENCHMARK
Given a two-level cache hierarchy, 128KB available for L1 cache and 1MB available for L2 cache,
in order to find the optimal configuration (in terms of achieved CPI) for each benchmark,
Decision must be made between unified/separate caches, associativity, replacement policy etc.
5.1. Assumptions
1. Both L1 and L2 use the same block size (e.g. if L1 cache uses 32KB, L2 cache also uses
32KB). We have considered only 32KB and 64KB block sizes for both L1 and L2 in order
to find the optimal configuration. Having higher block size would decrease the number
of lines in the cache and increases the miss penalty.
2. L1 block size cannot be larger than L2 block size.
3. Associativity values range from 1, 2, 4 and 8. Design does not consider associativity
more than 8 because they have much higher cost for a very little performance
improvement in reality. Directly mapped design (1-way associative) is also taken into
consideration even though it gives poor performance, but it will help in analyzing the
design better.
4. While optimizing, only three replacement policies are considered Random, FIFO (First
IN First OUT) and LRU (Least Recently Used).
The below graphs show the CPI plotted against various configuration for L1 Separate-L2
separate, L1 Separate L2 unified, and L1 & L2 unified for all three benchmarks
5.2. GCC Benchmark
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
2048:32:2:f--16384:32:2:f
2048:32:2:f--8192:32:4:f
2048:32:2:l--4096:32:8:f
2048:32:2:r--…
2048:64:1:f--16384:64:1:f
2048:64:1:f--8192:64:2:f
2048:64:1:l--4096:64:4:f
2048:64:1:r--2048:64:8:f
256:64:8:f--16384:64:1:f
256:64:8:f--8192:64:2:f
256:64:8:l--4096:64:4:f
256:64:8:r--2048:64:8:f
4096:32:1:f--16384:32:2:f
4096:32:1:f--8192:32:4:f
4096:32:1:l--4096:32:8:f
4096:32:1:r--…
512:32:8:f--16384:32:2:f
512:32:8:f--8192:32:4:f
512:32:8:l--4096:32:8:f
512:32:8:r--32768:32:1:f
512:64:4:f--16384:64:1:f
512:64:4:f--8192:64:2:f
512:64:4:l--4096:64:4:f
512:64:4:r--2048:64:8:f
1024:32:4:f--16384:32:2:f
1024:32:4:f--8192:32:4:f
1024:32:4:l--4096:32:8:f
1024:32:4:r--…
1024:64:2:f--16384:64:1:f
1024:64:2:f--8192:64:2:f
1024:64:2:l--4096:64:4:f
1024:64:2:r--2048:64:8:f
L1L2unified
L1sepL2unified
L1sepL2sep
9. 9
CONCLUSION FOR PART 3:
The graphs plotted above helps us to decide the design choices. Optimum CPI in each case is as
follows:
Sr. No. Benchmark Configuration Block Size (In
Bytes)
Replacement
Policy Associativity
Optimal CPI
L1 L2 L1 L2 L1 L2
1 GCC L1L2unified 64 64 LRU LRU 8 8 1.05296841
L1sepL2unified 64 64 LRU LRU 8 8 1.059915236
L1sepL2sep 64 64 LRU LRU 8 1 1.061853108
2 Anagram L1L2unified 64 64 LRU FIFO 4 1 1.07084492
L1sepL2unified 64 64 LRU FIFO 4 1 1.069724523
L1sepL2sep 64 64 FIFO FIFO 4 1 1.093832533
3 Go L1L2unified 64 64 Random LRU 8 4 1.011049037
L1sepL2unified 64 64 Random FIFO 8 2 1.012941826
L1sepL2sep 64 64 LRU Random 4 8 1.013631953
Row highlighted in green is the optimal CPI configuration for the corresponding Benchmark.
Hence, the Best Configuration for each benchmark from the above table is shown below:
-------------------------------------------------------------------------------------------------------------------------------
GO Benchmark
L1 unified 64 bytes block size, 8 way set associative and Random replacement Policy
L2 unified 64 bytes block size, 4 way set associative and LRU replacement Policy
CPI Optimum = 1.01105
-------------------------------------------------------------------------------------------------------------------------------
ANAGRAM Benchmark
L1 separate 64 bytes block size, 4-way set associative and LRU replacement policy
L2 unified 64 bytes block size, 1-way set associative and FIFO replacement policy
CPI Optimum = 1.06972
-------------------------------------------------------------------------------------------------------------------------------
GCC Benchmark:
L1 unified 64 bytes block size, 8-way set associative and LRU replacement Policy
L2 unified 64 bytes block size, 8-way set associative and LRU replacement Policy.
CPI Optimum = 1.05297
-------------------------------------------------------------------------------------------------------------------------------
10. 10
6. PART 4: DEFINE COST FUNCTION
Cost function plays a major role in determining parameters that give least CPI. So we need to
define a cost function, which assists architect to design a cost efficient, and performance
effective Cache. It refrains us from choosing whatever design giving the least CPI.
The cost function can be defined as below.
Cost Function = 0.35*(L1 cache size) + 0.25* (L2 Cache size)+ 0.075* (L1 associativity) + 0.075*
(L2 associativity)+ 0.15 *(Unified/Separate)+ 0.05*(L1 policy)+0.05*(L2 policy)+ 0*(block size)
Explanation of the cost function can be found in the below table.
We normalized the cost function in such a way that the total cost will not exceed 100 units. For
a given cost of a configuration say 85%, if the L1Cache Size doubles, the new cost function will
become 120%. So assumed cost function is reasonable.
Cost Function Weight Overall Weight Comment
L1 Cache Size 0.35
64KB: 50 units
128KB: 100 units
If cache size doubles, cost also doubles
L2 Cache Size 0.25
512KB: 100 units
1MB: 200 units
If cache size doubles, cost also doubles
L1 Associativity 0.075
1: 2 units
2: 4 units
4: 8 units
8: 16 units
Increasing associativity increases number
of comparators on hardware, which in turn
increases cost.
L2 Associativity 0.075
1: 2 units
2: 4 units
4: 8 units
8: 16 units
Increasing associativity increases number
of comparators on hardware, which in turn
increases cost.
Unified/Separate 0.15
L1 Separate-L2 Separate: 1.5 units
L1 Separate - L2 Unified: 0.5 units
L1 Unified - L2 Unified: 0 units
Separate data and instruction cache will
involve some additional hardware cost
compared to unified
L1 Policy 0.05
LRU: 0.1 units
FIFO: 0.05 units
Random: 0 units
For Random, no extra hardware required.
For FIFO to LRU, cost should increase by
5%.
L2 Policy 0.05
LRU: 0.1 units
FIFO: 0.05 units
Random: 0 units
For Random, no extra hardware required.
For FIFO to LRU, cost should increase by
5%.
Block Size 0
32 Bytes: 0 units
64 Bytes: 0 units
With respect to hardware there will not be
any additional cost for change in block size.
11. 11
7. PART 5: OPTIMIZE CACHE FOR PERFORMANCE/ COST
To find the optimal configuration, the cache configuration such as associativity, Replacement
policy and cache type is considered along with the cost. A plot of Cache configuration for each
benchmark is plotted against the product of the CPI and cost. The configuration, which gives
the lowest value among all in the graph, is considered as the optimal configuration.
7.1. GCC Benchmark
80
85
90
95
100
105
110
2048:32:2:f--16384:32:1:f
2048:32:2:f--8192:32:2:f
2048:32:2:l--4096:32:4:f
2048:32:2:r--2048:32:8:f
2048:64:1:f--1024:64:8:f
2048:64:1:f--8192:64:1:f
2048:64:1:l--4096:64:2:f
2048:64:1:r--2048:64:4:f
256:64:8:f--1024:64:8:f
256:64:8:f--8192:64:1:f
256:64:8:l--4096:64:2:f
256:64:8:r--2048:64:4:f
4096:32:1:f--16384:32:1:f
4096:32:1:f--8192:32:2:f
4096:32:1:l--4096:32:4:f
4096:32:1:r--2048:32:8:f
512:32:8:f--16384:32:1:f
512:32:8:f--8192:32:2:f
512:32:8:l--4096:32:4:f
512:32:8:r--2048:32:8:f
512:64:4:f--1024:64:8:f
512:64:4:f--8192:64:1:f
512:64:4:l--4096:64:2:f
512:64:4:r--2048:64:4:f
1024:32:4:f--16384:32:1:f
1024:32:4:f--8192:32:2:f
1024:32:4:l--4096:32:4:f
1024:32:4:r--2048:32:8:f
1024:64:2:f--1024:64:8:f
1024:64:2:f--8192:64:1:f
1024:64:2:l--4096:64:2:f
1024:64:2:r--2048:64:4:f
L1L2unified
L1sepL2unified
L1sepL2sep
13. 13
Hence, from the graph above the optimum configuration would be as follows for all the 3
benchmarks-
Row highlighted in green is the optimal CPI configuration for the corresponding benchmark. Hence, the
optimum configuration considering all three benchmark together with different specification
constraints, Cost and CPI would be –
-----------------------------------------------------------------------------------------------------------------------------------
Benchmark: GO
Cache Levels: L1 and L2
Configuration: L1 separate and L2 unified
Block size: 64 Bytes for L1 and 64 Bytes for L2
Block replacement policy: LRU for L1 and FIFO for L2
Associativity: 2-way associative for L1 and 2-way associative for L2
------------------------------------------------------------------------------------------------------------------------------------
Sr.
No.
Benchmark Configuration Block Size (in
Bytes)
Replacement
Policy Associativity
Optimal
Configuration
L1 L2 L1 L2 L1 L2 (CPI*Cost)
1 GCC L1L2unified 64 64 LRU LRU 4 2 91.53571716
L1sepL2unified 64 64 LRU LRU 4 2 92.14588838
L1sepL2sep 64 64 LRU LRU 4 4 92.65215522
2 Anagram L1L2unified 64 64 LRU Random 2 1 91.51083284
L1sepL2unified 64 64 LRU Random 2 1 91.49413323
L1sepL2sep 64 64 LRU Random 4 2 93.74364974
3 Go L1L2unified 64 64 LRU FIFO 4 2 86.86445493
L1sepL2unified 64 64 LRU FIFO 2 2 86.83958025
L1sepL2sep 64 64 LRU Random 2 2 86.98242473
14. 14
8. CONCLUSION
Thus, the best possible configuration and its cost function have been computed by
modifying various parameters such as cache type, associativity and replacement policies.
8.1 Optimal CPI
CPI is minimum mostly for LRU and High Associativity and Block size. It’s due to the
following real reasons.
High Associativity: This will help to reduce the conflict misses
High Block Size: It takes the advantage for spatial locality.
LRU: It's the best page replacement policy, as it keeps track for the least frequent use of
any block need to be replaced.
8.2 Optimal CPI v/s Cost.
Here we have observed that high cost cache architectures have optimal values of CPI
but the % change of CPI to the % change in cost function is not good.
Usually defining the cost function plays a very crucial role such that it should not be very
large that CPI wont affect the optimization, at the same time it should not be so small that it
doesn’t affect the optimization. Usually a configuration involves with FIFO as replacement
policy, 2-way associativity or 4 – way associativity and L1 separate and L2 unified should give us
the optimum performance design because cost and CPI with such configuration doesn’t have
very high as well as very small value. So their product (CPI * Cost) will be minimum.