SlideShare a Scribd company logo
1 of 76
Download to read offline
Unit 7 & 8 
Performance Analysis and Optimization 
By 
Leena Chandrashekar, 
Assistant Professor, ECE Dept, 
RNSIT, Bangalore 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 1
Performance or Efficiency Measures 
• Means time, space, power, cost 
• Depends on input data, hardware platform, 
compiler, compiler options. 
• Measure based on complexity, time and 
power, memory, cost and weight. 
• Development time, Ease of maintainance and 
extensibility. 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 2
The System 
• Hardware 
oComputational and control elements 
oCommunication system 
oMemory 
• Software 
oAlgorithms and data Structures 
oControl and Scheduling 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 3
Some Limitations 
• Amdahl’s Law 
Example 
Consider a system with the following characteristics: The task 
to be analyzed and improved currently executes in 100 time 
units, and the goal is to reduce execution time in 80 time 
units. The algorithm under consideration in the task uses 40 
time units. 
n=2; If execution speed is decreased by 20 time units , 
required result is met. Indicates the necessary requirement. 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 4
• Example 
Consider a system with the following characteristics: The task 
to be analyzed and improved currently executes in 100 time 
units, ad the goal is to reduce execution time to 50 time units. 
The algorithm to be improved uses 40 time units. 
Simplifying n=-4. The algorithm speed will have to run in 
negative time to meet the new specification. This is non-causal 
system. 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 5
Complexity Analysis – A High-Level Measure 
Intructions Operations 
int total (int myArray[], int n) --- 2 
{ 
int sum=0; ---1 
int i =0; ---1 
for (i=0;i<n;i++) ---2*n +1 
{ 
sum= sum + myArray[i]; --- 3*n 
} 
return sum; ---1 
} 
Total = 5n+6 operations 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 6
• 5n+6; for given n, the no. of operations are 
 n=10 ; 56 
 n=100; 506 
 n=1000; 5006 
 n= 10,000; 50,006 
Linear proportion to n; and final number is 
decreasing. 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 7
The Methodology 
1. Decompose the problem into a set of operations 
2. Count the total number of such operations 
3. Derive a formula, based on some parameter n that 
is size of the problem 
4. Use order of magnitudes estimation to assess 
behavior 
Most 
Important 
Slide 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 8
A Simple experiment 
• Linear 
• Quadratic 
• Logarithmic 
• Exponential 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 9
Asymptotic Complexity 
• F(n)=5n+6 
• The function grows asymptotically and referred to as 
asymptotic complexity 
• This is only an approximation as many other factors 
need to be considered like operations requiring 
varying amounts of time 
• As n increases, concentrate on the highest order 
term and drop the lower order term such as 
6(constant term) 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 10
Comparing Algorithms 
Based on 
• Worst case performance(upper bound) 
• Average case 
• Best performance(lower bound) 
F(N) = O(g(N)) – complexity function – Big-O notation 
 The complexity of an algorithm approaches a bound called order 
of the bound. 
 If such a function is expressed as a function of the problem size N, 
and that function is called g(N), then comparison can be written as 
f(N)=O(g(N)). 
 If there is a constant c such that f(N)<cg(N) then f(N) is of order of 
g(N). 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 11
Big-O Arithmetic 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 12
Analyzing Code 
• Constant Time Statements 
 int x,y; Declarations & Initializations 
 char myChar=‘a’; 
 x=y; Assignment 
 x=5*y+4*z; Arithmetic 
 A[j] Array Referencing 
 if(x<12) Conditional tests 
 Cursor = Head -> Next; Referencing/deferencing pointers. 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 13
Looping Constants 
• For Loops, While Loops 
• Determine number of iterations and number of steps 
per iteration. 
int sum=0; 1 
for (int j=0;j<N;j++) 3*N 
sum=sum+j; 1*N 
Total time for loop = 4 steps=O(1) steps per iteration. 
Total time is N.O(1)= O(N.1)=O(N) complexity of the 
loop is a constant. 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 14
While Loop 
bool done=false; 
int result=1; 
int n; 
While(!done) 
{ 
result=result*n; ----1(multiply)+1(assignment) 
n-; -----1(decrement) 
if(n<=1) 
done=true; 
} 
Total time is N.O(1)=O(N) 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 15
Sequences of Statements 
int j,k,sum=0; 
for (j=0;j<N;j++) 
for(k=0;K<j;k++) 
sum=sum+k*j; 
for(i=0;i<N;i++) 
sum=sum+i; 
The complexity is given by 
Total time is N3+N=O(N3) 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 16
Conditional Statements 
if(condition) 
{ 
statement1; ----- O(n2) 
else 
statement2; ----- O(n) 
} 
Consider worst case complexity/maximum running time. 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 17
Function Calls 
• Cost = the call+ passing the arguments+ executing 
the function/=returning a value. 
• Making and returning from call – O(1) 
• Passing arguments – depends on how it is passed – 
passed by value/reference 
• Cost of execution – body of function 
• Determining cost of return – values returned 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 18
Analyzing Algorithms 
• Complexity Function for 
• Analyzing Search Algorithms 
Linear Search – O(N) 
Binary Search – O(log2N) 
• Analyzing Sort Algorithms 
 Selection Sort – O(N2) 
 Quick Sort - O(Nlog2N) 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 19
Analyzing Data Structures 
• Insert/delete at the beginning 
• Insert/delete at the end 
• Insert/delete in the middle 
• Access at the beginning, the end and in the 
middle. 
• Each has a complexity function of O(N) 
Array 
Linked List 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 20
Instructions in Detail 
• Addressing Mode 
• Flow of control – Sequential 
Branch 
Loop 
Function Call 
• Analyzing the flow of control – Assembly and C language 
• Example 
ld r0,#0AAh --- 400ns 
push r0 ---600ns 
add r0,r1 ----400ns 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 21
Co-routine 
• A co-routine is a special kind of procedure call in 
which there is a mutual call exchange between 
cooperating procedures – 2 procedures sharing time. 
• Similar to procedure and time budget. 
• Procedures execute till the end whereas co-routine 
exit and return throughout the body of the 
procedure. 
• The control procedure starts the process. Each 
context switch is determined by any of the of the 
following – Control procedure, External event – a 
timing signal, internal event – a data value. 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 22
• The process continues until both procedures are 
completed. 
• It is time burdened and for faster response 
preemption must be used. 
Control Procedure 
Procedure2 Procedure3 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 23
Interrupt call 
Interrupt 
Handler 
Foreground 
Task 
ISR 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 24
Time Metrics 
• Response Time 
• Execution time 
• Throughput 
• Time loading – percentage of time that CPU is 
doing useful work. 
• Memory loading – percentage of usable 
memory. 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 25
Response Time 
• Time interval between the event and completion of 
associated action 
• Ex – A/D command and acquisition 
• Polled Loops – The response time consists of 3 
components 
Hardware delays in external device to set the 
signaling event 
 Time to test the flag 
Time needed to respond to and process the 
event associated with the flag. 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 26
External Hardware Device Delay 
• Two Cases considered 
a) Case 1 - The response through external system to 
prior internal event 
b) Case 2- An asynchronous external event 
Internal Event 
Casual System Responding 
System 
Response from External system 
Delay through 
External System 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 27
• Time to get the polling loop from the internal causal event 
• The delay through an external device 
• The time to generate the response 
• Flag time - Determined from the execution time of the 
machine's bit test instruction 
• Processing time – time to perform the task associated with 
triggering event 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 28
Case 2 Asynchronous Event from External Device 
• The occurrence of event cannot be determined. 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 29
Co-routine 
• Interrupt Driven Environment 
• Preemptive Schedule 
• Non-preemptive Schedule 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 30
Interrupt Driven Environment 
• Context switch to interrupt handler 
• To acknowledge the interrupt 
• Context switch to processing routine Context switch 
back to original routine 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 31
Preemptive Schedule 
• Context Switch 
• Task Execution 
• Interrupt latency – Highest Priority 
Lowest Priority 
Case 1 Highest Priority – 3 Factors 
• The time from the leading edge of the interrupt in the 
external device until that edge is recognized inside the 
system. 
• The time to complete the current instruction if interrupts are 
enabled. Most processors complete the current instruction 
before switching context. Some permit an interrupt to be 
recognized at the micro instruction level. Thus the time is 
going to be bounded by the longest instruction. 
• The time to complete the current task if interrupts are 
disabled. This time will be bounded by the task size. 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 32
Case 2 Low Priority Task 
• 2 Cases 
 First, the interrupt occurs and is processed. 
 Second, the interrupt occurs and is interrupted. Unless 
interrupts are disabled, the situation is non-deterministic. 
In critical cases, one may have to change 
the priority or place limits on the number of 
preemptions. 
• Non-Preemptive Schedule 
 Since preemption is not allowed, times are computed as 
in highest priority case. 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 33
Time Loading 
• Is percentage of time that the CPU is doing useful 
work – execution of tasks assigned to embedded 
system 
• The time loading is measured in terms of execution 
times of primary and secondary(supported) tasks. 
• Time loading = primary/primary+secondary 
• To compute the time, 3 methods are used 
Instruction counting 
Simulation 
Physical measurement 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 34
Instruction Counting 
• For periodic systems, the total execution time is 
computed and then divided by time for the individual 
module 
• For sporadic systems, the maximum task execution 
rates are used, and the percentages are combined 
over all of the tasks. 
• Effective instruction counting requires understanding 
of basic flow of control through a piece of software. 
Altering the flow involves context switch 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 35
Simulation 
• Complete understanding of the system and accurate 
workload, accurate model of system 
• Model can include hardware or software or both 
• Tools like Verilog or VHDL is used for hardware 
modeling 
• System C or a variety of software languages can be 
used for software modeling 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 36
Model 
• 2 major categories of models are behavior or conceptual and 
structural or analytic 
• Behavioral – symbols for qualitative aspects 
• Structural – mathematical or logical relations to represent the 
behavior 
 System-level model 
 Functional model 
 Physical model 
 Structural model 
 Behavioral model 
 Data model 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 37
Timers 
• Timers can be associated with various buses 
or pieces of code in the system 
• Start timer at beginning of the code and end 
timer at end of code 
• For determining the timing of blocks 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 38
Instrumentation 
• Numerous instruments – logic analyzer, code 
analyzer 
• Maximum and minimum times, time loops, identify 
non executed code, capture the rates of execution, 
frequently used code 
• Limitation – there are like input to the system, not 
good for typical and boundary conditions 
• They are not predictive – don’t guarantee 
performance under all circumstance 
• Provide significant information 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 39
Memory Loading 
• Most devices come with large 
memory 
• But amount of memory may be 
reduced to save weight 
(aircraft/spacecraft) 
• Memory loading is defined as 
percentage of usable memory for 
a application 
• Memory map – useful in 
understanding the allocation and 
use of available memory 
Memory 
mapped I/O 
and DMA 
Firmware 
RAM 
Stack Space 
System 
Memory 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 A Memory Map 40
• The total memory loading will be sum of individual 
loadings for instructions, stack and RAM 
• The values of Mi reflect memory loading for each 
portion of memory 
• Pi represent the percentage of total memory 
allocated for program 
• MT is represented as percentage 
• Memory mapped I/O and DMA are not included in 
the calculation, these are fixed by hardware design 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 41
Example 
• Let the system be implemented as follows 
Mi=15Mb;MR=100Kb;MS=150Kb 
PT=55%;PR=33%;PS=10% 
Find value of MT 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 42
Designing a Memory Map 
• Allocate minimum amount of memory necessary for 
the instructions and the stack 
• The firmware contains the program that implements 
the application 
• Memory loading is computed by dividing the number 
of user locations by the maximum allowable 
• Ram area – global variables, registers 
• Ram improves the instruction fetch speed 
• Size of Ram area is decided at design time 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 43
Stack Area 
• Stores context information and auto variables 
• Multiple stacks depending on design 
• Capacity – design time 
• Maximum stack size can be computed using 
• US=Smax*Tmax 
• Memory loading 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 44
Evaluating Performance 
• Depends on information 
• Exact times if computable 
• Measurement technique 
Criterion Analytic 
method 
Simulation Measurement 
Stage Any Any Post prototype 
Time Required SSmmaallll MMeeddiiuumm VVaarriieess 
Tools Analysis Computer languages Instrumentation 
Accuracy Low Moderate Varies 
Trade-off Evaluation Easy Moderate Difficult 
Cost Small Medium High 
Scalability Low Medium High 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 45
Early Stages 
• The model should be hierarchical. Complex system 
can be modeled by decomposing it to simpler parts. 
Progressive refinement, abstraction, reuse of existing 
components 
• The model should express concurrent and temporal 
interdependencies among physical and modeled 
elements. Understand dynamic performance and 
interaction between other elements 
• Model should be graphical; not necessary 
• Permit worst case and scenario analysis, boundary 
condition 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 46
Mid Stages 
• Real components of design 
• Prototype modules and integrate them into 
subsystems 
Later Stages 
• Integrate into larger system 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 47
Performance Optimization 
• What is being optimized ? 
• Why is it being optimized? 
• What is the effect on overall system? 
• Is optimization appropriate operating context? 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 48
Common Mistakes 
• Expecting improvement in one aspect of the design 
to improve overall performance proportional to 
improvement 
• Using hardware independent metrics to predict 
performance 
• Using peak performance 
• Comparing performance based on couple of metrics 
• Using synthetic benchmarks 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 49
Tricks of the Trade 
Response times and time loading can be reduced in 
number of ways 
1. Perform measurements and computations at a rate 
of change and values of the data, type of data, 
number of significant digits and operations 
2. Use of look up tables or combinational logic 
3. Modification of certain operations to reduce certain 
parameters 
4. Learn from compiler experts 
5. Loop management 
6. Flow of control optimization 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 50
Tricks of the Trade 
7. Use registers and caches 
8. Use of only necessary values 
9. Optimize a common path of frequently used 
code block 
10.Use page mode accesses 
11.Know when to use recursion vs. iteration 
12.Macros and Inlining functions 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 51
Hardware Accelerators 
• One technique to improve the performance of software 
implementation is to move some functionality to hardware 
• Such a collection of components is called hardware 
accelerators 
• Often attached to CPU bus 
• Communication with CPU is accomplished by – shared 
variables, shared memory 
• An accelerator is distinguished from coprocessor 
• The accelerator does not execute instructions; its interface 
appears as I/O 
• Designed to perform a specific operation and is generally 
implemented as an ASIC,FPGA, CPLD 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 52
Hardware Accelerators 
• Hardware accelerators are used when there are 
functions whose operations do not map onto the 
CPU 
• Examples – bit and bit field operations, differing 
precisions, high speed arithmetic, FFT calculations, 
high speed/demand input output operations, 
streaming applications 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 53
Optimizing for Power Consumption 
• Safe mode, low power mode, sleep mode 
• Advanced Configuration and power interface (ACPI) 
is international standard 
Software 
Hardware 
•Software 
The algorithms used 
Location of code 
Use of software to control various subsystems 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 54
Techniques to measure power consumption 
• Identify the portion of the code to be analyzed 
• Measure the current power consumed by processor while 
code is being executed 
• Modify the loop, such that code comprising the loop is 
disabled. Ensure compiler has not optimized the loop or 
section of code out 
• Measure current power consumed by processor 
• Kind the instructions 
• Collection or sequence of instructions executed 
• Locations of the instructions and their operands 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 55
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 56
Relative Power consumption for Common Processor 
Operation 
Operation Relative Power 
Consumption 
16-Bit Add 1 
16-Bit Multiply 3.6 
8x128x16 
4.4 
SRAM Read 
8x128x16 
SRAM Write 
9 
I/O access 10 
16-bit DRAM 
33 
Memory 
transfer 
Using cache have significant effect on system power consumption, SRAM consumes 
more power than DRAM on per-cell basis and cache is generally SRAM. The size of 
cache should be optimized. 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 57
Other Techniques 
• Power aware compilers 
• Use of registers effectively 
• Look for Cache conflicts and eliminate if 
possible 
• Unroll loops 
• Eliminate recursive procedures 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 58
Hardware Power Optimization Techniques 
Power Management Schemes 
• Best option to turn off system when not in use- power 
consumption is limited to leakage-lower bound of 
consumption- static power 
• Upper bound – apply power to all parts of the system – 
maximum value – dynamic power 
• The goal is to find a mid power consumption value, governed 
by specs 
• ex – topographic mapping satellite 
• Approaches 
Decide which portion of system to power down 
Decide components which have to shut down instantly 
Recognize which components do not power up instantly 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 59
Basic for System power down-power up sequence 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 60
Predictive Shutdown 
• The approaches discussed in previous slide is not possible 
everywhere. 
• Knowledge of current status and previous state must be 
considered to shutdown the system – predictive shutdown 
• Such a technique is used in branch prediction logic in 
instruction prefetch pipeline 
• This can lead to premature shutdown or restart 
Timers 
• Another technique is to use timers 
• Timers monitor the system behavior and turn off when timer 
expires 
• Device turns on again based on demand 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 61
Producer, service, consumer 
• Based on queuing theory 
• Producer is the system which is to be powered on 
• Consumer is part of a system which needs a service 
• A power manager monitors behavior of system and utilizes a 
schedule based on Markov modeling which maximizes system 
computational performance satisfying power budget 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 62
Example 
• The operating system is responsible for dynamically 
controlling the power in a simple I/O subsystems 
• The dynamically controlled portion supports two modes – OFF 
and ON 
• The dynamic subcomponents consume 10watts when on and 
0 watts when off 
• Switching takes 2 seconds and consumes 40joules to switch 
from off state to on state and one second and 10joules to 
switch from on to off 
• The request has a period of 25 seconds 
• Graphically 3 alternate schemes as illustrated 
• Observe same average throughput with substantially reduced 
power consumption 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 63
Example 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 64
Advanced Configuration and Power interface (ACPI) 
• ACPI is an industry standard power management scheme that 
was initially applied to PC specifically Windows. 
• This standard provides some basic power management 
facilities as well as interfaces to the hardware 
• The software more specifically operating systems provides 
management module 
• It is responsibility of OS to specify the power management 
policy for the system 
• The OS uses ACPI module to send the required controls to 
hardware and to monitor the state of hardware as an input to 
power manager 
• The behavior of the ACPI scheme is expressed in the state 
diagram 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 65
ACPI 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 66
• The standard supports 5 global power states 
1. G3- hard off or full off – defined as physically off 
state – system consumes no power 
2. G2- soft off requires full OS reboot to restore 
system to full operational condition 
3. G1- sleeping state – the system appears to be off. 
The time required to return to an operational 
condition is inversely proportional to power 
consumption 
4. G0 – working state in which the system is fully 
usable 
5. Legacy state – the system doesnot comply with ACPI 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 67
• Substates 
1. S1- low wakeup latency – ensures no loss of 
system context 
2. S2- low wakeup latency state – has loss of 
CPU and system cache state 
3. S3- low wakeup latency state – all system 
state except for main memory is lost 
4. S4- lowest power sleeping state – all the 
devices are off 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 68
Caches and Performance 
• Based on locality of reference characteristics, small amounts 
of high speed memory to hold a subset of instructions and 
data for immediate use can be used 
• Such a scheme gives the illusion that the program has 
unlimited amounts of high speed memory 
• The bulk of instructions and data are held in memory with 
much longer cycle/access times than available in the system 
CPU 
• One major problem in real time embedded application is that 
cache behavior is non deterministic 
• It is difficult to predict when there will be a cache hit or miss 
• It is difficult to set reasonable upper bounds on execution 
times for tasks 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 69
Pipelining 
• The problem is due to 2 sources – conditional branches and 
shared access with preemption 
• Conditional branches are handled with good branch 
prediction algorithms, but cannot be solved completely 
• The path taken and a successful cache access may vary with 
iteration 
• This is overcome with pipelined architectures 
• Pipelining techniques are used to prefetch data and 
instructions while other activities are taking place 
• The selection of an alternate branch requires that the pipe be 
flushed and refilled 
• This may lead to cache miss and time delay 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 70
Preemption and multi tasking 
• In a multi tasking interrupt context, one task may 
preempt the other 
• This requires different block of data/instructions that 
will have significant number of cache misses as task 
switch 
• Similar situation arises during Von Neuman machine 
– same memory for code and data 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 71
Shared Access 
• Example – consider a direct mapping caching scheme 
• If 1K cache with blocks of 64 words, such blocks from main 
memory addresses 0,1024,2048 and so on 
• Assume a following memory map 
• Instructions are loaded starting at location 1024, and data is 
loaded starting at location 8192. consider the simple code 
fragment 
for(i=0;i<10:i++) 
{ 
a[i]= b[i]+4; 
} 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 72
• On the first access, the instruction access will miss 
and bring in the appropriate block from main 
memory 
• The instruction will execute and have to bring in data 
• The data access will miss and bring in the 
appropriate block from main memory 
• Because block 0 is occupied, the data block will 
overwrite the instructions in cache block 0 
• On second access, the instruction access will again 
miss and bring in the appropriate block from the 
main memory 
• The miss occurs because the instructions had been 
over written by the incoming data 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 73
• The instruction will execute and have to bring the 
data again. Because block 0 is again occupied, the 
data block will over write block 0 again 
• This process repeats causing serious degradation 
• There is also a time burden of searching and 
managing the cache 
• The continuing main memory accesses can also 
increase the power consumption of the system 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 74
Possible solutions 
1. Use a set associative rather than direct 
mapping scheme 
2. Move to Harvard or Aiken Architecture 
3. Support an instruction cache and data cache 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 75
Smart Memory Allocation for Real time (SMART) 
• Cache is decomposed into restricted regions and 
common portions 
• A critical task is assigned a restricted portion on start 
up 
• All cache accesses are restricted to those partitions 
and to common area 
• The task retains exclusive rights to such areas until 
terminated or aborted 
• This remains an open problem and various heuristic 
schemes have been explored and utilized 
09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 76

More Related Content

What's hot

RTOS for Embedded System Design
RTOS for Embedded System DesignRTOS for Embedded System Design
RTOS for Embedded System Designanand hd
 
Unit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisationUnit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisationPavithra S
 
Trends and challenges in vlsi
Trends and challenges in vlsiTrends and challenges in vlsi
Trends and challenges in vlsilabishettybhanu
 
ARM Exception and interrupts
ARM Exception and interrupts ARM Exception and interrupts
ARM Exception and interrupts NishmaNJ
 
Pipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture pptPipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture pptmali yogesh kumar
 
Design and development of carry select adder
Design and development of carry select adderDesign and development of carry select adder
Design and development of carry select adderABIN THOMAS
 
System On Chip (SOC)
System On Chip (SOC)System On Chip (SOC)
System On Chip (SOC)Shivam Gupta
 
Computer organisation -morris mano
Computer organisation  -morris manoComputer organisation  -morris mano
Computer organisation -morris manovishnu murthy
 
REAL TIME OPERATING SYSTEM
REAL TIME OPERATING SYSTEMREAL TIME OPERATING SYSTEM
REAL TIME OPERATING SYSTEMprakrutijsh
 
Processors used in System on chip
Processors used in System on chip Processors used in System on chip
Processors used in System on chip A B Shinde
 

What's hot (20)

RTOS for Embedded System Design
RTOS for Embedded System DesignRTOS for Embedded System Design
RTOS for Embedded System Design
 
Unit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisationUnit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisation
 
pipelining
pipeliningpipelining
pipelining
 
ARM Processors
ARM ProcessorsARM Processors
ARM Processors
 
Task assignment and scheduling
Task assignment and schedulingTask assignment and scheduling
Task assignment and scheduling
 
Introduction to Genetic algorithm and its significance in VLSI design and aut...
Introduction to Genetic algorithm and its significance in VLSI design and aut...Introduction to Genetic algorithm and its significance in VLSI design and aut...
Introduction to Genetic algorithm and its significance in VLSI design and aut...
 
Trends and challenges in vlsi
Trends and challenges in vlsiTrends and challenges in vlsi
Trends and challenges in vlsi
 
ARM Exception and interrupts
ARM Exception and interrupts ARM Exception and interrupts
ARM Exception and interrupts
 
Pipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture pptPipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture ppt
 
dft
dftdft
dft
 
Design and development of carry select adder
Design and development of carry select adderDesign and development of carry select adder
Design and development of carry select adder
 
Radix 4 booth
Radix 4 boothRadix 4 booth
Radix 4 booth
 
Arm architecture
Arm architectureArm architecture
Arm architecture
 
System On Chip (SOC)
System On Chip (SOC)System On Chip (SOC)
System On Chip (SOC)
 
Computer organisation -morris mano
Computer organisation  -morris manoComputer organisation  -morris mano
Computer organisation -morris mano
 
VLSi
VLSiVLSi
VLSi
 
REAL TIME OPERATING SYSTEM
REAL TIME OPERATING SYSTEMREAL TIME OPERATING SYSTEM
REAL TIME OPERATING SYSTEM
 
Design for Testability
Design for Testability Design for Testability
Design for Testability
 
ARM Processor
ARM ProcessorARM Processor
ARM Processor
 
Processors used in System on chip
Processors used in System on chip Processors used in System on chip
Processors used in System on chip
 

Similar to Unit7 & 8 performance analysis and optimization

6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdf6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdfTigabu Yaya
 
Chapter One.pdf
Chapter One.pdfChapter One.pdf
Chapter One.pdfabay golla
 
ManSciProjMan-CPM.pptx
ManSciProjMan-CPM.pptxManSciProjMan-CPM.pptx
ManSciProjMan-CPM.pptxssuser85ddaa
 
Cpm module iii reference
Cpm module iii referenceCpm module iii reference
Cpm module iii referenceahsanrabbani
 
Project Management & Engineering Economics
Project Management & Engineering EconomicsProject Management & Engineering Economics
Project Management & Engineering EconomicsDeepak Paithankar
 
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...IRJET Journal
 
Data Structure and Algorithm chapter two, This material is for Data Structure...
Data Structure and Algorithm chapter two, This material is for Data Structure...Data Structure and Algorithm chapter two, This material is for Data Structure...
Data Structure and Algorithm chapter two, This material is for Data Structure...bekidea
 
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric EnvironmentsScheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric EnvironmentsLEGATO project
 
Project management@ ppt doms
Project management@ ppt doms Project management@ ppt doms
Project management@ ppt doms Babasab Patil
 
A Heterogeneous Static Hierarchical Expected Completion Time Based Scheduling...
A Heterogeneous Static Hierarchical Expected Completion Time Based Scheduling...A Heterogeneous Static Hierarchical Expected Completion Time Based Scheduling...
A Heterogeneous Static Hierarchical Expected Completion Time Based Scheduling...IRJET Journal
 
Sw metrics for regression testing
Sw metrics for regression testingSw metrics for regression testing
Sw metrics for regression testingJyotsna Sharma
 
Pert,cpm, resource allocation and gert
Pert,cpm, resource allocation and gertPert,cpm, resource allocation and gert
Pert,cpm, resource allocation and gertRaj J Das
 
Algorithm analysis
Algorithm analysisAlgorithm analysis
Algorithm analysisAkshay Dagar
 
Data Structures and Algorithm - Week 11 - Algorithm Analysis
Data Structures and Algorithm - Week 11 - Algorithm AnalysisData Structures and Algorithm - Week 11 - Algorithm Analysis
Data Structures and Algorithm - Week 11 - Algorithm AnalysisFerdin Joe John Joseph PhD
 
A Review of Different Types of Schedulers Used In Energy Management
A Review of Different Types of Schedulers Used In Energy ManagementA Review of Different Types of Schedulers Used In Energy Management
A Review of Different Types of Schedulers Used In Energy ManagementIRJET Journal
 
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...IRJET Journal
 

Similar to Unit7 & 8 performance analysis and optimization (20)

6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdf6_RealTimeScheduling.pdf
6_RealTimeScheduling.pdf
 
Chapter One.pdf
Chapter One.pdfChapter One.pdf
Chapter One.pdf
 
ManSciProjMan-CPM.pptx
ManSciProjMan-CPM.pptxManSciProjMan-CPM.pptx
ManSciProjMan-CPM.pptx
 
Cpm module iii reference
Cpm module iii referenceCpm module iii reference
Cpm module iii reference
 
Os2
Os2Os2
Os2
 
Project Management & Engineering Economics
Project Management & Engineering EconomicsProject Management & Engineering Economics
Project Management & Engineering Economics
 
Project management
Project managementProject management
Project management
 
RTS
RTSRTS
RTS
 
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
 
Data Structure and Algorithm chapter two, This material is for Data Structure...
Data Structure and Algorithm chapter two, This material is for Data Structure...Data Structure and Algorithm chapter two, This material is for Data Structure...
Data Structure and Algorithm chapter two, This material is for Data Structure...
 
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric EnvironmentsScheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
 
Project management@ ppt doms
Project management@ ppt doms Project management@ ppt doms
Project management@ ppt doms
 
10. resource management
10. resource management10. resource management
10. resource management
 
A Heterogeneous Static Hierarchical Expected Completion Time Based Scheduling...
A Heterogeneous Static Hierarchical Expected Completion Time Based Scheduling...A Heterogeneous Static Hierarchical Expected Completion Time Based Scheduling...
A Heterogeneous Static Hierarchical Expected Completion Time Based Scheduling...
 
Sw metrics for regression testing
Sw metrics for regression testingSw metrics for regression testing
Sw metrics for regression testing
 
Pert,cpm, resource allocation and gert
Pert,cpm, resource allocation and gertPert,cpm, resource allocation and gert
Pert,cpm, resource allocation and gert
 
Algorithm analysis
Algorithm analysisAlgorithm analysis
Algorithm analysis
 
Data Structures and Algorithm - Week 11 - Algorithm Analysis
Data Structures and Algorithm - Week 11 - Algorithm AnalysisData Structures and Algorithm - Week 11 - Algorithm Analysis
Data Structures and Algorithm - Week 11 - Algorithm Analysis
 
A Review of Different Types of Schedulers Used In Energy Management
A Review of Different Types of Schedulers Used In Energy ManagementA Review of Different Types of Schedulers Used In Energy Management
A Review of Different Types of Schedulers Used In Energy Management
 
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...
Scheduling Algorithm Based Simulator for Resource Allocation Task in Cloud Co...
 

Unit7 & 8 performance analysis and optimization

  • 1. Unit 7 & 8 Performance Analysis and Optimization By Leena Chandrashekar, Assistant Professor, ECE Dept, RNSIT, Bangalore 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 1
  • 2. Performance or Efficiency Measures • Means time, space, power, cost • Depends on input data, hardware platform, compiler, compiler options. • Measure based on complexity, time and power, memory, cost and weight. • Development time, Ease of maintainance and extensibility. 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 2
  • 3. The System • Hardware oComputational and control elements oCommunication system oMemory • Software oAlgorithms and data Structures oControl and Scheduling 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 3
  • 4. Some Limitations • Amdahl’s Law Example Consider a system with the following characteristics: The task to be analyzed and improved currently executes in 100 time units, and the goal is to reduce execution time in 80 time units. The algorithm under consideration in the task uses 40 time units. n=2; If execution speed is decreased by 20 time units , required result is met. Indicates the necessary requirement. 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 4
  • 5. • Example Consider a system with the following characteristics: The task to be analyzed and improved currently executes in 100 time units, ad the goal is to reduce execution time to 50 time units. The algorithm to be improved uses 40 time units. Simplifying n=-4. The algorithm speed will have to run in negative time to meet the new specification. This is non-causal system. 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 5
  • 6. Complexity Analysis – A High-Level Measure Intructions Operations int total (int myArray[], int n) --- 2 { int sum=0; ---1 int i =0; ---1 for (i=0;i<n;i++) ---2*n +1 { sum= sum + myArray[i]; --- 3*n } return sum; ---1 } Total = 5n+6 operations 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 6
  • 7. • 5n+6; for given n, the no. of operations are  n=10 ; 56  n=100; 506  n=1000; 5006  n= 10,000; 50,006 Linear proportion to n; and final number is decreasing. 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 7
  • 8. The Methodology 1. Decompose the problem into a set of operations 2. Count the total number of such operations 3. Derive a formula, based on some parameter n that is size of the problem 4. Use order of magnitudes estimation to assess behavior Most Important Slide 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 8
  • 9. A Simple experiment • Linear • Quadratic • Logarithmic • Exponential 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 9
  • 10. Asymptotic Complexity • F(n)=5n+6 • The function grows asymptotically and referred to as asymptotic complexity • This is only an approximation as many other factors need to be considered like operations requiring varying amounts of time • As n increases, concentrate on the highest order term and drop the lower order term such as 6(constant term) 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 10
  • 11. Comparing Algorithms Based on • Worst case performance(upper bound) • Average case • Best performance(lower bound) F(N) = O(g(N)) – complexity function – Big-O notation  The complexity of an algorithm approaches a bound called order of the bound.  If such a function is expressed as a function of the problem size N, and that function is called g(N), then comparison can be written as f(N)=O(g(N)).  If there is a constant c such that f(N)<cg(N) then f(N) is of order of g(N). 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 11
  • 12. Big-O Arithmetic 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 12
  • 13. Analyzing Code • Constant Time Statements  int x,y; Declarations & Initializations  char myChar=‘a’;  x=y; Assignment  x=5*y+4*z; Arithmetic  A[j] Array Referencing  if(x<12) Conditional tests  Cursor = Head -> Next; Referencing/deferencing pointers. 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 13
  • 14. Looping Constants • For Loops, While Loops • Determine number of iterations and number of steps per iteration. int sum=0; 1 for (int j=0;j<N;j++) 3*N sum=sum+j; 1*N Total time for loop = 4 steps=O(1) steps per iteration. Total time is N.O(1)= O(N.1)=O(N) complexity of the loop is a constant. 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 14
  • 15. While Loop bool done=false; int result=1; int n; While(!done) { result=result*n; ----1(multiply)+1(assignment) n-; -----1(decrement) if(n<=1) done=true; } Total time is N.O(1)=O(N) 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 15
  • 16. Sequences of Statements int j,k,sum=0; for (j=0;j<N;j++) for(k=0;K<j;k++) sum=sum+k*j; for(i=0;i<N;i++) sum=sum+i; The complexity is given by Total time is N3+N=O(N3) 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 16
  • 17. Conditional Statements if(condition) { statement1; ----- O(n2) else statement2; ----- O(n) } Consider worst case complexity/maximum running time. 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 17
  • 18. Function Calls • Cost = the call+ passing the arguments+ executing the function/=returning a value. • Making and returning from call – O(1) • Passing arguments – depends on how it is passed – passed by value/reference • Cost of execution – body of function • Determining cost of return – values returned 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 18
  • 19. Analyzing Algorithms • Complexity Function for • Analyzing Search Algorithms Linear Search – O(N) Binary Search – O(log2N) • Analyzing Sort Algorithms  Selection Sort – O(N2)  Quick Sort - O(Nlog2N) 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 19
  • 20. Analyzing Data Structures • Insert/delete at the beginning • Insert/delete at the end • Insert/delete in the middle • Access at the beginning, the end and in the middle. • Each has a complexity function of O(N) Array Linked List 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 20
  • 21. Instructions in Detail • Addressing Mode • Flow of control – Sequential Branch Loop Function Call • Analyzing the flow of control – Assembly and C language • Example ld r0,#0AAh --- 400ns push r0 ---600ns add r0,r1 ----400ns 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 21
  • 22. Co-routine • A co-routine is a special kind of procedure call in which there is a mutual call exchange between cooperating procedures – 2 procedures sharing time. • Similar to procedure and time budget. • Procedures execute till the end whereas co-routine exit and return throughout the body of the procedure. • The control procedure starts the process. Each context switch is determined by any of the of the following – Control procedure, External event – a timing signal, internal event – a data value. 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 22
  • 23. • The process continues until both procedures are completed. • It is time burdened and for faster response preemption must be used. Control Procedure Procedure2 Procedure3 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 23
  • 24. Interrupt call Interrupt Handler Foreground Task ISR 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 24
  • 25. Time Metrics • Response Time • Execution time • Throughput • Time loading – percentage of time that CPU is doing useful work. • Memory loading – percentage of usable memory. 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 25
  • 26. Response Time • Time interval between the event and completion of associated action • Ex – A/D command and acquisition • Polled Loops – The response time consists of 3 components Hardware delays in external device to set the signaling event  Time to test the flag Time needed to respond to and process the event associated with the flag. 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 26
  • 27. External Hardware Device Delay • Two Cases considered a) Case 1 - The response through external system to prior internal event b) Case 2- An asynchronous external event Internal Event Casual System Responding System Response from External system Delay through External System 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 27
  • 28. • Time to get the polling loop from the internal causal event • The delay through an external device • The time to generate the response • Flag time - Determined from the execution time of the machine's bit test instruction • Processing time – time to perform the task associated with triggering event 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 28
  • 29. Case 2 Asynchronous Event from External Device • The occurrence of event cannot be determined. 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 29
  • 30. Co-routine • Interrupt Driven Environment • Preemptive Schedule • Non-preemptive Schedule 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 30
  • 31. Interrupt Driven Environment • Context switch to interrupt handler • To acknowledge the interrupt • Context switch to processing routine Context switch back to original routine 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 31
  • 32. Preemptive Schedule • Context Switch • Task Execution • Interrupt latency – Highest Priority Lowest Priority Case 1 Highest Priority – 3 Factors • The time from the leading edge of the interrupt in the external device until that edge is recognized inside the system. • The time to complete the current instruction if interrupts are enabled. Most processors complete the current instruction before switching context. Some permit an interrupt to be recognized at the micro instruction level. Thus the time is going to be bounded by the longest instruction. • The time to complete the current task if interrupts are disabled. This time will be bounded by the task size. 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 32
  • 33. Case 2 Low Priority Task • 2 Cases  First, the interrupt occurs and is processed.  Second, the interrupt occurs and is interrupted. Unless interrupts are disabled, the situation is non-deterministic. In critical cases, one may have to change the priority or place limits on the number of preemptions. • Non-Preemptive Schedule  Since preemption is not allowed, times are computed as in highest priority case. 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 33
  • 34. Time Loading • Is percentage of time that the CPU is doing useful work – execution of tasks assigned to embedded system • The time loading is measured in terms of execution times of primary and secondary(supported) tasks. • Time loading = primary/primary+secondary • To compute the time, 3 methods are used Instruction counting Simulation Physical measurement 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 34
  • 35. Instruction Counting • For periodic systems, the total execution time is computed and then divided by time for the individual module • For sporadic systems, the maximum task execution rates are used, and the percentages are combined over all of the tasks. • Effective instruction counting requires understanding of basic flow of control through a piece of software. Altering the flow involves context switch 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 35
  • 36. Simulation • Complete understanding of the system and accurate workload, accurate model of system • Model can include hardware or software or both • Tools like Verilog or VHDL is used for hardware modeling • System C or a variety of software languages can be used for software modeling 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 36
  • 37. Model • 2 major categories of models are behavior or conceptual and structural or analytic • Behavioral – symbols for qualitative aspects • Structural – mathematical or logical relations to represent the behavior  System-level model  Functional model  Physical model  Structural model  Behavioral model  Data model 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 37
  • 38. Timers • Timers can be associated with various buses or pieces of code in the system • Start timer at beginning of the code and end timer at end of code • For determining the timing of blocks 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 38
  • 39. Instrumentation • Numerous instruments – logic analyzer, code analyzer • Maximum and minimum times, time loops, identify non executed code, capture the rates of execution, frequently used code • Limitation – there are like input to the system, not good for typical and boundary conditions • They are not predictive – don’t guarantee performance under all circumstance • Provide significant information 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 39
  • 40. Memory Loading • Most devices come with large memory • But amount of memory may be reduced to save weight (aircraft/spacecraft) • Memory loading is defined as percentage of usable memory for a application • Memory map – useful in understanding the allocation and use of available memory Memory mapped I/O and DMA Firmware RAM Stack Space System Memory 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 A Memory Map 40
  • 41. • The total memory loading will be sum of individual loadings for instructions, stack and RAM • The values of Mi reflect memory loading for each portion of memory • Pi represent the percentage of total memory allocated for program • MT is represented as percentage • Memory mapped I/O and DMA are not included in the calculation, these are fixed by hardware design 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 41
  • 42. Example • Let the system be implemented as follows Mi=15Mb;MR=100Kb;MS=150Kb PT=55%;PR=33%;PS=10% Find value of MT 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 42
  • 43. Designing a Memory Map • Allocate minimum amount of memory necessary for the instructions and the stack • The firmware contains the program that implements the application • Memory loading is computed by dividing the number of user locations by the maximum allowable • Ram area – global variables, registers • Ram improves the instruction fetch speed • Size of Ram area is decided at design time 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 43
  • 44. Stack Area • Stores context information and auto variables • Multiple stacks depending on design • Capacity – design time • Maximum stack size can be computed using • US=Smax*Tmax • Memory loading 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 44
  • 45. Evaluating Performance • Depends on information • Exact times if computable • Measurement technique Criterion Analytic method Simulation Measurement Stage Any Any Post prototype Time Required SSmmaallll MMeeddiiuumm VVaarriieess Tools Analysis Computer languages Instrumentation Accuracy Low Moderate Varies Trade-off Evaluation Easy Moderate Difficult Cost Small Medium High Scalability Low Medium High 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 45
  • 46. Early Stages • The model should be hierarchical. Complex system can be modeled by decomposing it to simpler parts. Progressive refinement, abstraction, reuse of existing components • The model should express concurrent and temporal interdependencies among physical and modeled elements. Understand dynamic performance and interaction between other elements • Model should be graphical; not necessary • Permit worst case and scenario analysis, boundary condition 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 46
  • 47. Mid Stages • Real components of design • Prototype modules and integrate them into subsystems Later Stages • Integrate into larger system 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 47
  • 48. Performance Optimization • What is being optimized ? • Why is it being optimized? • What is the effect on overall system? • Is optimization appropriate operating context? 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 48
  • 49. Common Mistakes • Expecting improvement in one aspect of the design to improve overall performance proportional to improvement • Using hardware independent metrics to predict performance • Using peak performance • Comparing performance based on couple of metrics • Using synthetic benchmarks 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 49
  • 50. Tricks of the Trade Response times and time loading can be reduced in number of ways 1. Perform measurements and computations at a rate of change and values of the data, type of data, number of significant digits and operations 2. Use of look up tables or combinational logic 3. Modification of certain operations to reduce certain parameters 4. Learn from compiler experts 5. Loop management 6. Flow of control optimization 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 50
  • 51. Tricks of the Trade 7. Use registers and caches 8. Use of only necessary values 9. Optimize a common path of frequently used code block 10.Use page mode accesses 11.Know when to use recursion vs. iteration 12.Macros and Inlining functions 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 51
  • 52. Hardware Accelerators • One technique to improve the performance of software implementation is to move some functionality to hardware • Such a collection of components is called hardware accelerators • Often attached to CPU bus • Communication with CPU is accomplished by – shared variables, shared memory • An accelerator is distinguished from coprocessor • The accelerator does not execute instructions; its interface appears as I/O • Designed to perform a specific operation and is generally implemented as an ASIC,FPGA, CPLD 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 52
  • 53. Hardware Accelerators • Hardware accelerators are used when there are functions whose operations do not map onto the CPU • Examples – bit and bit field operations, differing precisions, high speed arithmetic, FFT calculations, high speed/demand input output operations, streaming applications 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 53
  • 54. Optimizing for Power Consumption • Safe mode, low power mode, sleep mode • Advanced Configuration and power interface (ACPI) is international standard Software Hardware •Software The algorithms used Location of code Use of software to control various subsystems 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 54
  • 55. Techniques to measure power consumption • Identify the portion of the code to be analyzed • Measure the current power consumed by processor while code is being executed • Modify the loop, such that code comprising the loop is disabled. Ensure compiler has not optimized the loop or section of code out • Measure current power consumed by processor • Kind the instructions • Collection or sequence of instructions executed • Locations of the instructions and their operands 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 55
  • 56. 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 56
  • 57. Relative Power consumption for Common Processor Operation Operation Relative Power Consumption 16-Bit Add 1 16-Bit Multiply 3.6 8x128x16 4.4 SRAM Read 8x128x16 SRAM Write 9 I/O access 10 16-bit DRAM 33 Memory transfer Using cache have significant effect on system power consumption, SRAM consumes more power than DRAM on per-cell basis and cache is generally SRAM. The size of cache should be optimized. 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 57
  • 58. Other Techniques • Power aware compilers • Use of registers effectively • Look for Cache conflicts and eliminate if possible • Unroll loops • Eliminate recursive procedures 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 58
  • 59. Hardware Power Optimization Techniques Power Management Schemes • Best option to turn off system when not in use- power consumption is limited to leakage-lower bound of consumption- static power • Upper bound – apply power to all parts of the system – maximum value – dynamic power • The goal is to find a mid power consumption value, governed by specs • ex – topographic mapping satellite • Approaches Decide which portion of system to power down Decide components which have to shut down instantly Recognize which components do not power up instantly 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 59
  • 60. Basic for System power down-power up sequence 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 60
  • 61. Predictive Shutdown • The approaches discussed in previous slide is not possible everywhere. • Knowledge of current status and previous state must be considered to shutdown the system – predictive shutdown • Such a technique is used in branch prediction logic in instruction prefetch pipeline • This can lead to premature shutdown or restart Timers • Another technique is to use timers • Timers monitor the system behavior and turn off when timer expires • Device turns on again based on demand 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 61
  • 62. Producer, service, consumer • Based on queuing theory • Producer is the system which is to be powered on • Consumer is part of a system which needs a service • A power manager monitors behavior of system and utilizes a schedule based on Markov modeling which maximizes system computational performance satisfying power budget 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 62
  • 63. Example • The operating system is responsible for dynamically controlling the power in a simple I/O subsystems • The dynamically controlled portion supports two modes – OFF and ON • The dynamic subcomponents consume 10watts when on and 0 watts when off • Switching takes 2 seconds and consumes 40joules to switch from off state to on state and one second and 10joules to switch from on to off • The request has a period of 25 seconds • Graphically 3 alternate schemes as illustrated • Observe same average throughput with substantially reduced power consumption 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 63
  • 64. Example 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 64
  • 65. Advanced Configuration and Power interface (ACPI) • ACPI is an industry standard power management scheme that was initially applied to PC specifically Windows. • This standard provides some basic power management facilities as well as interfaces to the hardware • The software more specifically operating systems provides management module • It is responsibility of OS to specify the power management policy for the system • The OS uses ACPI module to send the required controls to hardware and to monitor the state of hardware as an input to power manager • The behavior of the ACPI scheme is expressed in the state diagram 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 65
  • 66. ACPI 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 66
  • 67. • The standard supports 5 global power states 1. G3- hard off or full off – defined as physically off state – system consumes no power 2. G2- soft off requires full OS reboot to restore system to full operational condition 3. G1- sleeping state – the system appears to be off. The time required to return to an operational condition is inversely proportional to power consumption 4. G0 – working state in which the system is fully usable 5. Legacy state – the system doesnot comply with ACPI 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 67
  • 68. • Substates 1. S1- low wakeup latency – ensures no loss of system context 2. S2- low wakeup latency state – has loss of CPU and system cache state 3. S3- low wakeup latency state – all system state except for main memory is lost 4. S4- lowest power sleeping state – all the devices are off 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 68
  • 69. Caches and Performance • Based on locality of reference characteristics, small amounts of high speed memory to hold a subset of instructions and data for immediate use can be used • Such a scheme gives the illusion that the program has unlimited amounts of high speed memory • The bulk of instructions and data are held in memory with much longer cycle/access times than available in the system CPU • One major problem in real time embedded application is that cache behavior is non deterministic • It is difficult to predict when there will be a cache hit or miss • It is difficult to set reasonable upper bounds on execution times for tasks 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 69
  • 70. Pipelining • The problem is due to 2 sources – conditional branches and shared access with preemption • Conditional branches are handled with good branch prediction algorithms, but cannot be solved completely • The path taken and a successful cache access may vary with iteration • This is overcome with pipelined architectures • Pipelining techniques are used to prefetch data and instructions while other activities are taking place • The selection of an alternate branch requires that the pipe be flushed and refilled • This may lead to cache miss and time delay 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 70
  • 71. Preemption and multi tasking • In a multi tasking interrupt context, one task may preempt the other • This requires different block of data/instructions that will have significant number of cache misses as task switch • Similar situation arises during Von Neuman machine – same memory for code and data 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 71
  • 72. Shared Access • Example – consider a direct mapping caching scheme • If 1K cache with blocks of 64 words, such blocks from main memory addresses 0,1024,2048 and so on • Assume a following memory map • Instructions are loaded starting at location 1024, and data is loaded starting at location 8192. consider the simple code fragment for(i=0;i<10:i++) { a[i]= b[i]+4; } 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 72
  • 73. • On the first access, the instruction access will miss and bring in the appropriate block from main memory • The instruction will execute and have to bring in data • The data access will miss and bring in the appropriate block from main memory • Because block 0 is occupied, the data block will overwrite the instructions in cache block 0 • On second access, the instruction access will again miss and bring in the appropriate block from the main memory • The miss occurs because the instructions had been over written by the incoming data 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 73
  • 74. • The instruction will execute and have to bring the data again. Because block 0 is again occupied, the data block will over write block 0 again • This process repeats causing serious degradation • There is also a time burden of searching and managing the cache • The continuing main memory accesses can also increase the power consumption of the system 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 74
  • 75. Possible solutions 1. Use a set associative rather than direct mapping scheme 2. Move to Harvard or Aiken Architecture 3. Support an instruction cache and data cache 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 75
  • 76. Smart Memory Allocation for Real time (SMART) • Cache is decomposed into restricted regions and common portions • A critical task is assigned a restricted portion on start up • All cache accesses are restricted to those partitions and to common area • The task retains exclusive rights to such areas until terminated or aborted • This remains an open problem and various heuristic schemes have been explored and utilized 09-Nov-14 ECE Dept, RNSIT,VTU, Aug - Dec 2014 76