1. Getting Back Memory and Performance
Jen Costillo
While you wait, download:
http://tinyurl.com/nha7853
V1.2 release
2. Why Optimize?
Lower memory -> cheaper BOM
Lower Memory footprint
Low RAM footprint
PerformancePerformance
Bottlenecks
Power considerations
Maintenance
Sometime your compiler can’t do everything.
Sometime you want a challenge
11/15/2015 2Costillo Rebelbot
3. Things Covered
Basics of: Intro to tools:
Profiling tools
Code Optimization
RAM optimization
Keil
IDA
RAM optimization
Map files
Compiler/Linker
documentation secrets
11/15/2015 3Costillo Rebelbot
4. Things Not Covered (but are pretty cool)
Virtual memory
Caching for speed
Branch optimization
Processor pipeline considerationsProcessor pipeline considerations
Specifics of a particular processor family
Instead focus on where to find the info to accomplish
the goal
11/15/2015 4Costillo Rebelbot
5. Performance
Speed of execution
Resolve bottlenecks
S
p
a
c
eResolve bottlenecks
Meeting design guidelines
Meet power consumption
model
Time
e
11/15/2015 5Costillo Rebelbot
10. Before You Dive In
Don’t optimize too early
Keep a baseline
Profiling
Keep track of memory utilizationKeep track of memory utilization
Leverage tools with compiler tool chain/IDE
Create your own profiling systems
11/15/2015 10Costillo Rebelbot
12. Performance Measurements
1. Model – what are you expecting them
to be
a. Tasks
b. RTOS – task switching
c. ISRs
2. Measure
1. Review compiler listings and map files1. Review compiler listings and map files
2. Home-grown profiling tools
3. Leverage simulators
4. IDE/RTOS toolchain tools
3. Modify – basic tools
1. Leverage toolchain intrinsics
2. Utilize compiler optimizer
3. Use Big-O to improve code structure
4. Count and shrink instruction count
5. Assembly based on processor pipeline
knowledge11/15/2015 12Costillo Rebelbot
13. Measurements Interface Tradeoffs
Type Pro Cons
Logging –based: Human readable
Serial or other live
stream
Happens in “real” time Serial port is slow
Overhead is high
Can disrupt execution order
Circular buffer Faster than serial Need extraction tool
Not reading in real time
Limited data
File system with
extraction tool
Stays until you extract it
Size is limited by allocation
Requires extraction tool
Not reading in real time
HW-based: Execution disruption is low Need to decode for readability
GPIOs Setup is low Potentially high pin count
PWMs Low pin count Overhead can be high
Can be painful without
spectrum analyzer
DAC Low pin count
2^n bits of event levels
Need oscilloscope
11/15/2015 13Costillo Rebelbot
14. Modification Improvement
Strategies
Type Method Impact
Algorithm efficiency
function
• Review your Big O(n)
• Leverage preprocessor
intrinsics
• Count instructions and
write new code
ROM,
Scale, speed
Code size function ROM, processor pipeline,
write new code
• Utilize optimizer flags
• C/Assembly based on
processor knowledge
Code size function ROM, processor pipeline,
Target size or call
frequency
Memory usage • Leverage compiler
intrinsics
• Utilize optimizer or
linker flags
RAM
stack. heap
Memory location RAM
11/15/2015 14Costillo Rebelbot
15. Profiling LED Sample Task
MainThread thrLED
thrIsrLED
Sample
triggerSignal SetMainThread
500ms timer
thrLED
Toggle LED4
trigger
EXTI GPIO
ISR
Signal Set
Signal Set
11/15/2015 15Costillo Rebelbot
16. Lab1 - Make Profiler Module
Go to project Options and in C/C++, change
LAB0 -> LAB1
Load Saleae logic settings in /Saleae (Optional)
Exercise: How to improveExercise: How to improve
Observe:
Measurement accuracy
11/15/2015 16Costillo Rebelbot
17. Lab2 -Improve Profiler Module
Go to project Options and in C/C++, change
LAB1 -> LAB2
Exercise: Count Instruction Cycles
Observe:
8MHz processor speed
~56us Interrupt Delay (448cycles)~56us Interrupt Delay (448cycles)
~12ms blip (~96k cycles)
11/15/2015 17Costillo Rebelbot
19. Intro to reading a map file
Cross reference – location of data/functions
Symbol Table – size in section
Memory Map – memory section
Image Component Sizes – by moduleImage Component Sizes – by module
Callgraph – depth of stack usage *
Summary
*Keil uses a separate .htm file
11/15/2015 19Costillo Rebelbot
20. Image Breakdown
segment Data type Contains
.text/ .code READ
ONLY
Code Functions,
const, strings
literals and pre-
defined values
.bss/.zinit READ
WRITE
Zero-init
UninitializedWRITE Uninitialized
global static
variables
.data READ
WRITE
Initialized Global
variables,
static variables
STACK READ
WRITE
Call stack
Local function
vars
HEAP READ
WRITE
Malloc()
https://en.wikipedia.org/wiki/Data_segment11/15/2015 20Costillo Rebelbot
21. Estimate Instruction Count
Exercise: Estimate # number of instructions executed via
profiling and counting instructions. Are they close?
11/15/2015 21Costillo Rebelbot
22. Estimate Clocks Per Instruction
(CPI)
Method 1:
CPI= Execution Time/
(Instruction Count *
Clock Frequency)
Instruction Cycle Count
HAL_GPIO_TogglePin :
LDR R2, [R0,#0x14] 2
EORS R2, R1 1
(12ms)/(x* 8Mhz) = ???
Hard to count to find
reasonable number ~1-2
Method 2:
CPI = weighted average
of instruction types
EORS R2, R1 1
STR R2, [R0,#0x14] 2
BX LR 1 + (Pipeline
refill 1-3)
TOTAL instructions: 4 Total cycles:
6-8
CPI 1.5 - 2
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/CHDDIGAC.html11/15/2015 22Costillo Rebelbot
23. Instruction Counting versus Time
Counting/CPI Time
Tedious Counting
Non-representative result
Not required in most cases
Look at average execution
times, not instructions
Breakdown only to the levelNot required in most cases
Works best in small
encapsulated functions()
with limited call stack depth
Use when concerned with
nanoseconds.
Breakdown only to the level
of detail required
Best for large procedures
and subsystems
Use when at microsecond
and millisecond scope
11/15/2015 23Costillo Rebelbot
24. Lab2B –More Granularity
Go to project Options and in C/C++, change
LAB2-> LAB2B
Exercise:
Determine where processing time is spent
Toggle on each f() call in task
Observe:Observe:
LCD calls are long
LCD
Clear
LCD Display
11/15/2015 24Costillo Rebelbot
26. Now for something interesting
thrIsrLED() includes:
Creates sliding averaging
window structure
Collect Gyro data sample with
thrIsrLED
Sample
trigger
Collect Gyro data sample with
each button press.
Calculates magnitude of the 3
axis – just for fun.
Adds it to a sliding averaging
window
Prints out average of the
window on LCD screen.
EXTI GPIO
ISR
Signal Set
11/15/2015 26Costillo Rebelbot
27. Lab 3 – Measure Data Processing
Time
Go to project Options and in C/C++, change LAB2 ->
LAB3
Exercise:
Measure code in terms of time and size
Utilize compiler listings under IDAUtilize compiler listings under IDA
Observations:
Profiler time goes up (8ms-14ms 11ms -15ms)
Algorithm choices are poor
11/15/2015 27Costillo Rebelbot
29. Using IDA
Select “New”
disassemble a new file.
Open “Optimizer.axf”
Change Processor TypeChange Processor Type
to ARM Little Endian
Select “Ok” and “yes”
for everything else
11/15/2015 29Costillo Rebelbot
31. Revisit Estimate Instruction Count
Exercise: Is utilizing IDA to count instructions
effective?
11/15/2015 31Costillo Rebelbot
32. Quick Tips
Big O notation matters:
Iterations improvements
pay off big
Most compilers are smart
Math Consideration:
Data type matters
Division becomes >>
operations.
Use powers of 2 for bufferMost compilers are smart
enough to take optimize if
you tell them.
Use powers of 2 for buffer
sizes on averaging windows.
Skip % operations. They
usually become some
version of / and are
expensive.
While/subtraction loops
can be faster in some cases.
Pow(), sqrt(), and math.h
are expensive. Focus on
“good enough”
NOTE: some of these will
appear in the next labs
11/15/2015 32Costillo Rebelbot
34. RAM Optimization
Symptoms Solutions
You are out of space and can’t
link
Keep blowing your
Look at your local variables
inside functions.
Remove debug helperKeep blowing your
stack/heap (i.e. things are
suddenly in the weeds or
weird values)
malloc() keeps failing.
Remove debug helper
variables.
Reduce stack size if possible
Alter your memory map
Inputs on stack versus send
pointer to struct
Referencing globals and
static variables
11/15/2015 34Costillo Rebelbot
35. Lab 4 – Lower RAM footprint
Go to project Options and in C/C++, change
LAB3 -> LAB4
Exercise:
Role of global, static, and local variables in RAMRole of global, static, and local variables in RAM
footprint- what happens as they shift attributions?
Tradeoffs in hiding variables in the call stacks
Select the right stack size for your task
Observation:
Smaller .DATA segment
Decreased algorithm size
11/15/2015 35Costillo Rebelbot
37. Deeper Code Space Optimization
Reduce number of instructions
Use listing file
Check stack and heap usage *
Use compiler flagsUse compiler flags
Use processor intrinsics
11/15/2015 37Costillo Rebelbot
38. Lab 5 – More Code Space
Optimizer Through Toolchain
Go to project Options and in C/C++, change
LAB4 -> LAB5
Exercise:
Optimize only SlidingWindowAverage() with intrinsics.Optimize only SlidingWindowAverage() with intrinsics.
Use smarter math operation selections. Is there a trade
off?
Observation:
Check current size on listing
11/15/2015 38Costillo Rebelbot
40. Open Lab - Things to try
Unroll loop (need to standardize
buffer size)
Write a better squareroot function
using a lookup table.
Actually turn on Optimizer for
• #pragma unroll [(n)]
• Hint: Look up table
Actually turn on Optimizer for
space or speed
Check call graph for deepest stack
usage
Turn on RTOS run time stat feature
in FreeRTOSconfig.h file
Customize map file
Who can get the MOST EFFICIENT
CODE?
• Type –Os3
• Optimizer.htm
• configGENERATE_RUN_TI
ME_STATS
• HINT: remove PADDING
11/15/2015 40Costillo Rebelbot