1. Introduction to Multi-Core
A multi-core processor is an integrated circuit to which two or more processors have
been attached.
Leads to
o enhanced performance, reduced power consumption, and more efficient
simultaneous processing of multiple tasks
What changes expected in software design:
o To achieve competitive application performance on these new processors, many
applications must be written (or rewritten) as parallel, multithreaded
applications.
o Multithreaded development can be difficult, expensive, time consuming, and
error prone — and it requires new programming skill sets.
Adding cores results in additional overheads and latencies
o Serializes execution between communicating and non-communicating cores
(e.g. hardware barriers, fences, resource contention)
o Various interdependent sources of latency and overhead
Architecture : cache coherency
System: processor scheduling
Application: synchronization
o Sensitive to real workloads (e.g. data dependencies)
o As the number of cores increase, the size of the overheads and latencies
increases
Is Multiprocessor same as Multicore?
o Multi-core, multiple cpu cores within a single processor
o Multi-processor, multiple processor within a single chip
2. For software perspective, we can use either one of the term.
Multiprocessor MultiCore Diagram
3. Example of Multi-core Architecture:
ARMMPCORE
Two basic models of multi-core:
o Each core acts independently - “multiple single cores”
o Cores cooperate each other – “true multi-core”
4. What is Multiple single core?
Each core acts independently
o Pros
Simplifying a porting from Single core systems
The minimum of interaction between cores – less overhead and more
predictable system
No cache coherency issues between the cores
Tools support may remain the same as it was for single core
Good scalability – however depends on hardware support
o Cons
Load balancing issues – some cores maybe idle and some overloaded.
Hardware should support this mode of operations by providing I/O
Queues for network interfaces.
What is True Multi-core?
Cores cooperate each other
o Pros
Better possibilities for balance loading meaning more effective usage of
system resources
L1 instruction cache can be used more efficiently (cache affinity)
o Cons
Porting from single core is typically more complicated
Possible cache coherency issues between the cores
System becomes more complex especially when dependencies exist
between tasks. As a result, hard-real time scheduling is harder to
achieve
5. ??Example of True-Multi-Core designs: Master-Slave, SMP …
Different flavors of Multi-core
SMP ( Symmetric Multi-processor)
o Identical processor cores
o Dynamic Task allocation ( each task can run on any identical processor)
o Shared view of Memory
Synhcronization and communication via shared memory
o Normally homogeneous CPU arrangement
AMP (Asymmetric Multi-processor)
o Static Task allocation ( each processor is assigned a particular kind of task )
o Distributed or common view of memory
Synchronization and communication via message passing mechanism
o Either homogeneous or heterogeneous CPU cores
Cache coherency requires special attention
Master-Slave MP architecture
o Master core is responsible for all I/O operations and uses all the cores as the
slaves. It decides what task each core performs
o Slave core do not communicate each other but only thru a master core
6. OS + Multi-core Design:
Each CPU has its own OS
• Statically allocate physical memory to each CPU
• Each CPU runs its own independents OS
• Share peripherals
• Each CPU handles its processes system calls
• Used in early multiprocessor systems
• Simple to implement
• Avoids concurrency issues by not sharing
• Issues: 1. Each processor has its own scheduling queue.
2. Each processor has its own memory partition.
3. Consistency is an issue with independent disk buffer caches and
potentially shared files.
OS + Master-Slave Multiprocessors
• OS mostly runs on a single fixed CPU.
• User-level applications run on the other CPUs.
• All system calls are passed to the Master CPU for processing
• Very little synchronisation required
• Single to implement
• Single centralised scheduler to keep all processors busy
• Memory can be allocated as needed to all CPUs.
7. • Issues: Master CPU becomes the bottleneck.
OS + SMP
• OS kernel runs on all processors, while load and resources are balanced between all
processors.
• One alternative: A single mutex (mutual exclusion object) that make the entire kernel a
large critical section; Only one CPU can be in the kernel at a time; Only slight better than
master-slave
• Better alternative: Identify independent parts of the kernel and make each of them their
own critical section, which allows parallelism in the kernel
• Issues: A difficult task; Code is mostly similar to uniprocessor code; hard part is
identifying independent parts that don’t interfere with each other
• CPUs connected via shared bus to shared memory
• Each processor has L1 Cache
• Any task can be running on any CPU, every CPU is equal for system
• No master-slave configuration
• Each processor able to access the entire memory map
• Each processor is non-unique and equal power
Application porting on Multi-core:
Identify the threads (tasks) that can be executed concurrently by different cores
How to choose these tasks ?
o Minimize inter-task dependencies
o Each task should have schedulable real-time characteristics for single core
o Avoid too short tasks because of overhead
8. o Keep place for tuning at implementation stage
o Identify inter-task dependencies
o Inter-task dependencies may cause performance degradation as one core will
have to wait for other cores and as a result to missing deadlines.
o Inter-task dependencies may affect your scheduler decisions
o Define what management and I/O tasks you assign to a “master” core and what
is shared between several cores
Memory management can be done both by master core and by all cores
Ethernet and other I/O
DMA
o Define a scheduling policies
Take into account cache considerations and multicore :
1. For example, it may be more efficient to co-schedule two tasks that
are using the same working set in L2 cache
2. Running several “big” working sets on the different cores thrashing
each other in L2 at the same time may be painful
3. Data cache affinity – sometimes it is worth to give priority for task
to run on the same core and use advantage of “hot” cache
What is Cache Coherency?
Cache coherency is a state where each processor in a multiprocessor system sees the same
value for a data item in its cache as the value that is in System Memory.
This state is transparent to the software but affects software performance
For Example:
• Processor A and B both cache address x
• A writes to x
– Updates cache
• How does B find out?
9. There are many cache coherence protocols like:
– MESI
MESI
Modified
o Have modified the cached data, must write it back to memory
Exclusive
o No other processor has it cached, can be modified
Shared
o Not modified, other processors have cached it, if required to change have to
inform other processor to invalidate cache line
Invalid
o Cached line is no longer valid ( may be some other processor has updated it )
Specifics required to work with an MP core:
Identification
o CPUID to uniquely identify CPU to software
o Ability to indicate need to memory coherent
Can maintain memory coherency
o Caches can participate in MESI protocol
Provides consistent view of memory
o With a defined memory ordering
o Atomic and synchronization primitives
Communication with peers
o IPI
o Message passing
Interrupt distribution
10. o Interrupt distribution unit controlling individual processor Interrupt controller
unit
Multi-core/Multi-Processor design issues:
Cache coherency
Design of multi-threaded applications for multi-core
o Functional decomposition
o Domain decomposition ( independent data sets )
Snooping (Cache/Memory snooping)
Interrupt distribution
Processor affinity
Inter-processor Interrupts
Memory access
Concurrency
o Interrupt
o Instruction/data
o Memory/peripherals
Memory consistency/memory ordering model ( by hardware + by compiler
optimization)
SMP protection by OS/HW.
o Spinlocks
o Atomic operations for the basis of all protection tools (ARM LL/SC operation)
Debugging tools
Performance
Profiling
Linux SMP Design:
Process affinity
11. o Each processor has runqueue
o Runqueue is list of all active processes , to be scheduled
Load Balancing
o To shift process from one overloaded process to another symmetric processor
o Part of the scheduler
o Should maintain processor affinity for cache efficiency
Interrupt Affinity
o Requires help from the hardware interrupt distribution system ( APIC )
o APIC controls interrupt going to only one of the core
o Linux interrupt provides cpu_set function to change APIC behavior
Smp_processor_id
o Returns CPU identifier for which current code is executing
Per-CPU variable
o Define per-cpu memory region at the start of the kernel where per-cpu
variables will be placed
o Variable associated with a single core
o Variable defined as per-cpu creates an array of variables, one per CPU instance.
Spinlock
o Disabling preemption and interrupts will not help in MP environment
Big-lock
o Introduced in 2.2 kernel to serialize access across the system
What about BH?
o Tasklets are executed on the processor that schedules it