1. M. S. Ramaiah School of Advanced Studies 1
CSN2502 ACA Presentation
Anshuman Biswal
PT 2012 Batch, Reg. No.: CJB0412001
M. Sc. (Engg.) in Computer Science and Networking
Module Leader: Padma Priya Dharishini P.
Module Name: Advanced Computer Architecture
Module Code : CSN2502
Array Processor
2. M. S. Ramaiah School of Advanced Studies 2
Marking
Head Maximum Score
Technical Content 10
Grasp and Understanding 10
Delivery – Technical and
General Aspects
10
Handling Questions 10
Total 40
3. M. S. Ramaiah School of Advanced Studies 3
Presentation Outline
• History of Array Processor
• Array Processor
• How Array Processor can help?
• Array Processor classification
• Array Processor architecture
• Performance and scalability of array processor
• Why use the array processor?
• When to use and when not to use the array processor?
4. M. S. Ramaiah School of Advanced Studies 4
History of Array processor
• Array processor development began in early 1960’s at Westinghouse
in their Solomon project.
• Solomon’s goal was to improve the math performance by using large
number of math co-processors under the control of a single CPU.
• The CPU fed a single common instruction to all of the arithmetic logic
units (ALUs), one per "cycle", but with a different data point for each
one to work on.
• This allowed the Solomon machine to apply a single algorithm to a
large data set, fed in the form of an array.
• In 1962, Westinghouse cancelled the project, but the effort was
restarted at the University of Illinois as the ILLIAC IV.
• In 1972 , it was finally delivered to the world and till 1990’s it formed
the basic design of the fastest machine.
5. M. S. Ramaiah School of Advanced Studies 5
Array Processor
• Array processor is a synchronous parallel computer with multiple ALU
called processing elements ( PE) that can operate in parallel in lock
step fashion.
• It is composed of N identical PE under the control of a single control
unit and a number of memory modules.
• Array processors also frequently use a form of parallel computation
called pipelining where an operation is divided into smaller steps and
the steps are performed simultaneously.
• It can greatly improve performance on certain workloads mainly in
numerical simulation.
• These machines appeared in the 1970’s and
dominated supercomputer design through the 1970s into the 90s,
notably the various Cray platforms.
• The rapid rise in the price-to-performance ratio of
conventional microprocessor designs led to the vector supercomputer's
demise in the later 1990s.
6. M. S. Ramaiah School of Advanced Studies 6
How array processor can help?
• In general terms, CPUs are able to manipulate one or two pieces of
data at a time. For instance, most CPUs have an instruction that
essentially says "add A to B and put the result in C". The data for A, B
and C could be—in theory at least—encoded directly into the
instruction. However, in efficient implementation things are rarely that
simple. The data is rarely sent in raw form, and is instead "pointed to"
by passing in an address to a memory location that holds the data.
Decoding this address and getting the data out of the memory takes
some time.
• In order to reduce the amount of time this takes, most modern CPUs
use a technique known as instruction pipelining in which the
instructions pass through several sub-units in turn.
• Array processors take this concept one step further. Instead of
pipelining just the instructions, they also pipeline the data itself. This
allows for significant savings in decoding time.
7. M. S. Ramaiah School of Advanced Studies 7
How Array Processor can help?: An Example
• Consider the simple task of adding two groups of 10 numbers
together. In a normal programming language you might have done
something as
• execute this loop 10 times
• read the next instruction and decode it
• fetch this number fetch that number
• add them
• put the result here
• End loop
• But to an array processor this tasks looks as
• read instruction and decode it
• fetch these 10 numbers
• fetch those 10 numbers
• add them
• put the results here
8. M. S. Ramaiah School of Advanced Studies 8
How Array Processor can help?
• There are several savings inherent in this approach.(Based on the
example in previous slide)
A. Only two address translations are needed
B. Fetching and decoding the instruction is done only one time
instead of ten times
C. The code itself is also smaller, which can lead to more efficient
memory use.
D. It improve performance by avoiding stalls.
9. M. S. Ramaiah School of Advanced Studies 9
Array Processor Classification
• SIMD ( Single Instruction Multiple Data ): is an array processor that has a
single instruction multiple data organization.
It manipulates vector instructions by means of multiple functional unit responding to a
common instruction.
ILLIAC-IV, CM -2( Connection Machine ),MP-1(MasPar-1), BSP (Bulk Synchronous
Parallel )
• Attached array processor: is an auxiliary processor attached to a
general purpose computer.
Its intent is to improve the performance of the host computer in specific numeric
calculation tasks.
10. M. S. Ramaiah School of Advanced Studies 10
Array Processor Architecture - SIMD
• SIMD has two basic configuration
– a. Array processors using RAM also known as ( Dedicated
memory organization )
• ILLIAC-IV, CM-2,MP-1
– b. Associative processor using content accessible memory also
known as ( Global Memory Organization)
• BSP
11. M. S. Ramaiah School of Advanced Studies 11
SIMDArchitecture – Array Processor using RAM
Host
Computer
•Here we have a Control Unit
and multiple synchronized
PE.
•The control unit controls all
the PE below it .
•Control unit decode all the
instructions given to it and
decides where the decoded
instruction should be
executed.
•The vector instructions are
broadcasted to all the PE.
This broad casting is to get
spatial parallelism through
duplicate PE.
•The scalar instructions are
executed directly inside the
CU.
12. M. S. Ramaiah School of Advanced Studies 12
SIMDArchitecture – Array Processor using RAM
Control Unit
• A simple CPU
• Can execute instructions w/o PE intervention
• Coordinates all PE’s
• 64 64b registers, D0-D63
• 4 64b Accumulators A0-A3
• Ops:
– Integer ops
– Shifts
– Boolean
– Loop control
– Index Memory
D0
D63
A0
A3
A1
A2
ALU
CU
13. M. S. Ramaiah School of Advanced Studies 13
SIMDArchitecture – Array Processor using RAM
Processing Element
• A PE consists of an ALU with working registers and
a local memory PMEMi which is used to store
distributed data.
• All PE do the same function synchronously under the
super vision of CU in a lock-step fashion.
• Before execution in a PE the vector instructions
should be loaded into its PMEM .
• Data can be added into the PEM from an external
source or by the CU.
• When executing a instruction all the PE doesn't have
to work ,only the enabled PE have to work. For
enabling and disabling a PE during the execution of a
instruction we can used several masking schemes.
A
S
B
R
ALU
PEi
X
D
0
1
2043
PMEMi
PEi-1
PEi+1
PEi-8
PEi+8
• A PE consists of the following:
• 64 bit regs
• A: Accumulator
• B: 2nd operand for binary ops
• R: Routing – Inter-PE Communication
• S: Status Register
• X: Index for PMEM 16bits
• D: mode 8bits
• Communication:
– PMEM only from local PE
– Amongst PE with R
14. M. S. Ramaiah School of Advanced Studies 14
• IN: All communication between PE’s are done by the interconnection
network. It does all the routing and manipulation function . This
interconnection network is under the control of CU.
• Host Computer: The array processor is interfaced to the host controller
using host computer. The host computer does the resource
management and peripheral and I/O supervisions.
SIMDArchitecture – Array Processor using RAM
Interconnection Network and Host Computer
15. M. S. Ramaiah School of Advanced Studies 15
SIMDArchitecture – Masking and data routing
organization
Ai Bi
Di Ii Ri
Xi
PEi
Si
ALU
PEMi
For i=0,1,2…,N-1
.
.
.
To other PE’s via
interconnected
network
To CU
16. M. S. Ramaiah School of Advanced Studies 16
• One PE is connected to another PE via its routing register R.
• When one PE is communicating with the other PE ,it is the contents of
the R register that is transferred.
• All the inputs and output goes through this register , the inputs and
outputs are isolated by master-slave-flip-flops.
• The D register is the address register and it stores the 8 bit address of
the PE.
• During a instruction cycle only the enabled PE will take the operand
send to them while the other PE will discard the operands send to
them. For an enabled PE the status register S =1 and for a masked PE
status register S =0 .
• A = accumulator, B= 2nd operand of binary operations,
SIMDArchitecture – Masking and data routing
organization
17. M. S. Ramaiah School of Advanced Studies 17
SIMDArchitecture – Associative processor using
content accessible memory
Host
Compute
r
• In this configuration PE does not
have private memory. Memories
attached to PE are replaced by
parallel memory modules shared to
all PE via an alignment network.
• Alignment network does path
switching between PE and parallel
memory.
• The PE to PE communication is
also via alignment network .
• The alignment network is
controlled by the CU.
• The number of PE (N) and the number of memory modules (K)
may not be equal , in fact they are chosen to be prime to each
other.
• An alignment network should allow conflict free access of shared
memories by as many PEs as possible.
AlignmentNetwork
18. M. S. Ramaiah School of Advanced Studies 18
• In this configuration the attached array processor has an input output
interface to common processor and another interface with a local
memory.
• The local memory connects to the main memory with the help of a
high speed memory bus.
Attached Array Processor
19. M. S. Ramaiah School of Advanced Studies 19
Performance and Scalability of array processor
20. M. S. Ramaiah School of Advanced Studies 20
• The principal reason for using the array processor is speed.
• The design of most array processors optimizes its performance for
repetitive arithmetic operations , making it much faster at the vector
arithmetic than the host CPU. Since most array processors operate
asynchronously from the host CPU, they constitute a co-processor
which increases the capacity of the system.
• The second advantage is that AP consists of its own local memory. On
systems with limited physical memory, or address space, this can be an
important consideration.
Why use the array processor?
21. M. S. Ramaiah School of Advanced Studies 21
• The AP (array processor) is most efficient in doing repetitive
operations such as doing FFT’s and multiplying large vectors. Its
efficiency degrades for non repetitive operations, or operations
requiring a great number of decisions based on the results of
computations.
• Since the AP’s have their own program and data memory, the AP
instruction and data must be transferred to , and the results transferred
from the AP. These I/O operations may cost more CPU time than the
amount saved by using the array processor.
• As a general rule , use of AP is most efficient than the CPU when
multiple or complex (such as FFT) operations, which are highly
repetitious, are going to be done on relatively large amount of data (
thousands of words or more.). In other cases use of AP will not help
much and will keep other processes from using valuable resource.
When to use and not to use the array processor?
22. M. S. Ramaiah School of Advanced Studies 22
Conclusion
• Though array processor can improve the performance but all problems can not
be attacked with this sort of solution. Instructions of array processor to process
an array of data at a time necessarily adds complexity to the core CPU. That
complexity typically makes other instructions run .The more complex
instructions also add to the complexity of the decoders, which might slow
down the decoding of the more common instructions such as normal adding.
• So the array processors work best only when there are large amounts of data to
be worked on. For this reason, these sorts of CPUs were found primarily
in supercomputers, as the supercomputers themselves were, in general, found
in places such as weather prediction centres and physics labs, where huge
amounts of data are "crunched".
• This architecture relies on the fact that the data sets are all acting on a single
instruction. However if these data sets somewhat rely on each other then you
cannot apply parallel processing. For example if data A has to be processed
before data B then you cannot do both A and B simultaneously. This
dependency is what makes parallel processing difficult to implement and it is
why sequential machines are extremely common.
23. M. S. Ramaiah School of Advanced Studies 23
References
[1] Array or Vector Processing [Online] Available From: http://www.teach-
ict.com/as_as_computing/ocr/H447/F453/3_3_3/parallel_processors/miniweb/
pg3.htm# (Accessed:05 December 2012)
[2] Hennessy J. and Patterson D. (2007) Computer Architecture: A Quantitative
Approach, 4th edition, Morgan Kaufmann.
[3] Martin,J.(November 2011) Array processors- SIMD computer organisations
[Online] Available from :http://www.martinjacob.info/2011/11/17/array-
processors-simd-computer-organizations/ (Accessed:05 December 2012)
[4] Schaum.(2009)Theory and Problems of Computer Architecture, Indian special
edition,McGraw-Hill Companies Inc.