Super-resolution reconstruction is a method for reconstructing higher resolution images from a set of low resolution observations. The sub-pixel differences among different observations of the same scene allow to create higher resolution images with better quality. In the last thirty years, many methods for creating high resolution images have been proposed. However, hardware implementations of such methods are limited. Wiener filter design is one of the techniques we will use initially for this process. Wiener filter design involves matrix inversion. A novel method for the matrix inversion has been proposed in the report. QR decomposition will be the computational algorithm used using Givens Rotation.
1. Wiener Filter Realization using Hardware.
QR decomposition of matrices and inversion
by Givens’ Rotation
***************************************
7th
Semester Project Report
Akashdip Das
Abantika Chowdhury
Sayan Chaudhuri
Guide : Dr. Ayan Banerjee
Electronics and Telecommunication Engineering Department
December, 2016
1
2. Contents
1 Abstract 3
2 Introduction 3
3 Wiener Filtering 4
4 Q-R decomposition of a matrix 6
5 Hardware for inversion of an upper triangular matrix(R) 9
5.1 Storage in a RAM . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.2 Address generation Mechanism . . . . . . . . . . . . . . . . . 10
5.3 Hardware for finding the inverse of diagonal elements . . . . . 12
5.4 Hardware for the finding the inverse of the other elements . . 13
6 Conclusions 15
6.1 Multi PORT RAM for faster performance . . . . . . . . . . . 15
6.2 Distributed Arithmetic for computing the product of the two
matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7 Acknowledgements 17
2
3. 1 Abstract
Super-resolution reconstruction is a method for reconstructing higher reso-
lution images from a set of low resolution observations. The sub-pixel differ-
ences among different observations of the same scene allow to create higher
resolution images with better quality. In the last thirty years, many methods
for creating high resolution images have been proposed. However, hardware
implementations of such methods are limited. Wiener filter design is one
of the techniques we will use initially for this process. Wiener filter design
involves matrix inversion. A novel method for the matrix inversion has been
proposed in the report. QR decomposition will be the computational algo-
rithm used using Givens Rotation.
2 Introduction
The process of super resolution initially requires that the image be restored
from the effects of noise and degradation(assumed isotropic). For that pur-
pose the Wiener Filter is used that basically helps in forming an estimate of
the image from the degraded one.The fundamentals of the Wiener Filtering
has been discussed in Section 3. The Wiener Filtering requires generation of
the inverse of a given matrix The method followed here is the QR Decompo-
sition(discussed in Section 4). The QR decomposition involves generation of
an upper triangular matrix which we will be inverting in the proposed algo-
rithm. Various techniques for decomposition of the matrix has been discussed
in papers [3],[4]. However the inversion of a matrix proposed by them was
not sufficient for the general solution for the problem. Rather the solution
was illustrated for a specific system of 3x3 matrix. The QR decomposition
involves forming an upper triangular matrix and an orthogonal matrix. The
inversion of an orthogonal matrix is simply obtained by computing its trans-
pose. The inversion of the upper triangular matrix has been discussed in this
paper. The solutions available for this process is for a 3x3 or 4x4 system.
So in this paper we have generalized the inversion to a nxn system. The
hardware that is required for this purpose has been developed in Section 5
along with sound reasoning and justification. The hardware that has been
developed has scopes for enhanced performance that has been discussed in
section 6
3
4. 3 Wiener Filtering
In signal processing, the Wiener filter is a filter used to produce an estimate
of a desired or target random process by linear time-invariant (LTI) filtering
of an observed noisy process, assuming known stationary signal and noise
spectra, and additive noise. The Wiener filter minimizes the mean square
error between the estimated random process and the desired process. The
goal of the Wiener filter is to compute a statistical estimate of an unknown
signal using a related signal as an input and filtering that known signal to
produce the estimate as an output. For example, the known signal might
consist of an unknown signal of interest that has been corrupted by additive
noise. The Wiener filter can be used to filter out the noise from the corrupted
signal to provide an estimate of the underlying signal of interest. he Wiener
filter is based on a statistical approach based on MMSE (Minimum Mean
Square Error).The causal finite impulse response (FIR) Wiener filter, instead
of using some given data matrix X and output vector Y, finds optimal tap
weights by using the statistics of the input and output signals. It populates
the input matrix X with estimates of the auto-correlation of the input signal
(T) and populates the output vector Y with estimates of the cross-correlation
between the output and input signals (V).
In order to derive the coefficients of the Wiener filter, consider the signal
w[n] being fed to a Wiener filter of order N and with coefficients {a0, · · · , aN }.
The output of the filter is denoted x[n] which is given by the expression.
x[n] = N
i=0 aiw[n − i].
The residual error is denoted e[n] and is defined as e[n] = x[n] s[n] (see the
corresponding block diagram). The Wiener filter is designed so as to mini-
mize the mean square error (MMSE criteria) which can be stated concisely
as follows:
ai = arg min E e2
[n] , where E[·] denotes the expectation operator. In
the general case, the coefficientsai may be complex and may be derived for
the case where w[n] and s[n] are complex as well. With a complex signal, the
matrix to be solved is a Hermitian Toeplitz matrix, rather than symmetric
Toeplitz matrix. For simplicity, the following considers only the case where
all these quantities are real. The mean square error (MSE) may be rewritten
as:
4
5. E e2
[n] = E (x[n] − s[n])2
= E x2
[n] + E s2
[n] − 2E[x[n]s[n]]
= E
N
i=0
aiw[n − i]
2
+ E s2
[n] − 2E
N
i=0
aiw[n − i]s[n]
To find the vector [a0, . . . , aN ] which minimizes the expression above, calcu-
late its derivative with respect to each ai
∂
∂ai
E e2
[n] =
∂
∂ai
E
N
i=0
aiw[n − i]
2
+ E s2
[n] − 2E
N
i=0
aiw[n − i]s[n]
= 2E
N
j=0
ajw[n − j] w[n − i] − 2E[s[n]w[n − i]]
= 2
N
j=0
E[w[n − j]w[n − i]]aj − 2E[w[n − i]s[n]]
Assuming that w[n] and s[n] are each stationary and jointly stationary, the
sequencesRw[m] and Rws[m] known respectively as the autocorrelation of
w[n] and the cross-correlation between w[n] and s[n] can be defined as fol-
lows:
Rw[m] = E{w[n]w[n + m]}
Rws[m] = E{w[n]s[n + m]}
The derivative of the MSE may therefore be rewritten as (notice that Rws[−i] = Rsw[i])
∂
∂ai
E e2
[n] = 2
N
j=0
Rw[j − i]aj − 2Rsw[i] i = 0, · · · , N.
Letting the derivative be equal to zero results in
N
j=0
Rw[j − i]aj = Rsw[i] i = 0, · · · , N.
which can be rewritten in matrix form
Rw[0] Rw[1] · · · Rw[N]
Rw[1] Rw[0] · · · Rw[N − 1]
...
...
...
...
Rw[N] Rw[N − 1] · · · Rw[0]
T
a0
a1
...
aN
a
=
Rsw[0]
Rsw[1]
...
Rsw[N]
v
These equations are known as the Wiener–Hopf equations. The matrix T ap-
5
6. pearing in the equation is a symmetric Toeplitz matrix. Under suitable con-
ditions on R , these matrices are known to be positive definite and therefore
non-singular yielding a unique solution to the determination of the Wiener
filter coefficient vector,
a = T−1
v
It is this equation that makes it necessary to design a Matrix Inversion Hard-
ware that is faster than the existing ones so that there is less delay in image
processing and also generalization to NxN form. The inversion of the matrix
will be done in this paper using QR decomposition using Givens Rotation
4 Q-R decomposition of a matrix
QR Decomposition: QR decomposition is one of the most important opera-
tions in linear algebra. It can be used to find matrix inversion, to solve a set of
simulations equations or in numerous applications in scientific computing. It
represents one of the relatively small numbers of matrix operation primitive
from which a wide range of algorithms can be realized. QR decomposition
is an elementary operation, which decomposes a matrix into an orthogonal
and a triangular matrix. QR decomposition of a real square matrix A is a
decomposition of A as A = QR, where Q is an orthogonal matrix (QT Q =
I) and R is an upper triangular matrix. And we can factor m x n matrices
(with m n) of full rank as the product of an m x n orthogonal matrix where
QT Q = I and an n x n upper triangular matrix. There are different meth-
ods which can be used to compute QR decomposition. The techniques for
QR decomposition are Gram-Schmidt ortho-normalization method, House-
holder reflections, and the Givens rotations. Each decomposition method has
a number of advantages and disadvantages because of their specific solution
process.The Givens’ Rotation Technique has been discussed
If there are two nonzero vectors, x and y, in a plane, the angle, θ, between
them can be formalized as :
cos(θ)= (x,y)
||x||2||y||2
The rotation will be performed using 16 bit pipelined CORDIC.
This formula can be extended to n vectors. The angle, θ , can be defined
as
6
7. θ=arccos (x,y)
||x||2||y||2
((A−1
)
−1
)=A
A=QR where R is an upper triangular matrix and R is an orthogonal matrix.
I=QQT
Consider a 4X4 system
A =
a1,1 a1,2 a1,3 a1,4
a2,1 a2,2 a2,3 a2,4
a3,1 a3,2 a3,3 a3,4
a4,1 a4,2 a4,3 a4,4
R =
a1,1 a1,2 a1,3 a1,4
0 a2,2 a2,3 a2,4
0 0 a3,3 a3,4
0 0 0 a4,4
The matrix of Givens Rotation is
G(i,j, θ) =
1 0 0 0
0 cos(θ) sin(θ) 0
0 −sin(θ) cos(θ) 0
0 0 0 1
Givens Rotation process utilizes a cycle of rotation whose function is to
null an element in the sub-diagonal of the matrix forming the QR matrix. Q
matrix is obtained by concatenating all the Givens Rotation.
R is to be found from three rotation where each element is obtained from
each rotation. Givens Rotation matrices needed for a 3x3 system
G1 =
cos(θ) 0 sin(θ)
0 1 0
−sin(θ) 0 cos(θ)
G2 =
cos(θ) sin(θ) 0
−sin(θ) cos(θ) 0
cos(θ) cos(θ) 1
G3 =
1 0 0
cos(θ) cos(θ) sin(θ)
cos(θ) −sin(θ) cos(θ)
θ, A(3,1) , A(2,1), A(3,2) can be obtained using
c1 = A1(1,1)
√
A1(3,1)2
+A1(1,1)2
7
8. c2 = A1(1,1)
√
A1(2,2)2
+A1(3,2)2
c3 = A1(1,1)
√
A1(2,2)2
+A1(3,2)2
s1 = A1(3,1)
√
A1(3,1)2
+A1(1,1)2
s2 = A1(2,1)
√
A1(2,1)2
+A1(1,1)2
s3 = A1(3,2)
√
A1(2,2)2
+A1(3,2)2
Q = G1
T
.G2
T
.G3
T
A2 = G1A1
A3 = G2A2
R = G3A3
A = QR
A−1
= (QR)−1
A−1
= (R)−1
(Q)−1
A−1
= (R)−1
(Q)T
This nececitates the formation of the inverse of the upper triangular ma-
trix and it’s subsequent multiplication to the transpose of the orthogonal
matrix.
Figure 1: Basic Hardware for matrix inversion using QR decomposition.The
G matrix is formed using Givens Rotation performed using CORDIC
8
9. 5 Hardware for inversion of an upper trian-
gular matrix(R)
We have designed the hardware for inversion of a generalised N X N upper
triangular matrix R. where R=
r1,1 r1,2 · · · r1,n
0 r2,2 · · · r2,n
...
...
...
...
0 0 · · · rn,n
Let B be (R)−1
. The algorithm is as followed
1 f or ( row=1;row<=n ; row++)
2 B(row , row )=1/R(row , row )
3 next row
4 f or ( row=1;row<=n ; row++)
5 f or ( col=row+1; col<=n ; col++)
6 s=0
7 f or (k=1;k<=col −1;k++)
8 s=s+B(row , k)R(k , col )
9 s=−s /R( col , col )
10 B(row , col )=s
11 next k
12 next col
13 next row
We observe that the inverse of the upper triangular matrix is also an
upper triangular matrix with the diagonal elements reciprocal of the diag-
onal elements of the original matrix. The inverse of the other elements are
calculated recursively using the algorithm as mentioned above. An example
to illustrate how the algorithm works is shown below. Let A be an upper
triangular matrix and B be its inverse then
A=
a1,1 a1,2 a1.3 · · · r1,n
0 a2,2 a2,3 · · · a2,n
0 0 a3,3 · · · a3,n
...
...
...
...
...
0 0 0 · · · an,n
B=
b1,1 b1,2 ab1.3 · · · br1,n
0 b2,2 b2,3 · · · b2,n
0 0 b3,3 · · · b3,n
...
...
...
...
...
0 0 0 · · · bn,n
Since AB=I
9
10.
a1,1 a1,2 a1.3 · · · r1,n
0 a2,2 a2,3 · · · a2,n
0 0 a3,3 · · · a3,n
...
...
...
...
...
0 0 0 · · · an,n
b1,1 b1,2 ab1.3 · · · br1,n
0 b2,2 b2,3 · · · b2,n
0 0 b3,3 · · · b3,n
...
...
...
...
...
0 0 0 · · · bn,n
=
1 0 0 · · · 0
0 1 0 · · · 0
0 0 1 · · · 0
...
...
...
...
...
0 0 0 · · · 1
Multiplying the ith
row of matrix A with the ith
column of B yields ai,ibi,i=1.
Hence we see that bi,i = 1
ai,i
Now to solve for the non diagonal elements of the matrix B. We multiply the
first row and second column first to get a1,1b1,2+a1,2b2,2=0. We already know
thw value of b2,2 So the only unknown is b1,2. Now in general to obtain the
value of bi,j we multiply the ith
row of A and the jth
column of B and equate
that to 0 proceeding in a proper sequence of steps so that the values of b that
are needed to do the forward substitution are obtained from beforehand.
5.1 Storage in a RAM
In any matrix total number of elements = n x n=n2
. In the upper triangular
matrix generated here the number of non-zero elements is n(n−1)
2
since the
rest of the elements are zero in the bottom left triangle.So for minimisation
of hardware we have come up with an algorithm to omit storage of the zeros
in the RAM. If the zeros were not omitted the position of the element ri,j
would be j + (i-1)x n. However since this is not the case we are required to
develop an algorithm to generate the RAM location address for given i, j and
n
5.2 Address generation Mechanism
As in the upper triangular matrix the ri,j = 0 for i<j; there would no need for
storing them as zeroes individually in the RAM, instead we could just omit
the zeroes and find the position in the RAM corresponding to inputs (i,j)
that is ri,j would be given and a corresponding location in the RAM would
be obtained in our mechanism where zeroes are not stored, the address in
the RAM for ri,j would be equal to
n(i-1)+j-i(i−1)
2
-1.
Now this formula is obtained from the fact that in the actual system we
would have the address of the element ri,j as j + (i-1)x n but this time for
10
11. each row we are omitting i zeros, so the cumulative number of zeros omitted
is i
k=1 k
Figure 2: Block diagram of the address generation block
Figure 3: Circuit diagram of the address generation block
Hardware Required :
11
12. 4 adders/subtractors
2 multipliers
1 bit right shifter
5.3 Hardware for finding the inverse of diagonal ele-
ments
The following circuit (Figure 4) can be used for inversion of the diagonal
elements of the upper triangular matrix. The circuit consists of a loadable
up counter that counts till the number of rows in the matrix. Hence the
comparator to indicate that this process needs to stop when the value n is
reached. The circuit then sends value to the address generator block of RAM
A and then the same address is sent to RAM B so that the data is modified
in the same location in both RAM and RAM B.
Hardware Required :
1 Loadable Up Counter
1 Comparator
1 Inverter Block that computes the inverse of a 16 bit number.
Time Required :
Same as n clock pulses
12
13. Figure 4: Schematic hardware design for inversion of diagonal elements
5.4 Hardware for the finding the inverse of the other
elements
The following circuit(Figure 5) can be used for diagonalizing all elements
other than the diagonal elements.
Hardware Required :
3 Loadable Up Counter
4 address generation blocks
1 divider
1 multiplier
4 adders/subtractors
1 Register
Necessary control circuits for termination of loops
No. of clock cycles needed :
O(n2
)
13
14. Figure 5: Schematic hardware design for inversion of elements other than
those lying in the principal diagonal
14
15. 6 Conclusions
6.1 Multi PORT RAM for faster performance
One of the obstacles in the way of obtaining high performance in computing
is the memory-wall . If the processing elements cannot get the data from reg-
ister file (RF) at the processing rate, this causes a bottleneck that adversely
affects the overall performance. In order to meet the requirement of proper
data usage between the computational units, such a computation system
needs a register file that can meet the requirements of different computing
units on the FPGA. The demand to process more data per unit time requires
multiple read and write operations at a time, which can be achieved by the
usage of multi-port register files (MPo-RFs) instead of conventional single-
port RFs (SPo-RF).Multi-ported memories are challenging to implement on
FPGAs since the block RAMs included in the fabric typically have only two
ports. Hence we must construct memories requiring more than two ports
either out of logic elements or by combining multiple block RAMs. Some
Conventional Multi-Port Register File Implementations that can be used:
1. Distributed Memory
2. Replication
3. Banking
4. Multi-pumping
6.2 Distributed Arithmetic for computing the product
of the two matrices
Distributed arithmetic is a technique developed for the real-time computation
of the inner product of the vector with constant elements and the vector
with varying coefficients. The inner product is computed without splitting
into operations of multiplication and addition. At calculation, operations
of summation and shift of inner products of an unchangeable vector and a
bit-slice of a changeable vector are carried out. All possible values of partial
inner products are calculated offline and written down in Look Up Table
(LUT).The content of LUT is computed dynamically in the online mode.
Contents of this memory remain invariable for the period of multiplication of
the left matrix by a column of the right matrix. Despite need of calculation
of contents of LUT total number of micro-operations of addition decreases
15
16. Figure 6: 4 Read + 1 Write block RAM as an example of Multiport RAM
in comparison with a classical way of calculation of matrix product.
16
17. 7 Acknowledgements
The authors would like to thank their Project Guide Dr. Ayan Banerjee
for his invaluable suggestions and proper direction throughout the course
of the project. Thankfulness and heartfelt gratitude is also extended to Mr.
Anirban Chakraborty who is currently pursuing his Ph.D under the guidance
of Prof. Ayan Banerjee.
References
[1] Gonzalez, R. C., Woods, R. E. (2002). Digital image processing. Upper
Saddle River, NJ: Prentice Hall.
[2] Seyid K, Blanc S, Leblebici Y Hardware Implementation of Real-Time
Multiple Frame Super-Resolution eyid Very Large Scale Integration
(VLSI-SoC), 2015 IFIP/IEEE International Conference on
[3] Matrix Inversion Using QR Decomposition by Parabolic Synthesis Nafiz
Ahmed Chisty—
[4] Brown, Robert Grover; Hwang, Patrick Y.C. (1996). Introduction to Ran-
dom Signals and Applied Kalman Filtering (3 ed.). New York: John Wiley
Sons. ISBN 0-471-12839-2.
[5] D. Boulfelfel, R.M. Rangayyan, L.J. Hahn, and R. Kloiber, 1994, ”Three-
dimensional restoration of single photon emission computed tomography
images”, IEEE Transactions on Nuclear Science, 41(5): 1746-1754, Octo-
ber 1994
[6] Wiener, Norbert (1949). Extrapolation, Interpolation, and Smoothing of
Stationary Time Series. New York: Wiley. ISBN 0-262-73005-7.
[7] Thomas Kailath, Ali H. Sayed, and Babak Hassibi, Linear Estimation,
Prentice-Hall, NJ, 2000, ISBN 978-0-13-022464-4.
[8] Wiener N: The interpolation, extrapolation and smoothing of stationary
time series’, Report of the Services 19, Research Project DIC-6037 MIT,
February 1942
17
18. [9] Kolmogorov A.N: ’Stationary sequences in Hilbert space’, (In Russian)
Bull. Moscow Univ. 1941 vol.2 no.6 1-40. English translation in Kailath
T. (ed.) Linear least squares estimation Dowden, Hutchinson Ross 1977
[10] Vladislav Lesnikov, Tatiana Naumovich, Alexander Chastikov, ”Modifi-
cation of the architecture of a distributed arithmetic”, East-West Design
Test Symposium (EWDTS) 2015 IEEE, pp. 1-4, 2015.
[11] Tips Tricks: Creating a 2W+4R FPGA Block RAM, Part 1 ´Alvaro
Lopes, Senior Software engineer, Critical Software
[12] An Efficient FPGA Implementation of Scalable Matrix Inversion Core
using QR Decomposition
18