Unraveling Multimodality with Large Language Models.pdf
Speech signal processing lizy
1. SPEECH SIGNAL PROCESSING
KERALA UNIVERSITY M-TECH 1ST SEMESTER
M-
lizytvm@yahoo.com Lizy Abraham
+919495123331 Assistant Professor
Department of ECE
LBS Institute of Technology for Women
(A Govt. of Kerala Undertaking)
Poojappura
Trivandrum -695012
Kerala, India
1
2. SYLLABUS TSC 1004 SPEECH SIGNAL PROCESSING 3-0-0-3
3-
Speech Production :- Acoustic theory of speech production (Excitation, Vocal tract model for
speech analysis, Formant structure, Pitch). Articulatory Phonetic (Articulation, Voicing,
Articulatory model). Acoustic Phonetics ( Basic speech units and their classification).
Speech Analysis :- Short-Time Speech Analysis, Time domain analysis (Short time energy, short
time zero crossing Rate, ACF ). Frequency domain analysis (Filter Banks, STFT, Spectrogram,
Formant Estimation &Analysis). Cepstral Analysis
Parametric representation of speech :- AR Model, ARMA model. LPC Analysis ( LPC model, Auto
correlation method, Covariance method, Levinson-Durbin Algorithm, Lattice form).LSF, LAR,
MFCC, Sinusoidal Model, GMM, HMM
Speech coding :- Phase Vocoder, LPC, Sub-band coding, Adaptive Transform Coding , Harmonic
Coding, Vector Quantization based Coders, CELP
Speech processing :- Fundamentals of Speech recognition, Speech segmentation. Text-to-
speech conversion, speech enhancement, Speaker Verification, Language Identification, Issues
of Voice transmission over Internet.
2
3. REFERENCE
1. Douglas O'Shaughnessy, Speech Communications : Human & Machine, IEEE
Press, Hardcover 2nd edition, 1999; ISBN: 0780334493.
2. Nelson Morgan and Ben Gold, Speech and Audio Signal Processing : Processing
and Perception Speech and Music, July 1999, John Wiley & Sons, ISBN:0471351547
3. Rabiner and Schafer, Digital Processing of Speech Signals, Prentice Hall, 1978.
4. Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall, 1994.
5. Thomas F. Quatieri, Discrete-Time Speech Signal Processing: Principles and
Practice, Prentice Hall; ISBN: 013242942X; 1st edition
6. Donald G. Childers, Speech Processing and Synthesis Toolboxes, John Wiley &
Sons, September 1999; ISBN: 0471349593
For the End semester exam (100 marks), the question paper shall have six questions
of 20 marks each covering entire syllabus out of which any five shall be answered. It
shall have 75% problems & 25% Theory. For the internal marks of 50, Two test of 20
marks each and 10 marks for assignments (Minimum two) /Term Project.
3
5. Algorithms Psychoacoustics
(Programming) Room acoustics
Speech production
Speech Processing Acoustics
Signal
Processing Information Phonetics
Theory
Fourier transforms Entropy
Discrete time filters Statistical SP Communication theory
AR(MA) models Stochastic Rate-distortion theory
models
5
8. HOW IS SPEECH PRODUCED ?
Speech can be defined as “ a pressure
acoustic signal that is articulated in the
vocal tract”
Speech is produced when: air is forced
from the lungs through the vocal cords
and along the vocal tract.
8
9. This air flow is referred to as “excitation signal”.
This excitation signal causes the vocal cords to
vibrate and propagate the energy to excite the oral
and nasal openings, which play a major role in
shaping the sound produced.
Vocal Tract components:
– Oral Tract: (from lips to vocal cords).
– Nasal Tract: (from the velum till nostrills).
nostrills).
9
12. • Larynx: the source of speech
• Vocal cords (folds): the two folds of tissue in the larynx. They
can open and shut like a pair of fans.
• Glottis: the gap between the vocal cords. As air is forced
through the glottis the vocal cords will start to vibrate and
modulate the air flow.
• The frequency of vibration determines the pitch of the voice (for
a male, 50-200Hz; for a female, up to 500Hz).
12
15. Classes of speech sounds
Voiced sound
The vocal cords vibrate open and close
Quasi-periodic pulses of air
The rate of the opening and closing – the pitch
Unvoiced sounds
Forcing air at high velocities through a constriction
Noise-like turbulence
Show little long-term periodicity
Short-term correlations still present
Eg. “S”, “F”
Plosive sounds
A complete closure in the vocal tract
Air pressure is built up and released suddenly
Eg. “B” , “P”
15
17. SPEECH SOUNDS
Coarse classification with phonemes.
A phone is the acoustic realization of a
phoneme.
Allophones are context dependent
phonemes.
17
18. PHONEME HIERARCHY
Speech sounds
Language dependent.
About 50 in English.
Vowels Diphtongs Consonants
iy, ih, ae, aa, ay, ey,
ah, ao,ax, eh, oy, aw Lateral
er, ow, uh, uw liquid
Glide
Retroflex l
w, y Plosive liquid
p, b, t, Fricative
Nasal r
d, k, g
m, n, ng f, v, th, dh,
s, z, sh, zh, h
18
21. sounds like /SH/ and /S/ look like
(spectrally shaped) random noise,
while the vowel sounds /UH/, /IY/,
and /EY/ are highly structured and
quasi-periodic.
These differences result from the
distinctively different ways that these
sounds are produced.
21
26. Acoustic Characteristics of speech
Pitch:
Signal within each voiced interval is periodic. The period T is
called “pitch”. The pitch depends on the vowel being spoken,
changes in time. T~70 samples in this ex.
f0=1/T is the fundamental frequency (also known as formant
frequency).
26
27. FORMANTS
Formants can be recognized in the frequency content
of the signal segment.
Formants are best described as high energy peaks in the
frequency spectrum of speech sound.
27
28. The resonant frequencies of the vocal tract are
called formant frequencies or simply formants.
The peaks of the spectrum of the vocal tract
response correspond approximately to its
formants.
Under the linear time-invariant all-pole
assumption, each vocal tract shape is
characterized by a collection of formants.
28
29. Because the vocal tract is assumed stable with
poles inside the unit circle, the vocal tract
transfer function can be expressed either in
product or partial fraction expansion form:
29
31. A detailed acoustic theory must consider the effects of the
following:
• Time variation of the vocal tract shape
• Losses due to heat conduction and viscous friction at the
vocal tract walls
• Softness of the vocal tract walls
• Radiation of sound at the lips
• Nasal coupling
• Excitation of sound in the vocal tract
Let us begin by considering a simple case of a lossless tube:
31
32. 28 December 2012
MULTI-TUBE APPROXIMATION OF THE VOCAL
TRACT
We can represent the vocal tract as a concatenation of N lossless tubes with area {Ak}.and
equal length ∆x = l/N
The wave propagation time through each tube is τ =∆x/c = l/Nc
32
34. Consider an N-tube model of the previous figure. Each tube has length lk
and cross sectional area of Ak.
Assume:
No losses
Planar wave propagation
The wave equations for section k: 0≤x≤lk
34
36. 28 December 2012
SOUND PROPAGATION IN THE CONCATENATED
TUBE MODEL
Boundary conditions:
Physical principle of continuity:
Pressure and volume velocity must be continuous both in time and in space
everywhere in the system:
At k’th/(k+1)’st junction we have:
36
38. 28 December 2012
PROPAGATION OF SOUND IN A UNIFORM TUBE
The vocal tract transfer function of volume velocities is
38
39. 28 December 2012
PROPAGATION OF SOUND IN A UNIFORM TUBE
Using the boundary conditions U (0,s)=UG(s) and
P(-l,s)=0
*(derivation in Quateri text: page 122 – 125)
The poles of the transfer function T (j ) are where cos( l/c)=0
119 – 124: Quatieri
Derivation of eqn.4.18 is
important.
39
40. 28 December 2012
PROPAGATION OF SOUND IN A UNIFORM TUBE
(CON’T)
For c =34,000 cm/sec, l =17 cm, the natural frequencies (also called the formants) are at 500
Hz, 1500 Hz, 2500 Hz, …
The transfer function of a tube with no side branches, excited at one end and response measured at
another, only has poles
The formant frequencies will have finite bandwidth when vocal tract losses are considered (e.g.,
radiation, walls, viscosity, heat)
The length of the vocal tract, l, corresponds to 1/4λ1, 3/4λ2, 5/4λ3, …, where λi is the wavelength of
the ith natural frequency
40
41. 28 December 2012
UNIFORM TUBE MODEL
Example
Consider a uniform tube of length l=35 cm. If speed
of sound is 350 m/s calculate its resonances in Hz.
Compare its resonances with a tube of length l =
17.5 cm.
f=Ω/2π ⇒ π c
Ω=k , k = 1,3,5,...
2 l
Ω π c 1 350
f= =k =k = 250k
2π 2 l 2π 4 × 0.35
f = 250,750,1250,...
41
42. 28 December 2012
UNIFORM TUBE MODEL
For 17.5 cm tube:
Ω π c 1 350
f= =k =k = 250k
2π 2 l 2π 4 × 0.175
f = 500,1500,2500,...
42
46. VOWELS
Modeled as a tube closed at one end and open at the other
the closure is a membrane with a slit in it
the tube has uniform cross sectional area
membrane represents the source of energy (vocal folds)
the energy travels through the tube
the tube generates no energy on its own
the tube represents an important class of resonators
odd quarter length relationship
Fn=(2n-1)c/4l
47.
48. VOWELS
Filter characteristics for vowels
the vocal tract is a dynamic filter
it is frequency dependent
it has, theoretically, an infinite number of resonances
each resonance has a center frequency, an amplitude and a
bandwidth
for speech, these resonances are called formants
formants are numbered in succession from the lowest
F1, F2, F3, etc.
49. Fricatives
Modeled as a tube with a very severe constriction
The air exiting the constriction is turbulent
Because of the turbulence, there is no periodicity
unless accompanied by voicing
50. When a fricative constriction is tapered
the back cavity is involved
this resembles a tube closed at both ends
Fn=nc/2l
such a situation occurs primarily for articulation
disorders
57. SHORT-TIME SPEECH ANALYSIS
Segments (or frames, or vectors) are typically of
length 20 ms.
Speech characteristics are constant.
Allows for relatively simple modeling.
Often overlapping segments are extracted.
57
59. the system is an all-pole system with system function of the form:
For all-pole linear systems, the input and output are related by
a difference equation of the form:
59
63. SHORT-TIME ENERGY
simple to compute, and useful for estimating
properties of the excitation function in the
model.
In this case the operator T{ } is simply
squaring the windowed samples.
63
64. SHORT-TIME ZERO-CROSSING RATE
Weighted average of the number of times the
speech signal changes sign within the time
window. Representing this operator in terms of
linear filtering leads to:
64
65. Since |sgn{x[m]} − sgn{x[m − 1]}| is equal to 1
if x[m] and x[m − 1] have different algebraic
signs and 0 if they have the same sign, it
follows that it is a weighted sum of all the
instances of alternating sign (zero-crossing)
that fall within the support region of the shifted
window w[ˆn − m].
65
66. shows an example of the short-time energy and
zero crossing rate for a segment of speech with
a transition from unvoiced to voiced speech.
In both cases, the window is a Hamming
window of duration 25ms (equivalent to 401
samples at a 16 kHz sampling rate).
Thus, both the short-time energy and the
short-time zero-crossing rate are output of a
low pass filter whose frequency response is as
shown. 66
67. Short time energy and zero-crossing rate functions are slowly varying
compared to the time variations of the speech signal, and therefore, they
can be sampled at a much lower rate than that of the original speech
signal.
For finite-length windows like the Hamming window, this reduction of
the sampling rate is accomplished by moving the window position ˆn in
jumps of more than one sample
67
68. during the unvoiced interval, the zero-crossing
rate is relatively high compared to the zero-
crossing rate in the voiced interval.
Conversely, the energy is relatively low in the
unvoiced region compared to the energy in the
voiced region.
68
69. SHORT-TIME AUTOCORRELATION FUNCTION
(STACF)
The autocorrelation function is often used as a means
of detecting periodicity in signals, and it is also the
basis for many spectrum analysis methods.
STACF is defined as the deterministic autocorrelation
function of the sequence xˆn[m] = x[m]w[ˆn − m] that
is selected by the window shifted to time ˆn, i.e.,
69
71. e[n] is the excitation to the
linear system with impulse response h[n]. A
well known, and easily
proved, property of the autocorrelation
function is that
i.e., the autocorrelation function of s[n] =
e[n] h[n] is the convolution
of the autocorrelation functions of e[n] and
h[n].
71
73. SHORT-TIME FOURIER TRANSFORM (STFT)
The expression for the discrete-time STFT at
time n
where w[n] is assumed to be non-zero only
in the interval [0, N w - 1] and is referred to
as analysis window or sometimes as the
analysis filter
73
78. SHORT TIME SYNTHESIS
problem of obtaining a sequence back from its
discrete-time STFT.
This equation represents a synthesis
equation for the discrete-time STFT.
78
79. FILTER BANK SUMMATION (FBS) METHOD
the discrete STFT is considered to be the set of
outputs of a bank of filters.
the output of each filter is modulated with a
complex exponential, and these modulated
filter outputs are summed at each instant of
time to obtain the corresponding time sample
of the original sequence
That is, given a discrete STFT, X (n, k), the FBS
method synthesize a sequence y(n) satisfying
the following equation: 79
84. OVERLAP-ADD METHOD
Just as the FBS method was motivated from the
filteling view of the STFT, the OLA method is motivated
from the Fourier transform view of the STFT.
In this method, for each fixed time, we take the
inverse DFT of the corresponding frequency function
and divide the result by the analysis window.
However, instead of dividing out the analysis window
from each of the resulting short-time sections, we
perform an overlap and add operation between the
short-time sections.
84
85. given a discrete STFT X (n, k), the OLA method
synthesizes a sequence Y[n] given by
85
105. PHASE VOCODER
The fourier series is computed over a sliding
window of a single pitch period duration and
provide a measure of amplitude and frequency
trajectories of the musical tones.
105
108. which can be interpreted as a real sinewave
that is amplitude- and phase-modulated by the
STFT, the "carrier" of the latter being the kth
filter's center frequency.
the STFT of a continuos time signal as,
108
110. where is an initial condition.
The signal is likewise referred to as the
instantaneous amplitude for each channel. The
resulting filter-bank output is a sinewave with
generally a time-varying amplitude and
frequency modulation.
An alternative expression is,
110
111. which is the time-domain counterpart to the
frequency-domain phase derivative.
111
112. we can sample the continuous-time STFT, with
sampling interval T, to obtain the discrete-time
STFT.
112
123. CEPSTRAL)
HOMOMORPHIC (CEPSTRAL) SPEECH ANALYSIS
use of the short-time cepstrum as a representation of
speech and as a basis for estimating the parameters
of the speech generation model.
cepstrum of a discrete-time signal,
123
125. That is, the complex cepstrum operator
transforms convolution into addition.
This property, is what makes the cepstrum
useful for speech analysis, since the model for
speech production involves convolution of the
excitation with the vocal tract impulse
response, and our goal is often to separate the
excitation signal from the vocal tract signal.
125
126. The key issue in the definition and computation
of the complex cepstrum is the computation of
the complex logarithm.
ie, the computation of the phase angle
arg[X(ejω)], which must be done so as to
preserve an additive combination of phases for
two signals combined by convolution
126
127. SHORT-
THE SHORT-TIME CEPSTRUM
The short-time cepstrum is a sequence of
cepstra of windowed finite-duration segments
of the speech waveform.
127
129. RECURSIVE COMPUTATION OF THE COMPLEX
CEPSTRUM
Another approach to compute the complex
cepstrum applies only to minimum-phase
signals.
i.e., signals having an z-transform whose poles
and zeros are inside the unit circle.
An example would be the impulse response of
an all-pole vocal tract model with system
function
129
130. In this case, all the poles ck must be inside
the unit circle
for stability of the system.
130
132. The low quefrency part of the cepstrum is
expected to be representative of the slow
variations (with frequency) in the log spectrum,
while the high quefrency components would
correspond to the more rapid fluctuations of
the log spectrum.
132
133. the spectrum for the voiced segment has a structure of periodic ripples
due to the harmonic structure of the quasi-periodic segment of voiced
speech.
This periodic structure in the log spectrum manifests itself in the
cepstrum peak at a quefrency of about 9ms.
The existence of this peak in the quefrency range of expected pitch
periods strongly signals voiced speech.
Furthermore, the quefrency of the peak is an accurate estimate of the
pitch period during the corresponding speech interval.
the autocorrelation function also displays an indication of periodicity, but
not nearly as unambiguously as does the cepstrum.
But the rapid variations of the unvoiced spectra appear random with no
periodic structure.
As a result, there is no strong peak indicating periodicity as in the voiced
case.
133
134. These slowly varying log spectra clearly retain
the general spectral shape with peaks
corresponding to the formant resonance
structure for the segment of speech under
analysis.
134
135. APPLICATION TO PITCH DETECTION
The cepstrum was first applied in speech
processing to determine the excitation
parameters for the discrete-time speech model.
The successive spectra and cepstra are for 50
ms segments obtained by moving the window
in steps of 12.5 ms (100 samples at a
sampling rate of 8000 samples/sec).
135
136. for the positions 1 through 5, the window includes only
unvoiced speech
for positions 6 and 7 the signal within the window is partly
voiced and partly unvoiced.
For positions 8 through 15 the window only includes voiced
speech.
the rapid variations of the unvoiced spectra appear random
with no periodic structure.
the spectra for voiced segments have a structure of periodic
ripples due to the harmonic structure of the quasi-periodic
segment of voiced speech.
136
138. the cepstrum peak at a quefrency of about 11–
12 ms strongly signals voiced speech, and the
quefrency of the peak is an accurate estimate
of the pitch period during the corresponding
speech interval.
Presence of a strong peak implies voiced
speech, and the quefrency location of the peak
gives the estimate of the pitch period.
138
139. MEL-
MEL-FREQUENCY CEPSTRUM COEFFICIENTS
MFCC)
(MFCC)
The idea is to compute a frequency analysis based
upon a filter bank with approximately critical band
spacing of the filters and bandwidths.
For 4 KHz bandwidth, approximately 20 filters are
used.
a short-time Fourier analysis is done first, resulting in
a DFT Xˆn[k] for analysis time ˆn.
Then the DFT values are grouped together in critical
bands and weighted by a triangular weighting
function.
139
140. the bandwidths are constant for center
frequencies below 1 kHz and then increase
exponentially up to half the sampling rate of 4
kHz resulting in a total of 22 filters.
The mel-frequency spectrum at analysis timeˆn
is defined for r = 1,2,...,R as
140
142. is a normalizing factor for the rth mel-filter.
For each frame, a discrete cosine transform of
the log of the magnitude of the filter outputs is
computed to form the function mfccˆn[m], i.e.,
142
144. shows the result of mfcc analysis of a frame of
voiced speech in comparison with the short-
time Fourier spectrum, LPC spectrum, and a
homomorphically smoothed spectrum.
all these spectra are different, but they have in
common that they have peaks at the formant
resonances.
At higher frequencies, the reconstructed mel-
spectrum has more smoothing due to the
structure of the filter bank. 144
145. THE SPEECH SPECTROGRAM
simply a display of the magnitude of the STFT.
Specifically, the images in Figure are plots of
where the plot axes are labeled in terms of
analog time and frequency through the
relations tr = rRT and fk = k/(NT), where T is
the sampling period of the discrete-time signal
x[n] = xa(nT).
145
146. In order to make smooth, R is usually quite
small compared to both the window length L
and the number of samples in the frequency
dimension, N, which may be much larger than
the window length L.
Such a function of two variables can be plotted
on a two dimensional surface as either a gray-
scale or a color-mapped image.
The bars on the right calibrate the color map (in
dB). 146
148. if the analysis window is short, the spectrogram
is called a wide-band spectrogram which is
characterized by good time resolution and poor
frequency resolution.
when the window length is long, the
spectrogram is a narrow-band spectrogram,
which is characterized by good frequency
resolution and poor time resolution.
148
149. THE SPECTROGRAM
• A classic analysis tool.
– Consists of DFTs of overlapping, and
windowed frames.
• Displays the distribution of energy in time
and frequency.
2
– 10 log10 X m ( f ) is typically displayed.
149
152. Note the three broad peaks in the spectrum
slice at time tr = 430 ms, and observe that
similar slices would be obtained at other times
around tr = 430 ms.
These large peaks are representative of the
underlying resonances of the vocal tract at the
corresponding time in the production of the
speech signal.
152
153. The lower spectrogram is not as sensitive to
rapid time variations, but the resolution in the
frequency dimension is much better.
This window length is on the order of several
pitch periods of the waveform during voiced
intervals.
As a result, the spectrogram no longer displays
vertically oriented striations since several
periods are included in the window.
153
156. CEPSTRAL ANALYSIS
Signal(s)=convolution(*) of
glottal excitation (e) and vocal_tract_filter (h)
s(n)=e(n)*h(n), n is time index
After Fourier transform FT: FT{s(n)}=FT{e(n)*h(n)}
Convolution(*) becomes multiplication (.)
n(time) w(frequency),
S(w) = E(w).H(w)
Find Magnitude of the spectrum
|S(w)| = |E(w)|.|H(w)|
log10 |S(w)|= log10{|E(w)|}+ log10{|H(w)|}
Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1
156
157. CEPSTRUM
C(n)=IDFT[log10 |S(w)|]=
IDFT[ log10{|E(w)|} + log10{|H(w)|} ]
X(n) X(w) Log|x(w)|
S(n) windowing DFT Log|x(w)| IDFT C(n)
N=time index
w=frequency
I-DFT=Inverse-discrete Fourier transform
In c(n), you can see E(n) and H(n) at two different positions
Application: useful for (i) glottal excitation (ii) vocal tract filter
analysis
157
160. the time-decimated subband outputs are quantized
and encoded, then are decoded at the receiver.
In subband coding, a small number of filters with wide
and overlapping bandwidths are chosen and each
output is quantized
each bandpass filter output is quantized individually.
although the bandpass filters are wide and
overlapping, careful design of the filter, resuIts in a
cancellation of quantization noise that leaks across
bands.
160
161. Quadrature mirror filters are one such filter
class;
shows an example of a two-band subband
coder using two overlapping quadrature mirror
filters
Quadrature mirror filters can be further
subdivided from high to low filters by splitting
the fullband into two, then the resulting lower
band into two, and so on.
161
162. This octave-band splitting, together with the
iterative decimation, can be shown to yield a
perfect reconstruction filter bank
such octave-band filter banks, and their
conditions for perfect reconstruction, are
closely related to wavelet analysis/synthesis
structures.
162
164. 164
LINEAR PREDICTION (INTRODUCTION):
The object of linear prediction is to estimate
the output sequence from a linear combination
of input samples, past output samples or both :
q p
y(n) = ∑b( j) x(n − j) − ∑ a(i) y(n − i)
ˆ
j =0 i =1
The factors a(i) and b(j) are called predictor
coefficients.
165. 165
LINEAR PREDICTION (INTRODUCTION):
Many systems of interest to us are describable by a
linear, constant-coefficient difference equation :
p q
∑ a(i) y(n − i) = ∑ b( j ) x(n − j )
i =0 j =0
If Y(z)/X(z)=H(z), where H(z) is a ratio of polynomials
N(z)/D(z), then
q p
N ( z ) = ∑ b( j ) z − j and D( z ) = ∑ a(i ) z −i
j =0 i =0
Thus the predictor coefficients give us immediate access to the
poles and zeros of H(z).
166. 166
LINEAR PREDICTION (TYPES OF SYSTEM MODEL):
There are two important variants :
All-pole model (in statistics, autoregressive (AR)
model ) :
The numerator N(z) is a constant.
All-zero model (in statistics, moving-average (MA)
model ) :
The denominator D(z) is equal to unity.
The mixed pole-zero model is called the
autoregressive moving-average (ARMA) model.
167. 167
LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):
Given a zero-mean signal y(n), in the AR model :
p
y (n) = −∑ a(i ) y (n − i )
ˆ
i =1
The error is :
ˆ
e( n ) = y ( n ) − y ( n )
p
= ∑ a (i ) y (n − i )
i =0
To derive the predictor we use the orthogonality
principle, the principle states that the desired
coefficients are those which make the error orthogonal
to the samples y(n-1), y(n-2),…, y(n-p).
168. 168
LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):
Thus we require that
< y (n − j )e(n) >= 0 for j = 1, 2, ..., p
p
Or,
y (n − j )∑ a (i ) y (n − i ) = 0
i =0
Interchanging the operation of averaging and summing,
and representing < > by summing over n, we have
p
∑ a(i)∑ y(n − i) y(n − j ) = 0, j = 1,..., p
i =0 n
The required predictors are found by solving these
equations.
169. 169
LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):
The orthogonality principle also states that resulting
minimum error is given by
E = e 2 ( n ) = y ( n ) e( n )
Or, p
∑ a(i)∑ y(n − i) y(n) = E
i =0 n
We can minimize the error over all time :
p
∑
i=0
a ( i )ri − j = 0 , j = 1 ,2 , ...,p
p
∑
i=0
a ( i ) ri = E
∞
where
ri = ∑ y ( n) y ( n − i )
n = −∞
170. 170
LINEAR PREDICTION (APPLICATIONS):
Autocorrelation matching :
We have a signal y(n) with known autocorrelation
. We model this with the AR system shown below :
e(n)
ryy (n) y (n )
σ
1-A(z)
σ σ
H ( z) = = p
A( z )
1 − ∑ ai z −i
i =1
171. 171
LINEAR PREDICTION (ORDER OF LINEAR PREDICTION):
The choice of predictor order depends on the
analysis bandwidth. The rule of thumb is :
2 BW
p= +c
1000
For a normal vocal tract, there is an average of
about one formant per kilo Hertz of BW.
One formant requires two complex conjugate poles.
Hence for every formant we require two predictor
coefficients, or two coefficients per kilo Hertz of
bandwidth.
172. 172
LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL):
True Model:
Pitch Gain
s(n)
Speech
DT G(z)
Signal
Voiced Impulse Glottal U(n)
generator Filter Voiced
Volume
V H(z) R(z)
velocity Vocal tract LP
U Filter Filter
Uncorrelated
Unvoiced Noise
generator
Gain
173. 173
LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL):
Using LP analysis :
Pitch
Gain
DT
Voiced Impulse estimate s(n)
generator Speech
V All-Pole Signal
Filter
U (AR)
White
Unvoiced Noise H(z)
generator