SlideShare a Scribd company logo
1 of 173
Download to read offline
SPEECH SIGNAL PROCESSING
KERALA UNIVERSITY M-TECH 1ST SEMESTER
                  M-

      lizytvm@yahoo.com                           Lizy Abraham
      +919495123331                          Assistant Professor
                                            Department of ECE
                          LBS Institute of Technology for Women
                              (A Govt. of Kerala Undertaking)
                                                     Poojappura
                                           Trivandrum -695012
                                                  Kerala, India

                                                                   1
SYLLABUS TSC 1004 SPEECH SIGNAL PROCESSING 3-0-0-3
                                           3-


  Speech Production :- Acoustic theory of speech production (Excitation, Vocal tract model for
  speech analysis, Formant structure, Pitch). Articulatory Phonetic (Articulation, Voicing,
  Articulatory model). Acoustic Phonetics ( Basic speech units and their classification).
  Speech Analysis :- Short-Time Speech Analysis, Time domain analysis (Short time energy, short
  time zero crossing Rate, ACF ). Frequency domain analysis (Filter Banks, STFT, Spectrogram,
  Formant Estimation &Analysis). Cepstral Analysis
  Parametric representation of speech :- AR Model, ARMA model. LPC Analysis ( LPC model, Auto
  correlation method, Covariance method, Levinson-Durbin Algorithm, Lattice form).LSF, LAR,
  MFCC, Sinusoidal Model, GMM, HMM
  Speech coding :- Phase Vocoder, LPC, Sub-band coding, Adaptive Transform Coding , Harmonic
  Coding, Vector Quantization based Coders, CELP
  Speech processing :- Fundamentals of Speech recognition, Speech segmentation. Text-to-
  speech conversion, speech enhancement, Speaker Verification, Language Identification, Issues
  of Voice transmission over Internet.




                                                                                             2
REFERENCE

 1. Douglas O'Shaughnessy, Speech Communications : Human & Machine, IEEE
 Press, Hardcover 2nd edition, 1999; ISBN: 0780334493.
 2. Nelson Morgan and Ben Gold, Speech and Audio Signal Processing : Processing
 and Perception Speech and Music, July 1999, John Wiley & Sons, ISBN:0471351547
 3. Rabiner and Schafer, Digital Processing of Speech Signals, Prentice Hall, 1978.
 4. Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall, 1994.
 5. Thomas F. Quatieri, Discrete-Time Speech Signal Processing: Principles and
 Practice, Prentice Hall; ISBN: 013242942X; 1st edition
 6. Donald G. Childers, Speech Processing and Synthesis Toolboxes, John Wiley &
 Sons, September 1999; ISBN: 0471349593

 For the End semester exam (100 marks), the question paper shall have six questions
 of 20 marks each covering entire syllabus out of which any five shall be answered. It
 shall have 75% problems & 25% Theory. For the internal marks of 50, Two test of 20
 marks each and 10 marks for assignments (Minimum two) /Term Project.

                                                                                     3
Speech Processing means Processing of
 discrete time speech signals




                                        4
Algorithms                    Psychoacoustics
                        (Programming)                 Room acoustics
                                                      Speech production



                           Speech Processing                 Acoustics


         Signal
         Processing                     Information        Phonetics
                                        Theory



Fourier transforms                              Entropy
Discrete time filters        Statistical SP     Communication theory
AR(MA) models                Stochastic         Rate-distortion theory
                             models



                                                                          5
6
7
HOW IS SPEECH PRODUCED ?


   Speech can be defined as “ a pressure
   acoustic signal that is articulated in the
   vocal tract”


   Speech is produced when: air is forced
   from the lungs through the vocal cords
   and along the vocal tract.

                                                8
This air flow is referred to as “excitation signal”.

This excitation signal causes the vocal cords to
vibrate and propagate the energy to excite the oral
and nasal openings, which play a major role in
shaping the sound produced.


Vocal Tract components:
  – Oral Tract: (from lips to vocal cords).
  – Nasal Tract: (from the velum till nostrills).
                                      nostrills).

                                                       9
10
11
•   Larynx: the source of speech

•   Vocal cords (folds): the two folds of tissue in the larynx. They
    can open and shut like a pair of fans.

•   Glottis: the gap between the vocal cords. As air is forced
    through the glottis the vocal cords will start to vibrate and
    modulate the air flow.

•   The frequency of vibration determines the pitch of the voice (for
    a male, 50-200Hz; for a female, up to 500Hz).

                                                                       12
SPEECH PRODUCTION MODEL




                          13
Places of articulation


              alveolar post-alveolar/palatal
     dental
                                velar
                                  uvular
   labial
                                  pharyngeal

                                   laryngeal/glottal



                                                       14
Classes of speech sounds

   Voiced sound
      The vocal cords vibrate open and close
      Quasi-periodic pulses of air
      The rate of the opening and closing – the pitch
   Unvoiced sounds
      Forcing air at high velocities through a constriction
      Noise-like turbulence
      Show little long-term periodicity
      Short-term correlations still present
      Eg. “S”, “F”
   Plosive sounds
      A complete closure in the vocal tract
      Air pressure is built up and released suddenly
      Eg. “B” , “P”
                                                              15
Speech Model




               16
SPEECH SOUNDS


  Coarse classification with phonemes.

  A phone is the acoustic realization of a
  phoneme.

  Allophones are context dependent
  phonemes.

                                             17
PHONEME HIERARCHY
                     Speech sounds
                                                        Language dependent.
                                                        About 50 in English.

     Vowels       Diphtongs              Consonants

iy, ih, ae, aa,     ay, ey,
ah, ao,ax, eh,      oy, aw                                                          Lateral
er, ow, uh, uw                                                                      liquid
                   Glide
                                                                        Retroflex       l
                   w, y       Plosive                                   liquid
                              p, b, t,                Fricative
                                          Nasal                             r
                              d, k, g
                                         m, n, ng     f, v, th, dh,
                                                      s, z, sh, zh, h


                                                                                              18
19
20
sounds like /SH/ and /S/ look like
(spectrally shaped) random noise,
while the vowel sounds /UH/, /IY/,
and /EY/ are highly structured and
quasi-periodic.

These differences result from the
distinctively different ways that these
sounds are produced.



                                          21
22
Vowel Chart


               Front                 Center       Back


       i                                                 u

High           ɪ                                   ʊ



           e                                             o
                                       ə ʌ
                                                   ɪ
Mid                    ɛ




Low
                           æ                             a
                                              ɪ
24
SPEECH WAVEFORM CHARACTERISTICS


    Loudness
    Voiced/Unvoiced.
    Pitch.
      Fundamental frequency.
    Spectral envelope.
      Formants.


                                  25
Acoustic Characteristics of speech
 Pitch:
 Signal within each voiced interval is periodic. The period T is
 called “pitch”. The pitch depends on the vowel being spoken,
 changes in time. T~70 samples in this ex.
 f0=1/T is the fundamental frequency (also known as formant
 frequency).




                                                                   26
FORMANTS


 Formants can be recognized in the frequency content
 of the signal segment.

 Formants are best described as high energy peaks in the
 frequency spectrum of speech sound.




                                                           27
The resonant frequencies of the vocal tract are
called formant frequencies or simply formants.
The peaks of the spectrum of the vocal tract
response correspond approximately to its
formants.
Under the linear time-invariant all-pole
assumption, each vocal tract shape is
characterized by a collection of formants.

                                              28
Because the vocal tract is assumed stable with
poles inside the unit circle, the vocal tract
transfer function can be expressed either in
product or partial fraction expansion form:




                                             29
30
A detailed acoustic theory must consider the effects of the
following:
• Time variation of the vocal tract shape
• Losses due to heat conduction and viscous friction at the
vocal tract walls
• Softness of the vocal tract walls
• Radiation of sound at the lips
• Nasal coupling
• Excitation of sound in the vocal tract
Let us begin by considering a simple case of a lossless tube:


                                                                31
28 December 2012


MULTI-TUBE APPROXIMATION OF THE VOCAL
TRACT
 We can represent the vocal tract as a concatenation of N lossless tubes with area {Ak}.and
 equal length ∆x = l/N
 The wave propagation time through each tube is τ =∆x/c = l/Nc




                                                                                              32
33
Consider an N-tube model of the previous figure. Each tube has length lk
and cross sectional area of Ak.
Assume:
   No losses
   Planar wave propagation
The wave equations for section k: 0≤x≤lk




                                                                           34
35
28 December 2012


SOUND PROPAGATION IN THE CONCATENATED
TUBE MODEL
 Boundary conditions:
 Physical principle of continuity:
     Pressure and volume velocity must be continuous both in time and in space
     everywhere in the system:
 At k’th/(k+1)’st junction we have:




                                                                                 36
28 December 2012


ANALOGY WITH ELECTRICAL CIRCUIT
TRANSMISSION LINE




                                                 37
28 December 2012



PROPAGATION OF SOUND IN A UNIFORM TUBE




 The vocal tract transfer function of volume velocities is




                                                                                38
28 December 2012



PROPAGATION OF SOUND IN A UNIFORM TUBE

 Using the boundary conditions U (0,s)=UG(s) and
 P(-l,s)=0



 *(derivation in Quateri text: page 122 – 125)
 The poles of the transfer function T (j ) are where cos( l/c)=0




             119 – 124: Quatieri
               Derivation of eqn.4.18 is
             important.
                                                                           39
28 December 2012


PROPAGATION OF SOUND IN A UNIFORM TUBE
(CON’T)
 For c =34,000 cm/sec, l =17 cm, the natural frequencies (also called the formants) are at 500
 Hz, 1500 Hz, 2500 Hz, …




 The transfer function of a tube with no side branches, excited at one end and response measured at
 another, only has poles
 The formant frequencies will have finite bandwidth when vocal tract losses are considered (e.g.,
 radiation, walls, viscosity, heat)
 The length of the vocal tract, l, corresponds to 1/4λ1, 3/4λ2, 5/4λ3, …, where λi is the wavelength of
 the ith natural frequency



                                                                                                      40
28 December 2012




UNIFORM TUBE MODEL

 Example
  Consider a uniform tube of length l=35 cm. If speed
  of sound is 350 m/s calculate its resonances in Hz.
  Compare its resonances with a tube of length l =
  17.5 cm.
  f=Ω/2π ⇒            π c
                Ω=k        , k = 1,3,5,...
                       2 l
                     Ω       π c 1           350
                f=      =k           =k             = 250k
                    2π       2 l 2π        4 × 0.35
                f = 250,750,1250,...
                                                            41
28 December 2012



UNIFORM TUBE MODEL

 For 17.5 cm tube:
       Ω     π c 1         350
  f=      =k        =k            = 250k
      2π     2 l 2π     4 × 0.175
  f = 500,1500,2500,...




                                                   42
43
APPROXIMATING VOCAL TRACT SHAPES




                                   44
45
VOWELS
Modeled as a tube closed at one end and open at the other
  the closure is a membrane with a slit in it
  the tube has uniform cross sectional area
  membrane represents the source of energy (vocal folds)
     the energy travels through the tube
     the tube generates no energy on its own
  the tube represents an important class of resonators
     odd quarter length relationship
     Fn=(2n-1)c/4l
VOWELS

Filter characteristics for vowels
    the vocal tract is a dynamic filter
    it is frequency dependent
    it has, theoretically, an infinite number of resonances
    each resonance has a center frequency, an amplitude and a
    bandwidth
    for speech, these resonances are called formants
    formants are numbered in succession from the lowest
        F1, F2, F3, etc.
Fricatives
   Modeled as a tube with a very severe constriction
   The air exiting the constriction is turbulent
   Because of the turbulence, there is no periodicity
   unless accompanied by voicing
When a fricative constriction is tapered
  the back cavity is involved
  this resembles a tube closed at both ends
    Fn=nc/2l
  such a situation occurs primarily for articulation
  disorders
Introduction to Digital Speech Processing
(Rabiner & Schafer )– 20-23

                                            51
52
Rabiner &
Schafer : 98-
105




                53
54
28 December 2012


SOUND SOURCE:
VOCAL FOLD VIBRATION
 Modeled as a volume velocity source at glottis, UG(j )




                                                                             55
56
SHORT-TIME SPEECH ANALYSIS

 Segments (or frames, or vectors) are typically of
 length 20 ms.
   Speech characteristics are constant.
   Allows for relatively simple modeling.
 Often overlapping segments are extracted.




                                                57
SHORT-
SHORT-TIME ANALYSIS OF SPEECH




                                58
the system is an all-pole system with system function of the form:




For all-pole linear systems, the input and output are related by
a difference equation of the form:




                                                                     59
60
The operator T{ } defines the nature of the
short-time analysis function, and w[ˆn − m]
represents a time shifted window sequence




                                              61
62
SHORT-TIME ENERGY

 simple to compute, and useful for estimating
 properties of the excitation function in the
 model.




     In this case the operator T{ } is simply
     squaring the windowed samples.




                                                63
SHORT-TIME ZERO-CROSSING RATE

 Weighted average of the number of times the
 speech signal changes sign within the time
 window. Representing this operator in terms of
 linear filtering leads to:




                                                  64
Since |sgn{x[m]} − sgn{x[m − 1]}| is equal to 1
if x[m] and x[m − 1] have different algebraic
signs and 0 if they have the same sign, it
follows that it is a weighted sum of all the
instances of alternating sign (zero-crossing)
that fall within the support region of the shifted
window w[ˆn − m].



                                                 65
shows an example of the short-time energy and
zero crossing rate for a segment of speech with
a transition from unvoiced to voiced speech.
 In both cases, the window is a Hamming
window of duration 25ms (equivalent to 401
samples at a 16 kHz sampling rate).
 Thus, both the short-time energy and the
short-time zero-crossing rate are output of a
low pass filter whose frequency response is as
shown.                                        66
Short time energy and zero-crossing rate functions are slowly varying
compared to the time variations of the speech signal, and therefore, they
can be sampled at a much lower rate than that of the original speech
signal.
For finite-length windows like the Hamming window, this reduction of
the sampling rate is accomplished by moving the window position ˆn in
jumps of more than one sample




                                                                            67
during the unvoiced interval, the zero-crossing
rate is relatively high compared to the zero-
crossing rate in the voiced interval.
Conversely, the energy is relatively low in the
unvoiced region compared to the energy in the
voiced region.




                                                  68
SHORT-TIME AUTOCORRELATION FUNCTION
(STACF)
 The autocorrelation function is often used as a means
 of detecting periodicity in signals, and it is also the
 basis for many spectrum analysis methods.
 STACF is defined as the deterministic autocorrelation
 function of the sequence xˆn[m] = x[m]w[ˆn − m] that
 is selected by the window shifted to time ˆn, i.e.,




                                                       69
70
e[n] is the excitation to the
linear system with impulse response h[n]. A
well known, and easily
proved, property of the autocorrelation
function is that




               i.e., the autocorrelation function of s[n] =
               e[n] h[n] is the convolution
               of the autocorrelation functions of e[n] and
               h[n].




                                                              71
72
SHORT-TIME FOURIER TRANSFORM (STFT)

 The expression for the discrete-time STFT at
 time n
          where w[n] is assumed to be non-zero only
          in the interval [0, N w - 1] and is referred to
          as analysis window or sometimes as the
          analysis filter




                                                            73
74
FILTERING VIEW




                 75
76
77
SHORT TIME SYNTHESIS

 problem of obtaining a sequence back from its
 discrete-time STFT.




        This equation represents a synthesis
        equation for the discrete-time STFT.




                                                 78
FILTER BANK SUMMATION (FBS) METHOD

 the discrete STFT is considered to be the set of
 outputs of a bank of filters.
 the output of each filter is modulated with a
 complex exponential, and these modulated
 filter outputs are summed at each instant of
 time to obtain the corresponding time sample
 of the original sequence
 That is, given a discrete STFT, X (n, k), the FBS
 method synthesize a sequence y(n) satisfying
 the following equation:                         79
80
81
82
83
OVERLAP-ADD METHOD
 Just as the FBS method was motivated from the
 filteling view of the STFT, the OLA method is motivated
 from the Fourier transform view of the STFT.
 In this method, for each fixed time, we take the
 inverse DFT of the corresponding frequency function
 and divide the result by the analysis window.
 However, instead of dividing out the analysis window
 from each of the resulting short-time sections, we
 perform an overlap and add operation between the
 short-time sections.

                                                       84
given a discrete STFT X (n, k), the OLA method
synthesizes a sequence Y[n] given by




                                                 85
86
Furthermore, if the discrete STFT had been
decimated in time by a factor L, it can be
similarly shown that if the analysis window
satisfies




                                              87
88
DESIGN OF DIGITAL FILTER BANKS
           282 – 297: Rabiner & Schafer




                                          89
90
91
92
USING IIR FILTER




                   93
94
USING FIR FILTER




                   95
96
97
98
99
100
FILTER BANK ANALYSIS AND SYNTHESIS




                                     101
102
103
FBS synthesis results in multiple copies of the
input:




                                                  104
PHASE VOCODER

 The fourier series is computed over a sliding
 window of a single pitch period duration and
 provide a measure of amplitude and frequency
 trajectories of the musical tones.




                                             105
106
107
which can be interpreted as a real sinewave
that is amplitude- and phase-modulated by the
STFT, the "carrier" of the latter being the kth
filter's center frequency.
the STFT of a continuos time signal as,




                                              108
109
where            is an initial condition.
The signal           is likewise referred to as the
instantaneous amplitude for each channel. The
resulting filter-bank output is a sinewave with
generally a time-varying amplitude and
frequency modulation.
An alternative expression is,


                                                 110
which is the time-domain counterpart to the
frequency-domain phase derivative.




                                              111
we can sample the continuous-time STFT, with
sampling interval T, to obtain the discrete-time
STFT.




                                               112
113
114
115
116
117
SPEECH MODIFICATION




                      118
119
120
121
122
CEPSTRAL)
HOMOMORPHIC (CEPSTRAL) SPEECH ANALYSIS

 use of the short-time cepstrum as a representation of
 speech and as a basis for estimating the parameters
 of the speech generation model.
 cepstrum of a discrete-time signal,




                                                    123
124
That is, the complex cepstrum operator
transforms convolution into addition.
This property, is what makes the cepstrum
useful for speech analysis, since the model for
speech production involves convolution of the
excitation with the vocal tract impulse
response, and our goal is often to separate the
excitation signal from the vocal tract signal.

                                             125
The key issue in the definition and computation
of the complex cepstrum is the computation of
the complex logarithm.
ie, the computation of the phase angle
arg[X(ejω)], which must be done so as to
preserve an additive combination of phases for
two signals combined by convolution



                                             126
SHORT-
THE SHORT-TIME CEPSTRUM

 The short-time cepstrum is a sequence of
 cepstra of windowed finite-duration segments
 of the speech waveform.




                                                127
128
RECURSIVE COMPUTATION OF THE COMPLEX
CEPSTRUM
 Another approach to compute the complex
 cepstrum applies only to minimum-phase
 signals.
 i.e., signals having an z-transform whose poles
 and zeros are inside the unit circle.
 An example would be the impulse response of
 an all-pole vocal tract model with system
 function

                                               129
In this case, all the poles ck must be inside
the unit circle
for stability of the system.




                                                130
SHORT-
SHORT-TIME HOMOMORPHIC FILTERING OF
SPEECH –PAGE N0: 63, RABINER & SCHAFER




                                         131
The low quefrency part of the cepstrum is
expected to be representative of the slow
variations (with frequency) in the log spectrum,
while the high quefrency components would
correspond to the more rapid fluctuations of
the log spectrum.




                                               132
the spectrum for the voiced segment has a structure of periodic ripples
due to the harmonic structure of the quasi-periodic segment of voiced
speech.
This periodic structure in the log spectrum manifests itself in the
cepstrum peak at a quefrency of about 9ms.
The existence of this peak in the quefrency range of expected pitch
periods strongly signals voiced speech.
Furthermore, the quefrency of the peak is an accurate estimate of the
pitch period during the corresponding speech interval.
the autocorrelation function also displays an indication of periodicity, but
 not nearly as unambiguously as does the cepstrum.
But the rapid variations of the unvoiced spectra appear random with no
periodic structure.
As a result, there is no strong peak indicating periodicity as in the voiced
case.
                                                                          133
These slowly varying log spectra clearly retain
the general spectral shape with peaks
corresponding to the formant resonance
structure for the segment of speech under
analysis.




                                                  134
APPLICATION TO PITCH DETECTION

 The cepstrum was first applied in speech
 processing to determine the excitation
 parameters for the discrete-time speech model.
 The successive spectra and cepstra are for 50
 ms segments obtained by moving the window
 in steps of 12.5 ms (100 samples at a
 sampling rate of 8000 samples/sec).



                                             135
for the positions 1 through 5, the window includes only
unvoiced speech
for positions 6 and 7 the signal within the window is partly
voiced and partly unvoiced.
For positions 8 through 15 the window only includes voiced
speech.
the rapid variations of the unvoiced spectra appear random
with no periodic structure.
the spectra for voiced segments have a structure of periodic
ripples due to the harmonic structure of the quasi-periodic
segment of voiced speech.

                                                               136
137
the cepstrum peak at a quefrency of about 11–
12 ms strongly signals voiced speech, and the
quefrency of the peak is an accurate estimate
of the pitch period during the corresponding
speech interval.
Presence of a strong peak implies voiced
speech, and the quefrency location of the peak
gives the estimate of the pitch period.

                                            138
MEL-
MEL-FREQUENCY CEPSTRUM COEFFICIENTS
 MFCC)
(MFCC)
 The idea is to compute a frequency analysis based
 upon a filter bank with approximately critical band
 spacing of the filters and bandwidths.
 For 4 KHz bandwidth, approximately 20 filters are
 used.
 a short-time Fourier analysis is done first, resulting in
 a DFT Xˆn[k] for analysis time ˆn.
 Then the DFT values are grouped together in critical
 bands and weighted by a triangular weighting
 function.
                                                         139
the bandwidths are constant for center
frequencies below 1 kHz and then increase
exponentially up to half the sampling rate of 4
kHz resulting in a total of 22 filters.
The mel-frequency spectrum at analysis timeˆn
is defined for r = 1,2,...,R as




                                             140
141
is a normalizing factor for the rth mel-filter.
For each frame, a discrete cosine transform of
the log of the magnitude of the filter outputs is
computed to form the function mfccˆn[m], i.e.,




                                                142
143
shows the result of mfcc analysis of a frame of
voiced speech in comparison with the short-
time Fourier spectrum, LPC spectrum, and a
homomorphically smoothed spectrum.
all these spectra are different, but they have in
common that they have peaks at the formant
resonances.
At higher frequencies, the reconstructed mel-
spectrum has more smoothing due to the
structure of the filter bank.                   144
THE SPEECH SPECTROGRAM

 simply a display of the magnitude of the STFT.
 Specifically, the images in Figure are plots of

 where the plot axes are labeled in terms of
 analog time and frequency through the
 relations tr = rRT and fk = k/(NT), where T is
 the sampling period of the discrete-time signal
 x[n] = xa(nT).

                                                   145
In order to make smooth, R is usually quite
small compared to both the window length L
and the number of samples in the frequency
dimension, N, which may be much larger than
the window length L.
 Such a function of two variables can be plotted
on a two dimensional surface as either a gray-
scale or a color-mapped image.
The bars on the right calibrate the color map (in
dB).                                           146
147
if the analysis window is short, the spectrogram
is called a wide-band spectrogram which is
characterized by good time resolution and poor
frequency resolution.
when the window length is long, the
spectrogram is a narrow-band spectrogram,
which is characterized by good frequency
resolution and poor time resolution.

                                              148
THE SPECTROGRAM


 • A classic analysis tool.
   – Consists of DFTs of overlapping, and
     windowed frames.
 • Displays the distribution of energy in time
   and frequency.
                      2
   – 10 log10 X m ( f ) is typically displayed.



                                                  149
THE SPECTROGRAM CONT.




                        150
151
Note the three broad peaks in the spectrum
slice at time tr = 430 ms, and observe that
similar slices would be obtained at other times
around tr = 430 ms.
These large peaks are representative of the
underlying resonances of the vocal tract at the
corresponding time in the production of the
speech signal.

                                              152
The lower spectrogram is not as sensitive to
rapid time variations, but the resolution in the
frequency dimension is much better.
This window length is on the order of several
pitch periods of the waveform during voiced
intervals.
As a result, the spectrogram no longer displays
vertically oriented striations since several
periods are included in the window.
                                               153
SHORT TIME ACF
          /m/          /ow/   /s/




ACF




                                    154
CEPSTRUM
SPEECH WAVE (X)= EXCITATION (E) . FILTER (H)

               (S)
                               (H)
                              (Vocal tract
                              filter)                                    (E)
                                                           Glottal excitation
                                                           From
                                                           Vocal cords
                                                           (Glottis)




       http://home.hib.no/al/engelsk/seksjon/SOFF-MASTER/ill061.gif
                                                                                155
CEPSTRAL ANALYSIS
   Signal(s)=convolution(*) of
      glottal excitation (e) and vocal_tract_filter (h)
      s(n)=e(n)*h(n), n is time index
   After Fourier transform FT: FT{s(n)}=FT{e(n)*h(n)}
      Convolution(*) becomes multiplication (.)
      n(time) w(frequency),
   S(w) = E(w).H(w)
   Find Magnitude of the spectrum
   |S(w)| = |E(w)|.|H(w)|
   log10 |S(w)|= log10{|E(w)|}+ log10{|H(w)|}


 Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1
                                                              156
CEPSTRUM
   C(n)=IDFT[log10 |S(w)|]=
   IDFT[ log10{|E(w)|} + log10{|H(w)|} ]

                      X(n)         X(w)               Log|x(w)|
S(n)     windowing           DFT          Log|x(w)|               IDFT   C(n)
 N=time index
 w=frequency
 I-DFT=Inverse-discrete Fourier transform


   In c(n), you can see E(n) and H(n) at two different positions
   Application: useful for (i) glottal excitation (ii) vocal tract filter
   analysis


                                                                                157
EXAMPLE OF CEPSTRUM
                      sampling frequency 22.05KHz




                                                158
SUB BAND CODING




                  159
the time-decimated subband outputs are quantized
and encoded, then are decoded at the receiver.
In subband coding, a small number of filters with wide
and overlapping bandwidths are chosen and each
output is quantized
each bandpass filter output is quantized individually.
although the bandpass filters are wide and
overlapping, careful design of the filter, resuIts in a
cancellation of quantization noise that leaks across
bands.
                                                     160
Quadrature mirror filters are one such filter
class;
shows an example of a two-band subband
coder using two overlapping quadrature mirror
filters
Quadrature mirror filters can be further
subdivided from high to low filters by splitting
the fullband into two, then the resulting lower
band into two, and so on.
                                               161
This octave-band splitting, together with the
iterative decimation, can be shown to yield a
perfect reconstruction filter bank
such octave-band filter banks, and their
conditions for perfect reconstruction, are
closely related to wavelet analysis/synthesis
structures.



                                                162
163
164




LINEAR PREDICTION (INTRODUCTION):

 The object of linear prediction is to estimate
 the output sequence from a linear combination
 of input samples, past output samples or both :
             q                    p
    y(n) = ∑b( j) x(n − j) − ∑ a(i) y(n − i)
    ˆ
            j =0                 i =1
   The factors a(i) and b(j) are called predictor
   coefficients.
165




LINEAR PREDICTION (INTRODUCTION):
 Many systems of interest to us are describable by a
 linear, constant-coefficient difference equation :
              p                     q

            ∑ a(i) y(n − i) = ∑ b( j ) x(n − j )
            i =0                   j =0
 If Y(z)/X(z)=H(z), where H(z) is a ratio of polynomials
 N(z)/D(z), then
                   q                           p
       N ( z ) = ∑ b( j ) z − j and D( z ) = ∑ a(i ) z −i
                 j =0                           i =0
      Thus the predictor coefficients give us immediate access to the
      poles and zeros of H(z).
166




LINEAR PREDICTION (TYPES OF SYSTEM MODEL):

 There are two important variants :
   All-pole model (in statistics, autoregressive (AR)
   model ) :
      The numerator N(z) is a constant.
   All-zero model (in statistics, moving-average (MA)
   model ) :
      The denominator D(z) is equal to unity.
   The mixed pole-zero model is called the
   autoregressive moving-average (ARMA) model.
167




LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):

  Given a zero-mean signal y(n), in the AR model :
                         p
                       y (n) = −∑ a(i ) y (n − i )
                       ˆ
                                       i =1
      The error is :
                                          ˆ
                       e( n ) = y ( n ) − y ( n )
                                 p
                             = ∑ a (i ) y (n − i )
                                i =0
      To derive the predictor we use the orthogonality
      principle, the principle states that the desired
      coefficients are those which make the error orthogonal
      to the samples y(n-1), y(n-2),…, y(n-p).
168




LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):
    Thus we require that
             < y (n − j )e(n) >= 0 for j = 1, 2, ..., p
                              p
       Or,
                    y (n − j )∑ a (i ) y (n − i ) = 0
                             i =0


       Interchanging the operation of averaging and summing,
       and representing < > by summing over n, we have
              p

             ∑ a(i)∑ y(n − i) y(n − j ) = 0, j = 1,..., p
             i =0      n
       The required predictors are found by solving these
       equations.
169




LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):
    The orthogonality principle also states that resulting
    minimum error is given by
                                 E = e 2 ( n ) = y ( n ) e( n )
       Or,                  p

                        ∑ a(i)∑ y(n − i) y(n) = E
                        i =0           n

    We can minimize the error over all time :
               p

                      ∑
                      i=0
                            a ( i )ri − j = 0 , j = 1 ,2 , ...,p
              p

             ∑
             i=0
                   a ( i ) ri = E
                                      ∞

             where
                            ri =    ∑ y ( n) y ( n − i )
                                    n = −∞
170




LINEAR PREDICTION (APPLICATIONS):

 Autocorrelation matching :
    We have a signal y(n) with known autocorrelation
    . We model this with the AR system shown below :
      e(n)
               ryy (n)                   y (n )
                σ



                                             1-A(z)

                    σ             σ
     H ( z) =            =        p
                A( z )
                             1 − ∑ ai z −i
                                 i =1
171




LINEAR PREDICTION (ORDER OF LINEAR PREDICTION):
  The choice of predictor order depends on the
  analysis bandwidth. The rule of thumb is :
                             2 BW
                        p=        +c
                             1000
    For a normal vocal tract, there is an average of
    about one formant per kilo Hertz of BW.
    One formant requires two complex conjugate poles.
    Hence for every formant we require two predictor
    coefficients, or two coefficients per kilo Hertz of
    bandwidth.
172




  LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL):

     True Model:
             Pitch                             Gain
                                                                                   s(n)
                                                                                  Speech
           DT              G(z)
                                                                                  Signal
 Voiced Impulse           Glottal    U(n)
        generator         Filter    Voiced
                                    Volume
                                                      V     H(z)         R(z)
                                    velocity              Vocal tract     LP
                                                      U     Filter       Filter
           Uncorrelated
Unvoiced     Noise
           generator
                                               Gain
173




LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL):

  Using LP analysis :
           Pitch

                                  Gain
             DT
   Voiced Impulse                estimate                     s(n)
          generator                                          Speech

                         V                        All-Pole   Signal
                                                   Filter
                         U                         (AR)

             White
  Unvoiced   Noise                                 H(z)
           generator

More Related Content

What's hot

Speech Recognition
Speech Recognition Speech Recognition
Speech Recognition Goa App
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCCHira Shaukat
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By MatlabAnkit Gujrati
 
Power delay profile,delay spread and doppler spread
Power delay profile,delay spread and doppler spreadPower delay profile,delay spread and doppler spread
Power delay profile,delay spread and doppler spreadManish Srivastava
 
4.4 diversity combining techniques
4.4   diversity combining techniques4.4   diversity combining techniques
4.4 diversity combining techniquesJAIGANESH SEKAR
 
Linear Predictive Coding
Linear Predictive CodingLinear Predictive Coding
Linear Predictive CodingSrishti Kakade
 
3.Frequency Domain Representation of Signals and Systems
3.Frequency Domain Representation of Signals and Systems3.Frequency Domain Representation of Signals and Systems
3.Frequency Domain Representation of Signals and SystemsINDIAN NAVY
 
Small Scale Multi path measurements
Small Scale Multi path measurements Small Scale Multi path measurements
Small Scale Multi path measurements Siva Ganesan
 
Introduction to Digital Signal Processing
Introduction to Digital Signal ProcessingIntroduction to Digital Signal Processing
Introduction to Digital Signal Processingop205
 
Mel frequency cepstral coefficient (mfcc)
Mel frequency cepstral coefficient (mfcc)Mel frequency cepstral coefficient (mfcc)
Mel frequency cepstral coefficient (mfcc)BushraShaikh44
 
Small scale fading and multipath measurements
Small scale fading and multipath measurementsSmall scale fading and multipath measurements
Small scale fading and multipath measurementsVrince Vimal
 
Speech recognition An overview
Speech recognition An overviewSpeech recognition An overview
Speech recognition An overviewsajanazoya
 
DSP_FOEHU - Lec 05 - Frequency-Domain Representation of Discrete Time Signals
DSP_FOEHU - Lec 05 - Frequency-Domain Representation of Discrete Time SignalsDSP_FOEHU - Lec 05 - Frequency-Domain Representation of Discrete Time Signals
DSP_FOEHU - Lec 05 - Frequency-Domain Representation of Discrete Time SignalsAmr E. Mohamed
 
L 1 5 sampling quantizing encoding pcm
L 1 5 sampling quantizing encoding pcmL 1 5 sampling quantizing encoding pcm
L 1 5 sampling quantizing encoding pcmDEEPIKA KAMBOJ
 

What's hot (20)

Speech Recognition
Speech Recognition Speech Recognition
Speech Recognition
 
Matched filter
Matched filterMatched filter
Matched filter
 
Subband Coding
Subband CodingSubband Coding
Subband Coding
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
 
SPEECH CODING
SPEECH CODINGSPEECH CODING
SPEECH CODING
 
Adaptive equalization
Adaptive equalizationAdaptive equalization
Adaptive equalization
 
Adaptive filter
Adaptive filterAdaptive filter
Adaptive filter
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Power delay profile,delay spread and doppler spread
Power delay profile,delay spread and doppler spreadPower delay profile,delay spread and doppler spread
Power delay profile,delay spread and doppler spread
 
4.4 diversity combining techniques
4.4   diversity combining techniques4.4   diversity combining techniques
4.4 diversity combining techniques
 
Linear Predictive Coding
Linear Predictive CodingLinear Predictive Coding
Linear Predictive Coding
 
3.Frequency Domain Representation of Signals and Systems
3.Frequency Domain Representation of Signals and Systems3.Frequency Domain Representation of Signals and Systems
3.Frequency Domain Representation of Signals and Systems
 
Small Scale Multi path measurements
Small Scale Multi path measurements Small Scale Multi path measurements
Small Scale Multi path measurements
 
Introduction to Digital Signal Processing
Introduction to Digital Signal ProcessingIntroduction to Digital Signal Processing
Introduction to Digital Signal Processing
 
Speech encoding techniques
Speech encoding techniquesSpeech encoding techniques
Speech encoding techniques
 
Mel frequency cepstral coefficient (mfcc)
Mel frequency cepstral coefficient (mfcc)Mel frequency cepstral coefficient (mfcc)
Mel frequency cepstral coefficient (mfcc)
 
Small scale fading and multipath measurements
Small scale fading and multipath measurementsSmall scale fading and multipath measurements
Small scale fading and multipath measurements
 
Speech recognition An overview
Speech recognition An overviewSpeech recognition An overview
Speech recognition An overview
 
DSP_FOEHU - Lec 05 - Frequency-Domain Representation of Discrete Time Signals
DSP_FOEHU - Lec 05 - Frequency-Domain Representation of Discrete Time SignalsDSP_FOEHU - Lec 05 - Frequency-Domain Representation of Discrete Time Signals
DSP_FOEHU - Lec 05 - Frequency-Domain Representation of Discrete Time Signals
 
L 1 5 sampling quantizing encoding pcm
L 1 5 sampling quantizing encoding pcmL 1 5 sampling quantizing encoding pcm
L 1 5 sampling quantizing encoding pcm
 

Similar to Speech signal processing lizy

SodaBottles-licensing Copyright-Fix.pdf
SodaBottles-licensing Copyright-Fix.pdfSodaBottles-licensing Copyright-Fix.pdf
SodaBottles-licensing Copyright-Fix.pdfNga Trinh
 
phonetics and phonology
phonetics and phonologyphonetics and phonology
phonetics and phonologyWu Heping
 
Lec 6 phonetics
Lec 6 phoneticsLec 6 phonetics
Lec 6 phoneticsAnshita111
 
Week 3& 4 phonetics and phonology
Week 3& 4 phonetics and phonologyWeek 3& 4 phonetics and phonology
Week 3& 4 phonetics and phonologyzouhirgabsi
 
Presentation 2 phonetic in prosthodontic
Presentation 2 phonetic in prosthodonticPresentation 2 phonetic in prosthodontic
Presentation 2 phonetic in prosthodonticPratik Hodar
 
Phonetic and phonology pp2
Phonetic and phonology pp2Phonetic and phonology pp2
Phonetic and phonology pp2zhian fadhil
 
Phonetic and phonology pp2
Phonetic and phonology pp2Phonetic and phonology pp2
Phonetic and phonology pp2zhian asaad
 
Class 09 emerson_phonetics_fall2014_phonemes_allophones_vot_epg
Class 09 emerson_phonetics_fall2014_phonemes_allophones_vot_epgClass 09 emerson_phonetics_fall2014_phonemes_allophones_vot_epg
Class 09 emerson_phonetics_fall2014_phonemes_allophones_vot_epgLisa Lavoie
 
C:\Fakepath\Phonetics&amp;Phonology
C:\Fakepath\Phonetics&amp;PhonologyC:\Fakepath\Phonetics&amp;Phonology
C:\Fakepath\Phonetics&amp;Phonologymaryrosearg
 
Principal characteristics of speech
Principal characteristics of speechPrincipal characteristics of speech
Principal characteristics of speechNikolay Karpov
 
Eng phon. 1st bim video lesson 1
Eng phon. 1st bim video  lesson 1Eng phon. 1st bim video  lesson 1
Eng phon. 1st bim video lesson 1UTPL UTPL
 

Similar to Speech signal processing lizy (20)

Part1 speech basics
Part1 speech basicsPart1 speech basics
Part1 speech basics
 
B110512
B110512B110512
B110512
 
SodaBottles-licensing Copyright-Fix.pdf
SodaBottles-licensing Copyright-Fix.pdfSodaBottles-licensing Copyright-Fix.pdf
SodaBottles-licensing Copyright-Fix.pdf
 
Phonetics
PhoneticsPhonetics
Phonetics
 
phonetics and phonology
phonetics and phonologyphonetics and phonology
phonetics and phonology
 
Lec 6 phonetics
Lec 6 phoneticsLec 6 phonetics
Lec 6 phonetics
 
Phonetics
PhoneticsPhonetics
Phonetics
 
Phonitics In Complete Denture (with animations)
 Phonitics In Complete Denture   (with animations) Phonitics In Complete Denture   (with animations)
Phonitics In Complete Denture (with animations)
 
speech and phonetics
speech and phoneticsspeech and phonetics
speech and phonetics
 
Slideshare
SlideshareSlideshare
Slideshare
 
Week 3& 4 phonetics and phonology
Week 3& 4 phonetics and phonologyWeek 3& 4 phonetics and phonology
Week 3& 4 phonetics and phonology
 
Presentation 2 phonetic in prosthodontic
Presentation 2 phonetic in prosthodonticPresentation 2 phonetic in prosthodontic
Presentation 2 phonetic in prosthodontic
 
Phonetic and phonology pp2
Phonetic and phonology pp2Phonetic and phonology pp2
Phonetic and phonology pp2
 
Phonetic and phonology pp2
Phonetic and phonology pp2Phonetic and phonology pp2
Phonetic and phonology pp2
 
Phonetics and phonology
Phonetics and phonologyPhonetics and phonology
Phonetics and phonology
 
Intro phonetics
Intro phoneticsIntro phonetics
Intro phonetics
 
Class 09 emerson_phonetics_fall2014_phonemes_allophones_vot_epg
Class 09 emerson_phonetics_fall2014_phonemes_allophones_vot_epgClass 09 emerson_phonetics_fall2014_phonemes_allophones_vot_epg
Class 09 emerson_phonetics_fall2014_phonemes_allophones_vot_epg
 
C:\Fakepath\Phonetics&amp;Phonology
C:\Fakepath\Phonetics&amp;PhonologyC:\Fakepath\Phonetics&amp;Phonology
C:\Fakepath\Phonetics&amp;Phonology
 
Principal characteristics of speech
Principal characteristics of speechPrincipal characteristics of speech
Principal characteristics of speech
 
Eng phon. 1st bim video lesson 1
Eng phon. 1st bim video  lesson 1Eng phon. 1st bim video  lesson 1
Eng phon. 1st bim video lesson 1
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Speech signal processing lizy

  • 1. SPEECH SIGNAL PROCESSING KERALA UNIVERSITY M-TECH 1ST SEMESTER M- lizytvm@yahoo.com Lizy Abraham +919495123331 Assistant Professor Department of ECE LBS Institute of Technology for Women (A Govt. of Kerala Undertaking) Poojappura Trivandrum -695012 Kerala, India 1
  • 2. SYLLABUS TSC 1004 SPEECH SIGNAL PROCESSING 3-0-0-3 3- Speech Production :- Acoustic theory of speech production (Excitation, Vocal tract model for speech analysis, Formant structure, Pitch). Articulatory Phonetic (Articulation, Voicing, Articulatory model). Acoustic Phonetics ( Basic speech units and their classification). Speech Analysis :- Short-Time Speech Analysis, Time domain analysis (Short time energy, short time zero crossing Rate, ACF ). Frequency domain analysis (Filter Banks, STFT, Spectrogram, Formant Estimation &Analysis). Cepstral Analysis Parametric representation of speech :- AR Model, ARMA model. LPC Analysis ( LPC model, Auto correlation method, Covariance method, Levinson-Durbin Algorithm, Lattice form).LSF, LAR, MFCC, Sinusoidal Model, GMM, HMM Speech coding :- Phase Vocoder, LPC, Sub-band coding, Adaptive Transform Coding , Harmonic Coding, Vector Quantization based Coders, CELP Speech processing :- Fundamentals of Speech recognition, Speech segmentation. Text-to- speech conversion, speech enhancement, Speaker Verification, Language Identification, Issues of Voice transmission over Internet. 2
  • 3. REFERENCE 1. Douglas O'Shaughnessy, Speech Communications : Human & Machine, IEEE Press, Hardcover 2nd edition, 1999; ISBN: 0780334493. 2. Nelson Morgan and Ben Gold, Speech and Audio Signal Processing : Processing and Perception Speech and Music, July 1999, John Wiley & Sons, ISBN:0471351547 3. Rabiner and Schafer, Digital Processing of Speech Signals, Prentice Hall, 1978. 4. Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall, 1994. 5. Thomas F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice, Prentice Hall; ISBN: 013242942X; 1st edition 6. Donald G. Childers, Speech Processing and Synthesis Toolboxes, John Wiley & Sons, September 1999; ISBN: 0471349593 For the End semester exam (100 marks), the question paper shall have six questions of 20 marks each covering entire syllabus out of which any five shall be answered. It shall have 75% problems & 25% Theory. For the internal marks of 50, Two test of 20 marks each and 10 marks for assignments (Minimum two) /Term Project. 3
  • 4. Speech Processing means Processing of discrete time speech signals 4
  • 5. Algorithms Psychoacoustics (Programming) Room acoustics Speech production Speech Processing Acoustics Signal Processing Information Phonetics Theory Fourier transforms Entropy Discrete time filters Statistical SP Communication theory AR(MA) models Stochastic Rate-distortion theory models 5
  • 6. 6
  • 7. 7
  • 8. HOW IS SPEECH PRODUCED ? Speech can be defined as “ a pressure acoustic signal that is articulated in the vocal tract” Speech is produced when: air is forced from the lungs through the vocal cords and along the vocal tract. 8
  • 9. This air flow is referred to as “excitation signal”. This excitation signal causes the vocal cords to vibrate and propagate the energy to excite the oral and nasal openings, which play a major role in shaping the sound produced. Vocal Tract components: – Oral Tract: (from lips to vocal cords). – Nasal Tract: (from the velum till nostrills). nostrills). 9
  • 10. 10
  • 11. 11
  • 12. Larynx: the source of speech • Vocal cords (folds): the two folds of tissue in the larynx. They can open and shut like a pair of fans. • Glottis: the gap between the vocal cords. As air is forced through the glottis the vocal cords will start to vibrate and modulate the air flow. • The frequency of vibration determines the pitch of the voice (for a male, 50-200Hz; for a female, up to 500Hz). 12
  • 14. Places of articulation alveolar post-alveolar/palatal dental velar uvular labial pharyngeal laryngeal/glottal 14
  • 15. Classes of speech sounds Voiced sound The vocal cords vibrate open and close Quasi-periodic pulses of air The rate of the opening and closing – the pitch Unvoiced sounds Forcing air at high velocities through a constriction Noise-like turbulence Show little long-term periodicity Short-term correlations still present Eg. “S”, “F” Plosive sounds A complete closure in the vocal tract Air pressure is built up and released suddenly Eg. “B” , “P” 15
  • 17. SPEECH SOUNDS Coarse classification with phonemes. A phone is the acoustic realization of a phoneme. Allophones are context dependent phonemes. 17
  • 18. PHONEME HIERARCHY Speech sounds Language dependent. About 50 in English. Vowels Diphtongs Consonants iy, ih, ae, aa, ay, ey, ah, ao,ax, eh, oy, aw Lateral er, ow, uh, uw liquid Glide Retroflex l w, y Plosive liquid p, b, t, Fricative Nasal r d, k, g m, n, ng f, v, th, dh, s, z, sh, zh, h 18
  • 19. 19
  • 20. 20
  • 21. sounds like /SH/ and /S/ look like (spectrally shaped) random noise, while the vowel sounds /UH/, /IY/, and /EY/ are highly structured and quasi-periodic. These differences result from the distinctively different ways that these sounds are produced. 21
  • 22. 22
  • 23. Vowel Chart Front Center Back i u High ɪ ʊ e o ə ʌ ɪ Mid ɛ Low æ a ɪ
  • 24. 24
  • 25. SPEECH WAVEFORM CHARACTERISTICS Loudness Voiced/Unvoiced. Pitch. Fundamental frequency. Spectral envelope. Formants. 25
  • 26. Acoustic Characteristics of speech Pitch: Signal within each voiced interval is periodic. The period T is called “pitch”. The pitch depends on the vowel being spoken, changes in time. T~70 samples in this ex. f0=1/T is the fundamental frequency (also known as formant frequency). 26
  • 27. FORMANTS Formants can be recognized in the frequency content of the signal segment. Formants are best described as high energy peaks in the frequency spectrum of speech sound. 27
  • 28. The resonant frequencies of the vocal tract are called formant frequencies or simply formants. The peaks of the spectrum of the vocal tract response correspond approximately to its formants. Under the linear time-invariant all-pole assumption, each vocal tract shape is characterized by a collection of formants. 28
  • 29. Because the vocal tract is assumed stable with poles inside the unit circle, the vocal tract transfer function can be expressed either in product or partial fraction expansion form: 29
  • 30. 30
  • 31. A detailed acoustic theory must consider the effects of the following: • Time variation of the vocal tract shape • Losses due to heat conduction and viscous friction at the vocal tract walls • Softness of the vocal tract walls • Radiation of sound at the lips • Nasal coupling • Excitation of sound in the vocal tract Let us begin by considering a simple case of a lossless tube: 31
  • 32. 28 December 2012 MULTI-TUBE APPROXIMATION OF THE VOCAL TRACT We can represent the vocal tract as a concatenation of N lossless tubes with area {Ak}.and equal length ∆x = l/N The wave propagation time through each tube is τ =∆x/c = l/Nc 32
  • 33. 33
  • 34. Consider an N-tube model of the previous figure. Each tube has length lk and cross sectional area of Ak. Assume: No losses Planar wave propagation The wave equations for section k: 0≤x≤lk 34
  • 35. 35
  • 36. 28 December 2012 SOUND PROPAGATION IN THE CONCATENATED TUBE MODEL Boundary conditions: Physical principle of continuity: Pressure and volume velocity must be continuous both in time and in space everywhere in the system: At k’th/(k+1)’st junction we have: 36
  • 37. 28 December 2012 ANALOGY WITH ELECTRICAL CIRCUIT TRANSMISSION LINE 37
  • 38. 28 December 2012 PROPAGATION OF SOUND IN A UNIFORM TUBE The vocal tract transfer function of volume velocities is 38
  • 39. 28 December 2012 PROPAGATION OF SOUND IN A UNIFORM TUBE Using the boundary conditions U (0,s)=UG(s) and P(-l,s)=0 *(derivation in Quateri text: page 122 – 125) The poles of the transfer function T (j ) are where cos( l/c)=0 119 – 124: Quatieri Derivation of eqn.4.18 is important. 39
  • 40. 28 December 2012 PROPAGATION OF SOUND IN A UNIFORM TUBE (CON’T) For c =34,000 cm/sec, l =17 cm, the natural frequencies (also called the formants) are at 500 Hz, 1500 Hz, 2500 Hz, … The transfer function of a tube with no side branches, excited at one end and response measured at another, only has poles The formant frequencies will have finite bandwidth when vocal tract losses are considered (e.g., radiation, walls, viscosity, heat) The length of the vocal tract, l, corresponds to 1/4λ1, 3/4λ2, 5/4λ3, …, where λi is the wavelength of the ith natural frequency 40
  • 41. 28 December 2012 UNIFORM TUBE MODEL Example Consider a uniform tube of length l=35 cm. If speed of sound is 350 m/s calculate its resonances in Hz. Compare its resonances with a tube of length l = 17.5 cm. f=Ω/2π ⇒ π c Ω=k , k = 1,3,5,... 2 l Ω π c 1 350 f= =k =k = 250k 2π 2 l 2π 4 × 0.35 f = 250,750,1250,... 41
  • 42. 28 December 2012 UNIFORM TUBE MODEL For 17.5 cm tube: Ω π c 1 350 f= =k =k = 250k 2π 2 l 2π 4 × 0.175 f = 500,1500,2500,... 42
  • 43. 43
  • 45. 45
  • 46. VOWELS Modeled as a tube closed at one end and open at the other the closure is a membrane with a slit in it the tube has uniform cross sectional area membrane represents the source of energy (vocal folds) the energy travels through the tube the tube generates no energy on its own the tube represents an important class of resonators odd quarter length relationship Fn=(2n-1)c/4l
  • 47.
  • 48. VOWELS Filter characteristics for vowels the vocal tract is a dynamic filter it is frequency dependent it has, theoretically, an infinite number of resonances each resonance has a center frequency, an amplitude and a bandwidth for speech, these resonances are called formants formants are numbered in succession from the lowest F1, F2, F3, etc.
  • 49. Fricatives Modeled as a tube with a very severe constriction The air exiting the constriction is turbulent Because of the turbulence, there is no periodicity unless accompanied by voicing
  • 50. When a fricative constriction is tapered the back cavity is involved this resembles a tube closed at both ends Fn=nc/2l such a situation occurs primarily for articulation disorders
  • 51. Introduction to Digital Speech Processing (Rabiner & Schafer )– 20-23 51
  • 52. 52
  • 53. Rabiner & Schafer : 98- 105 53
  • 54. 54
  • 55. 28 December 2012 SOUND SOURCE: VOCAL FOLD VIBRATION Modeled as a volume velocity source at glottis, UG(j ) 55
  • 56. 56
  • 57. SHORT-TIME SPEECH ANALYSIS Segments (or frames, or vectors) are typically of length 20 ms. Speech characteristics are constant. Allows for relatively simple modeling. Often overlapping segments are extracted. 57
  • 59. the system is an all-pole system with system function of the form: For all-pole linear systems, the input and output are related by a difference equation of the form: 59
  • 60. 60
  • 61. The operator T{ } defines the nature of the short-time analysis function, and w[ˆn − m] represents a time shifted window sequence 61
  • 62. 62
  • 63. SHORT-TIME ENERGY simple to compute, and useful for estimating properties of the excitation function in the model. In this case the operator T{ } is simply squaring the windowed samples. 63
  • 64. SHORT-TIME ZERO-CROSSING RATE Weighted average of the number of times the speech signal changes sign within the time window. Representing this operator in terms of linear filtering leads to: 64
  • 65. Since |sgn{x[m]} − sgn{x[m − 1]}| is equal to 1 if x[m] and x[m − 1] have different algebraic signs and 0 if they have the same sign, it follows that it is a weighted sum of all the instances of alternating sign (zero-crossing) that fall within the support region of the shifted window w[ˆn − m]. 65
  • 66. shows an example of the short-time energy and zero crossing rate for a segment of speech with a transition from unvoiced to voiced speech. In both cases, the window is a Hamming window of duration 25ms (equivalent to 401 samples at a 16 kHz sampling rate). Thus, both the short-time energy and the short-time zero-crossing rate are output of a low pass filter whose frequency response is as shown. 66
  • 67. Short time energy and zero-crossing rate functions are slowly varying compared to the time variations of the speech signal, and therefore, they can be sampled at a much lower rate than that of the original speech signal. For finite-length windows like the Hamming window, this reduction of the sampling rate is accomplished by moving the window position ˆn in jumps of more than one sample 67
  • 68. during the unvoiced interval, the zero-crossing rate is relatively high compared to the zero- crossing rate in the voiced interval. Conversely, the energy is relatively low in the unvoiced region compared to the energy in the voiced region. 68
  • 69. SHORT-TIME AUTOCORRELATION FUNCTION (STACF) The autocorrelation function is often used as a means of detecting periodicity in signals, and it is also the basis for many spectrum analysis methods. STACF is defined as the deterministic autocorrelation function of the sequence xˆn[m] = x[m]w[ˆn − m] that is selected by the window shifted to time ˆn, i.e., 69
  • 70. 70
  • 71. e[n] is the excitation to the linear system with impulse response h[n]. A well known, and easily proved, property of the autocorrelation function is that i.e., the autocorrelation function of s[n] = e[n] h[n] is the convolution of the autocorrelation functions of e[n] and h[n]. 71
  • 72. 72
  • 73. SHORT-TIME FOURIER TRANSFORM (STFT) The expression for the discrete-time STFT at time n where w[n] is assumed to be non-zero only in the interval [0, N w - 1] and is referred to as analysis window or sometimes as the analysis filter 73
  • 74. 74
  • 76. 76
  • 77. 77
  • 78. SHORT TIME SYNTHESIS problem of obtaining a sequence back from its discrete-time STFT. This equation represents a synthesis equation for the discrete-time STFT. 78
  • 79. FILTER BANK SUMMATION (FBS) METHOD the discrete STFT is considered to be the set of outputs of a bank of filters. the output of each filter is modulated with a complex exponential, and these modulated filter outputs are summed at each instant of time to obtain the corresponding time sample of the original sequence That is, given a discrete STFT, X (n, k), the FBS method synthesize a sequence y(n) satisfying the following equation: 79
  • 80. 80
  • 81. 81
  • 82. 82
  • 83. 83
  • 84. OVERLAP-ADD METHOD Just as the FBS method was motivated from the filteling view of the STFT, the OLA method is motivated from the Fourier transform view of the STFT. In this method, for each fixed time, we take the inverse DFT of the corresponding frequency function and divide the result by the analysis window. However, instead of dividing out the analysis window from each of the resulting short-time sections, we perform an overlap and add operation between the short-time sections. 84
  • 85. given a discrete STFT X (n, k), the OLA method synthesizes a sequence Y[n] given by 85
  • 86. 86
  • 87. Furthermore, if the discrete STFT had been decimated in time by a factor L, it can be similarly shown that if the analysis window satisfies 87
  • 88. 88
  • 89. DESIGN OF DIGITAL FILTER BANKS 282 – 297: Rabiner & Schafer 89
  • 90. 90
  • 91. 91
  • 92. 92
  • 94. 94
  • 96. 96
  • 97. 97
  • 98. 98
  • 99. 99
  • 100. 100
  • 101. FILTER BANK ANALYSIS AND SYNTHESIS 101
  • 102. 102
  • 103. 103
  • 104. FBS synthesis results in multiple copies of the input: 104
  • 105. PHASE VOCODER The fourier series is computed over a sliding window of a single pitch period duration and provide a measure of amplitude and frequency trajectories of the musical tones. 105
  • 106. 106
  • 107. 107
  • 108. which can be interpreted as a real sinewave that is amplitude- and phase-modulated by the STFT, the "carrier" of the latter being the kth filter's center frequency. the STFT of a continuos time signal as, 108
  • 109. 109
  • 110. where is an initial condition. The signal is likewise referred to as the instantaneous amplitude for each channel. The resulting filter-bank output is a sinewave with generally a time-varying amplitude and frequency modulation. An alternative expression is, 110
  • 111. which is the time-domain counterpart to the frequency-domain phase derivative. 111
  • 112. we can sample the continuous-time STFT, with sampling interval T, to obtain the discrete-time STFT. 112
  • 113. 113
  • 114. 114
  • 115. 115
  • 116. 116
  • 117. 117
  • 119. 119
  • 120. 120
  • 121. 121
  • 122. 122
  • 123. CEPSTRAL) HOMOMORPHIC (CEPSTRAL) SPEECH ANALYSIS use of the short-time cepstrum as a representation of speech and as a basis for estimating the parameters of the speech generation model. cepstrum of a discrete-time signal, 123
  • 124. 124
  • 125. That is, the complex cepstrum operator transforms convolution into addition. This property, is what makes the cepstrum useful for speech analysis, since the model for speech production involves convolution of the excitation with the vocal tract impulse response, and our goal is often to separate the excitation signal from the vocal tract signal. 125
  • 126. The key issue in the definition and computation of the complex cepstrum is the computation of the complex logarithm. ie, the computation of the phase angle arg[X(ejω)], which must be done so as to preserve an additive combination of phases for two signals combined by convolution 126
  • 127. SHORT- THE SHORT-TIME CEPSTRUM The short-time cepstrum is a sequence of cepstra of windowed finite-duration segments of the speech waveform. 127
  • 128. 128
  • 129. RECURSIVE COMPUTATION OF THE COMPLEX CEPSTRUM Another approach to compute the complex cepstrum applies only to minimum-phase signals. i.e., signals having an z-transform whose poles and zeros are inside the unit circle. An example would be the impulse response of an all-pole vocal tract model with system function 129
  • 130. In this case, all the poles ck must be inside the unit circle for stability of the system. 130
  • 131. SHORT- SHORT-TIME HOMOMORPHIC FILTERING OF SPEECH –PAGE N0: 63, RABINER & SCHAFER 131
  • 132. The low quefrency part of the cepstrum is expected to be representative of the slow variations (with frequency) in the log spectrum, while the high quefrency components would correspond to the more rapid fluctuations of the log spectrum. 132
  • 133. the spectrum for the voiced segment has a structure of periodic ripples due to the harmonic structure of the quasi-periodic segment of voiced speech. This periodic structure in the log spectrum manifests itself in the cepstrum peak at a quefrency of about 9ms. The existence of this peak in the quefrency range of expected pitch periods strongly signals voiced speech. Furthermore, the quefrency of the peak is an accurate estimate of the pitch period during the corresponding speech interval. the autocorrelation function also displays an indication of periodicity, but not nearly as unambiguously as does the cepstrum. But the rapid variations of the unvoiced spectra appear random with no periodic structure. As a result, there is no strong peak indicating periodicity as in the voiced case. 133
  • 134. These slowly varying log spectra clearly retain the general spectral shape with peaks corresponding to the formant resonance structure for the segment of speech under analysis. 134
  • 135. APPLICATION TO PITCH DETECTION The cepstrum was first applied in speech processing to determine the excitation parameters for the discrete-time speech model. The successive spectra and cepstra are for 50 ms segments obtained by moving the window in steps of 12.5 ms (100 samples at a sampling rate of 8000 samples/sec). 135
  • 136. for the positions 1 through 5, the window includes only unvoiced speech for positions 6 and 7 the signal within the window is partly voiced and partly unvoiced. For positions 8 through 15 the window only includes voiced speech. the rapid variations of the unvoiced spectra appear random with no periodic structure. the spectra for voiced segments have a structure of periodic ripples due to the harmonic structure of the quasi-periodic segment of voiced speech. 136
  • 137. 137
  • 138. the cepstrum peak at a quefrency of about 11– 12 ms strongly signals voiced speech, and the quefrency of the peak is an accurate estimate of the pitch period during the corresponding speech interval. Presence of a strong peak implies voiced speech, and the quefrency location of the peak gives the estimate of the pitch period. 138
  • 139. MEL- MEL-FREQUENCY CEPSTRUM COEFFICIENTS MFCC) (MFCC) The idea is to compute a frequency analysis based upon a filter bank with approximately critical band spacing of the filters and bandwidths. For 4 KHz bandwidth, approximately 20 filters are used. a short-time Fourier analysis is done first, resulting in a DFT Xˆn[k] for analysis time ˆn. Then the DFT values are grouped together in critical bands and weighted by a triangular weighting function. 139
  • 140. the bandwidths are constant for center frequencies below 1 kHz and then increase exponentially up to half the sampling rate of 4 kHz resulting in a total of 22 filters. The mel-frequency spectrum at analysis timeˆn is defined for r = 1,2,...,R as 140
  • 141. 141
  • 142. is a normalizing factor for the rth mel-filter. For each frame, a discrete cosine transform of the log of the magnitude of the filter outputs is computed to form the function mfccˆn[m], i.e., 142
  • 143. 143
  • 144. shows the result of mfcc analysis of a frame of voiced speech in comparison with the short- time Fourier spectrum, LPC spectrum, and a homomorphically smoothed spectrum. all these spectra are different, but they have in common that they have peaks at the formant resonances. At higher frequencies, the reconstructed mel- spectrum has more smoothing due to the structure of the filter bank. 144
  • 145. THE SPEECH SPECTROGRAM simply a display of the magnitude of the STFT. Specifically, the images in Figure are plots of where the plot axes are labeled in terms of analog time and frequency through the relations tr = rRT and fk = k/(NT), where T is the sampling period of the discrete-time signal x[n] = xa(nT). 145
  • 146. In order to make smooth, R is usually quite small compared to both the window length L and the number of samples in the frequency dimension, N, which may be much larger than the window length L. Such a function of two variables can be plotted on a two dimensional surface as either a gray- scale or a color-mapped image. The bars on the right calibrate the color map (in dB). 146
  • 147. 147
  • 148. if the analysis window is short, the spectrogram is called a wide-band spectrogram which is characterized by good time resolution and poor frequency resolution. when the window length is long, the spectrogram is a narrow-band spectrogram, which is characterized by good frequency resolution and poor time resolution. 148
  • 149. THE SPECTROGRAM • A classic analysis tool. – Consists of DFTs of overlapping, and windowed frames. • Displays the distribution of energy in time and frequency. 2 – 10 log10 X m ( f ) is typically displayed. 149
  • 151. 151
  • 152. Note the three broad peaks in the spectrum slice at time tr = 430 ms, and observe that similar slices would be obtained at other times around tr = 430 ms. These large peaks are representative of the underlying resonances of the vocal tract at the corresponding time in the production of the speech signal. 152
  • 153. The lower spectrogram is not as sensitive to rapid time variations, but the resolution in the frequency dimension is much better. This window length is on the order of several pitch periods of the waveform during voiced intervals. As a result, the spectrogram no longer displays vertically oriented striations since several periods are included in the window. 153
  • 154. SHORT TIME ACF /m/ /ow/ /s/ ACF 154
  • 155. CEPSTRUM SPEECH WAVE (X)= EXCITATION (E) . FILTER (H) (S) (H) (Vocal tract filter) (E) Glottal excitation From Vocal cords (Glottis) http://home.hib.no/al/engelsk/seksjon/SOFF-MASTER/ill061.gif 155
  • 156. CEPSTRAL ANALYSIS Signal(s)=convolution(*) of glottal excitation (e) and vocal_tract_filter (h) s(n)=e(n)*h(n), n is time index After Fourier transform FT: FT{s(n)}=FT{e(n)*h(n)} Convolution(*) becomes multiplication (.) n(time) w(frequency), S(w) = E(w).H(w) Find Magnitude of the spectrum |S(w)| = |E(w)|.|H(w)| log10 |S(w)|= log10{|E(w)|}+ log10{|H(w)|} Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1 156
  • 157. CEPSTRUM C(n)=IDFT[log10 |S(w)|]= IDFT[ log10{|E(w)|} + log10{|H(w)|} ] X(n) X(w) Log|x(w)| S(n) windowing DFT Log|x(w)| IDFT C(n) N=time index w=frequency I-DFT=Inverse-discrete Fourier transform In c(n), you can see E(n) and H(n) at two different positions Application: useful for (i) glottal excitation (ii) vocal tract filter analysis 157
  • 158. EXAMPLE OF CEPSTRUM sampling frequency 22.05KHz 158
  • 160. the time-decimated subband outputs are quantized and encoded, then are decoded at the receiver. In subband coding, a small number of filters with wide and overlapping bandwidths are chosen and each output is quantized each bandpass filter output is quantized individually. although the bandpass filters are wide and overlapping, careful design of the filter, resuIts in a cancellation of quantization noise that leaks across bands. 160
  • 161. Quadrature mirror filters are one such filter class; shows an example of a two-band subband coder using two overlapping quadrature mirror filters Quadrature mirror filters can be further subdivided from high to low filters by splitting the fullband into two, then the resulting lower band into two, and so on. 161
  • 162. This octave-band splitting, together with the iterative decimation, can be shown to yield a perfect reconstruction filter bank such octave-band filter banks, and their conditions for perfect reconstruction, are closely related to wavelet analysis/synthesis structures. 162
  • 163. 163
  • 164. 164 LINEAR PREDICTION (INTRODUCTION): The object of linear prediction is to estimate the output sequence from a linear combination of input samples, past output samples or both : q p y(n) = ∑b( j) x(n − j) − ∑ a(i) y(n − i) ˆ j =0 i =1 The factors a(i) and b(j) are called predictor coefficients.
  • 165. 165 LINEAR PREDICTION (INTRODUCTION): Many systems of interest to us are describable by a linear, constant-coefficient difference equation : p q ∑ a(i) y(n − i) = ∑ b( j ) x(n − j ) i =0 j =0 If Y(z)/X(z)=H(z), where H(z) is a ratio of polynomials N(z)/D(z), then q p N ( z ) = ∑ b( j ) z − j and D( z ) = ∑ a(i ) z −i j =0 i =0 Thus the predictor coefficients give us immediate access to the poles and zeros of H(z).
  • 166. 166 LINEAR PREDICTION (TYPES OF SYSTEM MODEL): There are two important variants : All-pole model (in statistics, autoregressive (AR) model ) : The numerator N(z) is a constant. All-zero model (in statistics, moving-average (MA) model ) : The denominator D(z) is equal to unity. The mixed pole-zero model is called the autoregressive moving-average (ARMA) model.
  • 167. 167 LINEAR PREDICTION (DERIVATION OF LP EQUATIONS): Given a zero-mean signal y(n), in the AR model : p y (n) = −∑ a(i ) y (n − i ) ˆ i =1 The error is : ˆ e( n ) = y ( n ) − y ( n ) p = ∑ a (i ) y (n − i ) i =0 To derive the predictor we use the orthogonality principle, the principle states that the desired coefficients are those which make the error orthogonal to the samples y(n-1), y(n-2),…, y(n-p).
  • 168. 168 LINEAR PREDICTION (DERIVATION OF LP EQUATIONS): Thus we require that < y (n − j )e(n) >= 0 for j = 1, 2, ..., p p Or, y (n − j )∑ a (i ) y (n − i ) = 0 i =0 Interchanging the operation of averaging and summing, and representing < > by summing over n, we have p ∑ a(i)∑ y(n − i) y(n − j ) = 0, j = 1,..., p i =0 n The required predictors are found by solving these equations.
  • 169. 169 LINEAR PREDICTION (DERIVATION OF LP EQUATIONS): The orthogonality principle also states that resulting minimum error is given by E = e 2 ( n ) = y ( n ) e( n ) Or, p ∑ a(i)∑ y(n − i) y(n) = E i =0 n We can minimize the error over all time : p ∑ i=0 a ( i )ri − j = 0 , j = 1 ,2 , ...,p p ∑ i=0 a ( i ) ri = E ∞ where ri = ∑ y ( n) y ( n − i ) n = −∞
  • 170. 170 LINEAR PREDICTION (APPLICATIONS): Autocorrelation matching : We have a signal y(n) with known autocorrelation . We model this with the AR system shown below : e(n) ryy (n) y (n ) σ 1-A(z) σ σ H ( z) = = p A( z ) 1 − ∑ ai z −i i =1
  • 171. 171 LINEAR PREDICTION (ORDER OF LINEAR PREDICTION): The choice of predictor order depends on the analysis bandwidth. The rule of thumb is : 2 BW p= +c 1000 For a normal vocal tract, there is an average of about one formant per kilo Hertz of BW. One formant requires two complex conjugate poles. Hence for every formant we require two predictor coefficients, or two coefficients per kilo Hertz of bandwidth.
  • 172. 172 LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL): True Model: Pitch Gain s(n) Speech DT G(z) Signal Voiced Impulse Glottal U(n) generator Filter Voiced Volume V H(z) R(z) velocity Vocal tract LP U Filter Filter Uncorrelated Unvoiced Noise generator Gain
  • 173. 173 LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL): Using LP analysis : Pitch Gain DT Voiced Impulse estimate s(n) generator Speech V All-Pole Signal Filter U (AR) White Unvoiced Noise H(z) generator