SlideShare a Scribd company logo
1 of 46
Automatic speech recognition
 What is the task?
 What are the main difficulties?
 How is it approached?
 How good is it?
 How much better could it be?
2/34
What is the task?
 Getting a computer to understand spoken language
 By “understand” we might mean
 React appropriately
 Convert the input speech into another medium, e.g. text
 Several variables impinge on this
3/34
How do humans do it?
 Articulation produces sound
waves which the ear conveys
to the brain for processing
4/34
How might computers do it?
 Digitization
 Acoustic analysis of the speech
signal
 Linguistic interpretation
5/34
Acoustic waveform Acoustic signal
Speech recognition
Basic Block Diagram
6/34
Speech
Recognition
Parallel
Port
Control
P2
to
P9
ATMega32
microcontroller
LCD
Display
MATLAB
What’s hard about that?
 Digitization
 Converting analogue signal into digital representation
 Signal processing
 Separating speech from background noise
 Phonetics
 Variability in human speech
 Phonology
 Recognizing individual sound distinctions (similar phonemes)
 Lexicology and syntax
 Disambiguating homophones
 Features of continuous speech
 Syntax and pragmatics
 Interpreting prosodic features
 Pragmatics
 Filtering of performance errors (disfluencies)
7/34
Digitization
 Analogue to digital conversion
 Sampling and quantizing
 Use filters to measure energy levels for various
points on the frequency spectrum
 Knowing the relative importance of different
frequency bands (for speech) makes this process
more efficient
 E.g. high frequency sounds are less informative, so
can be sampled using a broader bandwidth (log
scale)
8/34
Separating speech from background noise
 Noise cancelling microphones
 Two mics, one facing speaker, the other facing away
 Ambient noise is roughly same for both mics
 Knowing which bits of the signal relate to speech
 Spectrograph analysis
9/34
Variability in individuals’ speech
 Variation among speakers due to
 Vocal range (f0, and pitch range – see later)
 Voice quality (growl, whisper, physiological elements
such as nasality, adenoidality, etc)
 ACCENT !!! (especially vowel systems, but also
consonants, allophones, etc.)
 Variation within speakers due to
 Health, emotional state
 Ambient conditions
 Speech style: formal read vs spontaneous
10/34
Speaker-(in)dependent systems
 Speaker-dependent systems
 Require “training” to “teach” the system your individual
idiosyncracies
 The more the merrier, but typically nowadays 5 or 10 minutes is enough
 User asked to pronounce some key words which allow computer to infer
details of the user’s accent and voice
 Fortunately, languages are generally systematic
 More robust
 But less convenient
 And obviously less portable
 Speaker-independent systems
 Language coverage is reduced to compensate need to be flexible in
phoneme identification
 Clever compromise is to learn on the fly
11/34
(Dis)continuous speech
 Discontinuous speech much easier to recognize
 Single words tend to be pronounced more clearly
 Continuous speech involves contextual coarticulation
effects
 Weak forms
 Assimilation
 Contractions
12/34
Performance errors
 Performance “errors” include
 Non-speech sounds
 Hesitations
 False starts, repetitions
 Filtering implies handling at syntactic level or above
 Some disfluencies are deliberate and have pragmatic
effect – this is not something we can handle in the
near future
13/34
14/34
Approaches
to ASR
Template
based
Neural
Network
based
Statistics
based
Template-based approach
 Store examples of units (words, phonemes), then find
the example that most closely fits the input
 Extract features from speech signal, then it’s “just” a
complex similarity matching problem, using solutions
developed for all sorts of applications
 OK for discrete utterances, and a single user
15/34
Template-based approach
 Hard to distinguish very similar templates
 And quickly degrades when input differs from
templates
 Therefore needs techniques to mitigate this
degradation:
 More subtle matching techniques
 Multiple templates which are aggregated
 Taken together, these suggested …
16/34
Neural Network based approach
17/34
Statistics-based approach
 Collect a large corpus of transcribed speech recordings
 Train the computer to learn the correspondences
(“machine learning”)
 At run time, apply statistical processes to search
through the space of all possible solutions, and pick
the statistically most likely one
18/34
Statistics based approach
 Acoustic and Lexical Models
 Analyse training data in terms of relevant features
 Learn from large amount of data different possibilities
 different phone sequences for a given word
 different combinations of elements of the speech signal for a
given phone/phoneme
 Combine these into a Hidden Markov Model expressing
the probabilities
19/34
HMMs for some words
20/34
 Identify individual phonemes
 Identify words
 Identify sentence structure and/or meaning
21/34
SPEECH RECOGNITION BLOCK DIAGRAM
22/34
BLOCK DIAGRAM DESCRIPTION
23/34
Speech Acquisition Unit
•It consists of a microphone to obtain the analog speech signal
•The acquisition unit also consists of an analog to digital converter
Speech Recognition Unit
•This unit is used to recognize the words contained in the input speech
signal.
•The speech recognition is implemented in MATLAB with the help of
•template matching algorithm
Device Control Unit
•This unit consists of a microcontroller, the ATmega32, to control the
various appliances
•The microcontroller is connected to the PC via the PC parallel port
•The microcontroller then reads the input word and controls the device
connected to it accordingly.
SPEECH RECOGNITION
24/34
End Point
Detection
Feature
Extraction
Dynamic
Time
Warping
X(n)
Digitized
Speech
XF(n)
MFCC
Recognized
word
END-POINT DETECTION
25/34
• The accurate detection of a word's start and end points means that
subsequent processing of the data can be kept to a minimum by
processing only the parts of the input corresponding to speech.
•We will use the endpoint detection algorithm proposed by Rabiner and
Sambur. This algorithm is based on two simple time-domain
measurements of the signal - the energy and the zero crossing rate.
The algorithm should tackle the following cases:-
1. Words which begin with or end with a low energy phoneme
2. Words which end with a nasal
3. Speakers ending words with a trailing off in intensity or short breath
Steps for EPD
26/34
•Removal of noise by subtracting the signal values with that of noise
• Word extraction
steps –
1. ITU [Upper energy threshold]
2. ITL [Lower energy threshold]
3. IZCT [Zero crossing rate threshold ]
Feature Extraction
 Input data to the algorithm is usually too large to be
processed
 Input data is highly redundant
 Raw analysis requires high computational powers and
large amounts of memory
 Thus, removing the redundancies and transforming
the data into a set of features
 DCT based Mel Cepstrum
27/34
DCT Based MFCC
• Take the Fourier transform of a signal.
• Map the powers of the spectrum obtained above onto
the mel scale, using triangular overlapping windows.
• Take the logs of the powers at each of the mel
frequencies.
• Take the discrete cosine transform of the list of mel log
powers, as if it were a signal.
• The MFCCs are the amplitudes of the resulting
spectrum.
28/34
MFCC Computation
 As Log Magnitude is real and symmetric IDFT reduces to DCT. The DCT
produces highly un-correlated feature yt
(m)(k). The Zero Order MFCC
coefficient yt
(0)(k) is approximately equal to the Log Energy of the frame.
29/34The number of MFCC co-effecients chosen were 13
Feature extraction by MFCC
Processing
30/34
Dynamic Time Warping and Minimum
Distance Paths measurement
 Isolated word recognition:
• Task :
• Want to build an isolated word recogniser
• Method:
1. Record, parameterise and store vocabulary of reference words.
2. Record test word to be recognised and parameterize.
3. Measure distance between test word and each reference word.
4. Choose reference word ‘closest’ to test word.
31/34
32
Words are parameterised on a frame-by-frame basis
Choose frame length, over which speech remains reasonably stationary
Overlap frames e.g. 40ms frames, 10ms frame shift
We want to compare frames of test and reference words
i.e. calculate distances between them
40ms
20m
s
33
• Hard:
Number of frames won’t always correspond
• Easy:
Sum differences between corresponding frames
Calculating Distances
34
• Solution 1: Linear Time Warping
Stretch shorter sound
• Problem?
Some sounds stretch more than others
35
• Solution 2:
Dynamic Time Warping (DTW)
5 3 9 7 3
4 7 4
Test
Reference
Using a dynamic alignment, make most similar frames correspond
Find distances between two utterances using these corresponding frames
Dynamic Programming
36
Waveforms showing the utterance of the word “Open” at two
different instants. The signals are not time aligned.
37
3 5 1 x 4 x 1 x
7 4 3 x 0 x 3 x
9 3 5 x 2 x 5 x
3 2 1 x 4 x 1 x
5 1 1 x 2 x 1 x
1 2 3
4 7 4
Reference
T
e
s
t
Place distance between frame r
of Test and frame c of Reference
in cell(r,c) of distance matrix
DTW Process
Constraints
 Global
 Endpoint detection
 Path should be close to diagonal
 Local
 Must always travel upwards or eastwards
 No jumps
 Slope weighting
 Consecutive moves upwards/eastwards
38
Empirical Results : Known Speaker
39
SONY SUVARNA GEMINI HBO CNN NDTV IMAGINE ZEE CINEMA
SONY 9 0 1 0 0 0 0 0
SUVARNA 0 10 0 0 0 0 0 0
GEMINI 0 0 8 0 0 0 2 0
HBO 0 0 0 10 0 0 0 0
CNN 0 0 0 0 8 0 2 0
NDTV 0 0 0 0 0 10 0 0
IMAGINE 0 0 0 0 0 0 10 0
ZEE CINEMA 0 0 0 0 0 0 1 9
Empirical Results : Unknown Speaker
40
SONY SUVARNA GEMINI HBO CNN NDTV IMAGINE ZEE CINEMA
SONY 8 0 1 0 0 0 1 0
SUVARNA 0 8 0 0 0 0 0 2
GEMINI 1 0 8 0 0 0 1 0
HBO 0 0 0 10 0 0 0 0
CNN 1 0 0 0 8 0 2 0
NDTV 0 0 0 0 0 10 0 0
IMAGINE 0 0 0 0 0 0 10 0
ZEE CINEMA 0 2 0 0 0 0 0 8
Applications
 Medical Transcription
 Military
 Telephony and other domains
 Serving the disabled
Further Applications
• Home automation
• Automobile audio systems
• Telematics
41
Where, From here?
42
43/34
all speakers of the
language including
foreign
application
independent or
adaptive
all styles including
human-human
(unaware)
wherever speech
occurs
2015
vehicle noise radio
cell phones
regional accents
native speakers
competent foreign
speakers
some application–
specific data and one
engineer year
natural human-
machine dialog (user
can adapt)
1995
expert years
to create
app– specific
language
model
speaker independent
and adaptive
normal office
various microphones
telephone
planned speech
1985
NOISE
ENVIRONMENT
SPEECH STYLE
USER
POPULATION
COMPLEXITY
1975
quiet room
fixed high –
quality mic
careful
reading
speaker-dep.
application–
specific
speech and
language
Evolution of ASR
44
Conclusive remarks
Recorded Speech Noise Padded
Gain Adjustment DC offset elimination
Spectral Subtraction End Point Detection
Automatic speech recognition
Automatic speech recognition

More Related Content

What's hot

Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognitionfathitarek
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminarDiptimaya Sarangi
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech RecognitionHugo Moreno
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognitionananth
 
Speech Recognition
Speech Recognition Speech Recognition
Speech Recognition Goa App
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologySrijanKumar18
 
A seminar report on speech recognition technology
A seminar report on speech recognition technologyA seminar report on speech recognition technology
A seminar report on speech recognition technologySrijanKumar18
 
speech processing and recognition basic in data mining
speech processing and recognition basic in  data miningspeech processing and recognition basic in  data mining
speech processing and recognition basic in data miningJimit Rupani
 
Unit 1 speech processing
Unit 1 speech processingUnit 1 speech processing
Unit 1 speech processingazhagujaisudhan
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition systemAlok Tiwari
 
Speech recognition techniques
Speech recognition techniquesSpeech recognition techniques
Speech recognition techniquessonukumar142
 
Speech synthesis technology
Speech synthesis technologySpeech synthesis technology
Speech synthesis technologyKalluri Madhuri
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversionankit_saluja
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologyAamir-sheriff
 
Gujarati Text-to-Speech Presentation
Gujarati Text-to-Speech PresentationGujarati Text-to-Speech Presentation
Gujarati Text-to-Speech Presentationsamyakbhuta
 
Digital speech processing lecture1
Digital speech processing lecture1Digital speech processing lecture1
Digital speech processing lecture1Samiul Parag
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition systemAlok Tiwari
 

What's hot (20)

Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognition
 
Speech Recognition
Speech Recognition Speech Recognition
Speech Recognition
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Automatic Speech Recognition
Automatic Speech RecognitionAutomatic Speech Recognition
Automatic Speech Recognition
 
A seminar report on speech recognition technology
A seminar report on speech recognition technologyA seminar report on speech recognition technology
A seminar report on speech recognition technology
 
speech processing and recognition basic in data mining
speech processing and recognition basic in  data miningspeech processing and recognition basic in  data mining
speech processing and recognition basic in data mining
 
Speech Recognition System
Speech Recognition SystemSpeech Recognition System
Speech Recognition System
 
Unit 1 speech processing
Unit 1 speech processingUnit 1 speech processing
Unit 1 speech processing
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition system
 
Speech recognition techniques
Speech recognition techniquesSpeech recognition techniques
Speech recognition techniques
 
Speech synthesis technology
Speech synthesis technologySpeech synthesis technology
Speech synthesis technology
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
 
Gujarati Text-to-Speech Presentation
Gujarati Text-to-Speech PresentationGujarati Text-to-Speech Presentation
Gujarati Text-to-Speech Presentation
 
Digital speech processing lecture1
Digital speech processing lecture1Digital speech processing lecture1
Digital speech processing lecture1
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition system
 

Viewers also liked

Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By MatlabAnkit Gujrati
 
Basic image processing
Basic image processingBasic image processing
Basic image processingJay Thakkar
 
Basics of Image Processing using MATLAB
Basics of Image Processing using MATLABBasics of Image Processing using MATLAB
Basics of Image Processing using MATLABvkn13
 
TESTIMAGES - a large-scale archive for testing visual devices and basic image...
TESTIMAGES - a large-scale archive for testing visual devices and basic image...TESTIMAGES - a large-scale archive for testing visual devices and basic image...
TESTIMAGES - a large-scale archive for testing visual devices and basic image...Tecnick.com LTD
 
Introduction of image processing
Introduction of image processingIntroduction of image processing
Introduction of image processingAvani Shah
 
Digital image processing using matlab (fundamentals)
Digital image processing using matlab (fundamentals)Digital image processing using matlab (fundamentals)
Digital image processing using matlab (fundamentals)Taimur Adil
 
Information visualization: information dashboards
Information visualization: information dashboardsInformation visualization: information dashboards
Information visualization: information dashboardsKatrien Verbert
 
基礎影像處理
基礎影像處理基礎影像處理
基礎影像處理weihan cheng
 
Image Processing Basics
Image Processing BasicsImage Processing Basics
Image Processing BasicsNam Le
 
Introduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLABIntroduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLABRay Phan
 
Image proceesing with matlab
Image proceesing with matlabImage proceesing with matlab
Image proceesing with matlabAshutosh Shahi
 
Artificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemArtificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemREHMAT ULLAH
 
Face recognition using neural network
Face recognition using neural networkFace recognition using neural network
Face recognition using neural networkIndira Nayak
 
Digital Image Processing
Digital Image ProcessingDigital Image Processing
Digital Image ProcessingSahil Biswas
 

Viewers also liked (15)

Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Basic image processing
Basic image processingBasic image processing
Basic image processing
 
Basics of Image Processing using MATLAB
Basics of Image Processing using MATLABBasics of Image Processing using MATLAB
Basics of Image Processing using MATLAB
 
TESTIMAGES - a large-scale archive for testing visual devices and basic image...
TESTIMAGES - a large-scale archive for testing visual devices and basic image...TESTIMAGES - a large-scale archive for testing visual devices and basic image...
TESTIMAGES - a large-scale archive for testing visual devices and basic image...
 
Introduction of image processing
Introduction of image processingIntroduction of image processing
Introduction of image processing
 
Digital image processing using matlab (fundamentals)
Digital image processing using matlab (fundamentals)Digital image processing using matlab (fundamentals)
Digital image processing using matlab (fundamentals)
 
Information visualization: information dashboards
Information visualization: information dashboardsInformation visualization: information dashboards
Information visualization: information dashboards
 
基礎影像處理
基礎影像處理基礎影像處理
基礎影像處理
 
Image Processing Basics
Image Processing BasicsImage Processing Basics
Image Processing Basics
 
Introduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLABIntroduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLAB
 
Image proceesing with matlab
Image proceesing with matlabImage proceesing with matlab
Image proceesing with matlab
 
Artificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemArtificial intelligence Speech recognition system
Artificial intelligence Speech recognition system
 
Face recognition using neural network
Face recognition using neural networkFace recognition using neural network
Face recognition using neural network
 
Image processing ppt
Image processing pptImage processing ppt
Image processing ppt
 
Digital Image Processing
Digital Image ProcessingDigital Image Processing
Digital Image Processing
 

Similar to Automatic speech recognition

speech recognition and removal of disfluencies
speech recognition and removal of disfluenciesspeech recognition and removal of disfluencies
speech recognition and removal of disfluenciesAnkit Sharma
 
Course report-islam-taharimul (1)
Course report-islam-taharimul (1)Course report-islam-taharimul (1)
Course report-islam-taharimul (1)TANVIRAHMED611926
 
Speech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using VocoderSpeech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using VocoderIJTET Journal
 
Audio/Speech Signal Analysis for Depression
Audio/Speech Signal Analysis for DepressionAudio/Speech Signal Analysis for Depression
Audio/Speech Signal Analysis for Depressionijsrd.com
 
Automatic Speech Recognition.ppt
Automatic Speech Recognition.pptAutomatic Speech Recognition.ppt
Automatic Speech Recognition.pptRudraSaraswat3
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionboddu syamprasad
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisRushin Shah
 
Speaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization ApproachSpeaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization Approachijsrd.com
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionArif A.
 
DDSP_2018_FOEHU - Lec 10 - Digital Signal Processing Applications
DDSP_2018_FOEHU - Lec 10 - Digital Signal Processing ApplicationsDDSP_2018_FOEHU - Lec 10 - Digital Signal Processing Applications
DDSP_2018_FOEHU - Lec 10 - Digital Signal Processing ApplicationsAmr E. Mohamed
 
DSP_FOEHU - Lec 13 - Digital Signal Processing Applications I
DSP_FOEHU - Lec 13 - Digital Signal Processing Applications IDSP_FOEHU - Lec 13 - Digital Signal Processing Applications I
DSP_FOEHU - Lec 13 - Digital Signal Processing Applications IAmr E. Mohamed
 
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...IDES Editor
 
Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...csandit
 
Speech user interface
Speech user interfaceSpeech user interface
Speech user interfaceHusain master
 
International journal of signal and image processing issues vol 2015 - no 1...
International journal of signal and image processing issues   vol 2015 - no 1...International journal of signal and image processing issues   vol 2015 - no 1...
International journal of signal and image processing issues vol 2015 - no 1...sophiabelthome
 
Environmental Sound detection Using MFCC technique
Environmental Sound detection Using MFCC techniqueEnvironmental Sound detection Using MFCC technique
Environmental Sound detection Using MFCC techniquePankaj Kumar
 
Automatic subtitle generation
Automatic subtitle generationAutomatic subtitle generation
Automatic subtitle generationtanyasaxena1611
 

Similar to Automatic speech recognition (20)

speech recognition and removal of disfluencies
speech recognition and removal of disfluenciesspeech recognition and removal of disfluencies
speech recognition and removal of disfluencies
 
Course report-islam-taharimul (1)
Course report-islam-taharimul (1)Course report-islam-taharimul (1)
Course report-islam-taharimul (1)
 
Kc3517481754
Kc3517481754Kc3517481754
Kc3517481754
 
Speech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using VocoderSpeech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using Vocoder
 
Audio/Speech Signal Analysis for Depression
Audio/Speech Signal Analysis for DepressionAudio/Speech Signal Analysis for Depression
Audio/Speech Signal Analysis for Depression
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 
Automatic Speech Recognition.ppt
Automatic Speech Recognition.pptAutomatic Speech Recognition.ppt
Automatic Speech Recognition.ppt
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
 
Speaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization ApproachSpeaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization Approach
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
DDSP_2018_FOEHU - Lec 10 - Digital Signal Processing Applications
DDSP_2018_FOEHU - Lec 10 - Digital Signal Processing ApplicationsDDSP_2018_FOEHU - Lec 10 - Digital Signal Processing Applications
DDSP_2018_FOEHU - Lec 10 - Digital Signal Processing Applications
 
DSP_FOEHU - Lec 13 - Digital Signal Processing Applications I
DSP_FOEHU - Lec 13 - Digital Signal Processing Applications IDSP_FOEHU - Lec 13 - Digital Signal Processing Applications I
DSP_FOEHU - Lec 13 - Digital Signal Processing Applications I
 
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...
 
Speech user interface
Speech user interfaceSpeech user interface
Speech user interface
 
International journal of signal and image processing issues vol 2015 - no 1...
International journal of signal and image processing issues   vol 2015 - no 1...International journal of signal and image processing issues   vol 2015 - no 1...
International journal of signal and image processing issues vol 2015 - no 1...
 
Environmental Sound detection Using MFCC technique
Environmental Sound detection Using MFCC techniqueEnvironmental Sound detection Using MFCC technique
Environmental Sound detection Using MFCC technique
 
Automatic subtitle generation
Automatic subtitle generationAutomatic subtitle generation
Automatic subtitle generation
 

Recently uploaded

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Automatic speech recognition

  • 1.
  • 2. Automatic speech recognition  What is the task?  What are the main difficulties?  How is it approached?  How good is it?  How much better could it be? 2/34
  • 3. What is the task?  Getting a computer to understand spoken language  By “understand” we might mean  React appropriately  Convert the input speech into another medium, e.g. text  Several variables impinge on this 3/34
  • 4. How do humans do it?  Articulation produces sound waves which the ear conveys to the brain for processing 4/34
  • 5. How might computers do it?  Digitization  Acoustic analysis of the speech signal  Linguistic interpretation 5/34 Acoustic waveform Acoustic signal Speech recognition
  • 7. What’s hard about that?  Digitization  Converting analogue signal into digital representation  Signal processing  Separating speech from background noise  Phonetics  Variability in human speech  Phonology  Recognizing individual sound distinctions (similar phonemes)  Lexicology and syntax  Disambiguating homophones  Features of continuous speech  Syntax and pragmatics  Interpreting prosodic features  Pragmatics  Filtering of performance errors (disfluencies) 7/34
  • 8. Digitization  Analogue to digital conversion  Sampling and quantizing  Use filters to measure energy levels for various points on the frequency spectrum  Knowing the relative importance of different frequency bands (for speech) makes this process more efficient  E.g. high frequency sounds are less informative, so can be sampled using a broader bandwidth (log scale) 8/34
  • 9. Separating speech from background noise  Noise cancelling microphones  Two mics, one facing speaker, the other facing away  Ambient noise is roughly same for both mics  Knowing which bits of the signal relate to speech  Spectrograph analysis 9/34
  • 10. Variability in individuals’ speech  Variation among speakers due to  Vocal range (f0, and pitch range – see later)  Voice quality (growl, whisper, physiological elements such as nasality, adenoidality, etc)  ACCENT !!! (especially vowel systems, but also consonants, allophones, etc.)  Variation within speakers due to  Health, emotional state  Ambient conditions  Speech style: formal read vs spontaneous 10/34
  • 11. Speaker-(in)dependent systems  Speaker-dependent systems  Require “training” to “teach” the system your individual idiosyncracies  The more the merrier, but typically nowadays 5 or 10 minutes is enough  User asked to pronounce some key words which allow computer to infer details of the user’s accent and voice  Fortunately, languages are generally systematic  More robust  But less convenient  And obviously less portable  Speaker-independent systems  Language coverage is reduced to compensate need to be flexible in phoneme identification  Clever compromise is to learn on the fly 11/34
  • 12. (Dis)continuous speech  Discontinuous speech much easier to recognize  Single words tend to be pronounced more clearly  Continuous speech involves contextual coarticulation effects  Weak forms  Assimilation  Contractions 12/34
  • 13. Performance errors  Performance “errors” include  Non-speech sounds  Hesitations  False starts, repetitions  Filtering implies handling at syntactic level or above  Some disfluencies are deliberate and have pragmatic effect – this is not something we can handle in the near future 13/34
  • 15. Template-based approach  Store examples of units (words, phonemes), then find the example that most closely fits the input  Extract features from speech signal, then it’s “just” a complex similarity matching problem, using solutions developed for all sorts of applications  OK for discrete utterances, and a single user 15/34
  • 16. Template-based approach  Hard to distinguish very similar templates  And quickly degrades when input differs from templates  Therefore needs techniques to mitigate this degradation:  More subtle matching techniques  Multiple templates which are aggregated  Taken together, these suggested … 16/34
  • 17. Neural Network based approach 17/34
  • 18. Statistics-based approach  Collect a large corpus of transcribed speech recordings  Train the computer to learn the correspondences (“machine learning”)  At run time, apply statistical processes to search through the space of all possible solutions, and pick the statistically most likely one 18/34
  • 19. Statistics based approach  Acoustic and Lexical Models  Analyse training data in terms of relevant features  Learn from large amount of data different possibilities  different phone sequences for a given word  different combinations of elements of the speech signal for a given phone/phoneme  Combine these into a Hidden Markov Model expressing the probabilities 19/34
  • 20. HMMs for some words 20/34
  • 21.  Identify individual phonemes  Identify words  Identify sentence structure and/or meaning 21/34
  • 22. SPEECH RECOGNITION BLOCK DIAGRAM 22/34
  • 23. BLOCK DIAGRAM DESCRIPTION 23/34 Speech Acquisition Unit •It consists of a microphone to obtain the analog speech signal •The acquisition unit also consists of an analog to digital converter Speech Recognition Unit •This unit is used to recognize the words contained in the input speech signal. •The speech recognition is implemented in MATLAB with the help of •template matching algorithm Device Control Unit •This unit consists of a microcontroller, the ATmega32, to control the various appliances •The microcontroller is connected to the PC via the PC parallel port •The microcontroller then reads the input word and controls the device connected to it accordingly.
  • 25. END-POINT DETECTION 25/34 • The accurate detection of a word's start and end points means that subsequent processing of the data can be kept to a minimum by processing only the parts of the input corresponding to speech. •We will use the endpoint detection algorithm proposed by Rabiner and Sambur. This algorithm is based on two simple time-domain measurements of the signal - the energy and the zero crossing rate. The algorithm should tackle the following cases:- 1. Words which begin with or end with a low energy phoneme 2. Words which end with a nasal 3. Speakers ending words with a trailing off in intensity or short breath
  • 26. Steps for EPD 26/34 •Removal of noise by subtracting the signal values with that of noise • Word extraction steps – 1. ITU [Upper energy threshold] 2. ITL [Lower energy threshold] 3. IZCT [Zero crossing rate threshold ]
  • 27. Feature Extraction  Input data to the algorithm is usually too large to be processed  Input data is highly redundant  Raw analysis requires high computational powers and large amounts of memory  Thus, removing the redundancies and transforming the data into a set of features  DCT based Mel Cepstrum 27/34
  • 28. DCT Based MFCC • Take the Fourier transform of a signal. • Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows. • Take the logs of the powers at each of the mel frequencies. • Take the discrete cosine transform of the list of mel log powers, as if it were a signal. • The MFCCs are the amplitudes of the resulting spectrum. 28/34
  • 29. MFCC Computation  As Log Magnitude is real and symmetric IDFT reduces to DCT. The DCT produces highly un-correlated feature yt (m)(k). The Zero Order MFCC coefficient yt (0)(k) is approximately equal to the Log Energy of the frame. 29/34The number of MFCC co-effecients chosen were 13
  • 30. Feature extraction by MFCC Processing 30/34
  • 31. Dynamic Time Warping and Minimum Distance Paths measurement  Isolated word recognition: • Task : • Want to build an isolated word recogniser • Method: 1. Record, parameterise and store vocabulary of reference words. 2. Record test word to be recognised and parameterize. 3. Measure distance between test word and each reference word. 4. Choose reference word ‘closest’ to test word. 31/34
  • 32. 32 Words are parameterised on a frame-by-frame basis Choose frame length, over which speech remains reasonably stationary Overlap frames e.g. 40ms frames, 10ms frame shift We want to compare frames of test and reference words i.e. calculate distances between them 40ms 20m s
  • 33. 33 • Hard: Number of frames won’t always correspond • Easy: Sum differences between corresponding frames Calculating Distances
  • 34. 34 • Solution 1: Linear Time Warping Stretch shorter sound • Problem? Some sounds stretch more than others
  • 35. 35 • Solution 2: Dynamic Time Warping (DTW) 5 3 9 7 3 4 7 4 Test Reference Using a dynamic alignment, make most similar frames correspond Find distances between two utterances using these corresponding frames
  • 36. Dynamic Programming 36 Waveforms showing the utterance of the word “Open” at two different instants. The signals are not time aligned.
  • 37. 37 3 5 1 x 4 x 1 x 7 4 3 x 0 x 3 x 9 3 5 x 2 x 5 x 3 2 1 x 4 x 1 x 5 1 1 x 2 x 1 x 1 2 3 4 7 4 Reference T e s t Place distance between frame r of Test and frame c of Reference in cell(r,c) of distance matrix DTW Process
  • 38. Constraints  Global  Endpoint detection  Path should be close to diagonal  Local  Must always travel upwards or eastwards  No jumps  Slope weighting  Consecutive moves upwards/eastwards 38
  • 39. Empirical Results : Known Speaker 39 SONY SUVARNA GEMINI HBO CNN NDTV IMAGINE ZEE CINEMA SONY 9 0 1 0 0 0 0 0 SUVARNA 0 10 0 0 0 0 0 0 GEMINI 0 0 8 0 0 0 2 0 HBO 0 0 0 10 0 0 0 0 CNN 0 0 0 0 8 0 2 0 NDTV 0 0 0 0 0 10 0 0 IMAGINE 0 0 0 0 0 0 10 0 ZEE CINEMA 0 0 0 0 0 0 1 9
  • 40. Empirical Results : Unknown Speaker 40 SONY SUVARNA GEMINI HBO CNN NDTV IMAGINE ZEE CINEMA SONY 8 0 1 0 0 0 1 0 SUVARNA 0 8 0 0 0 0 0 2 GEMINI 1 0 8 0 0 0 1 0 HBO 0 0 0 10 0 0 0 0 CNN 1 0 0 0 8 0 2 0 NDTV 0 0 0 0 0 10 0 0 IMAGINE 0 0 0 0 0 0 10 0 ZEE CINEMA 0 2 0 0 0 0 0 8
  • 41. Applications  Medical Transcription  Military  Telephony and other domains  Serving the disabled Further Applications • Home automation • Automobile audio systems • Telematics 41
  • 43. 43/34 all speakers of the language including foreign application independent or adaptive all styles including human-human (unaware) wherever speech occurs 2015 vehicle noise radio cell phones regional accents native speakers competent foreign speakers some application– specific data and one engineer year natural human- machine dialog (user can adapt) 1995 expert years to create app– specific language model speaker independent and adaptive normal office various microphones telephone planned speech 1985 NOISE ENVIRONMENT SPEECH STYLE USER POPULATION COMPLEXITY 1975 quiet room fixed high – quality mic careful reading speaker-dep. application– specific speech and language Evolution of ASR
  • 44. 44 Conclusive remarks Recorded Speech Noise Padded Gain Adjustment DC offset elimination Spectral Subtraction End Point Detection