View
151
Download
0
Category
Tags:
Preview:
Citation preview
Visual-speech to text
conversion applicable
to telephone
communication for deaf
individuals
30TH APRIL 2013
Lip-reading technique,
speech can be understood by interpreting
movements of lips, face and tongue.
not one-to-one
Impossible to distinguish phonemes using
visual information alone
Visual-speech to text conversion applicable to telephone communication for deaf individuals
INTRODUCTION
developed by Cornett
contains two components:
the hand shape the hand position relative to the
face.
Hand shapes- consonant phonemes
hand positions -vowel phonemes.
improves speech perception to a large extent
the Cued Speech system
Visual-speech to text conversion applicable to telephone communication for deaf individuals
the Cued Speech system
Visual-speech to text conversion applicable to telephone communication for deaf individuals
AIM OF NEW SYSTEM
To investigate the designing of a system able to
automatically recognize Cued Speech and convert it
to text.
Possible for deaf or speech-impaired individuals to
communicate with each other and also with normal-
hearing persons
Using gestures
captured by devices equipped by a camera
Visual-speech to text conversion applicable to telephone communication for deaf individuals
METHODS
Visual-speech to text conversion applicable to telephone communication for deaf individuals
Corpus, feature extraction, and
statistical modeling
The speakers’ lips were painted blue, and color
marks were placed on the speakers’ fingers. .
The data were derived from a video recording of
the cuers pronouncing and coding in Cued
Speech
landmarks with different colors were placed on
the fingers
faster and more accurate image processing
stage.
The audio part of the video recording was
synchronized with the image.
An automatic image processing method was
applied to the video lip width (A),
lip aperture (B),
lip area (S).
pinching of the upper lip (Bsup)
lower (Binf) lip
Visual-speech to text conversion applicable to telephone communication for deaf individuals
Concatenative feature fusion
Tracks and extracts the xy coordinates
each time frame,
uses those values as features in the
HMM modeling.
uses the concatenation of the
synchronous lip shape and hand features
as the joint feature vector given by,
Visual-speech to text conversion applicable to telephone communication for deaf individuals
Lip shape
feature vector,
Joint lip hand
feature vector,
Hand feature
vector,
Dimensionality of the
joint feature vector
Parameters used for lip
shape modeling.
Visual-speech to text conversion applicable to telephone communication for deaf individuals
RESULTS
Visual-speech to text conversion applicable to telephone communication for deaf individuals
Isolated word recognition
1. Recognition in normal-hearing subject
2. Recognition in deaf subject
Visual-speech to text conversion applicable to telephone communication for deaf individuals
3. Multi-speaker isolated word recognition:
investigate whether it is possible to train speaker-
independent HMMs for Cued Speech recognition.
The training data consisted of 750 words from the
normal-hearing subject, and 750 words from the
deaf subject.
For testing 700 words from normal-hearing subject
and 700 words from the deaf subject were used,
respectively.
Each state was modeled with a mixture of 4
Gaussian distributions.
For lip shape and hand shape integration,
concatenative feature fusion was used.
Visual-speech to text conversion applicable to telephone communication for deaf individuals
4. Continuous phoneme recognition
Phoneme correct for continuous phoneme word
recognition in the case of a normal-hearing subject.
Visual-speech to text conversion applicable to telephone communication for deaf individuals
Phoneme correct for continuous phoneme word
recognition in the case of a deaf subject.
Visual-speech to text conversion applicable to telephone communication for deaf individuals
Hand shapes and lips shape were integrated
using concatenative feature fusion and HMM-
based automatic recognition was conducted.
For continuous phoneme recognition, a 86%
phoneme correct was achieved for the normal-
hearing cuer and a 82.7% phoneme correct for
the dead cuer were achieved, respectively.
Speech in both normal-hearing and deaf
subjects were also conducted obtaining a
94.9% and a 89% accuracy, respectively.
.
CONCLUSION
Visual-speech to text conversion applicable to telephone communication for deaf individuals
A multi-speaker experiment using data
from both normal-hearing and deaf subject
showed a 89.6% word accuracy, on
average.
This result indicates that training speaker-
independent HMMs for Cued Speech using
a large number of subjects should not face
particular difficulties
CONCLUSION
Visual-speech to text conversion applicable to telephone communication for deaf individuals
REFERENCES
Visual-speech to text conversion applicable to telephone communication for deaf individuals
G. Potamianos, C. Neti, G. Gravier, A. Garg, and A.W. Senior,
“recent Advances in the automatic recognition of audiovisual
speech,” in Proceedings of the IEEE, vol. 91, issue 9, pp.
1306–1326, 2003.
S. Nakamura, K. Kumatani, and S. Tamura, “Multi-modal
temporal asynchronicity modeling by product hmms for
robust audio-visual speech recognition,” in Proceedings of
Fourth IEEE International Conference on Multimodal
Interfaces (ICMI’02), p. 305, 2002.
R. O. Cornett, “Cued speech,” American Annals of the Deaf,
vol. 112, pp. 3–13, 1967.
J. Leybaert, “Phonology acquired through the eyes and
spelling in deaf children,”Journal of Experimental Child
Psychology, vol. 75, pp. 291– 318, 2000
Recommended