27
Hearing by seeing: Can improving the visibility of the speaker's lips make you hear better? Najwa Alghamdi, MSc

Hearing by seeing: Can improving the visibility of the speaker's lips make you hear better?

  • Upload
    hci-lab

  • View
    296

  • Download
    0

Embed Size (px)

Citation preview

The University PowerPoint Template

Hearing by seeing: Can improving the visibility of the speaker's lips make you hear better?Najwa Alghamdi, MSc

BioLecturer in theInformation Technology Department, CCIS, KSU.SKERG memberPhD Candidate in University of Sheffield.Member of theComputer Graphics and Virtual Reality,Speech and Hearing, research groupsSupervised by: Dr.Steve Maddock,Prof. Guy BrownandDr. Jon Barker.My research investigates methods for enhancing visual speech intelligibility* to support hard of hearing ( cochlear implant (CI) users in particular). Alghamdi, Najwa / Maddock, Steve / Brown, Guy J. / Barker, Jon (2015): "Investigating the impact of artificial enhancement of lip visibility on the intelligibility of spectrally-distorted speech", InFAAVSP-2015, 93-98.

* Speech intelligibility is a measure of how comprehensiblespeechis in given conditions

27/04/2016 King Saud University - The University of Sheffield2

Why CI users ?18% of Saudi children are hard of hearing. Cochlear implant surgeries are flourishing in Saudi Arabia. The number of CI users are increasing2

IntroductionCochlear implants help profoundly deaf peopleto become more aware of everyday soundsto understand speech better when combined with lip-readingThe sound waveform is separated by band-pass filters into different frequency componentsUsers initially describe the sound characteristics like mechanical and synthetic

27/04/20163

Real Synthesized

Cochlear Implant (CI) King Saud University - The University of Sheffield

Introduction: Training after implantation Auditory training is formal listening activities whose goal is to optimize the activity of speech perception. Auditory training helps the CI user use the new hearingTypically, training uses audio-only speech stimuliRecent studies suggest that using visual speech stimuli in the training may maximize the benefit from the training (Bernstein et al., 2013)

27/04/20164

Audio-only Audiovisual King Saud University - The University of Sheffield

Introduction: Enhancing training videosLander and Capek (2013) found that increasing and decreasing lip visibility by applying lipstick and concealer improved the speechreading performance of words and sentences compared to natural, unadorned lips

Our idea is to artificially colour a speakers lips in a video sequence to improve lip visibility

27/04/20165

Natural lip with lipstick with concealer King Saud University - The University of Sheffield

Aim of Research Investigate whether or not artificiallyenhancing the appearance of a speakers lips:Supports lip-reading thus improving theintelligibility of visual speechImproves auditory trainingPreliminary step: study non-native, normal hearing listeners using cochlear implant simulation. Why? Both CI users and non-native listeners deal with internal adverse conditions when listening to CI processed speech:Linguistic knowledge in non-native listeners (Bent et al., 2009)Damaged inner ear in a CI userNon-native listeners may help predict the performance of CI users

27/04/20166

King Saud University - The University of Sheffield

Enhancement Method

Automatic tracking using Faceware Analyzer*XML ParserSmoothing landmarks using piecewise bicubic Bzier curves

Colour & luminance blending

Lip contour smoothing using average filterLandmarks XMLfile 727/04/2016

*http://facewaretech.com/products/software/analyzer/ King Saud University - The University of Sheffield

Enhancement Method 27/04/20168

Original Simulated King Saud University - The University of Sheffield

Method: Subjects 46 non-native, Saudi listeners from King Saud University, Riyadh, Saudi Arabia Minimum IELTS score = 5.5Subjects are split into groups

27/04/20169GroupSizePre-testTrainingPost-testAAudio-only13A AAVAudioVisual19VEEnhanced audiovisual14E

King Saud University - The University of Sheffield

9

Method: Stimuli We used the Grid corpus*Example: bin blue at L 8 please

Audio and video (facial) recordings of 1000 sentences 34 talkers (18 male, 16 female)We used audio and video recordings made by a single talker27/04/201610commandcolourprepositionletterdigitadverbbin, lay, place, setblue, green, red, whiteat, by, in, with25,no w10again, now, please, soon

*http://spandh.dcs.shef.ac.uk/gridcorpus/

King Saud University - The University of Sheffield

10

Method: Stimuli The Grid videos are processed to produce the different stimuli

The subjects need to identify the colour, the letter and the digit keyword of a Grid stimulus in all training and testing sessions

27/04/201611AVEGrid audio stimuli of a single speaker are spectrally distorted (vocoded) to simulate CI processed speech (Tabri et al., 2011)

The audio tracks of Grid videos are replaced with the spectrally distorted Audio-only stimuliThe speaker's lips in the Audiovisual stimuli are automatically tracked and artificially coloured

King Saud University - The University of Sheffield

11

Results: three setsThe impact of using E speech in auditory trainingtraining gain = post-test - pre-testA comparison of the intelligibility of A, V and E speechTraining scores can be used to provide a subjective intelligibility assessmentLetter confusion matrices from post-testUnderstand the possible sources of confusion when identifying letters during the audio-only post test

27/04/201612

King Saud University - The University of Sheffield

12

1. The Impact of Using Enhanced Audiovisual Speech in Auditory Training

27/04/201613AVEANOVAPost-hocPre-test mean scores 14%14%13%Post-test mean scores 46%54%71%p= 0.04p= 0.037Training gain32%40%58%p= 0.01p= 0.009

King Saud University - The University of Sheffield

2. A comparison of the intelligibility of A, V and E speech27/04/201614Speech intelligibility of X speech = 10060 where X = {A, V or E}ANOVAp= 0.008Post-hocp= 0.006, between A & E

King Saud University - The University of Sheffield

3. Letter Confusion Matrices from post-test resultsLetter identification was the most challenging task25 letters to choose from [no W]Due to the vocoding process, some letters sound similar: (P,B), (G,T), (M,N) and vowels27/04/201615

Letter presented in the testUsers responseClusters1Dipthongs2Contains plosive sounds3Contains nasal sounds4Contains fricative sounds5Contains a lateral approximant sound

King Saud University - The University of Sheffield

3. Letter Confusion Matrices from post-test resultsThe analysis of the letter confusion matrices for the audio-only post-test shows that E subjects were better at letter and diphthong identification:27/04/201616EVALetter identification75%65%55%Vowels identification81%66%52%

E V A King Saud University - The University of Sheffield

3. Letter Confusion Matrices from post-test resultsThe visual signal might impede learning the discrimination of visually similar sounds such as P & B .

27/04/201617

E V A

27/04/2016 King Saud University - The University of Sheffield

Conclusions Audio-only post-training tests suggest that the enhanced visual signal improves the training gain of participantsIntelligibility of spectrally-distorted speech is improved when a corresponding enhanced visual signal is introducedNext steps: Expand the study; Similar experiment on a group of CI and hearing aid users

27/04/201618 King Saud University - The University of Sheffield

Current Experiment in SKERGEvaluation study of a new enhancement method that exaggerate speaking style of the speaker in the video. 27/04/2016 King Saud University - The University of Sheffield19

Normal Exaggerated Exaggerated with lipstick

[email protected]

www.najwa-alghamdi.net

2027/04/2016

This research has been supported by the Saudi Ministry of Education, King Saud University and Faceware Technologies Inc.

King Saud University - The University of Sheffield

References T. Bent, A. Buchwald, D. B. Pisoni. Perceptual adaptation and intelligibility of multiple talkers for two types of degraded speech, The Journal of the Acoustical Society of America,126(5), 26602669,2009.L. E. Bernstein, E. T. Auer Jr, S. P. Eberhardt, and J. Jiang, Auditory perceptual learning for speech perception can be enhanced by audiovisual training, Frontiers in neuroscience, vol. 7, 2013.M. F., Dorman, P. C, Loizou. The identification of consonants and vowels by cochlear implant patients using a 6-channel continuous interleaved sampling processor and by normal-hearing subjects using simulations of processors with two to nine channels. Ear and hearing 19(2), 162166, 1998.K. Lander and C. Capek, Investigating the impact of lip visibility and talking style on speechreading performance, Speech Communication, vol. 55, no. 5, pp. 600605, 2013.D. Tabri, K. M. S. A. Chacra, and T. Pring, Speech perception in noise by monolingual, bilingual and trilingual listeners, International Journal of Language & Communication Disorders, vol. 46, no. 4, pp. 411422, 2011.2127/04/2016 King Saud University - The University of Sheffield

22 The University of Sheffield27/04/2016

Users responseLetter presented in the test

Colouring the lipsSmoothing the lip contours

Where are controls points and is a Bernstein polynomial given by 27/04/2016 The University of Sheffield23

Luminance BlendingLuminance blending was utilized as well to improve colour blending under different lighting conditionsThis was accomplished by applying Luminance blending in luma/chroma (YCbCr) space and then converting the results to the RGB space using the following equations27/04/2016 The University of Sheffield24

CI simulation The GRID audio was spectrally distorted using an eight-channel sine-wave vocoder (AngelSim*).Normal hearing listeners can perform in a comparable way to CI users when hearing no more or less than 8 channels (Dorman et al., 1998)The fluctuation of noise in a noise vocoder is not presented in real CI (Bent et al., 2009) thus we used the sinewave vocoder. The processing of vocoding The signal is divided into 8 channels by a bandpass filter [200 to 7,000Hz] (slope=24dB/octave);Each channel was then low-pass filtered by 160Hz (slope=24dB/octave) to obtain the envelope;The envelope of each channel modulated a sine wave that replaced the signal frequency

27/04/2016 The University of Sheffield25*http://www.tigerspeech.com

HearingHearing enables us to socialize, work, interact, communicate and even relax. Good hearing also helps to keep us safe, warning us of potential danger or alerting us to someone elses distress.Our hearing provides us with an enormous source of information.

27/04/2016 King Saud University - The University of Sheffield26

The cochlea (the inner ear) is responsible for pitch perception. It is composed of 1000s of hair cells that are arranged in pitch order.The vibration of sound causes the cochlea fluid to move, which stimulates the hair cells. The brain then perceives sounds by electrical pulses passed by the stimulated hair cells .

26

Hearing loss Hearing enables us to socialize, work, interact, communicate and even relax. Good hearing also helps to keep us safe, warning us of potential danger or alerting us to someone elses distress.Our hearing provides us with an enormous source of information.18 % of Saudi children are hearing impaired*.

* 2010, King Abdullah center for cochlear implant.27/04/2016 King Saud University - The University of Sheffield27

Hearing loss Solution: cochlear implant.

27/04/2016 King Saud University - The University of Sheffield28

27/04/2016 King Saud University - The University of Sheffield29healthy cochlea --

back

27/04/2016 King Saud University - The University of Sheffield30

cochlear implant--

backneurosensory hearing-loss conditions