Clinical Applications of Speech Technology

Clinical Applications of Speech TechnologyPhil GreenSpeech and Hearing Research GroupDept of Computer ScienceUniversity of [email protected]

The University of Sheffield / Department of Marketing and Communications

Talk OverviewSPandH - Speech and Hearing @ SheffieldThe CAST groupBuilding Automatic Speech Recognisers conventional methodologyASR for clients with speech disordersKinematic MapsVoice-driven Environmental ControlVIVOCACustomising VoicesFuture Directions


SPandHPhonetics &LinguisticsHearing & AcousticsElectrical Engineering &Signal ProcessingSpeech & Language Therapy


Prof Mark HawleySchool of Health and Related ResearchAssistive TechnologyProf Pam EnderbyInstitute of General Practice and Primary CareUniversity of SheffieldSpeech TherapyProf Phil GreenProf Roger K MooreSpeech and Hearing Research GroupDepartment of Computer ScienceUniversity of SheffieldSpeech TechnologyDr Stuart CunninghamDepartment of Human Communication SciencesUniversity of SheffieldSpeech Perception, Speech TechnologyContact: [email protected]


Conventional Automatic Speech Recogniser ConstructionStandard technique uses generative statistical models:Each state is characterised by a mixture Gaussian distribution over the components of the acoustic vector x.Parameters of the distributions estimated in training (EM Baum-Welch)All this is the acoustic model. There will also be a language model.Decoding finds model & state sequence most likely to generate X .Training based on large pre-recorded speaker-independent speech corpus


DysarthriaLoss of control of speech articulatorsStroke victims, cerebral palsy, MS..Effects 170 per 100,000 populationSevere cases unintelligible to strangers:

Often accompanied by physical disabilitychannellampradio


STARDUST: ASR for Dysarthric SpeakersNHS NEAT FundingEnvironmental controlSmall vocabulary, isolated wordsSpeaker-dependentSparse training dataVariable training data


STARDUST MethodologyInitial recordings


STARDUST training resultsECS trial: halved the average time to execute a command


STARDUST Consistency Training


STARDUST Clinical Trial


OPTACIA: Kinematic MapsPronunciation Training AidEC FundingSpeech acoustics mapped to x,y position in map window in real timeMapping by trained Neural NetCustomise for exercises and clientsANN MappingSignalProcessingshsiSpeech


Example: Vowel Map


SPECS: Speech-Driven Environmental Control SystemsNHS HTD FundingIndustrial exploitationSTARDUST on balloon board


VIVOCA- Voice Input Voice Output Communication AidNHS NEAT fundingAssists communication with strangers;Client: buy tea [unintelligible]VIVOCA: A cup of tea with milk and no sugar please [intelligible synthesised speech]Runs on a PDA


Voices for VIVOCAIt is possible to build voices from training dataA local voice is preferableYorkshire voices:Ian MacMillan Christa Ackroyd


Concatenative synthesisInput dataText inputSynthesised speechSpeech recordings UnitsegmentationUnit databaseUnitselectionConcatenation+ smoothingiashFestvox: http://festvox.org/+ ++


Concatenative synthesisHigh qualityNatural soundingSounds like original speakerNeed a lot of data (~600 sentences)Can be inconsistentDifficult to manipulate prosody


HMM synthesisyesyes


HMM synthesis: adaptationInput dataText inputAverage speaker modelSynthesisedspeechSpeech recordingsTrainingSynthesisetHTS http://hts.sp.nitech.ac.jp/Adapted speaker modelAdaptationetSpeechrecordings100200


HMM synthesisConsistentIntelligibleEasier to manipulate prosodyNeeds relatively little input for adaptation data (>5 sentences)Less natural than concatenative


Personalisation for individuals with progressive speech disorders Voice banking Before deteriorationCapturing the essence of a voiceDuring deterioration


HMM synthesis: adaptation for dysarthric speechInput dataText inputAverage speaker modelSynthesisedspeechSpeech recordingsTrainingSynthesisetHTS http://hts.sp.nitech.ac.jp/Adapted speaker modelAdaptationetSpeechrecordingsDuration, phonation and energy information


Future directionsPersonal Adaptive Listeners (PALS)Home ServiceCompanions


The PALS ConceptA PAL is a portable (PDA, wearable..) device which you ownYour PAL is like your valetIt knows a lot about you..The way you speak, the words you like to useYour interests, contacts, networksYou talk with it The knowledge makes conversational dialogues viable It does things for youBookings, appointments, remindersCommunicationAccess to services..It learns to do a better jobBy explicit training (this is how I refer to things, these are the names I use..) USER-AS-TEACHERBy Automatic Adaptation: acoustic models, language models, dialogue models


Documents

Clinical Applications of Speech Technology