Diagnostic Assessment of Childhood Apraxia of Speech Using Automatic Speech Recognition (ASR) Systems Lawrence D. Shriberg 1 John-Paul Hosom 2 Jordan R

Diagnostic Assessment of Diagnostic Assessment of Childhood Apraxia of Speech Using Childhood Apraxia of Speech Using

Automatic Speech Recognition (ASR) Automatic Speech Recognition (ASR) SystemsSystems

Lawrence D. Shriberg1

John-Paul Hosom2

Jordan R. Green3

1Waisman Center, University of Wisconsin - Madison

2Center for Spoken Language Understanding, Oregon Health & Science University

3Department of Special Education & Communication Disorders,University of Nebraska - Lincoln

This research is supported by NIDCD grants DC000496 and DC006722

http://www.waisman.wisc.edu/phonology/Index.htm









2

AcknowledgmentsAcknowledgments

Phonology Project,Waisman Center, University of Wisconsin - Madison

Roger Brown Katherina Hauner Jane McSweenyCatherine Coffey Heather Karlsson Connie NadlerPeter Flipsen Jr.a Ray Kent Alison ScheerJordan Greenb Yunjung Kim Christie TilkensSheryl Hall Joan Kwiatkowski David Wilson

Collaborative Projects

• Thomas Campbell & colleagues: University of Pittsburgh• John-Paul Hosom & colleagues: Oregon Health & Science

University• Barbara Lewis & colleagues: Case Western Reserve

University• Christopher Moore & colleagues: University of Washington• Rhea Paul & colleagues: Yale Child Studies Center• Bruce Pennington & colleagues: University of Colorado• Joanne Roberts & Colleagues: University of North Carolina• Bruce Tomblin & colleagues: University of Iowaa. University of Tennessee, Knoxville b. University of Nebraska

3

Complex Disease Model for ChildhoodComplex Disease Model for ChildhoodSpeech Sound Disorders (SSD) of Speech Sound Disorders (SSD) of

Unknown OriginUnknown OriginRisk and Protective Factors

EnvironmentalGenetic

Cognitive-Linguistic

Auditory-Perceptual

Speech Motor

Control

Psycho-social

Phonological Attunement

Speech Delay – Genetic

(SD-GEN)

Speech Delay – Otitis Media

with Effusion (SD-OME)

Speech Delay Speech Motor Involvement

(SD-SMI)

Speech Delay – Developmental Psychosocial Involvement

(SD-DPI)

Speech Errors (SE)

SD-GEN SD-OME SD-AOS SD-DPI SE-/s/ SE-/r/

I. Etiological Processes

II. Explanatory Processes

III. Nosological Entity

IV. Trait Markers (phenotypes, endophenotypes)

> Omissions< Distortions< Backing


- - - -- - - - - - - -

> M1 values- - - - - - - -

- - - -- - - - - - - -

< F3–F2- - - - - - - -

- - - -- - - -- - - -

Speech markers> I-S Gap> Backing

- - - - - - - - - - - -

> Severity- - - - - - - -

SD-DYS

8 speech markersLex. Stress Ratio>Coeff. Var. Ratio


- - - -- - - - - - - -

- - - - - - - - - - - -

V. Diagnostic Markers

*Shriberg, Austin, et al. (1997)

4





Auditory-Perceptual

Speech Motor

Control

Psycho-social



(SD-GEN)




(SD-SMI)


(SD-DPI)

Speech Errors (SE)






SDCS*- - - -- - - -

SDCS- - - - - - - -

- - - -- - - - - - - -

SDCS- - - - - - - -

- - - -- - - - - - - -

SDCS- - - - - - - -

- - - - - - - - - - - -

SDCS- - - - - - - -

- - - -- - - - - - - -

SDCS- - - - - - - -

SD-DYS

SDCS- - - - - - - -

SDCS- - - - - - - -

- - - -- - - - - - - -

SDCS- - - - - - - -



5





Auditory-Perceptual

Speech Motor

Control

Psycho-social



(SD-GEN)




(SD-SMI)


(SD-DPI)

Speech Errors (SE)

SD-GEN SD-AOS SD-DPI SE-/s/ SE-/r/





SD-DYS



SD-OME

SDCS*- - - -- - - -

SDCS- - - - - - - -

- - - -- - - - - - - -

SDCS- - - - - - - -

- - - -- - - - - - - -

SDCS- - - - - - - -

- - - - - - - - - - - -

SDCS- - - - - - - -

- - - -- - - - - - - -

SDCS- - - - - - - -

SDCS- - - - - - - -

SDCS- - - - - - - -

- - - -- - - - - - - -

SDCS- - - - - - - -

6





Auditory-Perceptual

Speech Motor

Control

Psycho-social



(SD-GEN)




(SD-SMI)


(SD-DPI)

Speech Errors (SE)






SD-DYS



SDCS*- - - -- - - -

SDCS- - - - - - - -

- - - -- - - - - - - -

SDCS- - - - - - - -

- - - -- - - - - - - -

SDCS- - - - - - - -

- - - - - - - - - - - -

SDCS- - - - - - - -

- - - -- - - - - - - -

SDCS- - - - - - - -

SDCS- - - - - - - -

SDCS- - - - - - - -

- - - -- - - - - - - -

SDCS- - - - - - - -

7





Auditory-Perceptual

Speech Motor

Control

Psycho-social



(SD-GEN)




(SD-SMI)


(SD-DPI)

Speech Errors (SE)






SD-DYS





- - - -- - - - - - - -

> M1 values- - - - - - - -

- - - -- - - - - - - -

< F3–F2- - - - - - - -

- - - -- - - -- - - -

Speech markers> I-S Gap> Backing

- - - - - - - -- - - -

> Severity- - - - - - - -



- - - -- - - - - - - -

- - - - - - - - - - - -

8

Diagnostic Markers and Diagnostic Markers and Automatic Speech RecognitionAutomatic Speech Recognition

• Childhood Apraxia of Speech is controversial disorderdue to lack of consensus on features that define it and etiologic conditions. (Guyette & Diedrich, 1981; Shriberg et al., 1997)

• “suspected Apraxia of Speech” (sAOS) proposed as interim term (Shriberg et al., 1997)

• Two proposed markers for sAOS: Lexical Stress Ratio (LSR) (Shriberg et al., 2003a)

Coefficient of Variation Ratio (CVR) (Shriberg et al., 2003b)

• This work: Pilot study for complete automation of these markers, to address inherent human variability. Aim was to replicate results of prior work.

• Techniques from automatic speech recognition (ASR)

9

Outline of TalkOutline of Talk

• Complex Disease Model for Childhood Speech-Sound Disorders of Unknown Origin

• Diagnostic Markers for sAOS

• Applying ASR to the Lexical Stress Ratio (LSR) The Lexical Stress Ratio Identifying Vowel Boundaries Using ASR Computing the LSR Results

• Applying ASR to Coefficient of Variation Ratio (CVR)

• Summary & Conclusion

10

Applying ASR to the Lexical Stress Ratio:Applying ASR to the Lexical Stress Ratio:The Lexical Stress RatioThe Lexical Stress Ratio

• LSR (Shriberg et al., 2003a) measures “inappropriate lexical stress” observed in children with sAOS

• Inappropriate lexical stress:excessive stress on a syllable, or lack of stress on a syllable that is normally stressed

• Three factors used to measure lexical stress:frequency area, amplitude area, and duration of the first and second vowels in trochaic words

• Combine values to a single dimension, defined as stress

• Both high and low LSR values associated with sAOS.

• This work: pilot study on 2 subjects from original LSR study.

11

Applying ASR to the Lexical Stress Ratio:Applying ASR to the Lexical Stress Ratio:Identifying Vowel Boundaries Using ASRIdentifying Vowel Boundaries Using ASR

• Primary issue in automating LSR:Determine boundaries of both vowels in known, isolated, two-syllable words (e.g. “ladder”)

• Vowel boundaries determined by “forced alignment”

.pau l @ dc d 3r .pau

ForcedAlignment

12

Applying ASR to the Lexical Stress Ratio:Applying ASR to the Lexical Stress Ratio:Computing the LSRComputing the LSR

• Vowel duration (D) = (end time of vowel)–(begin time of vowel)

• Amplitude area (AA) = (average amplitude of vowel (dB))×D

• Frequency area (FA) = (average F0 of vowel)×D

amp:

F0:

phon:

spec:

wave:

13

Applying ASR to the Lexical Stress Ratio:Applying ASR to the Lexical Stress Ratio:Computing the LSRComputing the LSR

iiii

N

n nv

nvi

N

n nv

nvi

N

n nv

nvi

SCSCSCLSR

CCC

NDD

S

NAAAA

S

NFAFA

S

332211

321

13

12

11

3030507049002

1

2

1

2

1

.,.,.

,

,

,

,

,

,

v1 = first vowel, v2 = second vowel, N = number of utterances, i = subject

• LSR computed as described in Shriberg et al. (2003a):Ratios of the frequency area, amplitude area, and duration between the first and second vowel are computed and combined into a single score.

14

Applying ASR to the Lexical Stress Ratio:Applying ASR to the Lexical Stress Ratio:ResultsResults

• Standard error of the mean estimated from data published by Shriberg et al. (2003a): 0.023

• Difference in results for Subject 1 is within estimated standard error of the mean; not so for Subject 2

• Single gross error from forced alignment procedure, when manually corrected, caused automatic LSR result to become 0.88 (within standard error of the mean)

ParticipantReported

LSRAutomatic

LSRAbsolute

Difference

Subject 1 1.65 1.63 0.02

Subject 2 0.89 0.83 0.06

15

Outline of TalkOutline of Talk

• Complex Disease Model…

• Diagnostic Markers for sAOS

• Applying ASR to the Lexical Stress Ratio (LSR)

• Applying ASR to Coefficient of Variation Ratio (CVR) The Coefficient of Variation Ratio Identifying Speech/Pause Regions Using ASR Computing the CVR Results

• Summary & Conclusion

16

Applying ASR to the Coefficient of Applying ASR to the Coefficient of Variation Ratio:Variation Ratio:

The Coefficient of Variation RatioThe Coefficient of Variation Ratio• CVR (Shriberg et al., 2003b) measures reduction in normal

temporal variation of speech, as observed in children with sAOS

• Measurement of CVR depends on duration of speech events and duration of pause events

• Because of reduced variability of speech-event durations in children with sAOS, these children have higher CVR values relative to control group

s

s

p

p

speech

pause

CV

CVCVR

p = standard deviation of pause eventsp = mean duration of pause eventss = standard deviation of speech eventss = mean duration of speech events

17


The Coefficient of Variation RatioThe Coefficient of Variation Ratio• In Shriberg et al. 2003b, speech/pause events detected by:

(1) displaying speech amplitude envelope(2) human identification of pause event with largest amplitude(3) speech/pause classification using threshold from Step (2)(4) removing speech/pause regions with duration < 100 msec

• Preliminary results show good agreement between this Matlab-based algorithm and manual measurements from spectrograms (Green et al., this conference)

18


Identifying Speech/Pause Regions Using Identifying Speech/Pause Regions Using ASRASR• Can be difficult to identify speech/pause from only energy

or amplitude envelope, so investigated speech/pausedetection using ASR

• ASR system trained using 300 utterances from 3 children with speech delay of unknown origin

• All training data phonetically labeled by hand, time-aligned at the phoneme level

• ASR system trained to classify 8 broad-phonemic classes related to speech (e.g. “nasal”), instead of phonemes

• Grammar used by ASR system imposed constraints onsequences of phonemic classes to be consistent withEnglish syllable structure

19


Computing the CVRComputing the CVR• ASR results (broad phonetic classes with English syllable

structure) mapped to “speech” and “pause” events

• CVR computed as in Shriberg et al. (2003b):s

s

p

p

CVR

phone class:

speech/pau:

wave:

spectrogram:

20


ResultsResults

• ASR-method CVR values within 3% of reported values

Participant MethodAverage CV

of pause events

Average CV of speech

eventsCVR

Subject 1reported 0.581 0.407 1.43

ASR 0.565 0.398 1.42

Subject 2reported 0.545 0.503 1.08

ASR 0.509 0.460 1.11

21

Summary & ConclusionSummary & Conclusion

• Agreement between published results and current pilot-study results indicates potential for ASR-based LSR

• Improvements necessary to automation of LSR: Train speech recognizer on children’s speech data

• ASR-based CVR results for two subjects considered to be comparable to reported CVR results.

• Algorithms still require refinement; improvements possible

• Need to evaluate generalization of methods to other subjects

22

ReferencesReferences

• Green, J., Beukelman, D., Ball, L., Ullman, C., and Maassen K. (2004). “Development and Evaluation of a Computer-based System to Measure and Analyze Pause and Speech Events,” Conference on Motor Speech: Motor Speech Disorders, Speech Motor Control, Albuquerque, NM.

• Guyette, T. W. and Diedrich, W. M. (1981). "A Critical Review of Developmental Apraxia of Speech," in Speech and Language: Advances in Basic Research and Practice, 5, pp. 1-45.

• Hawley, M. (2003). “Speech Training And Recognition for Dysarthric Users of Assistive Technology (STARDUST) ”, Wales International Conference on Electronic Assistive Technology, Cardiff, Wales, July 2003.

• Hosom, J. P. (2000). Automatic Time Alignment of Phonemes Using Acoustic-Phonetic Information. Ph.D. thesis, Oregon Graduate Institute of Science and Technology, Beaverton, Oregon.

• Kasi, K. and Zahorian, S. A. (2002). “Yet Another Algorithm for Pitch Tracking,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2002, Orlando, FL, 1, pp. 361-364.

23

ReferencesReferences

• Marquardt, T. P., Sussman, H. M., Snow, T., and Jacks, A. (2002). "The Intelligibility of the syllable in developmental apraxia of speech," in Journal of Communication Disorders, 35, pp. 31-49.

• Shriberg, L. D., Austin, D., Lewis, B. A., McSweeny, J. L., and Wilson, D. L. (1997). "The Speech Disorders Classification System (SDCS): Extensions and Lifespan Reference Data," in Journal of Speech, Language, and Hearing Research, 40, pp. 723-740.

• Shriberg, D. L., Campbell, T. F., Karlsson, H. B., Brown, R. L., McSweeny, J. L., & Nadler, C. J. (2003a). A Diagnostic Marker for Childhood Apraxia of Speech: The Lexical Stress Ratio,” in Special Issue: Diagnostic Markers for Child Speech-Sound Disorders, Clinical Linguistics & Phonetics. 17.7, pp. 549-574.

• Shriberg, D. L., Green, J. R., Campbell, T. F., McSweeny, J. L., & Scheer, A. (2003b). “A Diagnostic Marker for Childhood Apraxia of Speech: The Coefficient of Variation Ratio,” in Special Issue: Diagnostic Markers for Child Speech-Sound Disorders, Clinical Linguistics & Phonetics, 17.7, pp. 575-595.

24

Diagnostic Assessment of Diagnostic Assessment of Childhood Apraxia of Speech Using Childhood Apraxia of Speech Using

Automatic Speech Recognition (ASR) Automatic Speech Recognition (ASR) SystemsSystems

Lawrence D. Shriberg1

John-Paul Hosom2

Jordan R. Green3

1Waisman Center, University of Wisconsin - Madison

2Center for Spoken Language Understanding, Oregon Health & Science University

3Department of Special Education & Communication Disorders,University of Nebraska - Lincoln

This research is supported by NIDCD grants DC000496 and DC006722










25

26

Applying ASR to the Lexical Stress Ratio:Applying ASR to the Lexical Stress Ratio:Speech DataSpeech Data

• Data from Shriberg et al.’s 2003a study (LSR corpus):

24 children with speech delay (control data)

11 children with sAOS

Recordings of elicited samples of 8 trochaic words

Average age: 6 yrs, 4 mo. for children with speech delay, 7 yrs, 1 mo. for children with sAOS.

• This study:

Pilot study

2 children from LSR corpus

27

Applying ASR to the Lexical Stress Ratio:Applying ASR to the Lexical Stress Ratio:Measuring FMeasuring F00 and Amplitude and Amplitude

• Fundamental frequency (F0) measured by computing auto-correlation of relative changes in energy between 200 and 700 Hz (Hosom, 2000)

• Post-processing of F0 contour: dynamic programming

method to correct pitch-doubling and pitch-halving errors(e.g. Kasi and Zahorian, 2002)

• 3 Hz average absolute error on single speaker underquiet conditions (cf. electro-glottography measurements)

• Amplitude computed as the log energy (in decibels) of the signal using a 20-msec Hamming window

28


Speech DataSpeech Data• Data from Shriberg et al.’s 2003b study (CVR corpus):

30 children with normal speech acquisition (control data) 30 children with speech delay (control data) 15 children with sAOS Recordings of conversational speech Inclusionary criteria for children with sAOS based on

set of provisional speech and prosody-voice markers considered consistent with sAOS.

• This study: Pilot study 2 children from CVR corpus

29


ResultsResults• Replicated Shriberg et al.’s CV computation using threshold

of amplitude envelope

• Differences between replicated and reported values for average CVs of pause and speech < 0.01; smallest reported standard error of the mean = 0.015

• Comparison of results of individual samples from the ASR and Matlab methods:

(A) Matlab CVR method yields some speech events that are “interrupted” by low-amplitude speech. ASR-based CVR may be less sensitive to these interruptions.

(B) other differences in way ASR and Matlab techniques detect speech and pause events

30

Why LSR = state of art system for adult Why LSR = state of art system for adult speech,speech,

CVR = broad classes trained on child. CVR = broad classes trained on child. speech?speech?• LSR forced alignment could be

(a) state-of-art system trained on adult speech (b) not-state-of-art system trained on children’s

speech

• Phoneme identities are given, so mismatch between adult and child speech not as severe as in ASR

• Chose (a) because of ease of implementation

• For CVR task of speech/pause detection, existing forcedalignment system

(a) would require software modifications for ASR(b) was not required to identify phonemes

• Implemented broad-phonetic-category ASR because ofpresumed robustness and ease of implementation.

31

Long-Term GoalsLong-Term Goals

• ASR may also allow automatic measurement of otherprosodic factors, such as syllable duration, inter-stressintervals, and linguistic rhythm

• Multiple measures of sAOS may be combined for improvedsensitivity and specificity

• Evaluate specific factors that influence diagnosis

Documents

Diagnostic Assessment of Childhood Apraxia of Speech Using Automatic Speech Recognition (ASR) Systems Lawrence D. Shriberg 1 John-Paul Hosom 2 Jordan R