1
Velocity-based features Spatio-temporal gestural features Spatio-temporal gestural features MIAL Medical Image AnalysisLab CONCLUSIONS • We presented a machine learning approach to analyze and describe motions of the human tongue in dynamic US • Results show that our proposed descriptors can be employed to perform different classification tasks effectively • Future work includes applying the method to data with more varied articulations Feature-Extraction 2) Spatio-temporal gestural descriptors • These descriptors are designed to explicitly encode changes in tongue motion over time • We perform principal component analysis on the x and y components of all displacement fields (for all k in all n studies) • We represent D (n,k) using the principal coefficients C of the first M principal components: P k = [ C x ... C x C y ... C y ] • Our spatio-temporal gestural descriptor is then encoded as: [ P 1 P 2 ... P k P K-1 ] • To capture regional velocity differences that may exist, we divide the image domain into 3 regions and compute the distributions of the x- and y- components of D (n,k) • Entries of each histogram consitute a feature vector, e.g. Vx-P is the vector for the x-component in the posterior • Concatenating all feature vectors yields our velocity-based descriptor 3 How does Dynamic Time Warp [4] work? 1. Spectral analysis to extract features that relate to pitch, as well as onset times of beats/notes 2. Constructs a t x t similarity matrix S; S ij is the cosine distance of the features of signals A m and A n extracted at the t-th timestep 3. Find the lowest-cost path through S using dynamic-programming [1] Rastadmehr et al. Increased midsagittal tongue velocity as indication of articulatory compensation in patients with lateral partial glossectomies. J Head and Neck 30(6) (2008) 718–726 [2] Kocjancic, T.: Ultrasound study of tongue movements in childhood apraxia of speech. In: Ultrafest V. (2010) 1–2 [3] Herold et al.: Analysis oaf vowel-consonant-vowel sequences in patients with partial glossectomies using 2D ultrasound imaging. In: Ultrafest V. (2010) 1–2 [4] Turetsky, R., Ellis, D.: Ground-truth transcriptions of real music from force-aligned midi syntheses. In: 4th ISMIR. (2003) 135–141 [5] Metz et al.: Nonrigid registration of dynamic medical imaging data using nD+t B-splines and a groupwise optimization approach. Medical Image Analysis 15(2) (2011) 238 – 24 [6] Wu, J.: A Fast Dual Method for HIK SVM Learning. In: ECCV. (2010) 552–565 A Machine Learning Approach to Tongue Motion Analysis in 2D Ultrasound Image Sequences Lisa Tang 1 , Ghassan Hamarneh 1 and Tim Bressmann 2 1 Medical Image Analysis Lab, School of Computing Science, Simon Fraser University 2 Department of Speech-Language Pathology, Faculty of Medicine, University of Toronto UNIVERSITY of TORONTO SPEECH-LANGUAGE PATHOLOGY Time Frequency .15 1 1. 5 2 2. 5 Lowest cost path frame indices of A m frame indices of A n i-th frame of A n matches to j-th frame of A m S ij Data Normalization 1. One patient study is chosen as reference, and its audio signal A m is chosen as the template audio signal 2. For each other patient study n, we seek a mapping T m : A m A n that aligns audio signal A m to A n using Dynamic Time Warp [4] 3. We then compute from T m the K indices that indicate frame correspondences Reading speeds vary across subjects so the same word is articulated at different times We thus need to resolve temporal correspondences of the US sequences across subjects, i.e. extract a subset of US frames from each sequence in which same sounds were spoken In finding temporal correspondence across studies U m and U n , we use their audio recordings A m and A n : 1 U (n,1) U (n,k) U (n,K-1) U (n,K) U (m,1) U (m,k) U (m,K-1) U (m,K) T m A m A n Motion Characterization • We characterize tongue motions via groupwise registration of the K extracted US frames • We employ the 2D+time registration algorithm of [5] • Registration accuracy has been confirmed using expert-delineated tongue contours • This generates a set of displacement fields {D (n,k) : k = 1 ... K }, each of which maps points in frame k to corresponding points in frame k+1 in 2 . . . D (n,1) . . . D (n,K-1) ) 1) Velocity-based descriptors INTRODUCTION • Analysis of ultrasound (US) tongue sequences and accompanying audio recordings enables speech research • Current goal: develop a procedure for tongue motion analysis • Ultimate goal: develop reliable and robust indicators that quantify what constitute normal and abnormal tongue movement • Such indicators would aid the development of treatment strategies for impediments • In contrast to previous tongue motion analyses, e.g. [1-3], we propose a method that does not require segmentations • We analyze tongue motion captured in the US data via 3 classification tasks to be described below • Given a training set of paired data: we train a Support Vector Machine (SVM) that predicts the label of each sample in a test set based on the sample’s features • Distance between a i and a j is measured as [6]: where F is the length of a i • We then train the SVM using Intersection Coordinate Descent [6], a deterministic algorithm that was shown to be fast and accurate EXPERIMENTAL RESULTS Analysis 4 74% 86% 84% 86% /ishi/ vs. /ushi/ /ishi/ vs. /ushi/ /aka/ vs. /ishi/ 3-class Examine how tongue velocity varies in different regions of the tongue as subject spoke Examine whether the spatio-temporal descriptors can be used to predice utterance type Examine whether the spatio-temporal descriptors can be used to predict abnormal tongue motion Subjects recited a passage of over 50 words Subjects recited 3 utterances 5x, each is a vowel-consonant-vowel (VCV) se- quence: /aka/, /ishi/, /ushu/ Same as Task #2 Task 1 Task 2 Task 3 Displacement field computed between two US frames The sequence of displacement fields generated from an entire VCV sequence Same as Task #2 Abnormal vs. normal tongue motion /aka/ vs. /ishi/ vs. /ushu/ Abnormal vs. normal tongue motion Analyses Objective Setup Motion Samples b i a i Results 84% 86% 84% /aka/ /ishi/ /ushu/ 70 80 90 Vx-B Vy-B Vx-D Vy-D Vx-P Vy-P All 86% 90% 81% 89% 90% 91% 94% Classification accuracies f=1 F K( a i , a j ) = min( a i , a j ) f f ( a i , b i ) feature vector label of motion sample 1 M 1 M Posterior Blade Dorsum {Vx-P, Vy-P} {Vx-B, Vy-B} {Vx-D, Vy-D} Dorsum Blade Posterior U n 70 80 90 70 80 90

A Machine Learning Approach to Tongue Motion Analysis in ...tonguetrack.cs.sfu.ca/MICCAI_MLMI_2011_poster.pdfdescriptor is then encoded as: [P1 P2... Pk PK-1] • To capture regional

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Machine Learning Approach to Tongue Motion Analysis in ...tonguetrack.cs.sfu.ca/MICCAI_MLMI_2011_poster.pdfdescriptor is then encoded as: [P1 P2... Pk PK-1] • To capture regional

Velocity-based features Spatio-temporal gestural features Spatio-temporal gestural features

M I A LM e d i c a l I ma g e A n a l y s i s L a b

CONCLUSIONS• We presented a machine learning approach

to analyze and describe motions of the human tongue in dynamic US

• Results show that our proposed descriptors can be employed to perform different classification tasks effectively

• Future work includes applying the method to data with more varied articulations

Feature-Extraction

2) Spatio-temporal gestural descriptors

• These descriptors are designed to explicitly encode changes in tongue motion over time

• We perform principal component analysis on the x and y components of all displacement fields (for all k in all n studies)

• We represent D(n,k) using the principal coefficients C of the first M principal components:

Pk=[Cx ... Cx Cy ... Cy ]• Our spatio-temporal gestural descriptor is then encoded as:

[ P1 P2 ... Pk PK-1 ]

• To capture regional velocity differences that may exist, we divide the image domain into 3 regions and compute the distributions of the x- and y- components of D(n,k)

• Entries of each histogram consitute a feature vector, e.g. Vx-P is the vector for the x-component in the posterior

• Concatenating all feature vectors yields our velocity-based descriptor

3

How does Dynamic Time Warp [4] work?

1. Spectral analysis to extract features that relate to pitch, as well as onset times of beats/notes

2. Constructs a t x t similarity matrix S; Sij

is the cosine distance of the features of signals Am and An extracted at the t-th timestep

3. Find the lowest-cost path through S using dynamic-programming

[1] Rastadmehr et al. Increased midsagittal tongue velocity as indication of articulatory compensation in patients with lateral partial glossectomies. J Head and Neck 30(6) (2008) 718–726

[2] Kocjancic, T.: Ultrasound study of tongue movements in childhood apraxia of speech. In: Ultrafest V. (2010) 1–2

[3] Herold et al.: Analysis oaf vowel-consonant-vowel sequences in patients with partial glossectomies using 2D ultrasound imaging. In: Ultrafest V. (2010) 1–2

[4] Turetsky, R., Ellis, D.: Ground-truth transcriptions of real music from force-aligned midi syntheses. In: 4th ISMIR. (2003) 135–141

[5] Metz et al.: Nonrigid registration of dynamic medical imaging data using nD+t B-splines and a groupwise optimization approach. Medical Image Analysis 15(2) (2011) 238 – 24

[6] Wu, J.: A Fast Dual Method for HIK SVM Learning. In: ECCV. (2010) 552–565

A Machine Learning Approach to Tongue Motion Analysis in 2D Ultrasound Image SequencesLisa Tang1, Ghassan Hamarneh1 and Tim Bressmann2

1

Medical Image Analysis Lab, School of Computing Science, Simon Fraser University 2

Department of Speech-Language Pathology, Faculty of Medicine, University of TorontoUNIVERSITY of TORONTOSPEECH-LANGUAGE PATHOLOGY

Time

Freq

uen

cy

.15 1 1.5 2 2.5

Lowest cost path

fram

e in

dic

es o

f A m

frame indices of An

i-th frame of Anmatches to

j-th frame of Am

Sij

Data Normalization

1. One patient study is chosen as reference, and its audio signal Am is chosen as the template audio signal

2. For each other patient study n, we seek a mapping

Tm: Am An

that aligns audio signal Am to An using Dynamic Time Warp [4]

3. We then compute from Tm the K indices that indicate frame correspondences

• Reading speeds vary across subjects so the same word is articulated at different times

• We thus need to resolve temporal correspondences of the US sequences across subjects, i.e. extract a subset of US frames from each sequence in which same sounds were spoken

• In finding temporal correspondence across studies Um and Un , we use their audio recordings Am and An :

1

U(n,1) U(n,k) U(n,K-1) U(n,K)

U(m,1) U(m,k) U(m,K-1) U(m,K)

Tm

Am

An

Motion Characterization

• We characterize tongue motions via groupwise registration of the K extracted US frames

• We employ the 2D+time registration algorithm of [5]

• Registration accuracy has been confirmed using expert-delineated tongue contours

• This generates a set of displacement fields {D(n,k) : k = 1 ... K }, each of which maps points in frame k to corresponding points in frame k+1 in

2

. . .

D(n,1) . . . D(n,K-1)

)

1) Velocity-based descriptors

INTRODUCTION• Analysis of ultrasound (US) tongue

sequences and accompanying audio recordings enables speech research

• Current goal: develop a procedure for tongue motion analysis

• Ultimate goal: develop reliable and robust indicators that quantify what constitute normal and abnormal tongue movement

• Such indicators would aid the development of treatment strategies for impediments

• In contrast to previous tongue motion analyses, e.g. [1-3], we propose a method that does not require segmentations

• We analyze tongue motion captured in the US data via 3 classification tasks to be described below

• Given a training set of paired data:

we train a Support Vector Machine (SVM) that predicts the label of each sample in a test set based on the sample’s features

• Distance between ai and aj is measured as [6]:

where F is the length of ai

• We then train the SVM using Intersection Coordinate Descent [6], a deterministic algorithm that was shown to be fast and accurate

EXP

ERIM

ENTA

L R

ESU

LTS

Analysis4

74%

86% 84%86%/ishi/

vs. /ushi/

/ishi/ vs.

/ushi/

/aka/ vs.

/ishi/ 3-class

Examine how tongue velocity varies in different regions of the tongue as subject spoke

Examine whether the spatio-temporal descriptors can be used to predice utterance type

Examine whether the spatio-temporal descriptors can be used to predict abnormal tongue motion

Subjects recited a passage of over 50 words

Subjects recited 3 utterances 5x, each is a vowel-consonant-vowel (VCV) se-quence: /aka/, /ishi/, /ushu/

Same as Task #2

Task 1 Task 2 Task 3

Displacement field computed between two US frames

The sequence of displacement fields generated from an entire VCV sequence

Same as Task #2

Abnormal vs. normal tongue motion /aka/ vs. /ishi/ vs. /ushu/ Abnormal vs. normal tongue motion

AnalysesObjective

Setup

Motion Samples

bi

ai

Results

84% 86%84%

/aka/ /ishi/ /ushu/70

80

90

Vx-B Vy-B Vx-D Vy-D Vx-P Vy-P All

86%90%

81%89% 90% 91% 94%

Classification accuracies

f=1

FK( ai , aj ) = min( ai , aj )

f f

( ai , bi )feature vector label

of motion sample

1 M 1 M

PosteriorBlade

Dorsum

{Vx-P, Vy-P}{Vx-B, Vy-B}{Vx-D, Vy-D}

Dorsum BladePosterior

Un

70

80

90

70

80

90