recording for speech production studyI · Optimal sensor placement in electromagnetic articulography recording for speech production studyI TagedPD1XAshok Kumar PattemX D2XaX, D3XAravind

Available online at www.sciencedirect.com

Computer Speech & Language 47 (2018) 157�174

www.elsevier.com/locate/csl

Optimal sensor placement in electromagnetic articulography

recording for speech production studyI

TaggedPD1X XAshok Kumar PattemD2X X

a, D3X XAravind Illa D4X X*,a, D5X XAmber Afshan D6X Xb, D7X XPrasanta Kumar Ghosh D8X Xa

TaggedP

aDepartment of Electrical Engineering, Indian Institute of Science, Bangalore, Karnataka 560012, IndiabDepartment of Electrical Engineering, University of California, 420 Westwood Plaza, Los Angeles, CA 90095, USA

Received 27 April 2017; received in revised form 13 July 2017; accepted 26 July 2017

Available online 27 July 2017

TaggedPAbstract

Electromagnetic articulography (EMA) is one of the technological solutions, widely used to measure the articulatory movement

useful for speech production research. EMA is typically used to track articulatory flesh points by placing sensors, often heuristically,

on the key articulators including lips, jaw, tongue and velum in the mid-sagittal plane. In this work, we address the problem of opti-

mal placement of EMA sensors by posing it as the optimal selection of points for minimizing the reconstruction error of the air-tissue

boundaries in the real-time magnetic resonance imaging (rtMRI) video frames of vocal tract (VT) in the mid-sagittal plane. We pro-

pose an algorithm for optimal placement of EMA sensors using dynamic programming. Experiments are performed using rtMRI

video frames for read speech from four subjects with upper and lower lips as two fixed points. One optimal sensor on the upper VT

boundary is found to be at an average distance of 21.41(§25.54) mm from the velum tip. Similarly, for the lower VT boundary, one

optimal sensor is found at the lower incisor at a distance of 26.37(§8.08) mm from lower lip and three optimal sensors on tongue �at tongue tip (19.93(§11.45) mm from tongue base) and 38.2(§11.52) mm and 80.51(§13.51) mm away from the tongue tip.

� 2017 Elsevier Ltd. All rights reserved.

TaggedPKeywords: Electromagnetic articulography; Sensor placement; Speech production

1. INTRODUCTION

TaggedPRecording of the dynamics of the speech articulators (e.g., lips, tongue, jaw, velum) is critical for the study of

speech production (Rubin and Vatikiotis-Bateson, 1998). Articulatory movement data for speech production

research are acquired using different modalities such as mid-sagittal X-ray diagrams (Ladefoged et al., 1978), X-ray

microbeam imaging (XRMB) (Westbury et al., 1990), Ultrasound (Watkin and Rubin, 1989), Electropalatography

(Stone and Lundberg, 1996), tagged MRI (Parthasarathy et al., 2007), Electromagnetic Articulography (EMA)

(Maurer et al., 1993) and real-time magnetic resonance imaging (rtMRI) (Demolin et al., 2000; Narayanan et al.,

2004). rtMRI provides a complete 2D mid-sagittal view of articulatory dynamics during read speech (Narayanan

et al., 2014). Among different modalities, only MRI technique provides a three-dimensional images of the vocal tract

for sustained vowels (Demolin et al., 1996). The air-tissue boundaries from rtMRI images provide a time-varying

I This paper has been recommended for acceptance by Prof. R. K. Moore.

* Corresponding author.

E-mail address: [email protected] (A. Illa).

http://dx.doi.org/10.1016/j.csl.2017.07.008

0885-2308/ 2017 Elsevier Ltd. All rights reserved.

mailto:[email protected]

http://dx.doi.org/10.1016/j.csl.2017.07.008

http://www.sciencedirect.com

http://dx.doi.org/

http://www.elsevier.com/locate/csl

http://crossmark.crossref.org/dialog/?doi=10.1016/j.csl.2017.07.008&domain=pdf

http://crossmark.crossref.org/dialog/?doi=10.1016/j.csl.2017.07.008&domain=pdf

158 A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174

TaggedPdescription of the vocal tract shape in the mid-sagittal plane. However, the rtMRI data has low temporal resolution

(23.18 frames/s) (Narayanan et al., 2014). It also remains a challenge to record a good quality speech from the sub-

ject while he/she undergoes rtMRI scan, due to the loud MRI scanner noise.

TaggedPUnlike the rtMRI, the XRMB provides the articulatory movement data at a rate more than 100 Hz (Westbury,

1994). In spite of a high temporal resolution, the XRMB technique is limited, in the sense that it does not provide a

complete mid-sagittal view of articulatory dynamics, since only a few pellets placed sparsely on various articulators

are tracked (Westbury et al., 1990). Ultrasound also provides a high temporal resolution (50 frames/s or more

Slłrdahl et al. (2001)) and a good quality audio can be recorded simultaneously too. But the ultrasound images are

noisy and detect only the first air-tissue boundary (Bresch et al., 2008). Hence, it is not possible to record the dynam-

ics of anterior tongue tip and lips in ultrasound imaging. On the other hand, EMA has a high temporal resolution

(sampling rate of »500 Hz). But it cannot capture the structure of pharyngeal wall unlike the rtMRI recording. The

EMA data provides the co-ordinates of sensors sparsely placed on different articulators. Another advantage of EMA

recording is that a good quality audio can be recorded in parallel to the EMA recording. However, proper care has to

be taken while doing the EMA recording to minimize the measurement errors. The accuracy in the measurement of

the articulatory movements by EMA is affected due to sensor failures, electromagnetic interference, sensors going

out out of the measurement region and also by numerical instabilities (Yunusova et al., 2009; Stella et al., 2012).

Attempts are made to handle out of range issues (Kroos, 2008) and to improve the measurement accuracy of EMA

(Kroos, 2012; Uchida et al., 2016). For acquiring articulatory movements during speech production, it has been

claimed that AG501 provides greater accuracy and is more user-friendly than AG500 (Stella et al., 2013). Therefore,

it is apparent that different modalities capture different amount of spatial and temporal information about articula-

tory movements (Bresch et al., 2008) depending on the imaging technique used or the placement of the sensors and

pellets. In this work, we focus on optimal placement of sensors in the mid-sagittal plane for EMA recording such

that it provides maximal information about the air-tissue boundaries as observed in rtMRI recording.

TaggedPEMA data has been crucial for several speech production studies, analysis and modeling including the study of

experimental phonetics, the articulatory movement modeling (Perkell et al., 1992; King and Wrench, 1999), examin-

ing the variability of coarticulation (Cho, 2004; Bombien et al., 2007; Hardcastle et al., 1996; Recasens, 2002; Hoole

et al., 1993; Hoole and Gfoerer, 1990; Hoole and Nguyen, 1997; Mooshammer and Hoole, 1993; Mooshammer and

Schiller, 1996; Katz et al., 1990; West, 2000) understanding coupling dynamics (Van Lieshout, 2001; Van Lieshout

et al., 2002) of motor primitives in speech movements in case of the normal and disordered speech (Schulz et al.,

2000; Maassen et al., 2007; Van Lieshout, 2007; Van Lieshout et al., 2007) as well as during stuttering (Peters et al.,

2000; McClean and Runyan, 2000; Namasivayam and Van Lieshout, 2001; 2008) and swallowing (Steele and

Van Lieshout, 2004; 2005; Bennett et al., 2007; Steele and Van Lieshout, 2009). EMA data of the articulatory kine-

matics available through MOCHA-TIMIT (Wrench, 2000) and USC-TIMIT (Narayanan et al., 2014) are widely

used for acoustic-articulatory modeling for speech recognition (Frankel et al., 2000; Wrench and Richmond, 2000;

Richardson et al., 2003), text-to-articulatory-movement prediction and analysis of critical articulators Zhang and

Renals (2008); Ling et al. (2010), mapping from articulatory movements to vocal tract spectrum (Payan and Perrier,

1997; Toda et al., 2004b; Steiner et al., 2013), acoustic-to-articulatory inversion (Toutios and Margaritis, 2003; Toda

et al., 2004a; Ghosh and Narayanan, 2010; Uria et al., 2011; Ghosh and Narayanan, 2011), multimodal speech ani-

mation (Kim et al., 2014; Engwall, 2003).

TaggedPGiven these wide-spread uses of the EMA data, it is important to develop a principled approach in the placement

of sensors during EMA recording. Since EMA data provides movement of few sparsely placed sensors, it is required

to place them optimally in order to capture maximal information of the articulatory dynamics. For most of the exist-

ing EMA recordings, the sensors are typically placed following some heuristic rules. For example, for recording

using Carstens system (AG100), the suggested three EMA sensors positions on the tongue are 1 cm from the tongue

tip, midpoint of the tongue body and 4 cm from the tongue tip as tongue dorsum (UCLA, 2017). The TORGO Data-

base of Dysarthric Articulation (Rudzicz et al., 2012) was recorded from dysarthria patients. It consists of both

acoustics and articulatory data of EMA and 3D-reconstruction from binocular video sequences. The sensors were

placed on the tongue at three different locations, namely, tongue tip being 1 cm, tongue middle 4 cm and tongue

back 6 cm behind from the anatomical tongue tip. In another study of pharyngealization using an articulograph

(Ouni and Laprie, 26�27 March, 2009), EMA data was collected by placing four sensors at 1.6, 3.6, 5.2 and 7 cm

away from the tongue tip. M€ucke et al. (2012) collected articulatory data from German speakers and they used only

two sensors on the tongue, at 1 and 4 cm away from the tongue tip and called them tongue blade and tongue body,

A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174 159

TaggedPrespectively. The database collected by Wong et al. (2011) had the same sensor placement on the tongue as that of

M€ucke et al. (2012) for the study of lingual kinematics in dysarthric patients. For estimating the control parameters

of an articulatory model, Toutios et al. (2011) used four sensors on the tongue where one sensor was placed approxi-

mately on the tongue tip and the rest being 1.4, 3.1 and 5.7 cm away from the tongue tip. Duran et al. (2013) used

articulatory data for developing context sequence model for speech production. The three sensor locations on the

tongue were as follows�one on the tip and the other two at 3, 4 cm away from the tip. Serrurier et al. (2008) col-

lected the EMA data during speaking and feeding activities. The sensor locations on the tongue were 1, 4 and 6 cm

from the tongue tip. Feng (2008) in his study, placed the first sensor at a distance of 1 cm from the tongue tip and

two other sensors posteriorly placed along the midline with a spacing of approximately 1�1.5 cm. Koos et al. (2013)

used three sensors � one on the tip and two others 2 and 4 cm behind the tongue tip. For the collection of MOCHA

database Wrench (2000) used three sensors at 1, »3�4 and »5�7 cm from the tongue tip. It is clear that there is no

uniform mechanism using which the sensors are placed on the tongue across different EMA based studies. In fact,

the sensor positions change across various databases.

TaggedPWang et al. (2013) provided a recommendation for the placement of the EMA sensors using a finite element

model of the tongue. However, no such recommendations are available for placement of sensors on articulators other

than the tongue. In this work, we propose an optimization framework for determining the optimal sensor locations in

the mid-sagittal plane for EMA recording. We aim to find the sensor locations so that the maximal information about

the vocal tract shape in the mid-sagittal plane is preserved. This is done by optimizing the locations of seven sensors

such that the vocal tract (VT) air-tissue boundaries (referred to as VT boundaries), as observed in rtMRI recording,

can be reconstructed with minimum error. For this purpose, we have used manually annotated air-tissue boundaries

from rtMRI videos of four subjects. We begin with the description of the dataset (Section 2). In Section 3, we present

an objective function for optimal placement of EMA sensors and the steps to solve it. We present the optimized sen-

sor locations and related discussions in Section 4. Section 5 summarizes the key findings and future work.

2. Dataset

TaggedPThe experiments in this work are performed using the MRI-TIMIT database (Narayanan et al., 2011), which is

an excellent resource for the analysis and understanding of the articulatory movements for the read speech. The

MRI-TIMIT contains audio recordings synchronized with a sequence of rtMRI images of the mid-sagittal upper

airway, acquired from two female (F1, F2) and two male (M1, M2) native American English speakers aged

between 23 and 33 years. A detailed description of working principle and acquisition of MRI are explained by

Brown et al. (2014). The upper airway is imaged at the rate of 23.18 frames/s with a resolution of 68 £ 68 pixels

(each pixel of size 2.9 £ 2.9 mm). Audio is simultaneously recorded at a sampling frequency of 20 kHz while sub-

jects are imaged, using a custom fiber-optic microphone noise-canceling system. The recording is performed while

each speaker utters a set of 460 sentences. The total duration of the recordings are 38.19, 37.99, 39.05, and

38.07 min for the four subjects, F1, F2, M1, and M2, respectively.

TaggedPSelection of optimal sensor locations in this work requires manual annotation which involves marking air-tissue

boundaries manually on every rtMRI image. However, the total numbers of rtMRI frames in MRI-TIMIT are

53.1£ 103, 52.8£ 103, 54.3£ 103 and 52.9£ 103 for F1, F2, M1 and M2, respectively. Since manual annotation of

all these frames is time-consuming, we select a subset of five sentences for each speaker. These five sentences are

chosen based on their phonetic richness so that vocal tract shapes for most of the phonemes are covered in the chosen

sentences. For this purpose, we have chosen a set of 51 phonemes (Liu, 1994). The five sentences are selected using a

forward sentence selection algorithm in which the entropy computed using the histogram of the phonemes is maxi-

mized. The forward sentence selection works in a greedy manner in which a sentence is chosen in each step such

that the phonemes of the chosen sentence when considered along with the phonemes of already chosen sentences

maximize the entropy. Following the forward sentence selection algorithm we obtain the following five sentences

from the MRI-TIMIT corpus for each subject: (1) She always jokes about too much garlic in his food, (2) There was

a gigantic wasp next to Irving’s big top hat, (3) Laugh, dance and sing, if fortune smiles upon you, (4) I’d ride the

subway but I haven’t enough change, (5) Eating spinach nightly increases strength miraculously.

TaggedPThese five sentences contain all 51 phonemes except five, namely, which lie in the seven

least frequent phonemes in the MRI-TIMIT corpus. The number of rtMRI frames from these five sentences turns out

to be 474, 430, 540 and 460 for F1, F2, M1 and M2, respectively.

Fig. 1. Sample annotation of an rtMRI video frame. (For interpretation of the references to color in this figure, the reader is referred to the web

version of this article.)


TaggedPThe air-tissue boundaries in each rtMRI image are manually traced. This involves tracing of three contours which

constitute the air-tissue boundaries near different anatomical points in the mid-sagittal plane of the upper airway,

namely, Contour1: Upper lip - hard palate - Velum, Contour2: Jaw - Lower lip - tongue - epiglottis - Larynx, Con-

tour3: Pharyngeal wall - glottis. The tracing and annotation process is done by five people of age group

21�25 years, who have prior knowledge of anatomy of the upper airway. The annotation task for different sentences

of different subjects is balanced across five annotators. A Graphical User Interface (GUI) developed in MATLAB

R2013 has been designed to help annotators draw three contours. For tracing each contour, the annotators are asked

to mark points (by clicking) to draw a contour on the air-tissue boundary. Annotators are allowed to mark as many

points on each contour as they feel appropriate to correctly depict the contour curvature. In addition to marking three

contours of interest, annotators are also asked to mark the locations of the upper lip (UL), lower lip (LL), velum

(VEL) tip and tongue base (TBa) in each rtMRI frame. A sample annotated frame is shown in Fig. 1, where the Con-

tour1 is indicated by blue dots, Contour2 by black dots and Contour3 by red dots. The annotation of the air-tissue

boundaries is done in a manner similar to that by Lingala et al. (2017). The locations of the UL, LL, VEL tip and

tongue base as marked by the annotator are shown in Fig. 1 with yellow squares. A large number of points on a con-

tour would result in lower distance between two consecutive points, referred to as inter-point distance. The number

of points marked by the annotators averaged over all frames of F1, F2, M1, and M2 separately is (60, 58, 59, 62),

(79, 64, 70, 78) and (49, 51, 40, 52) for Contour1, Contour2, and Contour3, respectively. These correspond to mini-

mum inter-point distances (in mm) of (1.54, 2.60, 2.40, 2.75), (2.29, 3.14, 2.93, 2.85) and (1.64, 2.26, 2.84, 2.62)

and average inter-point distances (in mm) of (5.45, 6.20, 6.00, 6.04), (5.80, 7.12, 7.25, 6.56) and (5.37, 5.79, 6.25,

5.52) when averaged over all frames of each subject. This indicates that annotators, on average, mark two consecu-

tive points at a distance ranging from 1.85 to 2.5 pixels. The minimum inter-point spacing (1.54 mm) is found to be

the least for Contour1 of F1.

TaggedPBefore the annotators started manual tracing, the details of the annotation using the GUI are explained to them. It

is known that some anatomical points could remain invisible due the presence of thermal noise (Huettel et al., 2004)

or the absence of sufficient hydrogen content (Botta, 2000; Berger, 2002). In such scenarios, annotators are asked to

trace the contours to the best of their judgments in the rtMRI frame. Tracings from each annotator in each rtMRI

frame have been cross-checked and corrected by another annotator to improve the quality of annotation. It is found

that, on average, the duration for completing annotation of one sentence is 7�8 h (6�10 min per rtMRI frame). It is

also found that the annotation time for F2 and M1 is higher compared to the other two subjects. This could be due to

the morphological differences across various subjects.

3. Optimal sensor placement

TaggedPDue to invasive nature of the recording, flesh points of only a few articulators are tracked in EMA. For example,

in most of the EMA recordings in the literature, sensors have been placed in the front part of the vocal tract including


TaggedPUL, LL, lower incisor (LI), tongue tip (TT), tongue body (TB) and tongue dorsum (TD) in the mid-sagittal plane.

Typical locations of these sensors (S1; . . . ; S6) are shown in Fig. 2, where the upper and lower vocal tract boundaries

are manually drawn (blue and green contours respectively in Fig. 2) on a randomly chosen video frame from the

MRI-TIMIT corpus. In few recordings (Wrench, 2000), a sensor is attached to the velum to track its movements

(indicated by S7 in Fig. 2). However, no sensors are typically attached to the back part of the vocal tract (behind

tongue dorsum on the lower VT boundary and behind velum on the upper VT boundary) mainly to avoid the discom-

fort of the subject during speaking. Thus, in this work, we optimize for the location of the sensors in the front part of

the vocal tract in the mid-sagittal plane. The front part of the vocal tract is assumed to begin with the UL and LL as

marked by the annotators. It is essential to record the UL and LL in order to track the opening of the vocal tract.

Hence, we assume that the locations of two out of the seven sensors should be UL and LL, which form the vocal tract

opening and, thus, we do not optimize the locations of UL (S1) and LL (S2). Having fixed two among seven sensor

locations, optimal locations of the remaining sensors are determined such that they carry maximal information about

the shape of the front part of vocal tract in the mid-sagittal plane. The front part of the vocal tract in the mid-sagittal

plane is defined using the upper and lower VT boundaries as shown in blue and green bold contours respectively in

Fig. 2. The velum is the only moving part in the upper VT boundary; hence, we assume that it is essential to capture

the vocal tract shape till the end of the velum segment. Similarly, we assume that, using EMA sensor locations, lower

part of the vocal tract can be at most reconstructed till the end of the tongue, since no sensors are typically placed

beyond tongue due to gag reflex and also, it is difficult to glue a sensor to anything behind the soft palate. Hence, it

would be difficult to capture vocal tract shape information near and beyond epiglottis.

TaggedPThus, the goal of optimum sensor placement becomes finding:

TaggedP1) The optimum location of one sensor (S7) on the upper VT boundary such that the VT boundary from the UL to

F

fi

ig.

gur

the end of velum segment (as shown by blue bold curve between two cyan boxes in Fig. 2) can be reconstructed

with minimal error,

TaggedP2)
The optimum locations of four sensors (S3, S4, S5, S6) on the lower VT boundary such that the VT boundary from
the LL to the end of the tongue (Tend) (as shown by green bold curve between two cyan boxes in Fig. 2) can be

reconstructed with minimal error.

TaggedPBy ensuring a good quality reconstruction of the upper and lower VT boundaries, a good quality reconstruction of the

vocal tract shape and area function in the front part of the vocal tract is also ensured. The optimum locations of the

seven sensors, thus obtained, would capture the information to recover (interpolate) the missing outline of the VT

boundaries in the front part of the vocal tract. The location of one sensor on the upper VT boundary in a frame is opti-

mized separately from that for the locations of four sensors in the lower VT boundary in the same frame. This is mainly

because the sensor locations on the lower VT boundary do not directly provide any information about the upper VT

boundary shape & vice versa. Suppose there are Nu points on the upper VT boundary of a test frame denoted by

2. Illustration of the sensor locations, cyan boxes indicate the endpoints of VT boundaries. (For interpretation of the references to color in this

e legend, the reader is referred to the web version of this article.)


TaggedPCu = {xu(i), yu(i), 1� i� Nu}. Similarly, there are Nl points on the lower VT boundary denoted by Cl = {xl(i), yl(i),

1� i�Nl}. The upper VT boundary contour of the kth (1� k� K) training rtMRI frame is denoted by Cku ¼

fxkuðiÞ; ykuðiÞ; 1� i�Nkug; where K is the total number of training frames. Similarly, Ck

l ¼ fxkl ðiÞ; ykl ðiÞ; 1� i�Nkl g

denotes the lower VT boundary contour of the kth training rtMRI frame. Interpolation is a key step in reconstructing

the VT boundaries from the sensor locations, which allows to reconstruct missing points between any two given

points on a boundary. We consider two types of interpolation namely, linear interpolation and data-driven interpola-

tion. These are described below before the optimum sensor localization algorithm is presented.

3.1. Linear interpolation:

TaggedPConsider ith and jth points on the upper VT boundary of a test frame, i.e., [xu(i), yu(i)] and [xu(j), yu(j)].

Suppose we need to interpolate N equidistant points fbxuðnÞ; byuðnÞ; 1� n�Ng between these two points such that

½bxuð1Þ; byuð1Þ� ¼ ½xuðiÞ; yuðiÞ� and ½bxuðNÞ; byuðNÞ� ¼ ½xuðjÞ; yuðjÞ�. For linear interpolation, all these points must lie on

the line joining the ith and jth points. The equation of this line is given by

byuðnÞ ¼ yuðjÞ�yuðiÞxuðjÞ�xuðiÞ bxuðnÞ þ yuðiÞxuðjÞ�yuðjÞxuðiÞ

xuðjÞ�xuðiÞ ; 1� n�N: ð1Þ

For equi-spacing, the distance between two consecutive points ½bxuðnÞ; byuðnÞ� and ½bxuðnþ 1Þ; byuðnþ 1Þ� must be

4 ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi�xuðiÞ�xuðjÞ

�2

þ�yuðiÞ�yuðjÞ

�2r

N�1: ð2Þ

Thus, the nth equi-spaced point which is at a distance of nD away from [xu(i), yu(i)] can be found by solving Eq.(1)

and the following equation:

n4 ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi�bxuðnÞ�xuðiÞ

�2

þ�byuðnÞ�yuðiÞ

�2r

ð3Þ

Since Eq. (3) is quadruple in bxuðnÞ and byuðnÞ; we obtain two solutions and keep the one, which lies closer to

[xu(j), yu(j)]. This, in turn, ensures that the solution lies on the line segment joining [xu(i), yu(i)] and [xu(j), yu(j)]. For

the lower VT boundary, bxlðnÞ and bylðnÞ are obtained in a similar manner described above.

TaggedPAs long as the segment of the VT boundary between two given points can be well approximated by a line

segment, the linear interpolation will work well. However, if the segment of the boundary has non-linear shapes,

linear interpolation would not be effective. Typically VT boundaries have non-linear segments. Hence, a data-driven

interpolation using a set of training boundaries is proposed to overcome this limitation.

3.2. Data-driven interpolation:

TaggedPIn the data-driven interpolation, the segment between any two points on a test VT boundary is reconstructed

by finding the best segment from the VT boundaries of the training set. The best segment from the training

VT boundaries is obtained by first finding two points on the training VT boundaries that are closest to the two

test points in the Euclidean sense; the segment between the two closest training points is then used for recon-

struction. Consider the task of interpolating N equidistant points between two given points on the upper VT

boundary [xu(i), yu(i)] and [xu(j), yu(j)]. For this purpose, the K upper VT boundaries from the training set are

used. One among K training VT boundaries is selected which has the closest proximity with the two given

points. This is obtained by finding two points in each training boundary � each point being closest to one of

the given points. Let ½xkuði0Þ; ykuði

0Þ� is the point of kth training boundary closest to [xu(i), yu(i)] and their distance is

denoted by Dki ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðxuðiÞ�xkuði0ÞÞ2 þ ðyuðiÞ�ykuði0ÞÞ2

q. Similarly, for [xu(j), yu(j)], ½xkuðj

0Þ; ykuðj0Þ� is the closest

point with a distance Dkj ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðxuðjÞ�xkuðj0ÞÞ2 þ ðyuðjÞ�ykuðj0ÞÞ2

q. The best boundary from the training set is obtained

as kH ¼ arg minkðDk

i þ Dkj Þ.

TaggedPThe segment (comprising N0 6¼ N points) of the chosen training boundary from ½xkHu ði0Þ; ykHu ði0Þ� to ½xkHu ðj0Þ; ykHu ðj0Þ�

is used to interpolate N points between [xu(i), yu(i)] and [xu(j), yu(j)] using affine transformation followed by resampling.


TaggedPFirst the affine transformation is applied on the boundary segment from ½xkHu ði0Þ; ykHu ði0Þ� to ½xkHu ðj0Þ; ykHu ðj0Þ� whichresults in N

0points (non-equi-spaced) between two given test points [xu(i), yu(i)] and [xu(j), yu(j)] (see Appendix A

for details). After this, a piece-wise linear contour is obtained by linearly interpolating these N0points and then N

points, fbxuðnÞ; byuðnÞ; 1� n�Ng; are sampled on this contour so that they are equispaced. The steps of resampling

a contour are outlined in Appendix B.

TaggedPFor the lower VT boundary, bxlðnÞ and bylðnÞ are obtained in a similar manner described above. In next two

subsections, we describe the sensor location optimization algorithms on the upper and lower VT boundaries separately.

3.3. Optimum location of one sensor on the upper VT boundary

TaggedPConsider the upper VT boundary Cu: {xu(i), yu(i), 1� i� Nu} of a test frame. Cu begins from the location of UL

sensor and continues till the end of the velum segment, i.e, [xu(1), yu(1)] and [xu(Nu), yu(Nu)] denote the UL sensor

location and the end of the velum segment respectively (indicated by the cyan squares in Fig. 2). One sensor can be

placed at any of the remaining Nu�2 points. Thus, the optimal sensor location is obtained by first reconstructing

the upper VT boundary using a sensor location anywhere among the remaining Nu�2 points and the end points,

followed by searching for the sensor location which results in the least reconstruction error. For a given frame, the

total reconstruction error (TRE) of the upper VT boundary is expressed as:

TREU ¼XNu

i¼1

�xuðiÞ�bxuðiÞ�2

þ�yuðiÞ�byuðiÞ�2

� �ð4Þ

TaggedPAt first, we define the local mean squared error, given two points [xu(s), yu(s)] and [xu(e), yu(e)] for interpolating

N ¼ ½ðe�sÞ�1� in-between pointsTaggedPfbxuðnÞ; byuðnÞ; s< n< eg by either a linear or a data-driven interpolation. Local mean squared error given the

sth and eth points, is defined as follows:

MLocu ðs; eÞ ¼

Xe�1n¼sþ1

�xuðnÞ�bxuðnÞ�2

þ�yuðnÞ�byuðnÞ�2

� �ð5Þ

TaggedPFor any chosen point 2� k�Nu�1; using Eqs. (4) and (5), we can write T REU ¼ Mlocu ð1; kÞ þMloc

u ðk;NuÞ.Hence, for the upper VT boundary, the optimal sensor location [xu(k*), yu(k*)] is obtained by performing the

following optimization:

½xuðk�Þ; yuðk�Þ� ¼ arg min2� k�Nu�1

T REU ð6Þ

¼ arg min2� k�Nu�1

MuLocð1; kÞ þMu

Locðk;NuÞ ð7Þ

3.4. Optimum location of four sensors on the lower VT boundary

TaggedPConsider the lower VT boundary Cl: {xl(i), yl(i), 1� i� Nl}. Given the LL point [xl(1), yl(1)] and the end

point of the tongue contour [xl(Nl), yl(Nl)], we obtain the optimal locations of four sensors (S3, S4, S5, S6) in a

rtMRI video frame by minimizing the reconstruction error between the original and the interpolated lower

VT boundary. These locations are denoted by ½xlðk�pÞ; ylðk�pÞ�; 1� p� Nopt, with 1< k�p < k�pþ1 <Nl; 8 p; whereNopt ¼ 4 denotes the number of optimal points. The total reconstruction error (TRE) of the lower VT boundary in

a frame is defined as:

TREL ¼XNl

i¼1

�xlðiÞ�bxlðiÞ�2

þ�ylðiÞ�bylðiÞ�2

� �ð8Þ

TaggedPSimilar to Eq. (5), the local mean squared error for the lower VT boundary between sth and eth points is expressed

as follows:

MLocl ðs; eÞ ¼

Xe�1n¼sþ1

�xlðnÞ�bxlðnÞ�2

þ�ylðnÞ�bylðnÞ�2

� �ð9Þ


TaggedPwhere ½bxlðnÞ; bylðnÞ� is a point on the boundary reconstructed using either a linear or a data-driven interpolation as

described in Sections 3.1 and 3.2. Suppose the indices of the four points corresponding to four sensors (S3, S4, S5, S6)

are chosen to be k1, k2, k3 and k4, where k1< k2< k3< k4. Then, the T REL can be written in terms ofMLocl as follows

TREL ¼ MLocl ð1; k1Þ þ

X3j¼1

MLocl ðkj; kjþ1Þ þMLoc

l ðk4;NlÞ ð10Þ

TaggedPThe indices of the optimal locations of four sensors, kH

1 ; kH

2 ; kH

3 ; kH

4 are obtained by solving the following

fkH

1 ; kH

2 ; kH

3 ; kH

4 g ¼ arg min1< k1 < TBa;

TBa < k2 < k3 < k4 <Nl

T REL ð11Þ

A full search for four points to minimize T REL would have a order complexity of Oð4NlÞ. It is computationally pro-

hibitive for Nl= 120, which is the average number of points marked by the annotators to depict the lower VT bound-

ary. We design an algorithm following the principle of dynamic programming for an efficient solution of the

optimization in Eq. (11). The steps of the algorithm are summarized in Algorithm 1.


TaggedPThe four optimal points thus obtained are declared as the optimal sensor locations� one between LL & TBaand remaining three on the tongue. It should be noted that in finding the optimal sensor location S3, we constrain the

optimal sensor to lie between LL and TBa. This is done to ensure that there are only three sensors on the tongue as

typically done in an EMA recording. In fact, the sensor S3 placed on LI is used for recording jaw movement. Since

teeth does not appear in rtMRI recording, we constrain the location of S3 to be on the lower VT boundary segment

joining LL and TBa. By constraining S3 to be between LL and TBa in the optimization, we assume that the optimized

sensor location would be ideal for recording the jaw motion.

4. Experiments and results

4.1. Experimental setup

TaggedPThe annotated rtMRI video frames from MRI-TIMIT form the basis for carrying out the experiments in this work.

The number of points marked by the annotators on the upper VT boundary varies across frames. This is true for the

lower VT boundary as well. This is mainly because the annotators are allowed to mark as many points as they found

appropriate to depict the air-tissue boundary. This could also be due to the fact that the shapes of both upper and lower

VT boundaries change from one frame to the next requiring different number of points to depict them. The points

marked by the annotators are also found to be unequally spaced along the trajectory of the boundary i.e., the marked

points are dense in some part of the boundary while it is not so in other parts of the boundary as seen in Fig. 1. In order

to consider all parts of the upper and lower VT boundaries equally for selecting optimal sensor locations, we resample

the upper VT boundary such that the points are equally spaced along the boundary and the number of points on the

upper VT boundary is fixed (NU) across all frames. This is similarly done for the lower VT boundary using a fixed (NL)

number of points. The resampling is done by finding equi-distant points on the boundary obtained by linear interpolation

of the annotated points following the steps outlined in Appendix B. If NU and NL are small, the resampled points may

not capture the actual shape of a boundary. The higher the value of NU and NL, the better is the representation of the

boundary shapes. However, increasing NU and NL arbitrarily may not increase the quality of the boundary as the infor-

mation about the boundary shape is limited by the spatial resolution of the points marked by the annotators. Therefore,

we determine the value of NU and NL such that the average distance between two consecutive points after resampling

matches with the average minimum distance (1.54mm) between two consecutive points marked by the annotators (see

Section 2 for details). This results in NU = 109 and NL = 113. With this fixed set of points, the smallest and highest dis-

tances between two consecutive resampled points are found to be 1.28 and 1.81mm across all frames of all subjects.

TaggedPThe sensor locations are optimized separately on the upper and lower VT boundaries to achieve minimal recon-

struction error using data-driven as well as linear interpolation as outlined in Section 3. The optimization for the sen-

sor locations is done separately for each of the four subjects (F1, F2, M1, M2). In particular, for the present study,

we find the optimum sensor locations in a five fold cross validation setup for each subject separately, where 1/5 of

all frames of a subject is used as the test set and the remaining are used as the training set in a round robin fashion.

Note that a training set is required only for the data-driven interpolation while that is not so for the linear interpola-

tion. This proposed data-driven interpolation is done under the assumption that the shape of a selected boundary seg-

ment from the training set would be similar to that of the test boundary segment. This assumption, in turn, requires

that the test segment would be located spatially close to the selected training segment. However, because of the head

movement of the subject, two segments corresponding to similar vocal tract shapes (e.g., VT shapes for same pho-

neme) may not match spatially. In order to compensate for this spatial offset, we perform an affine transformation

(following the steps outlined in Appendix A) on the upper VT boundary of each frame such that the begin and end

points of every boundary are mapped to (0,0) and (1,0). Similarly, the lower VT boundary is transformed before

they are used for finding optimum sensor locations.

TaggedPThe lower VT boundary changes its shape from one frame to the next. Hence, it becomes a challenge to associate

anatomically identical points on two contours from two different frames. The optimal sensor locations in two different

frames could not be associated due to the same reason. In order to report the location of an optimal sensor across differ-

ent frames, we compute different inter-sensor distances as well as distances from known anatomical points on the VT

boundary in each frame. Finally, we report the mean and standard deviation (SD) of these distances over all frames.

Describing the optimal sensor locations in this manner helps in identifying the position of the optimal points with

respect to the fixed anatomical points and other optimal sensor locations on the VT boundary in the mid-sagittal plane.


4.2. Results

TaggedPThe performance of the proposed algorithm for selecting optimal sensors’ locations is reported in terms of the root

mean squared error (RMSE) between the original VT boundary and the reconstructed boundary from the optimized

sensor locations defined as follows:

RMSE ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

NT

XNT

n¼1

��j½n��bj½n��22

s; ð12Þ

where j½n� ¼ ξx½n�ξy½n�

� and bj½n� ¼ bξ x½n�bξ y½n�

" #are the points on the original and reconstructed boundaries in each frame

respectively. NT is the total number of points in a frame.

TaggedPFig. 3 shows the bar plot of the RMSE for each fold separately for every subject (one for each row) and upper &

lower VT boundaries (one for each column). The bar height indicates the RMSE averaged over all test frames and

errorbar indicates the corresponding SD. For every fold, Fig. 3 reports the RMSE obtained using both linear (dark-

gray bars) and data-driven (light-gray bars) interpolation. From the figure, we can see that the data-driven interpola-

tion works better than the linear interpolation. This is mainly due to the fact the data-driven interpolation makes use

of the training frame’s boundary shapes while linear interpolation does not. The RMSE of the upper VT boundary is

found to be more than that of the lower VT boundary, because four sensor locations are used to reconstruct the lower

VT boundary while only one sensor is used for the upper VT boundary. When averaged over all subjects and folds,

we observe that the upper VT boundary is reconstructed with an RMSE of 1.33 mm while the lower VT boundary is

reconstructed with an RMSE of 0.36 mm using the optimized sensor locations and data-driven interpolation.

TaggedPAfter obtaining the optimal points for all the frames in each sentence of a subject, the mean and SD of the distan-

ces among optimal sensor locations and the anatomical landmarks are computed. Here, one pixel in rtMRI corre-

sponds to 2.9 mm £ 2.9 mm area in the physical dimension. Tables 1 and 2 report the mean and SD of various

distances on the lower and upper VT boundaries respectively. All the distances are reported in millimeters. In

Table 1, d(m, n) represents distance between the mth and nth sensor locations on the VT boundary. Apart from the

Fig. 3. RMSE for each fold computed by linear and data driven interpolation.


TaggedPsensors, anatomical landmarks such as LL, TBa and Tend are also used. In Table 2, variables Sb7 and Sa7 represent the

optimal location of sensor S7 when obtained before (toward UL) and after (away from UL) the VEL point respec-

tively. d(S7, VEL) denotes the distance of S7 from the velum tip considering distance to optimal location before and

after VEL as positive and negative, respectively.

TaggedPIt is clear from Table 1 that the average distance between S4 and TBa lies in the range of 18�22 mm, indicating the

optimized S4 location to be approximately the tongue tip position. The distance between the S4 and S5 locations indi-

cates that S5 has to be placed at about a distance of 37�39 mm from the tongue tip. Similarly, distance between the

S5 and S6 locations indicates that the optimal location of S6 is nearly 77�85 mm away from the tongue tip. Distances

in Table 2 indicate that the optimal location of S7 lies around VEL. Among all subjects, the optimal location of S7furthest from VEL occurs at 18.84 and 35.07 mm after and before VEL tip, respectively. This suggests that the opti-

mal point on the upper VT boundary primarily tracks the velum movement. d(S7, VEL) values in Table 2 suggest

that the optimal location of S7 occurs before VEL in most of the frames.

TaggedPIn order to produce different sounds, the vocal tract creates wide variety of shapes. Tongue plays a crucial role in

creating these different shapes by creating constriction in different directions as shown in Fig. 4. Due to the time

varying nature of vocal tract profiles, given an Nth point in a frame, the corresponding anatomical location on vocal

tract boundary need not be the Nth point in another frame. In-order to illustrate how the optimal sensor locations

vary depending on the vocal tract shapes for different phonemes, we choose four phonemes, namely,

and show the optimal sensor locations (using data-driven interpolation) on the upper and lower

VT boundaries for all four subjects in Fig. 5.

TaggedPThe first row in Fig. 5 depicts the VT configuration for the vowel phoneme . A typical VT configuration for

consists of an opened VT at the front, tongue raised at the back and wide gap between tongue and palate as shown in

the figure. From quantal nature of speech (Stevens, 1972; 1989; 2002), it is known that the target sound can be pro-

duced by having some degrees of articulatory freedom, where the articulation strategies are guided by a few principles

and rather constrained. This can be observed in the VT shapes in the sense that, while there is a gross similarity across

subjects, there are subject specific variations as well. This could be due to different contexts in which is spoken as

well as different articulation styles of the subjects. The optimal sensor locations are shown using black dots on the

upper and lower VT boundaries. It is clear that the shape of the velum changes across subjects and hence the optimal

location of S7 changes depending on the subject. For example, for F1 and M1, the optimized S7 is very close to the

Table 2

The mean and SD (in brackets) of the distances (in mm) between the optimal sensor

location and the VEL on the upper VT boundary. FeAvg, MAvg and Average indicate

the distances averaged across females, males, and all subjects, respectively.

Subject dðSa7;VELÞ dðSb7;VELÞ d(S7, VEL)

F1 17.09(15.97) 25.83(17.41) 13.70(25.76)

F2 8.72 (10.17) 35.07(21.05) 32.12(23.25)

FeAvg 15.60 (15.41) 30.83(19.99) 22.46(26.25)

M1 18.84 (12.77) 28.96 (17.88) 20.11(25.22)

M2 15.80 (14.30) 28.86(18.06) 20.89(24.43)

MAvg 17.47 (13.53) 28.91(17.95) 20.47(24.85)

Average 16.59 (14.46) 29.82(18.97) 21.41(25.54)

Table 1

The mean and SD (in brackets) of the distances (in mm) among different optimal sensor locations on

lower VT boundary. FeAvg, MAvg and Average indicate the distances averaged across females, males,

and all subjects, respectively.

Subject d(LL, S3) d(S3, TBa) d(TBa, S4) d(S4, S5) d(S5, S6) d(S6, Tend)

F1 26.41 (7.05) 10.37 (7.19) 21.24 (11.93) 37.67 (10.88) 40.15 (11.04) 35.11 (10.95)

F2 23.56 (7.18) 7.49 (6.82) 19.09 (10.12) 38.68 (10.7) 39.94 (11.3) 43.44 (11.05)

FeAvg 24.98 (7.46) 8.93 (7.41) 20.16 (11.48) 38.17 (11.08) 40.04 (11.50) 39.27 (12.13)

M1 26.56 (7.76) 10.91 (7.83) 18.25 (10.81) 37.68 (10.93) 42.99 (11.47) 40.16 (11.14)

M2 28.97 (8.35) 10.98 (8.42) 21.16 (11.22) 38.77 (12.46) 46.19 (12.10) 43.78 (13.51)

MAvg 27.76 (8.41) 10.94 (8.44) 19.70 (11.42) 38.22 (11.90) 44.59 (12.28) 41.97 (12.63)

Average 26.37 (8.08) 9.93 (8.03) 19.93 (11.45) 38.2 (11.52) 42.31 (12.11) 40.66 (12.48)

Fig. 4. Different vocal tract profiles due to constrictions created by tongue in different directions.


TaggedPVEL tip while for F2 and M2 the optimal location is away from the VEL tip. The optimal location of S3 is close to the

TBa for all subjects. The optimal location of S4 appears to coincide with TT and the locations of S5 and S6 are optimally

placed to capture the tongue shape well. The second row in Fig. 5 depicts the vocal tract configuration for the voiced

plosive consonant phoneme /d/, for which a small degree of opened vocal tract at the front can be seen along with the

constriction created by the raised tongue to the palate. It is interesting to observe that the optimal location of S4 occurs

exactly at the point of constriction. This could be due to the high curvature of the tongue near constriction requiring an

optimal point for best reconstruction of the tongue shape. The third row of images in Fig. 5 depicts the vocal tract con-

figuration for the fricative consonant phoneme highlighting the tongue constriction against palate. Unlike the opti-

mal locations for phoneme /d/, the optimal locations of the sensors on the tongue do not coincide with the constriction.

This is because the shape of the tongue during is different from that during /d/; in particular, the curvature of the

tongue near constriction for is lower than that for /d/. The fourth row in Fig. 5 depicts the vocal tract configuration

for the lateral approximant alveolar /l/. The tongue tip holds its contact with palate for producing /l/. Unlike /d/, the

optimal location of S4 does not occur at the constriction point for all subjects. It is clear that the shape of the tongue

Fig. 5. The vocal tract profiles and optimal sensor locations for phonemes for all the speakers. Optimal locations of S1and S2 are not shown since they are fixed at UL and LL which are the points at which the upper and lower VT boundaries begin, respectively.


TaggedPvaries across subjects for the sound /l/. The curvature of the tongue near constriction is high for subjects F1 and F2,

while that is not so for M1 and M2. Interestingly, the optimal location of S4 occurs exactly at the point of constriction

for F1 and F2 while that does not happen for M1 and M2. These illustrations show that the optimal sensor locations

vary according to the uttered sound and the speaker’s VT morphology and articulation.

4.3. Discussions

TaggedPThe optimal sensor location is computed in each rtMRI video frame separately. The optimal locations are found to

vary across frames within every utterance. However, in practice, it is not feasible to change the optimal sensor loca-

tion on a frame-by-frame basis. Hence, we examine the quality of VT boundary reconstruction with fixed sensor loca-

tions for each subject separately based on the average distances reported in Table 1. For example, for subject F1, S3 is

placed at a distance of 26.41 mm from LL, S4, S5, and S6 are placed at a distance of 21.24, 37.67, and 40.15 mm from

TBa, S4, and S5; respectively (as per the first row in Table 1). Similarly, for F1, S7 is placed at a distance of 13.70 mm

from from VEL toward the UL. Using these fixed locations in each frame, the RMSE of the reconstructed VT bound-

aries (over all frames of each subject) are reported in Table 3 under the sub-column titled ‘subject dependent’ under

the column titled ‘frame independent’. The RMSE values under the column titled ‘frame dependent’ correspond to the

reconstructed boundaries using optimized sensor location separately in each frame. This is identical to the average per-

formance across five folds shown in Fig. 3. It is clear that the RMSE, averaged across all subjects, increases by 0.26

and 0.34mm (absolute) for lower and upper VT respectively when a fixed set of sensor locations is used in all frames

compared to frame specific optimized sensor locations. We also report the RMSE when sensors are placed using inter-

sensor distances averaged across subjects within and across genders as indicated by ‘gender specific’ and ‘subject

independent’ sub-columns in Table 3. For these purpose, we use the inter-sensor distances following the third, sixth

and seventh rows of Tables 1 and 2. For example, for subject independent evaluation, this results in a distance of

26.37 mm between LL and S3. S4, S5 and S6 are placed at a distance of 19.93, 38.22, and 42.31 mm from TBa, S4 and S5; respectively (as seen in the seventh row of Table 1). S7 is placed at a distance of 21.41 mm from the VEL tip toward

the UL (as per the seventh row in Table 2). It is clear that the RMSE increases further when the sensors are placed in

gender specific as well as subject independent manner. This is mainly due to the fact that the VT morphology changes

across subjects and an average location across multiple subjects may not work well for individual ones. Although the

morphology of male and female subjects are different, we do not find any significant differences between the RMSE

values for male and female subjects when sensors are placed in a gender specific manner. But the average RMSE val-

ues for male subjects is higher than those of female subjects by 0.02mm in lower VT and 0.2�0.3 mm in upper VT.

TaggedPIn practice, it is challenging to place the sensors accurately to match the average optimal locations of the sensors

obtained from the optimization presented in this work. This could be due to several reasons including co-operation

from the subject, critical locations in the vocal tract without much discomfort to the subject, mismatch between the

plane of sensor placement and the mid-sagittal plane as observed in rtMRI, viscosity of glue, varying degree of sal-

vation across people, and placement of the wires. Placing EMA sensors on the tongue of a subject, in general, causes

discomfort during speaking with wires in the mouth. In particular, when a sensor is placed near tongue dorsum or

behind, it could cause gag reflex resulting in discomfort to the subject. Typically the EMA sensors are placed in the

mid-sagittal plane on the tongue based on visual inspection. This could cause a wrong estimate of the mid-sagittal

Table 3

The Mean and SD (in brackets) of the RMSE (in mm) of reconstructed VT boundaries by placing the sensors optimized in frame specific as well as

frame independent manner. FeAvg, MAvg and Average indicate the RMSE averaged across females, males, and all subjects, respectively.

Subject Frame dependent Frame independent

Subject dependent Subject dependent Gender specific Subject independent

Lower VT Upper VT Lower VT Upper VT Lower VT Upper VT Lower VT Upper VT

F1 0.35 (0.07) 1.48 (0.47) 0.57 (0.24) 1.86 (0.52) 0.56 (0.25) 1.86 (0.55) 0.56 (0.23) 1.88 (0.51)

F2 0.39 (0.07) 1.41 (0.32) 0.67 (0.25) 1.82 (0.51) 0.69 (0.27) 1.82 (0.53) 0.69 (0.26) 1.81 (0.51)

FeAvg 0.37 (0.07) 1.44 (0.41) 0.62 (0.25) 1.83 (0.52) 0.63 (0.27) 1.84 (0.54) 0.62 (0.25) 1.87 (0.52)

M1 0.33 (0.06) 1.34 (0.31) 0.57 (0.21) 1.68 (0.52) 0.57 (0.20) 1.70 (0.53) 0.57 (0.20) 1.70 (0.52)

M2 0.37 (0.09) 1.07 (0.26) 0.66 (0.48) 1.35 (0.39) 0.66 (0.49) 1.34 (0.41) 0.67 (0.49) 1.36 (0.34)

MAvg 0.35 (0.08) 1.22 (0.31) 0.61 (0.36) 1.53 (0.49) 0.61 (0.37) 1.53 (0.51) 0.61 (0.37) 1.54 (0.49)

Average 0.36 (0.08) 1.33 (0.38) 0.62 (0.32) 1.67 (0.53) 0.62 (0.32) 1.68 (0.54) 0.62 (0.32) 1.69 (0.53)


TaggedPplane particularly due to tongue twitching and the manner in which the sensor is placed. The salvation often results

in detachment of the sensor which exaberates the problem. Also due to the invasive nature of the EMA recording,

more sensors result in more discomfort to the subject during speaking. In this work, we did not consider including

factor due to subjective discomfort level in finding the optimal sensor locations. According to the the proposed opti-

mization, as the number of sensors increases the reconstruction error decreases. However, considering the invasive

nature of the EMA recording, the choice of the number of sensors should be determined by jointly considering the

objective metric (RMSE) as well as the subjective metric (discomfort in speaking). Another practical constraint in

determining optimal sensor location would be the required minimum distance between two sensors. For example, it

is recommended that two sensors in EMA recording should be placed at a minimum distance of 8mm to avoid inter-

sensor interference (AG500, 2017). Such a constraint is critical when one plans to find optimal locations of relatively

large number of sensors.

5. Conclusions

TaggedPIn this work, we propose an algorithm for finding optimal sensor locations for EMA recording by formulating it

as a problem of optimal point selection on the air-tissue boundaries for minimizing the reconstruction error in the

rtMRI video frames. Air-tissue boundaries are reconstructed using two types of interpolation functions, namely lin-

ear and data-driven. We have considered four different speakers to examine how the algorithm performs in pre-

dicting optimal sensor locations in VT with varying morphology and articulation styles. We have considered

rtMRI frames covering different vocal tract shapes corresponding to most of the phonemes of American English.

The RMSE of the reconstructed boundary has a range of 0.33�0.39 mm and 1.07�1.48 mm when optimal sensor

locations are used for reconstruction in the lower and upper VT, respectively. When averaged over all four sub-

jects, the proposed data-driven interpolation reveals that, for minimizing the reconstruction error of the lower VT

boundary, one sensor should be placed at lower incisor at a distance of 26.37(§8.08) mm from the lower lip and

three sensors at TT (19.93(§11.45) mm from tongue base) and 38.2(§11.52) mm and 80.51(§13.51) mm away

from TT. Similarly, for minimal reconstruction error of the upper VT boundary, one sensor should be placed at a

distance of 21.41(§25.54) mm from the velum tip. This leads to an average RMSE of the reconstruction error of

0.62 mm and 1.69 mm of the lower and upper VT boundaries, respectively.

TaggedPIn the current work, we have optimized the sensor locations on the VT boundary based on each frame indepen-

dently. Optimal sensor locations could also be found by minimizing the reconstruction error on all the frames jointly.

It would also be interesting to observe how the location of optimal sensors can be varied by considering the multi-slices

data in a rtMRI frame which adds information from the coronal plane. Modeling pharyngeal constriction based on opti-

mally placed sensors in the anterior tract is also a problem worth investigating. These are parts of our future work.

Acknowledgments

TaggedPWe thank all annotators who participated in marking air-tissue boundaries in the rtMRI video frames.

Appendix

TaggedPAppendix A. Converting a two-dimensional contour with start and end points as [x1, y1] to [x2, y2] respectively to a

new contour with start and end points as [x3, y3] to [x4, y4] respectively using affine transformation

TaggedPGiven two locations [x1, y1] and [x2, y2], a contour between them can be transformed to a new one starting from

[x3, y3] and ending at [x4, y4] by performing an affine transformationexey

� ¼ a1 �a2

a2 a1

� x

y

� þ b1

b2

� ; where [x, y]

and ½ex; ey� denote points on the contour before and after transformation, respectively. Affine parameters a1, a2, b1, b2are computed by solving the following equations obtained by equating the relations of the start and end points beforeand after transformation.

x1 �y1 1 0

y1 x1 0 1

x2 �y2 1 0

y2 x2 0 1

26643775

a1a2b1b2

26643775 ¼

x3y3x4y4

26643775 ð1Þ


TaggedPThe above equation is of the form Ax ¼ b; and by performing the elementary row operations on A; its row

equivalent form turns out to be

x1�x2

y1�y2�y1�y2

x1�x20 0 0

y1�y2

x1�x21 0 0

x2 �y2 1 0

y2 x2 0 1

26666664

37777775. Hence, the solution for above equation exists,

if ðx1�x2Þ2 6¼ ðy1�y2Þ2. In other words, the begin and the end points of the contour should not be identical for thesolution to exist. This is true for the optimization we have considered for optimal sensor placement.

TaggedPAppendix B. Resampling a two-dimensional contour with N points (non equi-spaced) with Nd equi-spaced points

TaggedPSuppose the contour P of length L, contains N points starting from p1 ¼ ½x1; y1� to pN ¼ ½xN ; yN �. The steps

outlined in Algorithm 2 describes the resampling of the same contour with Nd points which are equally spaced at a

distance d ¼ L=Nd. The estimated ordered points on the contour are p1, bp2; bp3; . . . ; bpNd�1; pN.


References

TaggedPCarstens Medizinelektronik Gmbh, AG500 Manual. 2017. http://www.ag500.de/manual/ag500/AG500_manual.pdf. (Accessed:15/2/2017).

TaggedPUCLA phonetics lab 2017. http://www.linguistics.ucla.edu/faciliti/facilities/physiology/ema.html#Where_Sensors. (Accessed:13/4/2017).

TaggedPBennett, J.W., Van Lieshout, P., Steele, C.M., 2007. Tongue control for speech and swallowing in healthy younger and older subjects. Int. J. Oro-

fac. Myol. 33, 5–18.

TaggedPBerger, A., 2002. How does it work? Magnetic resonance imaging. BMJ: Br. Med. J. 324 (7328), 35.

TaggedPBombien, L., Mooshammer, C., Hoole, P., Rathcke, T., K€uhnert, B., 2007. Articulatory strengthening in initial German /kl/ clusters under prosodic

variation. In: Proceedings of the Sixteenth International Congress of Phonetic Sciences. Saarbr€ucken, Germany, pp. 457–460.

TaggedPBotta, M., 2000. Second coordination sphere water molecules and relaxivity of gadolinium (III) complexes: implications for MRI contrast agents.

Eur. J. Inorg. Chem. 2000 (3), 399–407.

TaggedPBresch, E., Kim, Y.-C., Nayak, K., Byrd, D., Narayanan, S., 2008. Seeing speech: capturing vocal tract shaping using real-time magnetic reso-

nance imaging. IEEE Signal Process. Mag. 25 (3), 123–132.

TaggedPBrown, R.W., Cheng, Y.-C. N., Haacke, E.M., Thompson, M.R., Venkatesan, R., 2014. Magnetic Resonance Imaging: Physical Principles and

Sequence Design. John Wiley & Sons.

TaggedPCho, T., 2004. Prosodically conditioned strengthening and vowel-to-vowel coarticulation in English. J. Phon. 32 (2), 141–176.

TaggedPDemolin, D., Metens, T., Soquet, A., 1996. Three-dimensional measurement of the vocal tract by MRI. 1, 272–275.

TaggedPDemolin, D., Metens, T., Soquet, A., 2000. Real time MRI and articulatory coordinations in vowels. In: Proceedings of the Fifth Seminar on

Speech Production: Models and Data, pp. 86–93.

TaggedPDuran, D., Bruni, J., Dogil, G., 2013. Acoustic and articulatory information as joint factors coexisting in the context sequence model of speech pro-

duction. 19 (1), 060091.

TaggedPEngwall, O., 2003. Combining MRI, EMA and EPG measurements in a three-dimensional tongue model. Speech Commun. 41 (2), 303–329.

TaggedPFeng, Y., 2008. Dissociating the Role of auditory and Somatosensory Feedback in Speech Production: Sensorimotor Adaptation to Formant Shifts

and Articulatory Perturbations.

TaggedPFrankel, J., Richmond, K., King, S., Taylor, P., 2000. An automatic speech recognition system using neural networks and linear dynamic models to

recover and model articulatory traces. In: Proceedings of the International Conference on Spoken Language Processing (ICSLP), pp. 254–257.

TaggedPGhosh, P.K., Narayanan, S., 2010. A generalized smoothness criterion for acoustic-to-articulatory inversion. J. Acoust. Soc. Am. 128 (4), 2162–

2172.

TaggedPGhosh, P.K., Narayanan, S.S., 2011. A subject-independent acoustic-to-articulatory inversion. In: Proceedings of the IEEE International Confer-

ence on Acoustics, Speech and Signal Processing, pp. 4624–4627.

TaggedPHardcastle, W., Vaxelaire, B., Gibbon, F., Hoole, P., Nguyen, N., 1996. EMA/EPG study of lingual coarticulation in /kl/ clusters. In: Proceedings

of the Speech Production Seminar, pp. 53–56.

TaggedPHoole, P., Gfoerer, S., 1990. Electromagnetic articulography as a tool in the study of lingual coarticulation. J. Acoust. Soc. Am. 87 (S1), S123.

TaggedPHoole, P., Nguyen, N., 1997. Electromagnetic articulography in coarticulation research. Forschungsberichte des Instituts f€ur Phonetik und

Sprachliche Kommunikation der Universit€at M€unchen 35, 177–184.

TaggedPHoole, P., Nguyen-Trong, N., Hardcastle, W., 1993. A comparative investigation of coarticulation in fricatives: electropalatographic, electromag-

netic, and acoustic data. Lang. Speech 36 (2�3), 235–260.

TaggedPHuettel, S.A., Song, A.W., McCarthy, G., 2004. Functional magnetic resonance imaging. 1. Sinauer Associates, Sunderland.

TaggedPKatz, W., Machetanz, J., Orth, U., Sch€onle, P., 1990. A kinematic analysis of anticipatory coarticulation in the speech of anterior aphasic subjects

using electromagnetic articulography. Brain Lang. 38 (4), 555–575.

TaggedPKim, J., Lammert, A.C., Ghosh, P.K., Narayanan, S.S., 2014. Co-registration of speech production datasets from electromagnetic articulography

and real-time magnetic resonance imaging. J. Acoust. Soc. Am. 135 (2), EL115–EL121.

TaggedPKing, S., Wrench, A., 1999. Dynamical system modelling of articulator movements. In: Proceedings of the International Congress of Phonetic

Sciences, pp. 2259–2262.

TaggedPKoos, B., Horn, H., Schaupp, E., Axmann, D., Berneburg, M., 2013. Lip and tongue movements during phonetic sequences: analysis and definition

of normal values. Eur. J. Orthod. 35 (1), 51–58.

TaggedPKroos, C., 2008. Measurement accuracy in 3D electromagnetic articulography (Carstens AG500). In: Proceedings of the Eight International

Seminar on Speech Production, pp. 61–64.

TaggedPKroos, C., 2012. Evaluation of the measurement precision in three-dimensional electromagnetic articulography (Carstens AG500). J. Phon. 40 (3),

453–465.

TaggedPLadefoged, P., Harshman, R., Goldstein, L., Rice, L., 1978. Generating vocal tract shapes from formant frequencies. J. Acoust. Soc. Am. 64 (4),

1027–1035.

TaggedPLing, Z.-H., Richmond, K., Yamagishi, J., 2010. HMM-based text-to-articulatory-movement prediction and analysis of critical articulators. In:

Proceedings of the InterspeecH, pp. 2194–2197.

TaggedPLingala, S.G., Zhu, Y., Kim, Y.-C., Toutios, A., Narayanan, S., Nayak, K.S., 2017. A fast and flexible MRI system for the study of dynamic vocal

tract shaping. Magn. Reson. Med. 77 (1), 112–125.

TaggedPLiu, F.-H., 1994. Environmental Adaptation for Robust Speech Recognition. Carnegie Mellon University, Pittsburgh (Ph.D. thesis).

TaggedPMaassen, B., Kent, R., Peters, H., 2007. Speech Motor Control: In Normal and Disordered Speech. Oxford University Press.

TaggedPMaurer, D., Gr€one, B., Landis, T., Hoch, G., Sch€onle, P., 1993. Re-examination of the relation between the vocal tract and the vowel sound with

electromagnetic articulography (EMA) in vocalizations. Clin. Linguist. Phon. 7 (2), 129–143.

http://refhub.elsevier.com/S0885-2308(17)30112-2/sbref0001



























































TaggedPMcClean, M.D., Runyan, C.M., 2000. Variations in the relative speeds of orofacial structures with stuttering severity. J. Speech Lang. Hear. Res.

43 (6), 1524–1531.

TaggedPMooshammer, C., Hoole, P., 1993. Articulation and coarticulation in velar consonants. Forschungsberichte-Institut f€ur Phonetik und Sprachliche

Kommunikation der Universit€at M€unchen 31, 249–262.

TaggedPMooshammer, C., Schiller, N.O., 1996. Coarticulatory effects on kinematic parameters of rhotics in German. In: Proceedings of the First ESCA

Tutorial and Research Workshop on Speech Production Modeling: From Control Strategies to Acoustics. Autrans, pp. 25–28.

TaggedPM€ucke, D., Nam, H., Hermes, A., Goldstein, L., 2012. Coupling of tone and constriction gestures in pitch accents. In: Hoole, P. (Ed.), Consonant

Clusters and Structural Complexity. Mouton de Gruyter, Berlin, pp. 205–230.

TaggedPNamasivayam, A.K., Van Lieshout, P.H.H.M., 2001. Compensation and adaptation to static perturbations in people who stutter. Speech Motor

Control in Normal and Disordered Speech. 4th International Speech Motor Conference. Nijmegen, Netherlands, pp. 253–257.

TaggedPNamasivayam, A.K., Van Lieshout, P., 2008. Investigating speech motor practice and learning in people who stutter. J. Fluen. Disord. 33 (1), 32–

51.

TaggedPNarayanan, S., Bresch, E., Ghosh, P.K., Goldstein, L., Katsamanis, A., Kim, Y., Lammert, A.C., Proctor, M.I., Ramanarayanan, V., Zhu, Y., 2011.

A multimodal real-time MRI articulatory corpus for speech research. In: Proceedings of the Interspeech, pp. 837–840.

TaggedPNarayanan, S., Nayak, K., Lee, S., Sethy, A., Byrd, D., 2004. An approach to real-time magnetic resonance imaging for speech production. J.

Acoust. Soc. Am. 115 (4), 1771–1776.

TaggedPNarayanan, S., et al., 2014. Real-Time magnetic resonance imaging and electromagnetic articulography database for speech production research

(TC). J. Acoust. Soc. Am. 136 (3), 1307–1311.

TaggedPOuni, S., Laprie, Y., 2009. Studying pharyngealization using an articulograph. In: Proceedings of the International Workshop on Pharyngeals and

Pharyngealisation. Newcastle.

TaggedPParthasarathy, V., Prince, J.L., Stone, M., Murano, E.Z., NessAiver, M., 2007. Measuring tongue motion from tagged cine-MRI using harmonic

phase HARP processing. J. Acoust. Soc. Am. 121 (1), 491–504.

TaggedPPayan, Y., Perrier, P., 1997. Synthesis of V-V sequences with a 2d biomechanical tongue model controlled by the equilibrium point hypothesis.

Speech Commun. 22 (2), 185–205.

TaggedPPerkell, J.S., Cohen, M.H., Svirsky, M.A., Matthies, M.L., Garabieta, I., Jackson, M.T.T., 1992. Electromagnetic midsagittal articulometer systems

for transducing speech articulatory movements. J. Acoust. Soc. Am. 92 (6), 3078–3096.

TaggedPPeters, H.F.M., Hulstijn, W., Van Lieshout, P., 2000. Recent developments in speech motor research into stuttering. Folia Phoniatrica et Logo-

paedica 52 (1�3), 103–119.

TaggedPRecasens, D., 2002. An EMA study of VCV coarticulatory direction. J. Acoust. Soc. Am. 111 (6), 2828–2841.

TaggedPRichardson, M., Bilmes, J., Diorio, C., 2003. Hidden-articulator Markov models for speech recognition. Speech Commun. 41 (2), 511–529.

TaggedPRubin, P., Vatikiotis-Bateson, E., 1998. Measuring and modeling speech production. In Animal Acoustic Communication. Springer Berlin

Heidelberg, pp. 251–290.

TaggedPRudzicz, F., Namasivayam, A.K., Wolff, T., 2012. The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Lang.

Res. Eval. 46 (4), 523–541.

TaggedPSchulz, G., Sulc, S., Leon, S., Gilligan, G., 2000. Speech motor learning in parkinson disease. J. Med. Speech Lang. Pathol. 8 (4), 243–247.

TaggedPSerrurier, A., Barney, A., Badin, P., Bo€e, L.-J., Savariaux, C., 2008. Comparative articulatory modelling of the tongue in speech and feeding. In:

Proceedings of the International Seminar on Speech Production (ISSP).

TaggedPSlłrdahl, S.A., Bjærum, S., Amundsen, B.H., Stłylen, A., Heimdal, A., Rabben, S.I., Torp, H., 2001. High frame rate strain rate imaging of the

interventricular septum in healthy subjects. Eur. J. Ultrasound 14 (2), 149–155.

TaggedPSteele, C.M., Van Lieshout, P., 2004. Use of electromagnetic midsagittal articulography in the study of swallowing. J. Speech Lang. Hear. Res.

47 (2), 342–352.

TaggedPSteele, C.M., Van Lieshout, P., 2005. Does barium influence tongue behaviors during swallowing? Am. J. Speech Lang. Pathol. 14 (1), 27–39.

TaggedPSteele, C.M., Van Lieshout, P., 2009. Tongue movements during water swallowing in healthy young and older adults. J. Speech Lang. Hear. Res.

52 (5), 1255–1267.

TaggedPSteiner, I., Richmond, K., Ouni, S., 2013. Speech animation using electromagnetic articulography as motion capture data. In: Proceedings of the

Twelfth International Conference on Auditory-Visual Speech Processing. France, pp. 55–60.

TaggedPStella, M., Bernardini, P., Sigona, F., Stella, A., Grimaldi, M., Gili Fivela, B., 2012. Numerical instabilities and three-dimensional electromagnetic

articulography. J. Acoust. Soc. Am. 132 (6), 3941–3949.

TaggedPStella, M., Stella, A., Sigona, F., Bernardini, P., Grimaldi, M., Fivela, B.G., 2013. Electromagnetic articulography with AG500 and AG501. In:

Proceedings of the Interspeech, pp. 1316–1320.

TaggedPStevens, K.N., 1972. The quantal nature of speech: evidence from articulatory-acoustic data. In: David, E.E., Denes, P.B. (Eds.), Human

Communication: A Unified View. McGraw-Hill, New York, pp. 51–56.

TaggedPStevens, K.N., 1989. On the quantal nature of speech. J. Phon. 17 (1), 3–45.

TaggedPStevens, K.N., 2002. Toward a model for lexical access based on acoustic landmarks and distinctive features. J. Acoust. Soc. Am. 111 (4), 1872–

1891.

TaggedPStone, M., Lundberg, A., 1996. Three-dimensional tongue surface shapes of english consonants and vowels. J. Acoust. Soc. Am. 99 (6), 3728–

3737.

TaggedPToda, T., Black, A.W., Tokuda, K., 2004. Acoustic-to-articulatory inversion mapping with Gaussian mixture model. In: Proceedings of the Inter-

speech, pp. 1129–1132.

TaggedPToda, T., Black, A.W., Tokuda, K., 2004. Mapping from articulatory movements to vocal tract spectrum with gaussian mixture model for articula-

tory speech synthesis. In: Proceedings of the Fifth ISCA Speech Synthesis Workshop. Pittsburg, pp. 31–36.































































TaggedPToutios, A., Margaritis, K., 2003. Acoustic-to-articulatory inversion of speech: A review. In: Proceedings of the International 12th TAINN. https://

pdfs.semanticscholar.org/c756/3df3ecb34774f661d6681263874353d58119.pdf.

TaggedPToutios, A., Ouni, S., Laprie, Y., 2011. Estimating the control parameters of an articulatory model from electromagnetic articulograph data. J.

Acoust. Soc. Am. 129 (5), 3245–3257.

TaggedPUchida, H., Wakamiya, K., Kaburagi, T., 2016. Improvement of measurement accuracy for the three-dimensional electromagnetic articulograph

by optimizing the alignment of the transmitter coils. Acoust. Sci. Technol. 37 (3), 106–114.

TaggedPUria, B., Renals, S., Richmond, K., 2011. A deep neural network for acoustic-articulatory speech inversion. In: Proceedings of the NIPS Workshop

on Deep Learning and Unsupervised Feature Learning.

TaggedPVan Lieshout, P., 2001. Coupling dynamics of motion primitives in speech movements and its potential relevance for fluency. Soc. Chaos Theory

Psychol. Life Sci. Newslett. 8 (4), 18.

TaggedPVan Lieshout, P., 2007. Dynamical systems theory and its application in speech. Speech Motor Control in Normal and Disordered Speech. Oxford

University Press, chapter 3, pp. 51–82.

TaggedPVan Lieshout, P., Bose, A., Square, P.A., Steele, C.M., 2007. Speech motor control in fluent and dysfluent speech production of an individual with

apraxia of speech and bRoca’s aphasia. Clin. Linguist. Phon. 21 (3), 159–188.

TaggedPVan Lieshout, P., Rutjens, C., Spauwen, P., 2002. The dynamics of interlip coupling in speakers with a repaired unilateral cleft-lip history. J.

Speech Lang. Hear. Res. 45 (1), 5–19.

TaggedPWang, Y.K., Nash, M.P., Pullan, A.J., Kieser, J.A., R€ohrle, O., 2013. Model-based identification of motion sensor placement for tracking retraction

and elongation of the tongue. Biomech. Model. Mechanobiol. 12 (2), 383–399.

TaggedPWatkin, K.L., Rubin, J.M., 1989. Pseudo-three-dimensional reconstruction of ultrasonic images of the tongue. J. Acoust. Soc. Am. 85 (1), 496–

499.

TaggedPWest, P., 2000. Long-distance coarticulatory effects of British english /l/ and /r/: an EMA, EPG and acoustic study. In: Proceedings of the Fifth

Seminar on Speech Production: Models and Data. Kloster Seeon, Bavaria, Germany, pp. 105–108.

TaggedPWestbury, J., 1994. X-ray Microbeam Speech Production Database Users Handbook. Madison.

TaggedPWestbury, J., Milenkovic, P., Weismer, G., Kent, R., 1990. X-ray microbeam speech production database. J. Acoust. Soc. Am. 88 (S1), S56.

TaggedPWong, M.N., Murdoch, B.E., Whelan, B.-M., 2011. Lingual kinematics in dysarthric and nondysarthric speakers with Parkinson’s disease.

Parkinsons Dis. 2011, 352838. 8 pages, doi:10.4061/2011/352838.

TaggedPWrench, A., Richmond, K., 2000. Continuous speech recognition using articulatory data. In: Proceedings of the International Conference on

Spoken Language Processing, pp. 145–148.

TaggedPWrench, A.A., 2000. A multichannel articulatory database and its application for automatic speech recognition. In: Proceedings of the Fifth

Seminar of Speech Production.

TaggedPYunusova, Y., Green, J.R., Mefferd, A., 2009. Accuracy assessment for AG500, electromagnetic articulograph. Journal Speech Lang. Hear. Res.

52 (2), 547–555.

TaggedPZhang, L., Renals, S., 2008. Acoustic-articulatory modeling with the trajectory HMM. IEEE Signal Process. Lett. 15, 245–248.

https://pdfs.semanticscholar.org/c756/3df3ecb34774f661d6681263874353d58119.pdf

https://pdfs.semanticscholar.org/c756/3df3ecb34774f661d6681263874353d58119.pdf























doi:10.4061/2011/352838








Documents

recording for speech production studyI · Optimal sensor placement in electromagnetic articulography recording for speech production studyI TagedPD1XAshok Kumar PattemX D2XaX, D3XAravind