Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Available online at www.sciencedirect.com
Computer Speech & Language 47 (2018) 157�174
www.elsevier.com/locate/csl
Optimal sensor placement in electromagnetic articulography
recording for speech production studyI
TaggedPD1X XAshok Kumar PattemD2X X
a, D3X XAravind Illa D4X X*,a, D5X XAmber Afshan D6X Xb, D7X XPrasanta Kumar Ghosh D8X Xa
TaggedP
aDepartment of Electrical Engineering, Indian Institute of Science, Bangalore, Karnataka 560012, IndiabDepartment of Electrical Engineering, University of California, 420 Westwood Plaza, Los Angeles, CA 90095, USA
Received 27 April 2017; received in revised form 13 July 2017; accepted 26 July 2017
Available online 27 July 2017
TaggedPAbstract
Electromagnetic articulography (EMA) is one of the technological solutions, widely used to measure the articulatory movement
useful for speech production research. EMA is typically used to track articulatory flesh points by placing sensors, often heuristically,
on the key articulators including lips, jaw, tongue and velum in the mid-sagittal plane. In this work, we address the problem of opti-
mal placement of EMA sensors by posing it as the optimal selection of points for minimizing the reconstruction error of the air-tissue
boundaries in the real-time magnetic resonance imaging (rtMRI) video frames of vocal tract (VT) in the mid-sagittal plane. We pro-
pose an algorithm for optimal placement of EMA sensors using dynamic programming. Experiments are performed using rtMRI
video frames for read speech from four subjects with upper and lower lips as two fixed points. One optimal sensor on the upper VT
boundary is found to be at an average distance of 21.41(§25.54) mm from the velum tip. Similarly, for the lower VT boundary, one
optimal sensor is found at the lower incisor at a distance of 26.37(§8.08) mm from lower lip and three optimal sensors on tongue �at tongue tip (19.93(§11.45) mm from tongue base) and 38.2(§11.52) mm and 80.51(§13.51) mm away from the tongue tip.
� 2017 Elsevier Ltd. All rights reserved.
TaggedPKeywords: Electromagnetic articulography; Sensor placement; Speech production
1. INTRODUCTION
TaggedPRecording of the dynamics of the speech articulators (e.g., lips, tongue, jaw, velum) is critical for the study of
speech production (Rubin and Vatikiotis-Bateson, 1998). Articulatory movement data for speech production
research are acquired using different modalities such as mid-sagittal X-ray diagrams (Ladefoged et al., 1978), X-ray
microbeam imaging (XRMB) (Westbury et al., 1990), Ultrasound (Watkin and Rubin, 1989), Electropalatography
(Stone and Lundberg, 1996), tagged MRI (Parthasarathy et al., 2007), Electromagnetic Articulography (EMA)
(Maurer et al., 1993) and real-time magnetic resonance imaging (rtMRI) (Demolin et al., 2000; Narayanan et al.,
2004). rtMRI provides a complete 2D mid-sagittal view of articulatory dynamics during read speech (Narayanan
et al., 2014). Among different modalities, only MRI technique provides a three-dimensional images of the vocal tract
for sustained vowels (Demolin et al., 1996). The air-tissue boundaries from rtMRI images provide a time-varying
I This paper has been recommended for acceptance by Prof. R. K. Moore.
* Corresponding author.
E-mail address: [email protected] (A. Illa).
http://dx.doi.org/10.1016/j.csl.2017.07.008
0885-2308/ 2017 Elsevier Ltd. All rights reserved.
158 A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174
TaggedPdescription of the vocal tract shape in the mid-sagittal plane. However, the rtMRI data has low temporal resolution
(23.18 frames/s) (Narayanan et al., 2014). It also remains a challenge to record a good quality speech from the sub-
ject while he/she undergoes rtMRI scan, due to the loud MRI scanner noise.
TaggedPUnlike the rtMRI, the XRMB provides the articulatory movement data at a rate more than 100 Hz (Westbury,
1994). In spite of a high temporal resolution, the XRMB technique is limited, in the sense that it does not provide a
complete mid-sagittal view of articulatory dynamics, since only a few pellets placed sparsely on various articulators
are tracked (Westbury et al., 1990). Ultrasound also provides a high temporal resolution (50 frames/s or more
Slłrdahl et al. (2001)) and a good quality audio can be recorded simultaneously too. But the ultrasound images are
noisy and detect only the first air-tissue boundary (Bresch et al., 2008). Hence, it is not possible to record the dynam-
ics of anterior tongue tip and lips in ultrasound imaging. On the other hand, EMA has a high temporal resolution
(sampling rate of »500 Hz). But it cannot capture the structure of pharyngeal wall unlike the rtMRI recording. The
EMA data provides the co-ordinates of sensors sparsely placed on different articulators. Another advantage of EMA
recording is that a good quality audio can be recorded in parallel to the EMA recording. However, proper care has to
be taken while doing the EMA recording to minimize the measurement errors. The accuracy in the measurement of
the articulatory movements by EMA is affected due to sensor failures, electromagnetic interference, sensors going
out out of the measurement region and also by numerical instabilities (Yunusova et al., 2009; Stella et al., 2012).
Attempts are made to handle out of range issues (Kroos, 2008) and to improve the measurement accuracy of EMA
(Kroos, 2012; Uchida et al., 2016). For acquiring articulatory movements during speech production, it has been
claimed that AG501 provides greater accuracy and is more user-friendly than AG500 (Stella et al., 2013). Therefore,
it is apparent that different modalities capture different amount of spatial and temporal information about articula-
tory movements (Bresch et al., 2008) depending on the imaging technique used or the placement of the sensors and
pellets. In this work, we focus on optimal placement of sensors in the mid-sagittal plane for EMA recording such
that it provides maximal information about the air-tissue boundaries as observed in rtMRI recording.
TaggedPEMA data has been crucial for several speech production studies, analysis and modeling including the study of
experimental phonetics, the articulatory movement modeling (Perkell et al., 1992; King and Wrench, 1999), examin-
ing the variability of coarticulation (Cho, 2004; Bombien et al., 2007; Hardcastle et al., 1996; Recasens, 2002; Hoole
et al., 1993; Hoole and Gfoerer, 1990; Hoole and Nguyen, 1997; Mooshammer and Hoole, 1993; Mooshammer and
Schiller, 1996; Katz et al., 1990; West, 2000) understanding coupling dynamics (Van Lieshout, 2001; Van Lieshout
et al., 2002) of motor primitives in speech movements in case of the normal and disordered speech (Schulz et al.,
2000; Maassen et al., 2007; Van Lieshout, 2007; Van Lieshout et al., 2007) as well as during stuttering (Peters et al.,
2000; McClean and Runyan, 2000; Namasivayam and Van Lieshout, 2001; 2008) and swallowing (Steele and
Van Lieshout, 2004; 2005; Bennett et al., 2007; Steele and Van Lieshout, 2009). EMA data of the articulatory kine-
matics available through MOCHA-TIMIT (Wrench, 2000) and USC-TIMIT (Narayanan et al., 2014) are widely
used for acoustic-articulatory modeling for speech recognition (Frankel et al., 2000; Wrench and Richmond, 2000;
Richardson et al., 2003), text-to-articulatory-movement prediction and analysis of critical articulators Zhang and
Renals (2008); Ling et al. (2010), mapping from articulatory movements to vocal tract spectrum (Payan and Perrier,
1997; Toda et al., 2004b; Steiner et al., 2013), acoustic-to-articulatory inversion (Toutios and Margaritis, 2003; Toda
et al., 2004a; Ghosh and Narayanan, 2010; Uria et al., 2011; Ghosh and Narayanan, 2011), multimodal speech ani-
mation (Kim et al., 2014; Engwall, 2003).
TaggedPGiven these wide-spread uses of the EMA data, it is important to develop a principled approach in the placement
of sensors during EMA recording. Since EMA data provides movement of few sparsely placed sensors, it is required
to place them optimally in order to capture maximal information of the articulatory dynamics. For most of the exist-
ing EMA recordings, the sensors are typically placed following some heuristic rules. For example, for recording
using Carstens system (AG100), the suggested three EMA sensors positions on the tongue are 1 cm from the tongue
tip, midpoint of the tongue body and 4 cm from the tongue tip as tongue dorsum (UCLA, 2017). The TORGO Data-
base of Dysarthric Articulation (Rudzicz et al., 2012) was recorded from dysarthria patients. It consists of both
acoustics and articulatory data of EMA and 3D-reconstruction from binocular video sequences. The sensors were
placed on the tongue at three different locations, namely, tongue tip being 1 cm, tongue middle 4 cm and tongue
back 6 cm behind from the anatomical tongue tip. In another study of pharyngealization using an articulograph
(Ouni and Laprie, 26�27 March, 2009), EMA data was collected by placing four sensors at 1.6, 3.6, 5.2 and 7 cm
away from the tongue tip. M€ucke et al. (2012) collected articulatory data from German speakers and they used only
two sensors on the tongue, at 1 and 4 cm away from the tongue tip and called them tongue blade and tongue body,
A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174 159
TaggedPrespectively. The database collected by Wong et al. (2011) had the same sensor placement on the tongue as that of
M€ucke et al. (2012) for the study of lingual kinematics in dysarthric patients. For estimating the control parameters
of an articulatory model, Toutios et al. (2011) used four sensors on the tongue where one sensor was placed approxi-
mately on the tongue tip and the rest being 1.4, 3.1 and 5.7 cm away from the tongue tip. Duran et al. (2013) used
articulatory data for developing context sequence model for speech production. The three sensor locations on the
tongue were as follows�one on the tip and the other two at 3, 4 cm away from the tip. Serrurier et al. (2008) col-
lected the EMA data during speaking and feeding activities. The sensor locations on the tongue were 1, 4 and 6 cm
from the tongue tip. Feng (2008) in his study, placed the first sensor at a distance of 1 cm from the tongue tip and
two other sensors posteriorly placed along the midline with a spacing of approximately 1�1.5 cm. Koos et al. (2013)
used three sensors � one on the tip and two others 2 and 4 cm behind the tongue tip. For the collection of MOCHA
database Wrench (2000) used three sensors at 1, »3�4 and »5�7 cm from the tongue tip. It is clear that there is no
uniform mechanism using which the sensors are placed on the tongue across different EMA based studies. In fact,
the sensor positions change across various databases.
TaggedPWang et al. (2013) provided a recommendation for the placement of the EMA sensors using a finite element
model of the tongue. However, no such recommendations are available for placement of sensors on articulators other
than the tongue. In this work, we propose an optimization framework for determining the optimal sensor locations in
the mid-sagittal plane for EMA recording. We aim to find the sensor locations so that the maximal information about
the vocal tract shape in the mid-sagittal plane is preserved. This is done by optimizing the locations of seven sensors
such that the vocal tract (VT) air-tissue boundaries (referred to as VT boundaries), as observed in rtMRI recording,
can be reconstructed with minimum error. For this purpose, we have used manually annotated air-tissue boundaries
from rtMRI videos of four subjects. We begin with the description of the dataset (Section 2). In Section 3, we present
an objective function for optimal placement of EMA sensors and the steps to solve it. We present the optimized sen-
sor locations and related discussions in Section 4. Section 5 summarizes the key findings and future work.
2. Dataset
TaggedPThe experiments in this work are performed using the MRI-TIMIT database (Narayanan et al., 2011), which is
an excellent resource for the analysis and understanding of the articulatory movements for the read speech. The
MRI-TIMIT contains audio recordings synchronized with a sequence of rtMRI images of the mid-sagittal upper
airway, acquired from two female (F1, F2) and two male (M1, M2) native American English speakers aged
between 23 and 33 years. A detailed description of working principle and acquisition of MRI are explained by
Brown et al. (2014). The upper airway is imaged at the rate of 23.18 frames/s with a resolution of 68 £ 68 pixels
(each pixel of size 2.9 £ 2.9 mm). Audio is simultaneously recorded at a sampling frequency of 20 kHz while sub-
jects are imaged, using a custom fiber-optic microphone noise-canceling system. The recording is performed while
each speaker utters a set of 460 sentences. The total duration of the recordings are 38.19, 37.99, 39.05, and
38.07 min for the four subjects, F1, F2, M1, and M2, respectively.
TaggedPSelection of optimal sensor locations in this work requires manual annotation which involves marking air-tissue
boundaries manually on every rtMRI image. However, the total numbers of rtMRI frames in MRI-TIMIT are
53.1£ 103, 52.8£ 103, 54.3£ 103 and 52.9£ 103 for F1, F2, M1 and M2, respectively. Since manual annotation of
all these frames is time-consuming, we select a subset of five sentences for each speaker. These five sentences are
chosen based on their phonetic richness so that vocal tract shapes for most of the phonemes are covered in the chosen
sentences. For this purpose, we have chosen a set of 51 phonemes (Liu, 1994). The five sentences are selected using a
forward sentence selection algorithm in which the entropy computed using the histogram of the phonemes is maxi-
mized. The forward sentence selection works in a greedy manner in which a sentence is chosen in each step such
that the phonemes of the chosen sentence when considered along with the phonemes of already chosen sentences
maximize the entropy. Following the forward sentence selection algorithm we obtain the following five sentences
from the MRI-TIMIT corpus for each subject: (1) She always jokes about too much garlic in his food, (2) There was
a gigantic wasp next to Irving’s big top hat, (3) Laugh, dance and sing, if fortune smiles upon you, (4) I’d ride the
subway but I haven’t enough change, (5) Eating spinach nightly increases strength miraculously.
TaggedPThese five sentences contain all 51 phonemes except five, namely, which lie in the seven
least frequent phonemes in the MRI-TIMIT corpus. The number of rtMRI frames from these five sentences turns out
to be 474, 430, 540 and 460 for F1, F2, M1 and M2, respectively.
Fig. 1. Sample annotation of an rtMRI video frame. (For interpretation of the references to color in this figure, the reader is referred to the web
version of this article.)
160 A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174
TaggedPThe air-tissue boundaries in each rtMRI image are manually traced. This involves tracing of three contours which
constitute the air-tissue boundaries near different anatomical points in the mid-sagittal plane of the upper airway,
namely, Contour1: Upper lip - hard palate - Velum, Contour2: Jaw - Lower lip - tongue - epiglottis - Larynx, Con-
tour3: Pharyngeal wall - glottis. The tracing and annotation process is done by five people of age group
21�25 years, who have prior knowledge of anatomy of the upper airway. The annotation task for different sentences
of different subjects is balanced across five annotators. A Graphical User Interface (GUI) developed in MATLAB
R2013 has been designed to help annotators draw three contours. For tracing each contour, the annotators are asked
to mark points (by clicking) to draw a contour on the air-tissue boundary. Annotators are allowed to mark as many
points on each contour as they feel appropriate to correctly depict the contour curvature. In addition to marking three
contours of interest, annotators are also asked to mark the locations of the upper lip (UL), lower lip (LL), velum
(VEL) tip and tongue base (TBa) in each rtMRI frame. A sample annotated frame is shown in Fig. 1, where the Con-
tour1 is indicated by blue dots, Contour2 by black dots and Contour3 by red dots. The annotation of the air-tissue
boundaries is done in a manner similar to that by Lingala et al. (2017). The locations of the UL, LL, VEL tip and
tongue base as marked by the annotator are shown in Fig. 1 with yellow squares. A large number of points on a con-
tour would result in lower distance between two consecutive points, referred to as inter-point distance. The number
of points marked by the annotators averaged over all frames of F1, F2, M1, and M2 separately is (60, 58, 59, 62),
(79, 64, 70, 78) and (49, 51, 40, 52) for Contour1, Contour2, and Contour3, respectively. These correspond to mini-
mum inter-point distances (in mm) of (1.54, 2.60, 2.40, 2.75), (2.29, 3.14, 2.93, 2.85) and (1.64, 2.26, 2.84, 2.62)
and average inter-point distances (in mm) of (5.45, 6.20, 6.00, 6.04), (5.80, 7.12, 7.25, 6.56) and (5.37, 5.79, 6.25,
5.52) when averaged over all frames of each subject. This indicates that annotators, on average, mark two consecu-
tive points at a distance ranging from 1.85 to 2.5 pixels. The minimum inter-point spacing (1.54 mm) is found to be
the least for Contour1 of F1.
TaggedPBefore the annotators started manual tracing, the details of the annotation using the GUI are explained to them. It
is known that some anatomical points could remain invisible due the presence of thermal noise (Huettel et al., 2004)
or the absence of sufficient hydrogen content (Botta, 2000; Berger, 2002). In such scenarios, annotators are asked to
trace the contours to the best of their judgments in the rtMRI frame. Tracings from each annotator in each rtMRI
frame have been cross-checked and corrected by another annotator to improve the quality of annotation. It is found
that, on average, the duration for completing annotation of one sentence is 7�8 h (6�10 min per rtMRI frame). It is
also found that the annotation time for F2 and M1 is higher compared to the other two subjects. This could be due to
the morphological differences across various subjects.
3. Optimal sensor placement
TaggedPDue to invasive nature of the recording, flesh points of only a few articulators are tracked in EMA. For example,
in most of the EMA recordings in the literature, sensors have been placed in the front part of the vocal tract including
A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174 161
TaggedPUL, LL, lower incisor (LI), tongue tip (TT), tongue body (TB) and tongue dorsum (TD) in the mid-sagittal plane.
Typical locations of these sensors (S1; . . . ; S6) are shown in Fig. 2, where the upper and lower vocal tract boundaries
are manually drawn (blue and green contours respectively in Fig. 2) on a randomly chosen video frame from the
MRI-TIMIT corpus. In few recordings (Wrench, 2000), a sensor is attached to the velum to track its movements
(indicated by S7 in Fig. 2). However, no sensors are typically attached to the back part of the vocal tract (behind
tongue dorsum on the lower VT boundary and behind velum on the upper VT boundary) mainly to avoid the discom-
fort of the subject during speaking. Thus, in this work, we optimize for the location of the sensors in the front part of
the vocal tract in the mid-sagittal plane. The front part of the vocal tract is assumed to begin with the UL and LL as
marked by the annotators. It is essential to record the UL and LL in order to track the opening of the vocal tract.
Hence, we assume that the locations of two out of the seven sensors should be UL and LL, which form the vocal tract
opening and, thus, we do not optimize the locations of UL (S1) and LL (S2). Having fixed two among seven sensor
locations, optimal locations of the remaining sensors are determined such that they carry maximal information about
the shape of the front part of vocal tract in the mid-sagittal plane. The front part of the vocal tract in the mid-sagittal
plane is defined using the upper and lower VT boundaries as shown in blue and green bold contours respectively in
Fig. 2. The velum is the only moving part in the upper VT boundary; hence, we assume that it is essential to capture
the vocal tract shape till the end of the velum segment. Similarly, we assume that, using EMA sensor locations, lower
part of the vocal tract can be at most reconstructed till the end of the tongue, since no sensors are typically placed
beyond tongue due to gag reflex and also, it is difficult to glue a sensor to anything behind the soft palate. Hence, it
would be difficult to capture vocal tract shape information near and beyond epiglottis.
TaggedPThus, the goal of optimum sensor placement becomes finding:
TaggedP1) The optimum location of one sensor (S7) on the upper VT boundary such that the VT boundary from the UL to
F
fi
ig.
gur
the end of velum segment (as shown by blue bold curve between two cyan boxes in Fig. 2) can be reconstructed
with minimal error,
TaggedP2)
The optimum locations of four sensors (S3, S4, S5, S6) on the lower VT boundary such that the VT boundary fromthe LL to the end of the tongue (Tend) (as shown by green bold curve between two cyan boxes in Fig. 2) can be
reconstructed with minimal error.
TaggedPBy ensuring a good quality reconstruction of the upper and lower VT boundaries, a good quality reconstruction of the
vocal tract shape and area function in the front part of the vocal tract is also ensured. The optimum locations of the
seven sensors, thus obtained, would capture the information to recover (interpolate) the missing outline of the VT
boundaries in the front part of the vocal tract. The location of one sensor on the upper VT boundary in a frame is opti-
mized separately from that for the locations of four sensors in the lower VT boundary in the same frame. This is mainly
because the sensor locations on the lower VT boundary do not directly provide any information about the upper VT
boundary shape & vice versa. Suppose there are Nu points on the upper VT boundary of a test frame denoted by
2. Illustration of the sensor locations, cyan boxes indicate the endpoints of VT boundaries. (For interpretation of the references to color in this
e legend, the reader is referred to the web version of this article.)
162 A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174
TaggedPCu = {xu(i), yu(i), 1� i� Nu}. Similarly, there are Nl points on the lower VT boundary denoted by Cl = {xl(i), yl(i),
1� i�Nl}. The upper VT boundary contour of the kth (1� k� K) training rtMRI frame is denoted by Cku ¼
fxkuðiÞ; ykuðiÞ; 1� i�Nkug; where K is the total number of training frames. Similarly, Ck
l ¼ fxkl ðiÞ; ykl ðiÞ; 1� i�Nkl g
denotes the lower VT boundary contour of the kth training rtMRI frame. Interpolation is a key step in reconstructing
the VT boundaries from the sensor locations, which allows to reconstruct missing points between any two given
points on a boundary. We consider two types of interpolation namely, linear interpolation and data-driven interpola-
tion. These are described below before the optimum sensor localization algorithm is presented.
3.1. Linear interpolation:
TaggedPConsider ith and jth points on the upper VT boundary of a test frame, i.e., [xu(i), yu(i)] and [xu(j), yu(j)].
Suppose we need to interpolate N equidistant points fbxuðnÞ; byuðnÞ; 1� n�Ng between these two points such that
½bxuð1Þ; byuð1Þ� ¼ ½xuðiÞ; yuðiÞ� and ½bxuðNÞ; byuðNÞ� ¼ ½xuðjÞ; yuðjÞ�. For linear interpolation, all these points must lie on
the line joining the ith and jth points. The equation of this line is given by
byuðnÞ ¼ yuðjÞ�yuðiÞxuðjÞ�xuðiÞ bxuðnÞ þ yuðiÞxuðjÞ�yuðjÞxuðiÞ
xuðjÞ�xuðiÞ ; 1� n�N: ð1Þ
For equi-spacing, the distance between two consecutive points ½bxuðnÞ; byuðnÞ� and ½bxuðnþ 1Þ; byuðnþ 1Þ� must be
4 ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi�xuðiÞ�xuðjÞ
�2
þ�yuðiÞ�yuðjÞ
�2r
N�1: ð2Þ
Thus, the nth equi-spaced point which is at a distance of nD away from [xu(i), yu(i)] can be found by solving Eq.(1)
and the following equation:
n4 ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi�bxuðnÞ�xuðiÞ
�2
þ�byuðnÞ�yuðiÞ
�2r
ð3Þ
Since Eq. (3) is quadruple in bxuðnÞ and byuðnÞ; we obtain two solutions and keep the one, which lies closer to
[xu(j), yu(j)]. This, in turn, ensures that the solution lies on the line segment joining [xu(i), yu(i)] and [xu(j), yu(j)]. For
the lower VT boundary, bxlðnÞ and bylðnÞ are obtained in a similar manner described above.
TaggedPAs long as the segment of the VT boundary between two given points can be well approximated by a line
segment, the linear interpolation will work well. However, if the segment of the boundary has non-linear shapes,
linear interpolation would not be effective. Typically VT boundaries have non-linear segments. Hence, a data-driven
interpolation using a set of training boundaries is proposed to overcome this limitation.
3.2. Data-driven interpolation:
TaggedPIn the data-driven interpolation, the segment between any two points on a test VT boundary is reconstructed
by finding the best segment from the VT boundaries of the training set. The best segment from the training
VT boundaries is obtained by first finding two points on the training VT boundaries that are closest to the two
test points in the Euclidean sense; the segment between the two closest training points is then used for recon-
struction. Consider the task of interpolating N equidistant points between two given points on the upper VT
boundary [xu(i), yu(i)] and [xu(j), yu(j)]. For this purpose, the K upper VT boundaries from the training set are
used. One among K training VT boundaries is selected which has the closest proximity with the two given
points. This is obtained by finding two points in each training boundary � each point being closest to one of
the given points. Let ½xkuði0Þ; ykuði
0Þ� is the point of kth training boundary closest to [xu(i), yu(i)] and their distance is
denoted by Dki ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðxuðiÞ�xkuði0ÞÞ2 þ ðyuðiÞ�ykuði0ÞÞ2
q. Similarly, for [xu(j), yu(j)], ½xkuðj
0Þ; ykuðj0Þ� is the closest
point with a distance Dkj ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðxuðjÞ�xkuðj0ÞÞ2 þ ðyuðjÞ�ykuðj0ÞÞ2
q. The best boundary from the training set is obtained
as kH ¼ arg minkðDk
i þ Dkj Þ.
TaggedPThe segment (comprising N0 6¼ N points) of the chosen training boundary from ½xkHu ði0Þ; ykHu ði0Þ� to ½xkHu ðj0Þ; ykHu ðj0Þ�
is used to interpolate N points between [xu(i), yu(i)] and [xu(j), yu(j)] using affine transformation followed by resampling.
A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174 163
TaggedPFirst the affine transformation is applied on the boundary segment from ½xkHu ði0Þ; ykHu ði0Þ� to ½xkHu ðj0Þ; ykHu ðj0Þ� whichresults in N
0points (non-equi-spaced) between two given test points [xu(i), yu(i)] and [xu(j), yu(j)] (see Appendix A
for details). After this, a piece-wise linear contour is obtained by linearly interpolating these N0points and then N
points, fbxuðnÞ; byuðnÞ; 1� n�Ng; are sampled on this contour so that they are equispaced. The steps of resampling
a contour are outlined in Appendix B.
TaggedPFor the lower VT boundary, bxlðnÞ and bylðnÞ are obtained in a similar manner described above. In next two
subsections, we describe the sensor location optimization algorithms on the upper and lower VT boundaries separately.
3.3. Optimum location of one sensor on the upper VT boundary
TaggedPConsider the upper VT boundary Cu: {xu(i), yu(i), 1� i� Nu} of a test frame. Cu begins from the location of UL
sensor and continues till the end of the velum segment, i.e, [xu(1), yu(1)] and [xu(Nu), yu(Nu)] denote the UL sensor
location and the end of the velum segment respectively (indicated by the cyan squares in Fig. 2). One sensor can be
placed at any of the remaining Nu�2 points. Thus, the optimal sensor location is obtained by first reconstructing
the upper VT boundary using a sensor location anywhere among the remaining Nu�2 points and the end points,
followed by searching for the sensor location which results in the least reconstruction error. For a given frame, the
total reconstruction error (TRE) of the upper VT boundary is expressed as:
TREU ¼XNu
i¼1
�xuðiÞ�bxuðiÞ�2
þ�yuðiÞ�byuðiÞ�2
� �ð4Þ
TaggedPAt first, we define the local mean squared error, given two points [xu(s), yu(s)] and [xu(e), yu(e)] for interpolating
N ¼ ½ðe�sÞ�1� in-between pointsTaggedPfbxuðnÞ; byuðnÞ; s< n< eg by either a linear or a data-driven interpolation. Local mean squared error given the
sth and eth points, is defined as follows:
MLocu ðs; eÞ ¼
Xe�1n¼sþ1
�xuðnÞ�bxuðnÞ�2
þ�yuðnÞ�byuðnÞ�2
� �ð5Þ
TaggedPFor any chosen point 2� k�Nu�1; using Eqs. (4) and (5), we can write T REU ¼ Mlocu ð1; kÞ þMloc
u ðk;NuÞ.Hence, for the upper VT boundary, the optimal sensor location [xu(k*), yu(k*)] is obtained by performing the
following optimization:
½xuðk�Þ; yuðk�Þ� ¼ arg min2� k�Nu�1
T REU ð6Þ
¼ arg min2� k�Nu�1
MuLocð1; kÞ þMu
Locðk;NuÞ ð7Þ
3.4. Optimum location of four sensors on the lower VT boundary
TaggedPConsider the lower VT boundary Cl: {xl(i), yl(i), 1� i� Nl}. Given the LL point [xl(1), yl(1)] and the end
point of the tongue contour [xl(Nl), yl(Nl)], we obtain the optimal locations of four sensors (S3, S4, S5, S6) in a
rtMRI video frame by minimizing the reconstruction error between the original and the interpolated lower
VT boundary. These locations are denoted by ½xlðk�pÞ; ylðk�pÞ�; 1� p� Nopt, with 1< k�p < k�pþ1 <Nl; 8 p; whereNopt ¼ 4 denotes the number of optimal points. The total reconstruction error (TRE) of the lower VT boundary in
a frame is defined as:
TREL ¼XNl
i¼1
�xlðiÞ�bxlðiÞ�2
þ�ylðiÞ�bylðiÞ�2
� �ð8Þ
TaggedPSimilar to Eq. (5), the local mean squared error for the lower VT boundary between sth and eth points is expressed
as follows:
MLocl ðs; eÞ ¼
Xe�1n¼sþ1
�xlðnÞ�bxlðnÞ�2
þ�ylðnÞ�bylðnÞ�2
� �ð9Þ
164 A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174
TaggedPwhere ½bxlðnÞ; bylðnÞ� is a point on the boundary reconstructed using either a linear or a data-driven interpolation as
described in Sections 3.1 and 3.2. Suppose the indices of the four points corresponding to four sensors (S3, S4, S5, S6)
are chosen to be k1, k2, k3 and k4, where k1< k2< k3< k4. Then, the T REL can be written in terms ofMLocl as follows
TREL ¼ MLocl ð1; k1Þ þ
X3j¼1
MLocl ðkj; kjþ1Þ þMLoc
l ðk4;NlÞ ð10Þ
TaggedPThe indices of the optimal locations of four sensors, kH
1 ; kH
2 ; kH
3 ; kH
4 are obtained by solving the following
fkH
1 ; kH
2 ; kH
3 ; kH
4 g ¼ arg min1< k1 < TBa;
TBa < k2 < k3 < k4 <Nl
T REL ð11Þ
A full search for four points to minimize T REL would have a order complexity of Oð4NlÞ. It is computationally pro-
hibitive for Nl= 120, which is the average number of points marked by the annotators to depict the lower VT bound-
ary. We design an algorithm following the principle of dynamic programming for an efficient solution of the
optimization in Eq. (11). The steps of the algorithm are summarized in Algorithm 1.
A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174 165
TaggedPThe four optimal points thus obtained are declared as the optimal sensor locations� one between LL & TBaand remaining three on the tongue. It should be noted that in finding the optimal sensor location S3, we constrain the
optimal sensor to lie between LL and TBa. This is done to ensure that there are only three sensors on the tongue as
typically done in an EMA recording. In fact, the sensor S3 placed on LI is used for recording jaw movement. Since
teeth does not appear in rtMRI recording, we constrain the location of S3 to be on the lower VT boundary segment
joining LL and TBa. By constraining S3 to be between LL and TBa in the optimization, we assume that the optimized
sensor location would be ideal for recording the jaw motion.
4. Experiments and results
4.1. Experimental setup
TaggedPThe annotated rtMRI video frames from MRI-TIMIT form the basis for carrying out the experiments in this work.
The number of points marked by the annotators on the upper VT boundary varies across frames. This is true for the
lower VT boundary as well. This is mainly because the annotators are allowed to mark as many points as they found
appropriate to depict the air-tissue boundary. This could also be due to the fact that the shapes of both upper and lower
VT boundaries change from one frame to the next requiring different number of points to depict them. The points
marked by the annotators are also found to be unequally spaced along the trajectory of the boundary i.e., the marked
points are dense in some part of the boundary while it is not so in other parts of the boundary as seen in Fig. 1. In order
to consider all parts of the upper and lower VT boundaries equally for selecting optimal sensor locations, we resample
the upper VT boundary such that the points are equally spaced along the boundary and the number of points on the
upper VT boundary is fixed (NU) across all frames. This is similarly done for the lower VT boundary using a fixed (NL)
number of points. The resampling is done by finding equi-distant points on the boundary obtained by linear interpolation
of the annotated points following the steps outlined in Appendix B. If NU and NL are small, the resampled points may
not capture the actual shape of a boundary. The higher the value of NU and NL, the better is the representation of the
boundary shapes. However, increasing NU and NL arbitrarily may not increase the quality of the boundary as the infor-
mation about the boundary shape is limited by the spatial resolution of the points marked by the annotators. Therefore,
we determine the value of NU and NL such that the average distance between two consecutive points after resampling
matches with the average minimum distance (1.54mm) between two consecutive points marked by the annotators (see
Section 2 for details). This results in NU = 109 and NL = 113. With this fixed set of points, the smallest and highest dis-
tances between two consecutive resampled points are found to be 1.28 and 1.81mm across all frames of all subjects.
TaggedPThe sensor locations are optimized separately on the upper and lower VT boundaries to achieve minimal recon-
struction error using data-driven as well as linear interpolation as outlined in Section 3. The optimization for the sen-
sor locations is done separately for each of the four subjects (F1, F2, M1, M2). In particular, for the present study,
we find the optimum sensor locations in a five fold cross validation setup for each subject separately, where 1/5 of
all frames of a subject is used as the test set and the remaining are used as the training set in a round robin fashion.
Note that a training set is required only for the data-driven interpolation while that is not so for the linear interpola-
tion. This proposed data-driven interpolation is done under the assumption that the shape of a selected boundary seg-
ment from the training set would be similar to that of the test boundary segment. This assumption, in turn, requires
that the test segment would be located spatially close to the selected training segment. However, because of the head
movement of the subject, two segments corresponding to similar vocal tract shapes (e.g., VT shapes for same pho-
neme) may not match spatially. In order to compensate for this spatial offset, we perform an affine transformation
(following the steps outlined in Appendix A) on the upper VT boundary of each frame such that the begin and end
points of every boundary are mapped to (0,0) and (1,0). Similarly, the lower VT boundary is transformed before
they are used for finding optimum sensor locations.
TaggedPThe lower VT boundary changes its shape from one frame to the next. Hence, it becomes a challenge to associate
anatomically identical points on two contours from two different frames. The optimal sensor locations in two different
frames could not be associated due to the same reason. In order to report the location of an optimal sensor across differ-
ent frames, we compute different inter-sensor distances as well as distances from known anatomical points on the VT
boundary in each frame. Finally, we report the mean and standard deviation (SD) of these distances over all frames.
Describing the optimal sensor locations in this manner helps in identifying the position of the optimal points with
respect to the fixed anatomical points and other optimal sensor locations on the VT boundary in the mid-sagittal plane.
166 A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174
4.2. Results
TaggedPThe performance of the proposed algorithm for selecting optimal sensors’ locations is reported in terms of the root
mean squared error (RMSE) between the original VT boundary and the reconstructed boundary from the optimized
sensor locations defined as follows:
RMSE ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1
NT
XNT
n¼1
��������j½n��bj½n���������22
s; ð12Þ
where j½n� ¼ ξx½n�ξy½n�
� and bj½n� ¼ bξ x½n�bξ y½n�
" #are the points on the original and reconstructed boundaries in each frame
respectively. NT is the total number of points in a frame.
TaggedPFig. 3 shows the bar plot of the RMSE for each fold separately for every subject (one for each row) and upper &
lower VT boundaries (one for each column). The bar height indicates the RMSE averaged over all test frames and
errorbar indicates the corresponding SD. For every fold, Fig. 3 reports the RMSE obtained using both linear (dark-
gray bars) and data-driven (light-gray bars) interpolation. From the figure, we can see that the data-driven interpola-
tion works better than the linear interpolation. This is mainly due to the fact the data-driven interpolation makes use
of the training frame’s boundary shapes while linear interpolation does not. The RMSE of the upper VT boundary is
found to be more than that of the lower VT boundary, because four sensor locations are used to reconstruct the lower
VT boundary while only one sensor is used for the upper VT boundary. When averaged over all subjects and folds,
we observe that the upper VT boundary is reconstructed with an RMSE of 1.33 mm while the lower VT boundary is
reconstructed with an RMSE of 0.36 mm using the optimized sensor locations and data-driven interpolation.
TaggedPAfter obtaining the optimal points for all the frames in each sentence of a subject, the mean and SD of the distan-
ces among optimal sensor locations and the anatomical landmarks are computed. Here, one pixel in rtMRI corre-
sponds to 2.9 mm £ 2.9 mm area in the physical dimension. Tables 1 and 2 report the mean and SD of various
distances on the lower and upper VT boundaries respectively. All the distances are reported in millimeters. In
Table 1, d(m, n) represents distance between the mth and nth sensor locations on the VT boundary. Apart from the
Fig. 3. RMSE for each fold computed by linear and data driven interpolation.
A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174 167
TaggedPsensors, anatomical landmarks such as LL, TBa and Tend are also used. In Table 2, variables Sb7 and Sa7 represent the
optimal location of sensor S7 when obtained before (toward UL) and after (away from UL) the VEL point respec-
tively. d(S7, VEL) denotes the distance of S7 from the velum tip considering distance to optimal location before and
after VEL as positive and negative, respectively.
TaggedPIt is clear from Table 1 that the average distance between S4 and TBa lies in the range of 18�22 mm, indicating the
optimized S4 location to be approximately the tongue tip position. The distance between the S4 and S5 locations indi-
cates that S5 has to be placed at about a distance of 37�39 mm from the tongue tip. Similarly, distance between the
S5 and S6 locations indicates that the optimal location of S6 is nearly 77�85 mm away from the tongue tip. Distances
in Table 2 indicate that the optimal location of S7 lies around VEL. Among all subjects, the optimal location of S7furthest from VEL occurs at 18.84 and 35.07 mm after and before VEL tip, respectively. This suggests that the opti-
mal point on the upper VT boundary primarily tracks the velum movement. d(S7, VEL) values in Table 2 suggest
that the optimal location of S7 occurs before VEL in most of the frames.
TaggedPIn order to produce different sounds, the vocal tract creates wide variety of shapes. Tongue plays a crucial role in
creating these different shapes by creating constriction in different directions as shown in Fig. 4. Due to the time
varying nature of vocal tract profiles, given an Nth point in a frame, the corresponding anatomical location on vocal
tract boundary need not be the Nth point in another frame. In-order to illustrate how the optimal sensor locations
vary depending on the vocal tract shapes for different phonemes, we choose four phonemes, namely,
and show the optimal sensor locations (using data-driven interpolation) on the upper and lower
VT boundaries for all four subjects in Fig. 5.
TaggedPThe first row in Fig. 5 depicts the VT configuration for the vowel phoneme . A typical VT configuration for
consists of an opened VT at the front, tongue raised at the back and wide gap between tongue and palate as shown in
the figure. From quantal nature of speech (Stevens, 1972; 1989; 2002), it is known that the target sound can be pro-
duced by having some degrees of articulatory freedom, where the articulation strategies are guided by a few principles
and rather constrained. This can be observed in the VT shapes in the sense that, while there is a gross similarity across
subjects, there are subject specific variations as well. This could be due to different contexts in which is spoken as
well as different articulation styles of the subjects. The optimal sensor locations are shown using black dots on the
upper and lower VT boundaries. It is clear that the shape of the velum changes across subjects and hence the optimal
location of S7 changes depending on the subject. For example, for F1 and M1, the optimized S7 is very close to the
Table 2
The mean and SD (in brackets) of the distances (in mm) between the optimal sensor
location and the VEL on the upper VT boundary. FeAvg, MAvg and Average indicate
the distances averaged across females, males, and all subjects, respectively.
Subject dðSa7;VELÞ dðSb7;VELÞ d(S7, VEL)
F1 17.09(15.97) 25.83(17.41) 13.70(25.76)
F2 8.72 (10.17) 35.07(21.05) 32.12(23.25)
FeAvg 15.60 (15.41) 30.83(19.99) 22.46(26.25)
M1 18.84 (12.77) 28.96 (17.88) 20.11(25.22)
M2 15.80 (14.30) 28.86(18.06) 20.89(24.43)
MAvg 17.47 (13.53) 28.91(17.95) 20.47(24.85)
Average 16.59 (14.46) 29.82(18.97) 21.41(25.54)
Table 1
The mean and SD (in brackets) of the distances (in mm) among different optimal sensor locations on
lower VT boundary. FeAvg, MAvg and Average indicate the distances averaged across females, males,
and all subjects, respectively.
Subject d(LL, S3) d(S3, TBa) d(TBa, S4) d(S4, S5) d(S5, S6) d(S6, Tend)
F1 26.41 (7.05) 10.37 (7.19) 21.24 (11.93) 37.67 (10.88) 40.15 (11.04) 35.11 (10.95)
F2 23.56 (7.18) 7.49 (6.82) 19.09 (10.12) 38.68 (10.7) 39.94 (11.3) 43.44 (11.05)
FeAvg 24.98 (7.46) 8.93 (7.41) 20.16 (11.48) 38.17 (11.08) 40.04 (11.50) 39.27 (12.13)
M1 26.56 (7.76) 10.91 (7.83) 18.25 (10.81) 37.68 (10.93) 42.99 (11.47) 40.16 (11.14)
M2 28.97 (8.35) 10.98 (8.42) 21.16 (11.22) 38.77 (12.46) 46.19 (12.10) 43.78 (13.51)
MAvg 27.76 (8.41) 10.94 (8.44) 19.70 (11.42) 38.22 (11.90) 44.59 (12.28) 41.97 (12.63)
Average 26.37 (8.08) 9.93 (8.03) 19.93 (11.45) 38.2 (11.52) 42.31 (12.11) 40.66 (12.48)
Fig. 4. Different vocal tract profiles due to constrictions created by tongue in different directions.
168 A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174
TaggedPVEL tip while for F2 and M2 the optimal location is away from the VEL tip. The optimal location of S3 is close to the
TBa for all subjects. The optimal location of S4 appears to coincide with TT and the locations of S5 and S6 are optimally
placed to capture the tongue shape well. The second row in Fig. 5 depicts the vocal tract configuration for the voiced
plosive consonant phoneme /d/, for which a small degree of opened vocal tract at the front can be seen along with the
constriction created by the raised tongue to the palate. It is interesting to observe that the optimal location of S4 occurs
exactly at the point of constriction. This could be due to the high curvature of the tongue near constriction requiring an
optimal point for best reconstruction of the tongue shape. The third row of images in Fig. 5 depicts the vocal tract con-
figuration for the fricative consonant phoneme highlighting the tongue constriction against palate. Unlike the opti-
mal locations for phoneme /d/, the optimal locations of the sensors on the tongue do not coincide with the constriction.
This is because the shape of the tongue during is different from that during /d/; in particular, the curvature of the
tongue near constriction for is lower than that for /d/. The fourth row in Fig. 5 depicts the vocal tract configuration
for the lateral approximant alveolar /l/. The tongue tip holds its contact with palate for producing /l/. Unlike /d/, the
optimal location of S4 does not occur at the constriction point for all subjects. It is clear that the shape of the tongue
Fig. 5. The vocal tract profiles and optimal sensor locations for phonemes for all the speakers. Optimal locations of S1and S2 are not shown since they are fixed at UL and LL which are the points at which the upper and lower VT boundaries begin, respectively.
A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174 169
TaggedPvaries across subjects for the sound /l/. The curvature of the tongue near constriction is high for subjects F1 and F2,
while that is not so for M1 and M2. Interestingly, the optimal location of S4 occurs exactly at the point of constriction
for F1 and F2 while that does not happen for M1 and M2. These illustrations show that the optimal sensor locations
vary according to the uttered sound and the speaker’s VT morphology and articulation.
4.3. Discussions
TaggedPThe optimal sensor location is computed in each rtMRI video frame separately. The optimal locations are found to
vary across frames within every utterance. However, in practice, it is not feasible to change the optimal sensor loca-
tion on a frame-by-frame basis. Hence, we examine the quality of VT boundary reconstruction with fixed sensor loca-
tions for each subject separately based on the average distances reported in Table 1. For example, for subject F1, S3 is
placed at a distance of 26.41 mm from LL, S4, S5, and S6 are placed at a distance of 21.24, 37.67, and 40.15 mm from
TBa, S4, and S5; respectively (as per the first row in Table 1). Similarly, for F1, S7 is placed at a distance of 13.70 mm
from from VEL toward the UL. Using these fixed locations in each frame, the RMSE of the reconstructed VT bound-
aries (over all frames of each subject) are reported in Table 3 under the sub-column titled ‘subject dependent’ under
the column titled ‘frame independent’. The RMSE values under the column titled ‘frame dependent’ correspond to the
reconstructed boundaries using optimized sensor location separately in each frame. This is identical to the average per-
formance across five folds shown in Fig. 3. It is clear that the RMSE, averaged across all subjects, increases by 0.26
and 0.34mm (absolute) for lower and upper VT respectively when a fixed set of sensor locations is used in all frames
compared to frame specific optimized sensor locations. We also report the RMSE when sensors are placed using inter-
sensor distances averaged across subjects within and across genders as indicated by ‘gender specific’ and ‘subject
independent’ sub-columns in Table 3. For these purpose, we use the inter-sensor distances following the third, sixth
and seventh rows of Tables 1 and 2. For example, for subject independent evaluation, this results in a distance of
26.37 mm between LL and S3. S4, S5 and S6 are placed at a distance of 19.93, 38.22, and 42.31 mm from TBa, S4 and S5; respectively (as seen in the seventh row of Table 1). S7 is placed at a distance of 21.41 mm from the VEL tip toward
the UL (as per the seventh row in Table 2). It is clear that the RMSE increases further when the sensors are placed in
gender specific as well as subject independent manner. This is mainly due to the fact that the VT morphology changes
across subjects and an average location across multiple subjects may not work well for individual ones. Although the
morphology of male and female subjects are different, we do not find any significant differences between the RMSE
values for male and female subjects when sensors are placed in a gender specific manner. But the average RMSE val-
ues for male subjects is higher than those of female subjects by 0.02mm in lower VT and 0.2�0.3 mm in upper VT.
TaggedPIn practice, it is challenging to place the sensors accurately to match the average optimal locations of the sensors
obtained from the optimization presented in this work. This could be due to several reasons including co-operation
from the subject, critical locations in the vocal tract without much discomfort to the subject, mismatch between the
plane of sensor placement and the mid-sagittal plane as observed in rtMRI, viscosity of glue, varying degree of sal-
vation across people, and placement of the wires. Placing EMA sensors on the tongue of a subject, in general, causes
discomfort during speaking with wires in the mouth. In particular, when a sensor is placed near tongue dorsum or
behind, it could cause gag reflex resulting in discomfort to the subject. Typically the EMA sensors are placed in the
mid-sagittal plane on the tongue based on visual inspection. This could cause a wrong estimate of the mid-sagittal
Table 3
The Mean and SD (in brackets) of the RMSE (in mm) of reconstructed VT boundaries by placing the sensors optimized in frame specific as well as
frame independent manner. FeAvg, MAvg and Average indicate the RMSE averaged across females, males, and all subjects, respectively.
Subject Frame dependent Frame independent
Subject dependent Subject dependent Gender specific Subject independent
Lower VT Upper VT Lower VT Upper VT Lower VT Upper VT Lower VT Upper VT
F1 0.35 (0.07) 1.48 (0.47) 0.57 (0.24) 1.86 (0.52) 0.56 (0.25) 1.86 (0.55) 0.56 (0.23) 1.88 (0.51)
F2 0.39 (0.07) 1.41 (0.32) 0.67 (0.25) 1.82 (0.51) 0.69 (0.27) 1.82 (0.53) 0.69 (0.26) 1.81 (0.51)
FeAvg 0.37 (0.07) 1.44 (0.41) 0.62 (0.25) 1.83 (0.52) 0.63 (0.27) 1.84 (0.54) 0.62 (0.25) 1.87 (0.52)
M1 0.33 (0.06) 1.34 (0.31) 0.57 (0.21) 1.68 (0.52) 0.57 (0.20) 1.70 (0.53) 0.57 (0.20) 1.70 (0.52)
M2 0.37 (0.09) 1.07 (0.26) 0.66 (0.48) 1.35 (0.39) 0.66 (0.49) 1.34 (0.41) 0.67 (0.49) 1.36 (0.34)
MAvg 0.35 (0.08) 1.22 (0.31) 0.61 (0.36) 1.53 (0.49) 0.61 (0.37) 1.53 (0.51) 0.61 (0.37) 1.54 (0.49)
Average 0.36 (0.08) 1.33 (0.38) 0.62 (0.32) 1.67 (0.53) 0.62 (0.32) 1.68 (0.54) 0.62 (0.32) 1.69 (0.53)
170 A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174
TaggedPplane particularly due to tongue twitching and the manner in which the sensor is placed. The salvation often results
in detachment of the sensor which exaberates the problem. Also due to the invasive nature of the EMA recording,
more sensors result in more discomfort to the subject during speaking. In this work, we did not consider including
factor due to subjective discomfort level in finding the optimal sensor locations. According to the the proposed opti-
mization, as the number of sensors increases the reconstruction error decreases. However, considering the invasive
nature of the EMA recording, the choice of the number of sensors should be determined by jointly considering the
objective metric (RMSE) as well as the subjective metric (discomfort in speaking). Another practical constraint in
determining optimal sensor location would be the required minimum distance between two sensors. For example, it
is recommended that two sensors in EMA recording should be placed at a minimum distance of 8mm to avoid inter-
sensor interference (AG500, 2017). Such a constraint is critical when one plans to find optimal locations of relatively
large number of sensors.
5. Conclusions
TaggedPIn this work, we propose an algorithm for finding optimal sensor locations for EMA recording by formulating it
as a problem of optimal point selection on the air-tissue boundaries for minimizing the reconstruction error in the
rtMRI video frames. Air-tissue boundaries are reconstructed using two types of interpolation functions, namely lin-
ear and data-driven. We have considered four different speakers to examine how the algorithm performs in pre-
dicting optimal sensor locations in VT with varying morphology and articulation styles. We have considered
rtMRI frames covering different vocal tract shapes corresponding to most of the phonemes of American English.
The RMSE of the reconstructed boundary has a range of 0.33�0.39 mm and 1.07�1.48 mm when optimal sensor
locations are used for reconstruction in the lower and upper VT, respectively. When averaged over all four sub-
jects, the proposed data-driven interpolation reveals that, for minimizing the reconstruction error of the lower VT
boundary, one sensor should be placed at lower incisor at a distance of 26.37(§8.08) mm from the lower lip and
three sensors at TT (19.93(§11.45) mm from tongue base) and 38.2(§11.52) mm and 80.51(§13.51) mm away
from TT. Similarly, for minimal reconstruction error of the upper VT boundary, one sensor should be placed at a
distance of 21.41(§25.54) mm from the velum tip. This leads to an average RMSE of the reconstruction error of
0.62 mm and 1.69 mm of the lower and upper VT boundaries, respectively.
TaggedPIn the current work, we have optimized the sensor locations on the VT boundary based on each frame indepen-
dently. Optimal sensor locations could also be found by minimizing the reconstruction error on all the frames jointly.
It would also be interesting to observe how the location of optimal sensors can be varied by considering the multi-slices
data in a rtMRI frame which adds information from the coronal plane. Modeling pharyngeal constriction based on opti-
mally placed sensors in the anterior tract is also a problem worth investigating. These are parts of our future work.
Acknowledgments
TaggedPWe thank all annotators who participated in marking air-tissue boundaries in the rtMRI video frames.
Appendix
TaggedPAppendix A. Converting a two-dimensional contour with start and end points as [x1, y1] to [x2, y2] respectively to a
new contour with start and end points as [x3, y3] to [x4, y4] respectively using affine transformation
TaggedPGiven two locations [x1, y1] and [x2, y2], a contour between them can be transformed to a new one starting from
[x3, y3] and ending at [x4, y4] by performing an affine transformationexey
� ¼ a1 �a2
a2 a1
� x
y
� þ b1
b2
� ; where [x, y]
and ½ex; ey� denote points on the contour before and after transformation, respectively. Affine parameters a1, a2, b1, b2are computed by solving the following equations obtained by equating the relations of the start and end points beforeand after transformation.
x1 �y1 1 0
y1 x1 0 1
x2 �y2 1 0
y2 x2 0 1
26643775
a1a2b1b2
26643775 ¼
x3y3x4y4
26643775 ð1Þ
A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174 171
TaggedPThe above equation is of the form Ax ¼ b; and by performing the elementary row operations on A; its row
equivalent form turns out to be
x1�x2
y1�y2�y1�y2
x1�x20 0 0
y1�y2
x1�x21 0 0
x2 �y2 1 0
y2 x2 0 1
26666664
37777775. Hence, the solution for above equation exists,
if ðx1�x2Þ2 6¼ ðy1�y2Þ2. In other words, the begin and the end points of the contour should not be identical for thesolution to exist. This is true for the optimization we have considered for optimal sensor placement.
TaggedPAppendix B. Resampling a two-dimensional contour with N points (non equi-spaced) with Nd equi-spaced points
TaggedPSuppose the contour P of length L, contains N points starting from p1 ¼ ½x1; y1� to pN ¼ ½xN ; yN �. The steps
outlined in Algorithm 2 describes the resampling of the same contour with Nd points which are equally spaced at a
distance d ¼ L=Nd. The estimated ordered points on the contour are p1, bp2; bp3; . . . ; bpNd�1; pN.
172 A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174
References
TaggedPCarstens Medizinelektronik Gmbh, AG500 Manual. 2017. http://www.ag500.de/manual/ag500/AG500_manual.pdf. (Accessed:15/2/2017).
TaggedPUCLA phonetics lab 2017. http://www.linguistics.ucla.edu/faciliti/facilities/physiology/ema.html#Where_Sensors. (Accessed:13/4/2017).
TaggedPBennett, J.W., Van Lieshout, P., Steele, C.M., 2007. Tongue control for speech and swallowing in healthy younger and older subjects. Int. J. Oro-
fac. Myol. 33, 5–18.
TaggedPBerger, A., 2002. How does it work? Magnetic resonance imaging. BMJ: Br. Med. J. 324 (7328), 35.
TaggedPBombien, L., Mooshammer, C., Hoole, P., Rathcke, T., K€uhnert, B., 2007. Articulatory strengthening in initial German /kl/ clusters under prosodic
variation. In: Proceedings of the Sixteenth International Congress of Phonetic Sciences. Saarbr€ucken, Germany, pp. 457–460.
TaggedPBotta, M., 2000. Second coordination sphere water molecules and relaxivity of gadolinium (III) complexes: implications for MRI contrast agents.
Eur. J. Inorg. Chem. 2000 (3), 399–407.
TaggedPBresch, E., Kim, Y.-C., Nayak, K., Byrd, D., Narayanan, S., 2008. Seeing speech: capturing vocal tract shaping using real-time magnetic reso-
nance imaging. IEEE Signal Process. Mag. 25 (3), 123–132.
TaggedPBrown, R.W., Cheng, Y.-C. N., Haacke, E.M., Thompson, M.R., Venkatesan, R., 2014. Magnetic Resonance Imaging: Physical Principles and
Sequence Design. John Wiley & Sons.
TaggedPCho, T., 2004. Prosodically conditioned strengthening and vowel-to-vowel coarticulation in English. J. Phon. 32 (2), 141–176.
TaggedPDemolin, D., Metens, T., Soquet, A., 1996. Three-dimensional measurement of the vocal tract by MRI. 1, 272–275.
TaggedPDemolin, D., Metens, T., Soquet, A., 2000. Real time MRI and articulatory coordinations in vowels. In: Proceedings of the Fifth Seminar on
Speech Production: Models and Data, pp. 86–93.
TaggedPDuran, D., Bruni, J., Dogil, G., 2013. Acoustic and articulatory information as joint factors coexisting in the context sequence model of speech pro-
duction. 19 (1), 060091.
TaggedPEngwall, O., 2003. Combining MRI, EMA and EPG measurements in a three-dimensional tongue model. Speech Commun. 41 (2), 303–329.
TaggedPFeng, Y., 2008. Dissociating the Role of auditory and Somatosensory Feedback in Speech Production: Sensorimotor Adaptation to Formant Shifts
and Articulatory Perturbations.
TaggedPFrankel, J., Richmond, K., King, S., Taylor, P., 2000. An automatic speech recognition system using neural networks and linear dynamic models to
recover and model articulatory traces. In: Proceedings of the International Conference on Spoken Language Processing (ICSLP), pp. 254–257.
TaggedPGhosh, P.K., Narayanan, S., 2010. A generalized smoothness criterion for acoustic-to-articulatory inversion. J. Acoust. Soc. Am. 128 (4), 2162–
2172.
TaggedPGhosh, P.K., Narayanan, S.S., 2011. A subject-independent acoustic-to-articulatory inversion. In: Proceedings of the IEEE International Confer-
ence on Acoustics, Speech and Signal Processing, pp. 4624–4627.
TaggedPHardcastle, W., Vaxelaire, B., Gibbon, F., Hoole, P., Nguyen, N., 1996. EMA/EPG study of lingual coarticulation in /kl/ clusters. In: Proceedings
of the Speech Production Seminar, pp. 53–56.
TaggedPHoole, P., Gfoerer, S., 1990. Electromagnetic articulography as a tool in the study of lingual coarticulation. J. Acoust. Soc. Am. 87 (S1), S123.
TaggedPHoole, P., Nguyen, N., 1997. Electromagnetic articulography in coarticulation research. Forschungsberichte des Instituts f€ur Phonetik und
Sprachliche Kommunikation der Universit€at M€unchen 35, 177–184.
TaggedPHoole, P., Nguyen-Trong, N., Hardcastle, W., 1993. A comparative investigation of coarticulation in fricatives: electropalatographic, electromag-
netic, and acoustic data. Lang. Speech 36 (2�3), 235–260.
TaggedPHuettel, S.A., Song, A.W., McCarthy, G., 2004. Functional magnetic resonance imaging. 1. Sinauer Associates, Sunderland.
TaggedPKatz, W., Machetanz, J., Orth, U., Sch€onle, P., 1990. A kinematic analysis of anticipatory coarticulation in the speech of anterior aphasic subjects
using electromagnetic articulography. Brain Lang. 38 (4), 555–575.
TaggedPKim, J., Lammert, A.C., Ghosh, P.K., Narayanan, S.S., 2014. Co-registration of speech production datasets from electromagnetic articulography
and real-time magnetic resonance imaging. J. Acoust. Soc. Am. 135 (2), EL115–EL121.
TaggedPKing, S., Wrench, A., 1999. Dynamical system modelling of articulator movements. In: Proceedings of the International Congress of Phonetic
Sciences, pp. 2259–2262.
TaggedPKoos, B., Horn, H., Schaupp, E., Axmann, D., Berneburg, M., 2013. Lip and tongue movements during phonetic sequences: analysis and definition
of normal values. Eur. J. Orthod. 35 (1), 51–58.
TaggedPKroos, C., 2008. Measurement accuracy in 3D electromagnetic articulography (Carstens AG500). In: Proceedings of the Eight International
Seminar on Speech Production, pp. 61–64.
TaggedPKroos, C., 2012. Evaluation of the measurement precision in three-dimensional electromagnetic articulography (Carstens AG500). J. Phon. 40 (3),
453–465.
TaggedPLadefoged, P., Harshman, R., Goldstein, L., Rice, L., 1978. Generating vocal tract shapes from formant frequencies. J. Acoust. Soc. Am. 64 (4),
1027–1035.
TaggedPLing, Z.-H., Richmond, K., Yamagishi, J., 2010. HMM-based text-to-articulatory-movement prediction and analysis of critical articulators. In:
Proceedings of the InterspeecH, pp. 2194–2197.
TaggedPLingala, S.G., Zhu, Y., Kim, Y.-C., Toutios, A., Narayanan, S., Nayak, K.S., 2017. A fast and flexible MRI system for the study of dynamic vocal
tract shaping. Magn. Reson. Med. 77 (1), 112–125.
TaggedPLiu, F.-H., 1994. Environmental Adaptation for Robust Speech Recognition. Carnegie Mellon University, Pittsburgh (Ph.D. thesis).
TaggedPMaassen, B., Kent, R., Peters, H., 2007. Speech Motor Control: In Normal and Disordered Speech. Oxford University Press.
TaggedPMaurer, D., Gr€one, B., Landis, T., Hoch, G., Sch€onle, P., 1993. Re-examination of the relation between the vocal tract and the vowel sound with
electromagnetic articulography (EMA) in vocalizations. Clin. Linguist. Phon. 7 (2), 129–143.
A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174 173
TaggedPMcClean, M.D., Runyan, C.M., 2000. Variations in the relative speeds of orofacial structures with stuttering severity. J. Speech Lang. Hear. Res.
43 (6), 1524–1531.
TaggedPMooshammer, C., Hoole, P., 1993. Articulation and coarticulation in velar consonants. Forschungsberichte-Institut f€ur Phonetik und Sprachliche
Kommunikation der Universit€at M€unchen 31, 249–262.
TaggedPMooshammer, C., Schiller, N.O., 1996. Coarticulatory effects on kinematic parameters of rhotics in German. In: Proceedings of the First ESCA
Tutorial and Research Workshop on Speech Production Modeling: From Control Strategies to Acoustics. Autrans, pp. 25–28.
TaggedPM€ucke, D., Nam, H., Hermes, A., Goldstein, L., 2012. Coupling of tone and constriction gestures in pitch accents. In: Hoole, P. (Ed.), Consonant
Clusters and Structural Complexity. Mouton de Gruyter, Berlin, pp. 205–230.
TaggedPNamasivayam, A.K., Van Lieshout, P.H.H.M., 2001. Compensation and adaptation to static perturbations in people who stutter. Speech Motor
Control in Normal and Disordered Speech. 4th International Speech Motor Conference. Nijmegen, Netherlands, pp. 253–257.
TaggedPNamasivayam, A.K., Van Lieshout, P., 2008. Investigating speech motor practice and learning in people who stutter. J. Fluen. Disord. 33 (1), 32–
51.
TaggedPNarayanan, S., Bresch, E., Ghosh, P.K., Goldstein, L., Katsamanis, A., Kim, Y., Lammert, A.C., Proctor, M.I., Ramanarayanan, V., Zhu, Y., 2011.
A multimodal real-time MRI articulatory corpus for speech research. In: Proceedings of the Interspeech, pp. 837–840.
TaggedPNarayanan, S., Nayak, K., Lee, S., Sethy, A., Byrd, D., 2004. An approach to real-time magnetic resonance imaging for speech production. J.
Acoust. Soc. Am. 115 (4), 1771–1776.
TaggedPNarayanan, S., et al., 2014. Real-Time magnetic resonance imaging and electromagnetic articulography database for speech production research
(TC). J. Acoust. Soc. Am. 136 (3), 1307–1311.
TaggedPOuni, S., Laprie, Y., 2009. Studying pharyngealization using an articulograph. In: Proceedings of the International Workshop on Pharyngeals and
Pharyngealisation. Newcastle.
TaggedPParthasarathy, V., Prince, J.L., Stone, M., Murano, E.Z., NessAiver, M., 2007. Measuring tongue motion from tagged cine-MRI using harmonic
phase HARP processing. J. Acoust. Soc. Am. 121 (1), 491–504.
TaggedPPayan, Y., Perrier, P., 1997. Synthesis of V-V sequences with a 2d biomechanical tongue model controlled by the equilibrium point hypothesis.
Speech Commun. 22 (2), 185–205.
TaggedPPerkell, J.S., Cohen, M.H., Svirsky, M.A., Matthies, M.L., Garabieta, I., Jackson, M.T.T., 1992. Electromagnetic midsagittal articulometer systems
for transducing speech articulatory movements. J. Acoust. Soc. Am. 92 (6), 3078–3096.
TaggedPPeters, H.F.M., Hulstijn, W., Van Lieshout, P., 2000. Recent developments in speech motor research into stuttering. Folia Phoniatrica et Logo-
paedica 52 (1�3), 103–119.
TaggedPRecasens, D., 2002. An EMA study of VCV coarticulatory direction. J. Acoust. Soc. Am. 111 (6), 2828–2841.
TaggedPRichardson, M., Bilmes, J., Diorio, C., 2003. Hidden-articulator Markov models for speech recognition. Speech Commun. 41 (2), 511–529.
TaggedPRubin, P., Vatikiotis-Bateson, E., 1998. Measuring and modeling speech production. In Animal Acoustic Communication. Springer Berlin
Heidelberg, pp. 251–290.
TaggedPRudzicz, F., Namasivayam, A.K., Wolff, T., 2012. The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Lang.
Res. Eval. 46 (4), 523–541.
TaggedPSchulz, G., Sulc, S., Leon, S., Gilligan, G., 2000. Speech motor learning in parkinson disease. J. Med. Speech Lang. Pathol. 8 (4), 243–247.
TaggedPSerrurier, A., Barney, A., Badin, P., Bo€e, L.-J., Savariaux, C., 2008. Comparative articulatory modelling of the tongue in speech and feeding. In:
Proceedings of the International Seminar on Speech Production (ISSP).
TaggedPSlłrdahl, S.A., Bjærum, S., Amundsen, B.H., Stłylen, A., Heimdal, A., Rabben, S.I., Torp, H., 2001. High frame rate strain rate imaging of the
interventricular septum in healthy subjects. Eur. J. Ultrasound 14 (2), 149–155.
TaggedPSteele, C.M., Van Lieshout, P., 2004. Use of electromagnetic midsagittal articulography in the study of swallowing. J. Speech Lang. Hear. Res.
47 (2), 342–352.
TaggedPSteele, C.M., Van Lieshout, P., 2005. Does barium influence tongue behaviors during swallowing? Am. J. Speech Lang. Pathol. 14 (1), 27–39.
TaggedPSteele, C.M., Van Lieshout, P., 2009. Tongue movements during water swallowing in healthy young and older adults. J. Speech Lang. Hear. Res.
52 (5), 1255–1267.
TaggedPSteiner, I., Richmond, K., Ouni, S., 2013. Speech animation using electromagnetic articulography as motion capture data. In: Proceedings of the
Twelfth International Conference on Auditory-Visual Speech Processing. France, pp. 55–60.
TaggedPStella, M., Bernardini, P., Sigona, F., Stella, A., Grimaldi, M., Gili Fivela, B., 2012. Numerical instabilities and three-dimensional electromagnetic
articulography. J. Acoust. Soc. Am. 132 (6), 3941–3949.
TaggedPStella, M., Stella, A., Sigona, F., Bernardini, P., Grimaldi, M., Fivela, B.G., 2013. Electromagnetic articulography with AG500 and AG501. In:
Proceedings of the Interspeech, pp. 1316–1320.
TaggedPStevens, K.N., 1972. The quantal nature of speech: evidence from articulatory-acoustic data. In: David, E.E., Denes, P.B. (Eds.), Human
Communication: A Unified View. McGraw-Hill, New York, pp. 51–56.
TaggedPStevens, K.N., 1989. On the quantal nature of speech. J. Phon. 17 (1), 3–45.
TaggedPStevens, K.N., 2002. Toward a model for lexical access based on acoustic landmarks and distinctive features. J. Acoust. Soc. Am. 111 (4), 1872–
1891.
TaggedPStone, M., Lundberg, A., 1996. Three-dimensional tongue surface shapes of english consonants and vowels. J. Acoust. Soc. Am. 99 (6), 3728–
3737.
TaggedPToda, T., Black, A.W., Tokuda, K., 2004. Acoustic-to-articulatory inversion mapping with Gaussian mixture model. In: Proceedings of the Inter-
speech, pp. 1129–1132.
TaggedPToda, T., Black, A.W., Tokuda, K., 2004. Mapping from articulatory movements to vocal tract spectrum with gaussian mixture model for articula-
tory speech synthesis. In: Proceedings of the Fifth ISCA Speech Synthesis Workshop. Pittsburg, pp. 31–36.
174 A.K. Pattem et al. / Computer Speech & Language 47 (2018) 157�174
TaggedPToutios, A., Margaritis, K., 2003. Acoustic-to-articulatory inversion of speech: A review. In: Proceedings of the International 12th TAINN. https://
pdfs.semanticscholar.org/c756/3df3ecb34774f661d6681263874353d58119.pdf.
TaggedPToutios, A., Ouni, S., Laprie, Y., 2011. Estimating the control parameters of an articulatory model from electromagnetic articulograph data. J.
Acoust. Soc. Am. 129 (5), 3245–3257.
TaggedPUchida, H., Wakamiya, K., Kaburagi, T., 2016. Improvement of measurement accuracy for the three-dimensional electromagnetic articulograph
by optimizing the alignment of the transmitter coils. Acoust. Sci. Technol. 37 (3), 106–114.
TaggedPUria, B., Renals, S., Richmond, K., 2011. A deep neural network for acoustic-articulatory speech inversion. In: Proceedings of the NIPS Workshop
on Deep Learning and Unsupervised Feature Learning.
TaggedPVan Lieshout, P., 2001. Coupling dynamics of motion primitives in speech movements and its potential relevance for fluency. Soc. Chaos Theory
Psychol. Life Sci. Newslett. 8 (4), 18.
TaggedPVan Lieshout, P., 2007. Dynamical systems theory and its application in speech. Speech Motor Control in Normal and Disordered Speech. Oxford
University Press, chapter 3, pp. 51–82.
TaggedPVan Lieshout, P., Bose, A., Square, P.A., Steele, C.M., 2007. Speech motor control in fluent and dysfluent speech production of an individual with
apraxia of speech and bRoca’s aphasia. Clin. Linguist. Phon. 21 (3), 159–188.
TaggedPVan Lieshout, P., Rutjens, C., Spauwen, P., 2002. The dynamics of interlip coupling in speakers with a repaired unilateral cleft-lip history. J.
Speech Lang. Hear. Res. 45 (1), 5–19.
TaggedPWang, Y.K., Nash, M.P., Pullan, A.J., Kieser, J.A., R€ohrle, O., 2013. Model-based identification of motion sensor placement for tracking retraction
and elongation of the tongue. Biomech. Model. Mechanobiol. 12 (2), 383–399.
TaggedPWatkin, K.L., Rubin, J.M., 1989. Pseudo-three-dimensional reconstruction of ultrasonic images of the tongue. J. Acoust. Soc. Am. 85 (1), 496–
499.
TaggedPWest, P., 2000. Long-distance coarticulatory effects of British english /l/ and /r/: an EMA, EPG and acoustic study. In: Proceedings of the Fifth
Seminar on Speech Production: Models and Data. Kloster Seeon, Bavaria, Germany, pp. 105–108.
TaggedPWestbury, J., 1994. X-ray Microbeam Speech Production Database Users Handbook. Madison.
TaggedPWestbury, J., Milenkovic, P., Weismer, G., Kent, R., 1990. X-ray microbeam speech production database. J. Acoust. Soc. Am. 88 (S1), S56.
TaggedPWong, M.N., Murdoch, B.E., Whelan, B.-M., 2011. Lingual kinematics in dysarthric and nondysarthric speakers with Parkinson’s disease.
Parkinsons Dis. 2011, 352838. 8 pages, doi:10.4061/2011/352838.
TaggedPWrench, A., Richmond, K., 2000. Continuous speech recognition using articulatory data. In: Proceedings of the International Conference on
Spoken Language Processing, pp. 145–148.
TaggedPWrench, A.A., 2000. A multichannel articulatory database and its application for automatic speech recognition. In: Proceedings of the Fifth
Seminar of Speech Production.
TaggedPYunusova, Y., Green, J.R., Mefferd, A., 2009. Accuracy assessment for AG500, electromagnetic articulograph. Journal Speech Lang. Hear. Res.
52 (2), 547–555.
TaggedPZhang, L., Renals, S., 2008. Acoustic-articulatory modeling with the trajectory HMM. IEEE Signal Process. Lett. 15, 245–248.