106
Vanessa Sofia Martins Lopes Master of Science A Computer-Based Therapy Game with a Dynamic Difficulty Adjustment Model for Childhood Dysphonia Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Informatics Engineering Adviser: Prof. Dr. Sofia Cavaco, Assistant Professor, Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa Examination Committee Chairperson: Prof. Dr. Pedro Medeiros Raporteur: Prof. Dr. Aníbal Ferreira Member: Prof. Dr. Sofia Cavaco December, 2018

A Computer-Based Therapy Game with a Dynamic Difculty ...novasearch.org/.../papers_docs/MasterDissertation-VanessaLopes.pdf · This document was created using the (pdf) ... Lopes

Embed Size (px)

Citation preview

Vanessa Sofia Martins Lopes

Master of Science

A Computer-Based Therapy Game with a

Dynamic Difficulty Adjustment Model for

Childhood Dysphonia

Dissertation submitted in partial fulfillment

of the requirements for the degree of

Master of Science in

Computer Science and Informatics Engineering

Adviser: Prof. Dr. Sofia Cavaco,

Assistant Professor, Faculdade de Ciências e Tecnologia da

Universidade Nova de Lisboa

Examination Committee

Chairperson: Prof. Dr. Pedro Medeiros

Raporteur: Prof. Dr. Aníbal Ferreira

Member: Prof. Dr. Sofia Cavaco

December, 2018

AComputer-BasedTherapyGamewith aDynamicDifficultyAdjustmentModel

for Childhood Dysphonia

Copyright © Vanessa Sofia Martins Lopes, Faculty of Sciences and Technology, NOVA

University Lisbon.

The Faculty of Sciences and Technology and the NOVA University Lisbon have the right,

perpetual and without geographical boundaries, to file and publish this dissertation

through printed copies reproduced on paper or on digital form, or by any other means

known or that may be invented, and to disseminate through scientific repositories and

admit its copying and distribution for non-commercial, educational or research purposes,

as long as credit is given to the author and editor.

This document was created using the (pdf)LATEX processor, based in the “novathesis” template[1], developed at the Dep. Informática of FCT-NOVA [2].[1] https://github.com/joaomlourenco/novathesis [2] http://www.di.fct.unl.pt

Acknowledgements

This work was supported by the Portuguese Foundation for Science and Technology un-

der the projects BioVisualSpeech (CMUP-ERI/TIC/0033/2014) and also NOVA-LINCS

(PEest/UID/CEC/04516/2013).

First, I would like to express my gratitude to my advisor, Prof. Dr. Sofia Cavaco,

for her knowledge, support, and motivation, during the last year. For the exceptional

supervision that helped me achieve the work presented in this thesis. It was a great

pleasure to work with her.

I also want to thanks to Ines Jorge, for the fantastic designs produced. The final result

is perfect and would not be the same without her commitment.

A special thanks to all the therapists that participated directly or indirectly in the

furtherance of this project. Specifically, to Diana Lança, Cátia Pedroso, Sónia de Jesus

Lima and Nuno Silva for the availability, guidance, and knowledge during this last year.

I want to highlight the disponibility of the nursery school Alfredo da Mota in Castelo

Branco and the therapist Liliana for allowing me to validate the game platform with

children, which contributed a lot to the accomplishment of this work.

I also want to thankmy lab colleges, David, Flavio, Ivo, Gustavo andAndré for helping

me during the last year, for the great moments, philosophical discussions and all the

kindness and fun brought to the office. That lab has no identity without them!

To my special friends Daniela, Joana Silva, Joana Tavares, Joana Lopes, Frederico and

Catarina. For their friendship during the past 5 years and for being always there for me.

Also, to my nerd friends Pedro, Luis, Eduardo, and Daniel for the companion, support,

and friendly advices.

To my beloved, crazy and handsome friends André Pontes and Gonçalo Marcelino...

It was a long journey, that would not be so fun and memorable without them. For whom

I would like to tell, um bem haja.

To my best friend, Iana Lyckho, who was always prompt for support, advice... and

everything. For giving me so many good moments not only during the last year but also

through the remaining ones. We will always be partners in "crime."

To my uncle, a special thanks for his guidance and inspiration to overcome myself.

For being always supportive throughout the past 5 years.

Lastly, to my dad, my mom and my brother for being my home, my support and,

specially, for the effort along the past 5 years. This would never be possible without them.

v

Abstract

Problems in vocal quality are present in 4 to 12-year-old children, which may affect their

health as well as their social interactions and development process. Speech therapy has

a central role in their recovery and vocal re-education. Throughout the therapy sessions

with children, it is essential to keep them motivated and with the will to learn. With

the current digital advances, characterized by the increasing consumption of computer

devices, we seek to find new ways to practice the exercises included in the traditional

therapy sessions. These exercises should be adapted to the capabilities of each child so

that their experience follows a course without frustration nor boring moments.

For this purpose, we propose a computer-based therapy game that offers a new pow-

erful and engaging way of practicing the sustained vowel exercise. This interactive tool

was developed taking into account a set of scenarios and characters with an infant theme,

coupled with a gamification strategy to reward a player’s success. Additionally, to auto-

matically adapt the difficulty of the challenges in response to the child’s performance, we

created a novel dynamic difficulty adjustmentmodel. To measure the child’s performance,

the model uses parameters that are relevant to the therapy treatment.

Moreover, to allow an intensive training outside sessions, we developed an automatic

recognition system for the Portuguese vowels. The model is composed of the best combi-

nation of sound features extraction algorithms and classification algorithms. The merge

of these game components endeavors to challenge the child to practice the exercises with

higher performance and to prompt, in the long term, a healthy and stimulating therapy

process.

Keywords: Dysphonia, Sustained vowel exercise, Automatic sound recognition, Loud-

ness, Maximum phonation time, Dynamic difficulty adjustment model.

vii

Resumo

Os problemas na qualidade vocal estão presentes, sobretudo, em crianças entre os 4 e

os 12 anos e afetam as suas interações sociais e o seu processo de desenvolvimento, além

da saúde dos mesmos. A terapia da fala tem um papel fulcral na recuperação e reeducação

vocal, tanto a nível das patologias da voz, como da fala. Ao longo das sessões de terapia

com a criança, é importante manter a mesma motivada e suscetível à aprendizagem. Num

mundo tecnológico caracterizado pelo consumo crescente de dispositivos móveis e de

computadores, é fundamental encontrar exercícios alternativos que complementem as

sessões de terapia tradicionais, e que possam recorrer dos avanços tecnológicos atuais

para esse efeito. Por sua vez, esses exercícios devem ser adaptados às dificuldades de cada

criança para que a mesma não se sinta frustrada com a incapacidade de resolução das

tarefas ou que se aborreça com a facilidade das mesmas.

Desta forma, propõe-se um jogo sério para uso para complemento de terapia da fala,

que ofereça uma nova forma desafiante de praticar o exercício da vogal sustentada. Esta

ferramenta interativa foi desenvolvida tendo em consideração um conjunto de cenários

e personagens envolvidos num tema infantil, associado a uma estratégia de gamificação

com brindes conquistados a cada desafio ultrapassado com sucesso. Adicionalmente, de

forma a adaptar automaticamente a dificuldade dos desafios à performance da criança,

desenvolvemos um novo modelo dinâmico de ajuste da dificuldade. A medição da perfor-

mance da criança tem por base variáveis relevantes no contexto de terapia.

Ainda assim, de forma a permitir um treino intensivo fora das sessões de terapia,

desenvolvemos também um sistema de reconhecimento para vogais do português euro-

peu. Este modelo é composto pela melhor combinação de features extraídas do som com

algoritmos de classificação. A junção destas funciononalidades num único jogo, permite

estimular a criança a praticar o exercício com maior desempenho e a aumentar, a longo

prazo, os resultados do tratamento.

Palavras-chave: Disfonia, Exercício da vogal sustendada, Reconhecimento automático

de som, Frequência, Amplitude, Modelo dinâmico de ajustamento da dificuldade.

ix

Contents

List of Figures xiii

List of Tables xv

Acronyms xvii

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Document structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Fundamental concepts 7

2.1 Speech Therapy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 The sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 The voice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.3 Classification of voice disorders . . . . . . . . . . . . . . . . . . . . 13

2.1.4 Treatments for voice disorders . . . . . . . . . . . . . . . . . . . . . 15

2.2 Speech processing and Machine Learning . . . . . . . . . . . . . . . . . . 16

2.2.1 Spectrum Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.2 Speech processing and extraction . . . . . . . . . . . . . . . . . . . 16

2.2.3 Additional Sound Features . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.4 Classification algorithms . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Player-adaptability models . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 State of Art 21

3.1 Tools for speech therapy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Without sound recognition . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.2 With unspecified phoneme recognition . . . . . . . . . . . . . . . . 23

3.1.3 With identification of specific phonemes . . . . . . . . . . . . . . . 24

3.2 Tools with a DDA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Tools comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

xi

CONTENTS

4 Game and Architecture 29

4.1 Proposed game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 The sustained vowel exercise . . . . . . . . . . . . . . . . . . . . . 30

4.1.2 Game scenarios and gamification strategy . . . . . . . . . . . . . . 30

4.1.3 Visual feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.4 Game parametrization . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Platform architecture, design and structure . . . . . . . . . . . . . . . . . 34

4.2.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.2 Game’s storyboard . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 A Novel Dynamic Difficulty Adjustment model 41

5.1 The DDA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.1 Maximum phonation time . . . . . . . . . . . . . . . . . . . . . . . 43

5.1.2 Speech intensity level . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Automatic Sound Recognition System 51

6.1 Data set characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.2 Automatic recognition system of vowels . . . . . . . . . . . . . . . . . . . 53

6.2.1 Feature extraction techniques . . . . . . . . . . . . . . . . . . . . . 53

6.2.2 Data preprocessing and analysis . . . . . . . . . . . . . . . . . . . . 55

6.2.3 Data visualization and feature analysis . . . . . . . . . . . . . . . . 57

6.2.4 Model estimation methodology . . . . . . . . . . . . . . . . . . . . 58

6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3.1 Comparison between different classifiers . . . . . . . . . . . . . . . 59

6.3.2 Effect of varying the number of MFCCs . . . . . . . . . . . . . . . 60

6.3.3 Effect of varying the train and test sets . . . . . . . . . . . . . . . . 61

6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7 Feedback and Validation 67

7.1 Feedback from SLP(s) and heterogeneous audiences . . . . . . . . . . . . 67

7.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7.2.1 User testing sessions . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.2.2 Questionnaire to SLTs . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.2.3 Validation conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 76

8 Conclusion and future work 79

8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Bibliography 83

xii

List of Figures

1.1 A child practicing an exercise from our proposed solution. . . . . . . . . . . . 4

1.2 Proposed game platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Sinusoidal wave. Source: [43] . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Main places of articulation in the vocal tract. Source: [21] . . . . . . . . . . . 12

2.3 Broadband spectograms of nine standard EP oral vowels produced by a female

speaker. Source: [31] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Mel filters in a 8000 Hz signal. Source: [14] . . . . . . . . . . . . . . . . . . . 18

2.5 Flow model proposed by Csikszentmihalyi [42]. . . . . . . . . . . . . . . . . . 20

3.1 Training game with phonemes for articulation problems. Source: [51] . . . . 22

3.2 Falar a brincar game. Source: [23] . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Scenarios of the robust gamewith voice exercises for speech therapy. Source: [10] 23

3.4 Scene from the serious game for sustained vowel Source: [29] . . . . . . . . . 24

3.5 Screenshot from the sPeAK-MAN interface. Source: [48] . . . . . . . . . . . . 24

3.6 Tool with virtual therapist for aphasia treatment. Source: [39] . . . . . . . . . 25

3.7 Screenshot from the Interactive Game for the training of portuguese vowels.

Source: [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 The figure illustrates the interaction between the character and the target in a

scene context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Scenarios available for the exercise page. . . . . . . . . . . . . . . . . . . . . . 31

4.3 Set of characters available, representing both genders and four different ethnies. 32

4.4 Available rewards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.5 Add child basic info scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.6 Character’s falling feedback. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.7 Client-server architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.8 Activity diagram of the game platform. . . . . . . . . . . . . . . . . . . . . . . 36

4.9 Start page scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.10 Choose characters (left) and see rewards (right) scenarios. . . . . . . . . . . . 38

4.11 Add child basic info scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.12 Treatment editable parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.13 SLP and children’ game options. . . . . . . . . . . . . . . . . . . . . . . . . . . 39

xiii

List of Figures

5.1 Scheme for updating MPTe. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2 Allowed variation intensity level intervals, ∆L. The figure illustrates a case

with Lm = 50 dB, LM = 70 dB, and Le = 60 dB. The orange line illustrates the

time-varying speech intensity level achieved by the child, La(t). . . . . . . . . 46

5.3 Scheme for updating ∆L, with the influence of MPT variable. . . . . . . . . . 48

6.1 Comparative samples with 100 ms from the sustained phonemes /a/, /i/ and

/u/, with pitch and formants marked as blue and red, respectively. . . . . . . 53

6.2 Steps in the development of our vowel ASR system. . . . . . . . . . . . . . . . 54

6.3 Comparative samples from the sustained phonemes /a/, /i/ and /u/, with 40

filter banks and 13 MFCCs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.4 Radial visualization of the data set 1. . . . . . . . . . . . . . . . . . . . . . . . 57

6.5 Comparative dimensionality reduction for two features, with PCA e LDA tech-

niques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.6 Classifiers’ performance comparison regarding different train and test split-

ting methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.7 Classifier performance for the kernel SVM and the randon split, regarding the

number of MFCCs and different data sets. . . . . . . . . . . . . . . . . . . . . 61

6.8 Classifier performance for the kernel SVM with different data sets. . . . . . . 62

6.9 Comparative features distribution with radial visualization for each data sets

with FB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.10 Vowel detection confusion matrices. . . . . . . . . . . . . . . . . . . . . . . . 65

7.1 Game presentation in the European Congress of Speech and Language Ther-

apy, May 2018. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7.2 Basic information regarding the participants. . . . . . . . . . . . . . . . . . . 69

7.3 The setup used for the recordings. . . . . . . . . . . . . . . . . . . . . . . . . 70

7.4 Children’s performance during the experiment. . . . . . . . . . . . . . . . . . 71

7.5 Results regarding the SLPs and children interactions with the game platform. 73

7.6 Answers regarding the question Q10. . . . . . . . . . . . . . . . . . . . . . . . 74

7.8 Answers about the question Q15. . . . . . . . . . . . . . . . . . . . . . . . . . 76

xiv

List of Tables

2.1 Ages for each gender of children in the study. Source: [50] . . . . . . . . . . . 10

2.2 MPT for children between 4 and 12 years producing the sustained vowel / a

/. Source: [50] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Parameters of GRBAS scale. Source: [36] . . . . . . . . . . . . . . . . . . . . . 11

3.1 Comparative table that summarizes the described tools for Speech Therapy. . 27

5.1 Allowed intensity levels and intensity interval sizes in dB (SPL). . . . . . . . 42

5.2 Evolution of child’s performance during four trials. . . . . . . . . . . . . . . . 49

6.1 Number of samples for each vowel. . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2 Total number of children that perform the records from both datasets . . . . 52

6.3 Number of samples in the data sets 1-6. . . . . . . . . . . . . . . . . . . . . . 63

xv

Acronyms

APQ Amplitude Perturbation Quotient.

ASR Automatic Sound Recognition.

BVS BioVisualSpeech.

CPP Cepstral Peak Prominence.

DCT Discrete Cosine Transform.

DDA Dynamic Difficulty Adjustment.

EP European Portuguese.

ETMPT Emission Technique in Maximum Phonation Time.

FB Filter Bank.

FFT Fast Fourier Transform.

HNR Harmonic-Noise Ratio.

LDA Linear Discriminant Analysis.

LSVT Lee Silverman Voice Treatment.

MFCC Mel-Frequency Cepstral Coefficients.

MPT Maximum Phonation Time.

PCA Principal Component Analysis.

PhoRTE Phoneme Resistance Training Exercise.

PLVT Pitch Limiting Voice Treatment.

PPQ Pitch Perturbation Quotient.

QDA Quadratic Discriminant Analysis.

RF Random Forest.

xvii

ACRONYMS

SLP Speech and Language Pathologists.

SOVT Semi-Occluded Vocal Tract.

SSD Speech Sound Disorders.

SVE Sustained Vowel Exercise.

SVM Support Vector Machine.

xviii

Chapter

1Introduction

We speak not only to tell other people what we think, but to tell ourselves what we think. Speech

is a part of thought.

- Oliver Sacks

1.1 Overview

Speech is one of the most important ways to communicate in current societies. Many

children have speech sound disorders (SSD) that may affect not only their health but also

their social interactions and development process [19].

Deviations in the quality of an individual’s voice are known as dysphonia. These can

be identified through vocal quality parameters, such as the perception of the frequency

produced (pitch) or the intensity of the sound emitted (loudness) [21]. Childhood dyspho-

nia cases can occur as a result of an inappropriate vocal behavior or due to neurological,

physiological or social factors, among others. Studies on vocal analysis with children

between the ages of 2 to 12 years, report that voice disorders affect from approximately 4

to 38% of children, with hoarseness and breathy voice as the most frequent problems [11,

36, 49].

Dysphonia occurs more often in boys than in girl [36], possibly because of the vocal

effort and their personality traits. Until the entrance to primary education, the parame-

ters of vocal quality (the Chapter 2 addresses this concept) in children are very similar,

regardless of gender, and only tends to diverge when vocal changes occur in boys.

On the other hand, a speech disorder is associated with a problem in the articulation

of the sound [21], through the incorrect use of several articulators - throat, teeth, tongue,

lips, among other muscles and organs. These failures might be expressed in the exchange

of some sounds, the omission of phonemes in words, among other disturbances more or

1

CHAPTER 1. INTRODUCTION

less explicit.

In some cases, voice and speech pathologies can be naturally corrected while children

grow up [5, 21]. In other cases, the child may need to attend speech therapy for recovery

and vocal re-education, both concerning voice and speech pathologies. To detect and

treat dysphonia symptoms, speech and language pathologists (SLPs) in therapy sessions

with children commonly focus on pitch or loudness training, as well as, in the maximum

phonation time exercise, through the use of the sustained vowel exercise (SVE) [3, 12, 33,

47]. The goal of this exercise is to say a vowel for as long as possible while maintaining

the voice intensity level stable. The SVE is widely used in therapy sessions to evaluate the

patient’s voice quality, detect the existence of dysphonia, the severity of the pathology,

as well as to complement the treatment for dysphonia. For instance, it may be used to

correct hoarse voices. Additionally, this exercise is used with voice professionals like

actors and journalists, who make a constant vocal effort and need to learn how to put

the voice correctly. This exercise is also commonly used in therapy with patients with

Parkinson’s disease [12, 47].

In traditional therapy, dysphonic children usually attend speech therapy sessions only

once per week, and as a consequence they might have a slow progress curve, giving that

they do not repeat them with the desired frequency and cumulative intervention inten-

sity [19, 52]. With a portable solution to practice the exercises without the need for

supervision, it would be possible to perform therapy more often. With more frequent

sessions per week, which is known as intensive training, the results are considerably im-

proved. Repeating the vocal exercises used to correct voice problems may be monotonous

and tiring [5, 13]. Therefore, SLPs usually try to create more appealing sessions through

the use of several techniques, such as board games. Some SLPs even build PowerPoint an-

imations or try to adapt computer games that can be controlled manually: when the child

does the therapy exercise correctly, the SLP uses the powerpoint animations or makes the

game progress to motivate the child on doing the exercises.

The possibility of developing this type of tool, combined with a strategy of gamifica-

tion with rewards, introduces a positive stimulus for the child [13]. It induces the child

to practice to follow the regular training program with a stronger will and improve their

results. Such games should fit heterogeneous groups, where each child has capacities and

needs that grow while she is performing the exercises within the treatment. Moreover,

since different children have different needs, computer games and challenges for speech

and language therapy should adapt the difficulty of the tasks to the children needs and

capabilities.

Moreover, children are naturally motivated to use interactive displays. Thus, tak-

ing advantage of the current technological advances, several computer and mobile games

have been developed to complement traditional speech therapy techniques [13, 19]. Some

of these games can assist SLPs on keeping the children motivated on doing the therapy

exercises, such as the set of applications from the LittleBeeSpeech [23], Falar a Brin-

car [51], and sPeAK-MAN [48], which focus on articulation problems, alternatively,

2

1.2. OBJECTIVES

Flappy voice [28] and the Interactive Game for the training of Portuguese vowels [7]

which focus on problems like apraxia and vowels recognition, respectively.

Here we propose a platform that uses the sustained vowel exercise, usually performed

in the traditional therapy sessions. The tool incorporates a player-adaptable system,

to automatically adjust the exercise’s difficulty according to the player’s performance.

Additionally, the tool includes a gamification strategy with UI elements involved in a

childhood theme. Lastly, we present an automatic sound recognition system to identify

the produced vowel.

1.2 Objectives

This dissertation is part of the BioVisualSpeech project, which includes the partnership of

the Faculty of Sciences and Technologies of NOVAUniversity, Carnegie Mellon University,

INESC-ID, the company Voice Interaction, as well as institutes specialized in speech, the

School of Health of Alcoitão, and the Hospital Center of Lisbon. The project BioVisual-

Speech aims to investigate interaction mechanisms that aid speech therapy with children

and complement traditional therapy sessions with exercises tailored to the children’ needs.

Therefore, the project has developed a game focus on the treatment of voice and speech

pathologies, based on the European Portuguese (EP) language. Our solution purpose is to

contribute with a tool for the treatment of voice disorders and seek to improve the voice

quality of children between 5-9 years-old, to balance the values of maximum phonation

time (MPT) and loudness. As a contribution to improving the motivation of children on

performing the SVE, we have developed a serious computer game for this exercise.

Regarding the heterogeneous children’ situations, therapists might use different types

of age-appropriate exercises so that therapy is as effective as possible. It is also important

to consider the level of difficulty appropriate to the patient. Easy activities may not

sufficiently challenge the child, while overly strenuous activities can become frustrating

for the patient and limit their progress in treatment. Establishing the appropriate level

to the child’s abilities results in better performance and better treatment outcomes. Thus,

the principles of this treatment should begin with a level that allows the child to be

successful, gradually increasing the level of difficulty until the approximate results of

natural communication are achieved, with the minimization or total correction of the

disorder.

Additionally, it is important to include visual feedback and a gamification strategy

with prizes which can contribute to the creation of an interactive game environment,

with a motivational impact and reinforcement of the player’s focus and performance. The

game should allow intensive training, with a platform that automatically recognizes the

utterances produces by the child outside the therapy environment. In Figure 1.1 we

present a child practicing at home our solution, and so, she can perform the therapy task

with the SVE outside the traditional therapy sessions. The aspects reported so far lead

3

CHAPTER 1. INTRODUCTION

to an engaging and challenging problem to deal with in a dissertation context and are

described in the following chapters.

Figure 1.1: A child practicing an exercise from our proposed solution.

1.3 Proposed Solution

This dissertation presents a game for the SVE practice, an exercise used in therapy sessions

with dysphonic children. Here we propose a new and motivating tool as a new way to

perform therapy. This game includes the primary characteristics described in Figure 1.2.

Two simpler versions of this game platform have previously been proposed [10, 29].

Diogo [10] developed a solution with a set of scenarios within a child theme, without

an automatic sound recognition system or parametrization allowed. Lopes et all. [29]’s

solution included two scenes where the child’s voice controls the game character and

allows manual parametrization through SLP. Both versions had some limitations, and so

we develop a full new version, and we recovered the scenarios used from the previous

work.

The game’s current version includes (1) a set of different scenarios andmain characters

which are controlled in real time with the child’s voice. In this way, we try to offer a

game platform interesting to children with different tastes and interests. On the other

hand, (2) the real-time feedback allows the child to become self-aware of the quality of

her utterance, to see her progress while she self-corrects her phoneme productions to

accomplish the game goals. Additionally, the game presents (3) gamification elements to

prompt a healthy and stimulating therapy process.

An essential characteristic of the game is that it can be adjusted to each child’s needs

through a set of customization parameters. We also propose (4) a dynamic difficulty

adjustment (DDA) model that allows the game to meet the child’s changing needs as

well as capabilities so that the child does not feel frustrated with the inability to solve

the tasks nor loses the motivation on doing the exercise when the tasks are too easy for

the child. For that, the game automatically adapts the difficulty of the challenge based

on the child’s performance during the previous trials. Moreover, we implemented (5)

a client-server architecture, to guarantee the practice of the game in the child’s mobile

4

1.4. CONTRIBUTIONS

Figure 1.2: Proposed game platform.

device. This architecture spares the load from the device side (client) and forwards it to

the server so that it can be appropriately analyzed and the accurate feedback returned in

real time during the child’s trial.

1.4 Contributions

• Research concerning the speech therapy area, specifically solutions with the SVE

for childhood dysphonia.

• A speech therapy tool with a gamification strategy with multiple scenes, characters

and rewards for children;

• A novel DDA model for children with dysphonia, based on the use of the therapy

variables: MPT and Loudness;

• An ASR system for vowels from the EP language, with the testing of distinct ma-

chine learning algorithms and the analysis of the combination of different features;

• A validation process including the target audiences of our game, children and ther-

apists;

5

CHAPTER 1. INTRODUCTION

Moreover, the dynamic difficulty adjustment model was presented in an accepted

scientific paper [30].

1.5 Document structure

For a description of the work performed during the dissertation, we propose the following

organization of this document:

Chapter 2 (Fundamental Concepts) This chapterwill discuss the introductory concepts

involving speech therapy, since the acoustic and auditory-perceptive characteristics

of sound and voice, through the analysis of the degree of severity of dysphonia in

children and, lastly, the treatments for these disorders. Additionally, we include

a description of the machine learning models explored, as well as, the dynamic

difficulty adjustment concept, in order to support the developed work.

Chapter 3 (State of Art) Chapter 3 will contain related work to our problem, specifi-

cally, tools that have already been developed for speech therapy with and without

identification of specific sounds and phoneme. Moreover, we present tools that

include a DDA model and a final comparison between the described tools.

Chapter 4 (Game and Architecture) This chapter will focus on all the details of the

game, from the type of exercises included in the structure - the story that intercon-

nects the scenarios - as well as the architecture of the game.

Chapter 5 (A novel Dymanic Difficulty Adjustment model) In this chapter we discuss

the DDA used in our game, which is a more complex model with two parameters to

measure the child’s performance and decides how to increase the game’s difficulty

based on this performance measure.

Chapter 6 (Automatic Sound Recognition System) This chapter focuses on the analysis

of the sound recognition system adopted for classification of sustained vowels. It

includes the extraction of the data for the preprocessing of the information to be

sent to the classifier, prepared through different training and test sets. Lastly, we

combine it with different classifiers and compare the respectively results;

Chapter 7 (Feedback and validation) Here we present the feedback received from het-

erogeneous audiences for the game developed. Furthermore, we describe the vali-

dation methodologies applied with the target audiences, children and SLPs;

Chapter 8 (Conclusion and future work) Here we present the final conclusion and few

ideas for future work.

6

Chapter

2Fundamental concepts

In this chapter, we present the definitions and terminologies on the theme of voice disor-

ders, concepts that were fundamental to support the elaboration of this dissertation. More

specifically, it was necessary to study sound, acoustic and perceptive-auditory analysis of

voice, as well as the analysis of voice problem and its solutions.

This chapter presents in the section 2.1 the definition of the main notions of sound,

voice and its articulators. More is added to the spectral analysis of the sound, in particular

the vowels, and techniques for treating problems with the use of sustained vowels, since

the exercises to be developed are based on this type of exercise. In the following section

(Section 2.2), we introduce the features extraction techniques that support the implemen-

tation performed as well as the classification algorithms that define our ASR system. In

the last section, Section 2.3, we present some related work regarding player-adaptable

models that we intend to follow.

2.1 Speech Therapy

It is necessary to understand for each child therapy process, the problem to be addressed

and its sources so that they can be approached with the appropriate tools for each situa-

tion. With this purpose, we present some introductory concepts regarding the area. Note

that, in this dissertation, we are exclusively focused on the treatment of voice disorders.

Thus, any reference to speech therapy problems and its solutions will be associated with

that.

2.1.1 The sound

The sound can be understood as the result of a mechanical disturbance caused by the

vibration of an object [31]. However, the sensation of hearing is not directly derived from

7

CHAPTER 2. FUNDAMENTAL CONCEPTS

this same vibration. The vibration of an object may cause the formation of a wave in

air or other medium, provided that it is elastic and inert, and whose properties allow

random oscillation of the particles in that atmosphere [53]. When an object vibrates in

a medium, consider the example of air, the particles tend to move in the direction along

which the object moves. Beginning with the particles closest to the object, energy and

motion are propagated to the adjacent particles and thereafter, from the vibrating source

to the receivers.

In this way, as long as an object has the properties of inertia and elasticity, it can vibrate

and, with this, be a sound source. When an object with mass begins its vibration and if

there is no friction to the movement, the oscillation continues infinitely. The execution

of a force on the object triggers its movement and consequently causes the output of its

equilibrium point, up to a given maximum distance, A. Therefore, the distance d(t) to

the point of equilibrium, θ, exhibits an oscillatory motion that stands between -A and A,

in a sinusoidal motion, as shown in figure 2.1.

Figure 2.1: Sinusoidal wave. Source: [43]

A sinusoidal wave can be characterized by two types of analysis: physical or percep-

tual. The parameters concerning the physical description of the wave are: amplitude,

frequency and start phase or equilibrium point. The amplitude of a sound wave is deter-

mined by the distance between the maximum pressure point and the minimum pressure

point of the wave [53]. Note that the larger the amplitude, the greater the amount of

energy that is carried. Frequency is another physical wave’s characteristic, it is measured

in Hertz (Hz) and is defined by the number of times a full cycle of vibration repeats for

one second. Lastly, the start phase is the relative position of an object at the moment the

movement or vibration begins. It is generally defined in terms of degrees of an angle.

The vibratory motion of an object is defined by the previous properties. As already

mentioned, it is also possible to characterize the physical stimulus, that is, the sine wave,

concerning the human perception of the previous parameters. Changes in the amplitude

of a sinusoidal are associated with the loudness. Moreover, the frequency variations

are associated with the pitch. The differences in the perception of the initial phase of

the wave by the two ears result in changes in the perceptual location of the beginning

8

2.1. SPEECH THERAPY

of the stimulus. The loudness concept corresponds to the individual perception of the

sound amplitude and the higher the intensity, the higher the sound is perceived, and vice

versa [44]. Additionally, Pitch is the human being perception of the sound frequency [45].

The higher the frequency, the higher the pitch is perceived. On the other hand, the lower

the frequency, the lower the pitch of the sound.

Fourier derived the theorem that shows that any vibration can be reduced to the sum of

several sinusoidal waves with their own amplitude, frequency and initial phases [53]. The

wave that results from this sum is called a complex wave. The representation of the wave

graphically can be made by means of a spectrum of magnitudes, when the magnitude of

each sinusoidal is represented as a function of its frequency. The graphical representation

of the initial phase of each sinusoidal component as a function of frequency is called

the phase spectrum. The graphical representation of the temporal domain exposes the

amplitude-frequency relation as a function of time. The graphical description according

to these relations results from the application of the Fourier theorem.

2.1.2 The voice

The voice can be understood as the sound produced from laryngeal activity, more specif-

ically, the result of the relation between the pressure and velocity of the exhaled air

flow, with the interaction between the articulators of the human respiratory system [21].

Through this interaction, the quality and suitability of the voice can be affected. Voice

quality is associated with common standards, while voice adequacy is associated with a

deviation / variation of these standards, without affecting quality.

2.1.2.1 Voice Measures

For diagnosis and choice of the appropriate treatment, it is necessary to evaluate the vocal

quality of the individual to perceive the degree of pathology, if it exists. The analysis of

these parameters can also be performed for the detection of a possible dysphonia or

laryngeal pathology.

Maximum phonation time (MPT) Maximum phonation time is the maximun time (in

seconds) a person can sustain a vocal sound (for example a vowel), after taking a

deep breath and producing the sound with a comfortable level of intensity [46]. The

values resulting from this measurement express the individual’s ability to control

their respiration [8]. To calculate the sound emission time, the following steps must

be followed:

1. Ask the patient to breathe deeply, and then say the sustained vowel /a/ for as

long as possible, at a comfortable vocal intensity and height. Use a stopwatch

(in seconds) during the exercise to measure the duration of the sustained vowel.

2. Repeat step 1 and record the results as step 2 for the same sustained vowel.

9

CHAPTER 2. FUNDAMENTAL CONCEPTS

3. Repeat step 1 and record the results again for the same vowel.

4. The MPT is the maximum length of the vowel of the steps 1, 2 and 3.

It is recommended the use of acute vowels /i/, grave /u/ and medium /a/. The type

of vocal register cord of each individual influences the MPT, with higher pitched

voices typically presenting lower MPT values [21, 46]. Note that, for very low values

of MPT, outside normal age and gender patterns, there might be present a vocal

dysfunction or laryngeal pathology (these disorders will be analyzed in the 2.1.3

section).

Table 2.1: Ages for each gender of children in the study. Source: [50]

Gender

Boy Girl TotalAge (y:mo) n % n % n %

4:0–6:11 185 47.56 aA 204 52.44 aB 389 1007:0–9:11 483 50.52 aA 473 49.48 aB 956 10010:0–12:00 156 49.52 aA 159 50.48 aB 315 100Total 824 49.64 836 50.36 1660 100

Table 2.2: MPT for children between 4 and 12 years producing the sustained vowel / a /.Source: [50]

Gender Boy Girl Total

Age(y:mo)

MPT/a/

MPT/a/

MPT/a/

4:0-6:11 6.02 ± 1.77 aA 6.22 ± 1.99 aA 6.12 ± 1.89 a7:0-9:11 8.05 ± 1.98 bA 7.90 ± 1.98 bA 7.98 ± 1.98 b10:0-12:00 9.22 ± 2.33 cA 9.05 ± 2.02 cA 9.14 ± 2.18 c

The values presented in the Tables 2.1 and 2.2 can be used as normative for a

comparison term with the samples obtained with the MPT of children concerning

the study in this dissertation.

Fundamental frequency (F0) The fundamental vocal frequency is defined as the fre-

quency of the sound produced by the vocal chords, i.e., it represents the vibration

of the vocal folds per unit of time [8]. Changes in F0 have been associated with

human growth and development. In general, there is a marked decrease in the first

two to three years of life, and from there a gradual decline to puberty. The study

cited in [21] has shown variation in F0 between vowels, more specifically:

• Vowel /u/ - 177,4Hz;

• Vowel /i/ - 174,9Hz;

• Vowel /a/ - 160,9Hz;

10

2.1. SPEECH THERAPY

The same authors put /u/ and /i/ in the category of the vowels [+ high] and /a/ in

the category of the vowels [+ lows]. Moreover, it can be identified a variability in

F0 of the sustained vowels, compared to the F0 resulting from speech production,

being lower in the latter case [21]. Regarding the analysis of F0 in voices with or

without dysphonia, the measurements obtained do not allow to distinguish signif-

icantly the majority of the individuals with pathological voice of the individuals

with the common voice patterns.

Frequency perturbation quotient (PPQ) The PPQ is a method of extraction of jitter

(fundamental frequency disturbance) and is calculated by averaging the frequency

perturbations for each cycle [36].

Amplitude perturbation quotient (APQ) The APQ is a method of extracting the shim-

mer (amplitude perturbation) and is calculated by averaging the amplitude pertur-

bations for each cycle [36].

Spectral noise measurements The Harmonic-Noise Ratio (HNR) is a measure of distur-

bance in noise, being calculated by the proportion of noise present in a spectrum in

relation to the proportion of harmonics in the same spectrum [36]. A more detailed

description of the vocal spectra is given in the 2.3 section.

GRBAS scale - Subjective method of vocal analysis The scale of GRBAS [36] measures

the set of dysphonia degrees through a scale that varies between the values pre-

sented in the first column of the 2.3 Table. These measures are analyzed subjectively

by the SLP, regarding four vocal alteration parameters.

Table 2.3: Parameters of GRBAS scale. Source: [36]

Overall degree of severity Perception-auditory characteristics

0 - normal or absent Roughness (R)1 - discrete Breathy voice (B)2 - moderate Asthenia (weak voice) (A)3 - severe Strain (S)

An example is the result G1R2S2A0S1, which is translated by the following dys-

phonia: discrete global degree, moderate roughness, moderate breathiness, absent

asthenia and discrete tension.

Cepstral peak prominence (CPP) The CPP is an acoustic measure that has shown to be

quite promising for the measurement of the degree of severity of the disturbances

in the voice [6, 17, 24]. This measure helps the definition of the level of harmonicity

in the voice. Voices with greater period of signal are considered voices more harmo-

nious. Voices with higher CPP values are considered as more harmonious voices.

The great advantage of CPP is that it does not depend on the quality of the sound

11

CHAPTER 2. FUNDAMENTAL CONCEPTS

recorded or the volume differences in it for measuring the parameters of interest.

Additionally, you do not need to analyze periodic or extended sound samples for a

valid production of the CPP result.

2.1.2.2 Phonation

When speech is produced, air is exhaled from the lungs, passing through the throat [53].

It is at the top of the throat, more specifically in the larynx, that the vocal folds that

vibrate in response to air expiration and under the control of the muscles. The vibration

of the strings in sound wave form causes resonance 1 in the vocal tract. Voiced sounds

are produced by the vibration of the vocal chords in their state of separation, while non

voiced sounds are produced with the vocal folds in the joint movement.

2.1.2.3 Articulation

The articulatory system is composed of several organs responsible for the production

of speech and are mostly located in the oral cavity [31]. The vocal tract is divided into

two zones: the anterior zone, which lies between the lips and the hard palate, and a

posterior zone, which encompasses the remaining articulators, represented in figure 2.2,

until to the posterior wall of the pharynx. The intervention of these articulators allows

the production of speech, characterized by distinct articulation modes that give rise to

the production of different sounds in the communication process.

Figure 2.2: Main places of articulation in the vocal tract. Source: [21]

2.1.2.4 Articulatory classification of the European Portuguese (EP) vowels

The articulatory classification is defined as a form of categorization of speech sounds 2 [31].

The tongue is the main articulator responsible for defining the vowels. More specifically,

the height of the dorsum of tongue (high, medium or low relative to the neutral position

in the oral tract) and the relation to the point of articulation (the dorsum of the tongue

1The resonance can be understood as an acoustic phenomenon, since the vibration originating in the vo-cal folds is transmitted to the adjacent cavities by the agitation of the air particles between the structures [21].

2All the dialectical phonetic variants of the vowels [31], which are the result, for example, of regionalisms,were excluded from this analysis.

12

2.1. SPEECH THERAPY

advances, maintains or retreats from the neutral position 3) are forms of differentiation

and distinction between the vowels.

It is necessary to take into account the role of other articulators for the classification

of the articulatory point of consonants. However, this dissertation does not focus on the

study on the production of EP consonants, only in the vowels.

2.1.2.5 Vowels Spectrum

A spectrogram is a representation of the acoustic signal, since it reflects in the spectrum

the properties of the sound produced [21]: (1) Time on the horizontal axis; (2) The

frequency on the vertical axis; (3) The amplitude perceptible by the intensity of the

horizontal bars produced by the spectrogram. As a representation of the complex sound

wave, its frequencies are produced separately, the fundamental frequency (first harmonic)

and its multiple frequencies (harmonics) [31]. For this decomposition to be possible at

the spectra level, narrow band and broadband filters are used. Broad band filters allow

a higher temporal resolution while narrow band filters increase the resolution of the

frequency.

For the distinction of phonemes, namely the vowels presented in the figure 2.3, we can

analyze the spectral patterns, called formants, which are dependent on the physical char-

acteristics of the supraglottal cavities [53]. The resonance properties of each component

of the vocal tract might intensify or weaken the acoustic signal, which is why the spectral

image produced by the respective vibrations denote different phonemes. Note that the

vowels have a well-defined pattern of formants [31]. Thus, they are easily distinguished

at the spectral level, as we ca see in figure 2.3.

The application of this technique in the analysis of the vocal signal and in the anal-

ysis of the spectral noise visible between the harmonics, can be used for analysis of

perturbations present there [21]. Noise in spectral images of sustained vowels allows the

measurement of different levels of dysphonia severity.

2.1.3 Classification of voice disorders

The incorrect and violent use of phonation structures, for instance, vocal abuse 4 can

cause organic changes in the vocal folds and attached musculature [21]. A voice dis-

turbance, dysphonia, occurs when voice quality, pitch or loudness vary inappropriately

from normal patterns for an individual of a given age, gender, cultural background, or

geographic location [1].

3Rest position of tongue [31], usually in a central position in the oral treatment.4Vocal abuse [21] encompasses a set of behaviors that impair vocal health such as smoking habits, medi-

cations or drugs, poor hydration, prolonged use of excessive vocal volume, or even the type of personality(anxiety or stress)

13

CHAPTER 2. FUNDAMENTAL CONCEPTS

Figure 2.3: Broadband spectograms of nine standard EP oral vowels produced by a femalespeaker. Source: [31]

2.1.3.1 Functional versus organic dysphonia

Dysphonia can be classified as organic when it has physiological origin, either due to a

disturbance in breathing or in the mechanisms / components of the vocal tract [2, 21].

Within this category, dysphonia may be structural or neurological. The first case applies

when there are physical changes in the mechanisms of the vocal tract, such as localized

mass lesions or tissue changes. A neurological dysphonia refers to problems with the

central and / or peripheral nervous system which affects the nerve connected to the

larynx, conditioning the voice normal function. For instance, those problems might be

visible through the trembling in the voice, spasmodic dysphonia or even paralysis of the

vocal folds. On the other hand, functional disorders are idiopathic, e.g., they are diseases

whose cause is unknown and for which no explanation can be found [2, 21]. In these

cases, the use of vocal folds and vocal tract mechanisms tends to be inappropriate or even

inefficient, even without structural changes.

2.1.3.2 Dysphonia based on the perceptual vs. acoustic phenomenon

The vocal disturbances can be perceived based on the human perception and acoustic

levels of the voice, by analyzing the pitch, loudness or resonance of the voice [21]. Pitch

problems are perceived in breathy, harsh, or hoarseness (the combination of the above).

14

2.1. SPEECH THERAPY

Physiologically, the vocal chords present an inefficient behavior compared to normal,

with low vibration of the same and that can be caused by several laryngeal diseases.

Concerning audible loudness problems, the cause is primarily a hearing or learning

deficit and the voice is, typically, monochrome - with no variation in the intensity and

speed of speech. Additionally, the perception of disturbance in the resonance can be a

consequence of incorrect postures in the language, dimension of the tract or problems of

nasal assimilation.

2.1.4 Treatments for voice disorders

Regarding type of disturbance in the child’s speech, different treatment categories may

be applied [3]: voice therapy at the physiological or symptom level. Since the area encom-

passes numerous types of treatments, only treatments based on the use of sustained vow-

els will be specified in this document. Within physiological voice therapy, the phoneme

resistance training exercise, the Lee Silverman technique (LSVT) treatment and the pitch

limitation treatment may be used.

Lee Silverman Voice Treatment (LSVT) This treatment was originally created for the

treatment of Parkinson’s diseases [3, 12, 47]. Although it is already being used for

therapy of other pathologies in the voice, for instance, dysfunctions in the breathing

and in the larynx. The LSVT may use exercises with the sustained vowel or with

small phrases. During treatment, patients should:

1. think loud / think shout,

2. initiate sound production through the most sustained vowel and

3. repeat the exercise at least once more.

This treatment should be intensive so that it has a long-term effect and allows

patients to recalibrate the loudness.

Pitch Limiting Voice Treatment (PLVT) PLVT is very similar to LSVT. This treatment

uses the same exercises for the proposed effect [47]. Additionally, it also allows lim-

iting the pitch increase and thus preventing vocal pressure. The phrase that serves

as a motto for patients is speak loud and low so that sustained sound production

follows this pattern.

Phoneme Resistance Training Exercise (PhoRTE) PhoRTE is another type of treatment,

which combines the treatment of loudness and pitch and has been applied to im-

prove vocal quality and decrease phonation effort [3]. The treatment should include

the following steps:

1. the sustained vowel production /a/ with the maximum intensity of the sus-

tained phoneme,

15

CHAPTER 2. FUNDAMENTAL CONCEPTS

2. the sustained vowel production /a/ with loudness increase and pitch along

sound production,

3. production of sentences with high loudness and high pitch, and

4. finally, production of the same sentences with high loudness and low pitch.

Emission Technique in Maximum Phonation Time (ETMPT) This technique was tested

in the field of spasmodic conduction dysphonia, a type of disturbance in the voice of

neurological origin [33]. This technique aims to promote glottic resistance, improve

phonatory stability and suit glottic coaptation. The treatment uses the sustained

vowel /a/, whose steps are similar to the execution of the LSVT.

Concerning symptomatic voice therapy with sustained vowel exercise, the straw

phonation exercise of the semi-occluded vocal tract (SOVT) exercise set can be per-

formed [3]. Therapy with SOVTs aims to maximize the interaction between the vocal

chords and the vocal tract, in order to facilitate the production of a resonant sound. The

straw phonation exercise is intended to increase the pressure on the vocal folds by keep-

ing them separate during the phonation time with the sustained vowel with the aid of a

straw or tube.

2.2 Speech processing and Machine Learning

2.2.1 Spectrum Analysis

The vowels can be classified based on the analysis of segments of the sound spectrum in

a stable state of the same [15]. As already mentioned in 2.1.2.5, the formants presented

in the respective spectra of each vowel define patterns that easily distinguish the vowels

from each other [4, 15, 35, 37]. These can be represented by the low-resonance peaks and

with the first two formants f1 and f2, that might be complemented with the information

in the third formant. However, there are limitations with the spectral analysis of the

formants, due to some particular characteristics in the child’s speech, in the phonemes

where the pitch is higher [27, 37]. This is due to the fact that in these situations, the dis-

tance between the harmonics of the sound produced tends to be higher and, consequently,

the harmonics are more likely to coincide with the central frequency of the formants.

2.2.2 Speech processing and extraction

As mentioned previously in section 2.1.1, the sound can be described based on physical

or acoustic parameters. The acoustic analysis is related to the way the human being

perceives the sound, which does not follow a linear scale. MFCCs are the sound features

mostly used today and which provide a robustness to the linguistic content produced and

an attenuation of the noise present in the signal. The MFCCs are a representation of the

parametrized acoustic signal and noise reduction present, based on the application of the

16

2.2. SPEECH PROCESSING AND MACHINE LEARNING

Fourier transform to each segment of the signal [20]. In addition, it involves the mapping

of the energy of the same through the filter of the mel scale of frequencies 5. They result

in a compressed and equalized short spectrum of short duration.

Setup and Pre-emphasis The pre-emphasis filter on the signal amplifies the high fre-

quencies [14]. It balances the frequency spectrum considering that lower frequen-

cies have higher magnitudes and vice-versa. The goal of pre-emphasis is to com-

pensate the high-frequency part that was suppressed during the sound production

mechanism of humans.

Framing In a sound signal, the frequencies change over time [20]. By slicing the signal

into frames, we can obtain the frequency contours of the signal. If the frame is too

short, it might not have enough samples to get a reliable spectral estimate. On the

other hand, if the frame is too long, the signal changes too much throughout the

frame.

Window After slicing, a window function is applied to deal with FFT limitations [34].

When the FFT is applied, it assumes that the data set is a continuous spectrum, one

period of a periodic signal. However, we might not have an a continuous time signal,

it might include sharp transition changes, discontinuities with spectral leakages.

Windowing reduces the amplitude of these discontinuities at the boundaries of the

frame taking into account. Each frame has to be multiplied with a Hamming or

Hanning window, to keep continuity of the first and last points in the frame.

Fourier-Transform and Power spectrum The Fourier transform deconstructs a time do-

main representation of a signal into a frequency domain components with discrete

values - bins. The computation of the power spectrum generate a periodogram

which allows the identification of the frequencies in the frame.

Filter Banks The mel frequency range was developed based on the observations of hu-

man perceptions regarding stimuli with variations in frequency tones [7]. The

mel scale is applied to simulate the gaps in human hearing sensitivity for different

frequency tones, which become more spaced with increasing frequency.

Take the logarithm of the filter banks A transformation in the filter bank vectors is ap-

plied because loudness is not perceived in a linear scale [14]. Using a logarithm

function allows us to use cepstral mean subtraction as a normalization technique.

Take the Discrete Cosine Transform (DCT) This function is normally used for data com-

pression since it concentrates the amount of information in the first few points [20].

5The mel scale of frequencies is based on the pitch perception. Since human auditory system doesnot interpret pitch in a linear way, the mel scale presents a scale that follows the humam perception offrequencies (linear in frequency range 0-1000 Hz and logarithmic above 1000 Hz) [20]

17

CHAPTER 2. FUNDAMENTAL CONCEPTS

Figure 2.4: Mel filters in a 8000 Hz signal. Source: [14]

Therefore, it uses the 26 log filterbank energies from the previous step and trans-

form them in 26 cepstral coefficients - Mel Filter Cepstral Coefficients. Usually,

for ASR purpose, only the lower 12-16 coefficients are used.

Mean normalization The mean of each coefficient from all frames might be applied in

order to balance the spectrum and improve Signal-to-Noise (SNR) 6.

2.2.3 Additional Sound Features

Delta - Differential Coefficients Represents the changes in coefficients between consec-

utive frames and the returned matrix has the same size and data type as the original

coefficients array.

Double Delta - Acceleration Coefficients Represents the changes in delta values from

one frame to another. The returned matrix has the same size and data type as the

original coefficients array.

2.2.4 Classification algorithms

The classification algorithms to be applied were chosen with support in previous studies

involving the classification of vowels [7, 10], for which high classification results were

obtained.

Quadratic discriminant analysis (QDA) The QDA is a classic classifier [38]. In partic-

ular, it can learn to make quadratic boundaries, is more flexible than linear ap-

proaches, since it manages to use more dimensions. For the QDA application, the

co-variance matrix of the extracted acoustic parameters is used and it is assumed

that these matrices are different for each category.

Support Vector Machine (SVM) The SVM is a supervised learning algorithm and can

be used for linear regression and classification between two classes [38]. When used

for classification, the algorithm seeks to maximize the margin between the classes

6The signal-to-noise ration is defined as the ratio between the power of the signal and the backgroundnoise, expressed in decibels. [20]

18

2.3. PLAYER-ADAPTABILITY MODELS

of interest, using vectors to define these margins. In the scope of this dissertation,

it is used as a classifier to separate the vowels /a/, /e/, /i/, /o/ and /u/, based on

the feature vectors that distinguish them.

In order for the algorithm to be able to classify not linearly separable sets, a kernel

function can be used and so, the training vectors can be extended to higher dimen-

sions. In the study that included the classification of vowels in children [10] the

same algorithm was used, with the kernel function Gaussian radial basis, with high

and reliable results.

Random Forest Classifier (RF) The RF algorithm works as a large collection of decorre-

lated N-tree decision trees, where N-tree is the number of estimators chosen [38].

Each decision tree is created according to hierarchic splits, where a split corresponds

to a leaf of the tree. In turn, each split tries to minimize the entropy between the

data. Thus, the optimal split, maximizes the number of different points in each one

of the leafs.

The RF is based on ensemble learning, regarding that, for a new data point, it makes

each one of the N-tree trees predict the category to which the data points belongs,

assigning the new data point to the category that wins the majority vote.

2.3 Player-adaptability models

There are several strategies that can be followed in order to implement a serious game

adaptability model [26]. These include approaches that control the state of the game by

varying global resources or specific exercise variables. The Rubber Banding technique is

generally used in racing games, as the Mario Karts game [26, 41]. The idea behind this

technique is based on the manipulation of the available resources in the game, so that the

performance of a player starts very limited, within a certain threshold. In the beginning,

the system offers a limited set of resources to the player, so he can progress in the game,

with forward and backward movements towards achieving the success. This technique

challenges the player to overcome new tasks until he reaches a maximum level, where

the resources are fully available. What happens in these cases is that the game presents

itself rather less accessible for novice players then for experienced players.

In addition to this approach, the flow model tends to be widely used (figure 2.5) [42].

This model controls the resources and variables of the game according to the player’s

experience with the game platform. More specifically, this control is achieved by bal-

ancing the proposed challenge with the player’s skills. In order for the user to maintain

interest in the game he must remain in a state limited to the flow channel as illustrated

in figure 2.5. The figure shows a repeating cycle of increasing challenges, until a thresh-

old is reached and the player receives a reward or some new resources to motivate him

to keep on playing. This state is followed by a less challenging period, until the game’

variables change again, taking the challenge to new heights. The flow model was defined

19

CHAPTER 2. FUNDAMENTAL CONCEPTS

Figure 2.5: Flow model proposed by Csikszentmihalyi [42].

as a generalized scheme. It is important to understand the game variables that affect the

player’s experience to define how to make the game progress.

20

Chapter

3State of Art

Computational therapeutic interventions have shown to be essential tools to complement

traditional therapy sessions since they can be used in an informal and comfortable learn-

ing environment. Among the tools available for therapeutic use, some focus on problems

in speech articulation and others on voice disturbance, and we present them throughout

this chapter. Concerning the fact that we want to develop a dynamic difficulty adaptable

model, we analyze other platforms with this functionality. Lastly, we summarize the

mainly differences between these systems.

3.1 Tools for speech therapy

Some of the tools we are presenting do not use acoustic analysis, and so, only offers

visual interaction with the game. These games will be briefly described in the 3.1.1

section. There are sound-aware tools, without the specification of any phonemes that are

introduced in section 3.1.2. Lastly, systems with specific phonemes detection are briefly

presented in section 3.1.3.

3.1.1 Without sound recognition

Some tools available for speech therapy does not include acoustic analysis and offer an-

other type of exercises to complement speech treatment. The Little Bee Speech website

offers a range of articulation training applications with English or Spanish language [51].

More specifically, the Articulation Station, as shown in figure 3.1, provides exercises

for practicing isolated words or phrases in the context of different stories, with the pos-

sibility of including optional exercises with reproduction of the presented sounds. It

allows the player to train any sound through an interactive and childlike environment.

However, since the app does not detect sound, it can not give feedback to the child if the

21

CHAPTER 3. STATE OF ART

sound reproduction requested was correct or not. This app does not focus on a particular

disturbance.

Figure 3.1: Training game with phonemes for articulation problems. Source: [51]

The tool Falar a brincar, illustrated in figure 3.2, provides an interactive interface

without sound feedback, whose exercises allow syllables to be counted and identified in

the word [23]. Giving that, there is no ASR system to recognize the sounds produced, the

practice of the exercises must be performed within the sessions with an SLT. Unlike the

previous game, Falar a brincar is intended for EP language.

Figure 3.2: Falar a brincar game. Source: [23]

The robust game with voice exercises in the field of speech therapy is included in

the BVS project and is focused on the practice of phonemes, based on a treasure map

theme, with gifts to conquer in each exercise [10]. For further mention, we call it BVS

tool 1. The figure 3.3 reveals some of the scenarios included in the game. The possibility

of winning rewards challenges the child to practice the set of available exercises, which

is an exciting gamification strategy. However, this tool can only be used in a session

environment since, through a specific key, the therapist must give the feedback in the

game if according to the child’s performance to produce the requested sound. Otherwise,

the system cannot display a response to the children’ behavior.

22

3.1. TOOLS FOR SPEECH THERAPY

Figure 3.3: Scenarios of the robust game with voice exercises for speech therapy.Source: [10]

The limitation of many of these games and systems, such as the set of the Little-

BeeSpeech exercises and Falar a Brincar, is the lack of automatic phoneme recognition.

They depend on the support of an adult to judge the child’s speech productions and

manually make the game progress. On the other hand, our proposed game automatically

responds to the child’s voice, and so, it overcomes this limitation.

3.1.2 With unspecified phoneme recognition

The serious game of the vowel sustained with adaptive difficulty for speech therapy

is included in the BVS project and focus on the practice of sustained phonemes [29]. For

further mention, we call it BVS tool 2. The game includes two scenarios involved in a

childhood theme, as we present in figure 3.4. In this case, it is possible to identify if the

sound is being produced sustainably. Additionally, it allows the therapist’s parametriza-

tion of the difficulty within each trial.

The Lee Silverman voice treatment (LSVT) companion system, is a computer tool

that uses the SVE exercise to complement treatment sessions and speech therapy assess-

ment [16, 22]. The tool focus on patients with Parkinson’s disease, and other neurological

pathologies, including children with dysarthria or cerebral palsy. The crucial feature in

this therapy program is the intensive voice treatment to improve voice quality, specifically

vocal loudness.

The tool allows practicing the SVE and other continuous speech exercises, while it

records the speech parameters of interest. Although children can use this software, it

does not offer an attractive interactive interface for them. It was designed to be a tool,

23

CHAPTER 3. STATE OF ART

Figure 3.4: Scene from the serious game for sustained vowel Source: [29]

not a game. Moreover, the tool does not allow real-time customization, and so, the tasks’

parameters must be chosen before each session starts, manually.

3.1.3 With identification of specific phonemes

Tools without sound analysis lack the flexibility to practice the exercises outside the ther-

apy session and, consequently, do not allow the strengthening of the exercises practiced

during the session. The following examples introduce sound recognition and solve part of

the problem. sPeAK-MAN, illustrated in figure 3.5, uses a popular and well-known game,

Pac-Man, to motivate the player to practice the vocalization of words usually performed

in a therapeutic environment [48]. The feedback of the player’s performance is delivered

in real time for each sound production.

Figure 3.5: Screenshot from the sPeAK-MAN interface. Source: [48]

In the case of VITHEA, illustrated in figure 3.6 a virtual therapist is available for the

24

3.1. TOOLS FOR SPEECH THERAPY

treatment of aphasia, especially for cases with difficulty in producing some words [39].

It includes automatic word recognition and prompts the user of the tool to pronounce

the visual or audible stimulus presented to you correctly. The disease under treatment,

aphasia, does not usually occur in children and so, the tool does not offer other types of

exercises focused on them.

Figure 3.6: Tool with virtual therapist for aphasia treatment. Source: [39]

The game Flappy Voice is adapted from a popular game, the Flappy Bird, but in this

version, the player’s voice controls the bird’s movement [28]. The initial position of the

bird is mapped according to the intensity of the child’s voice, who has to vary the sound

intensity so that the character does not collide against obstacles. This tool allows the

repetition of the exercise in terms of time and loudness thresholds. The therapist can

also define different levels of difficulty. The game offers an assisted mode, which limits

the user skills according to the therapeutic settings and the advanced mode where this

limitation does not exist.

Figure 3.7: Screenshot from the Interactive Game for the training of portuguese vowels.Source: [7]

The Interactive Game for the training of Portuguese vowels uses a simple car race

theme [7]. This game offers an interactive application, as we present in figure 3.7, entirely

controlled with the pronunciation of isolated vowels (a, e, i, o, u) which are classified

25

CHAPTER 3. STATE OF ART

with an ASR system. This game does not allow the therapist’s parameterization and does

not include a player-adaptable model to fit the child’s needs.

3.2 Tools with a DDAmodel

As we have mentioned, it is essential to keep the child motivated during the in-game

experience. Thus, a gaming platform should prepare different challenging scenarios, for

instance, through difficulty levels, to stimulate the player to improve his performance and

continuos playing. In a therapy context, this appealing environment should arouse the

evolution of the therapy’s variables and therefore, contribute to the treatment’s progress.

Some of the previous systems implement a basic parametrization method. Falar a

Brincar and sPeAK-MAN include predefined difficulty levels. The passage to the next

level involves hitting a set of tasks which become harder, level after level. Moreover, there

are are other systems where the player-adaptable concept can be parameterized through

the therapist, although they incorporate a simplistic approach.

For instance, the Articulation Station allows the therapist to customize the list of

words to use in the exercise and so, the level of difficulty, by choosing more complex

words. Flappy Voice can adapt the game to the needs of each child. Specifically, the SLP

can create new scenarios with an arbitrary number of obstacles, or adapt the difficulty of

the game through changing two parameters: the reaction time of the bird to the sound of

the input and the vertical distance between the obstacles, which allows the crossing of the

bird. The BVS tool 2 includes a manual parametrization approach with two variables: the

MPT variable and the intensity level. However, regarding the intensity level chosen, there

is no possibility to change the intensity interval. So, if the child oscillates her intensity

production, the system assumes it as correct.

Alternatively, Yun et. al. present a less simple methodology that automatically adjusts

the game difficulty using a profile-based adaptive difficulty system (PADS) [54]. They

want to improve the gaming experience by using player profiles to determine the best

difficulty level to each player. To create a player profile, they use a player’s prior gaming

experience and his preferences. Then, with these parameters they set the game difficulty

adjustment thresholds, and the PADS use a performance-based algorithm to adjust the

difficulty settings to the player. To do this, they transform the player’s performance

data into a point scale (these points are calculated using a predefined threshold system

depending on the player profile). The difficulty level changes whenever the thresholds

are crossed. If the output is greater than the positive threshold they increase the difficulty

level, on the other hand, if the output is less than the negative threshold they decrease

the difficulty level.

Another approach to give an appropriate challenge level to each player is presented

by Demediuk and colleagues [9]. They developed an adaptive training framework to

construct an opponent, whose strategies and behavior adapt to the progress of the player.

Their goal is to alter the level of challenge of the opponent according to the changes in

26

3.3. TOOLS COMPARISON

the player’s proficiency. More specifically, the player competes against the AI opponent

which adapts its level of challenge by using Dynamic Difficulty Adjustment, which means

change strategies based on the interaction in real time with the player. They present a

comparison between the behavior of the opponent against the player, and against fixed

difficulty levels. After that, it is necessary to relate player proficiency to the difficulty

level of the opponent. Finally, it monitors the player’s proficiency level and adjusts the

adaptive AI opponent when necessary.

With our DDA proposal, we offer a dynamic strategy to parametrize the game dif-

ficulty and we allow the initial choice of MPT value, intensity level and the intensity

variation from an easier/larger interval to a harder/ reduced interval, allowing the child

to stabilize her production in phase.

3.3 Tools comparison

Table 3.1: Comparative table that summarizes the described tools for Speech Therapy.

Tool Pathology Idiom Platform ASR DDAArticulation

stationArticulation En iPad, iPhone No Yes

Falar a brincarPhonologicalawareness

Pt Android Yes Yes

BVS tool 1Articulation andvoice disorders

Pt Computer No No

BVS tool 2 Dysphonia Pt Computer No No

LSVT Parkinson En Computer No No

sPeAK-MAN Articulation EnComputer

+ sensor KinectYes Yes

VITHEA Aphasia Pt Online Yes No

Flappy Voice Apraxia En Mobile Yes Yes

Interactive Gamefor the trainingof Portuguese

Vowels

Vowelsrecognition

Pt Computer Yes No

Proposedsolution

Dysphonia PtComputer,mobile

Yes Yes

The Table 3.1 summarizes the main differences between the therapy tools previously

described in section 3.1. These systems focus on different pathologies, idioms, and plat-

forms. We also highlight the existence of an ASR system, as well as, a DDA model. How-

ever, none of the tools fulfill the requisites we desire — specifically, our proposed solution

focuses on voice disorders for children with dysphonia. It should run in both computer

and mobile devices and incorporate an ASR system and a DDA scheme. Therefore, the

following chapters cover a detailed explanation of the game functionalities, including a

novel DDA and the complete methodology’s description to prepare the ASR system.

27

Chapter

4Game and Architecture

In this chapter, we present in greater detail our game based approach, as we previously

introduced in chapter 1. We start with a detailed description of the primary exercise,

the game functionalities, main theme, scenes, and transitions, the game architecture and

other details concerning the game’s implementation.

4.1 Proposed game

The solution addressed in this dissertation focuses on the practice of the sustained vowel

exercise towards the treatment of voice disorders. As mentioned before, it is essential

to develop a tool that helps speech therapists, in the session and outside the therapy

environment. In order to keep the children interested and to stimulate learning, the

therapy should be a motivating and relaxed process, appropriate to age, gender and tastes

of each child.

Computer-based sessions with interactive interfaces might fill these requirements,

with a stimulus that can be represented through gifts, collected throughout the practice

of the exercises. Additionally, depending on the characteristics of the child in therapy,

numerous parameters must be adapted appropriately to each situation that might arise

during the course of the therapy. The intensity of therapy sessions can affect the child’s

performance. A more significant number of sessions per week strengthens the positive

results of the exercises and tends to accelerate the child’s progress. In this way, having

a portable tool that can be used in different spaces and outside the traditional therapy

environment adds training moments to the treatment besides the advantages mentioned

above.

29

CHAPTER 4. GAME AND ARCHITECTURE

4.1.1 The sustained vowel exercise

In this dissertation project, we developed a game with the SVE, destined for the treatment

of childhood dysphonia. To perform the exercise, the child has to produce one of the

vowels /a/, /e/, /i/, /o/ or /u/ for as much time as possible. This time duration is

associated with the MPT, one of the variables in therapy. Moreover, the platform goal is to

move the character within the exercise scene, from a starting point to a final position. The

character movement is illustrated in Figure 4.1, where a target object represents the final

position. Through this feedback, the game instructs the child to continuous produce the

sustained vowel, until the character reaches the target. The initial distance between these

two scene’s elements is associated with the MPT expected duration. Thus, to challenge

the practice of the MPT, the game can change the initial distance between the character

and the target.

Figure 4.1: The figure illustrates the interaction between the character and the target in ascene context.

Additionally, the character’s movement depends on two aspects: loudness of the

sound produced and the sound itself. Specifically, the loudness is another variable in

therapy and must stand between specific thresholds. These thresholds and further infor-

mation regarding the variables in therapy are explained in chapter 5. On the other hand,

the sound produced must follow the SLP choice, from one of the vowels /a/, /e/, /i/, /o/

or /u/ whose recognition is responsibility of the ASR system introduced in chapter 6.

4.1.2 Game scenarios and gamification strategy

Since the game is intended for children, the scenarios focus on an infant theme, without

too much detail, and with appealing colors. Figure 4.2 represents the scenes for the SVE

exercise. These scenarios were created with the help of a visual artist and with images

from Freepik [18].

We decided to focus the game on a journey for the discovery of treasures, which are

gifts that the players, i.e. the children conquered at the end of each task. Additionally, the

children can choose a character that reflects their preferences and tastes. Since tendencies

differ according to age, gender and culture, the game offers four different ethnic options

for both genders, and so, to fulfill different child’s preferences. The figure 4.3 presents the

30

4.1. PROPOSED GAME

Figure 4.2: Scenarios available for the exercise page.

31

CHAPTER 4. GAME AND ARCHITECTURE

Figure 4.3: Set of characters available, representing both genders and four different eth-nies.

Figure 4.4: Available rewards.

available characters and figure 4.4 shows the possible gifts. In total, we offer 15 rewards,

that the child can collect after a successful trial.

The primary purpose of these game’s UI elements lies in the importance of holding

the child engaged and satisfied so that she sees therapy not as a hassle but as a fun

and challenging experience. With this stimulus, our game tries to prompt the child to

practice the exercise continuously with motivation. After the child completes a task, she

can choose a reward and starts its rewards collections. In this way, these interactive

elements are combined in a gamification strategy that pretends to keep the child engage

to play.

4.1.3 Visual feedback

The children motivation to continue playing occurs due to an interactive game platform

where the child can see her progress. Thus, with the right visual feedback from the

exercise scene, the child can recognize her failures and improve her utterances produc-

tion until she accomplishes the exercise goal. During the trial, the game shows distinct

32

4.1. PROPOSED GAME

(a) Try again message. (b) Congratulations message.

Figure 4.5: Add child basic info scene.

(a) Fall of the flying carpet. (b) Fall of the train.

Figure 4.6: Character’s falling feedback.

feedbacks according to the child’s performance. Figure 4.5b presents the game visual

feedback in the case of success. Otherwise, if the character achieves the game margins,

the game presents the message illustrated in figure 4.5a.

Moreover, if the child exhibits a correct production, the game reacts through the char-

acter’s movement to the right. In the opposite situation, the character stops the movement

and begins to fall, until her production reaches the expected values. These movement’s

variations are supposed to be intuitive to the child and encourage her to improve her

performance. However, since the exercise scenarios have different characteristics, the

same movement may be appropriate in some scenarios and inadequate in other. The fol-

lowing figures 4.6 show different scenes for the SVE. In the 4.6a is adequate the fall of the

flying carpet. Nonetheless, in the scene 4.6b is almost unbearable to let the train decline.

To avoid these situations, we could adapt the character’s animation easily. Nevertheless,

we consider best to have similar behaviors in the furtherance of intuitiveness. Besides,

children have a vast imagination, and so, it should not be a problem.

4.1.4 Game parametrization

Since each child has a unique pathological situation, their treatment should focus on their

needs. Thus, before the child interacts with the platform, a set of parameters must be

established to ensure the ideal level of difficulty. Note that, different children deal in

33

CHAPTER 4. GAME AND ARCHITECTURE

different ways with a new stimulus. According to information from therapists, for an

autistic child, for example, the platform should include a reduced number of scenarios,

since these children are more comfortable with repetitive exercises. On the other hand,

hyperactive children prefer a game with more appealing and challenging UI elements.

Therefore, the therapist is able to choose the scenes in the "treasure hunt".

Each child can have a different performance throughout the treatment, and thus, the

difficulty level should be fit her abilities. For instance, the current level may be harder

than the child’s capacities and frustrate her, or be extremely simple and prove to be less

challenging and boring. Thus, each exercise must be adapted to the current level of

treatment and the child to find a balance between the degree of motivation and challenge.

The game difficulty variables can be chosen manually by the therapist or automatically

with our DDA scheme, further described in chapter 5.

For this purpose, the therapist can choose the corresponding parameters in therapy,

the MPT and the loudness variables. The therapist is also able to select the scenarios that

she considers most appropriate for the child. On the other hand, when the child interacts

with the game, the custom map should already be available with the parametrization

of the therapist. Since each platform focus on one particular child, the SLP can also

introduce the child’s basic information. This information includes the name, age, gender

and relevant additional description.

To sum up, the therapist can set the following information and parametrization:

• child’s basic info;

• the scenarios that are available for the child’ treatment;

• the intensity level - low, medium or high - regarding the purpose of the therapy;

• difficulty adaptation mode (manual or automatic);

• the established time for the child to complete the exercise (MPT);

• the loudness expected intervals;

• the vowel to identify during the SVE.

4.2 Platform architecture, design and structure

4.2.1 System architecture

The game is designed to run on computer or mobile devices. As we previously mentioned,

our platform includes an ASR system, responsible for the preprocessing and analysis of

the child’s utterances produced in real time, during the SVE. Depending on the device on

which the executable will run, resource limitation may be an obstacle to processing the

data received. Regarding the type of algorithm adopted for an ASR system, it is necessary

to assure its real-time interaction response with the child.

34

4.2. PLATFORM ARCHITECTURE, DESIGN AND STRUCTURE

To ensure the best performance of the software, we used a client-server architecture,

in which the system forwards the most complex processing to the server. In this way, the

tool is not limited by its resources and its future use is muchmore flexible. The processing

on the server must be performed in real time so that the feedback can be returned to the

child during her attempt. The data exchange between client and server is presented in

the figure 4.7 and is followed by a brief description of the responsibilities of each part of

the system.

Figure 4.7: Client-server architecture.

The client represents the part of the system responsible for the game actions, graphics

organization, recording of child’s utterances and sending them to the server. After his

response, the client generates the right feedback to the child through the game elements.

The server is responsible for all processing in the ASR system. Therefore, it receives

the segments of sound produced from the client side and deals with the proper extrac-

tion and data processing so that they can be analyzed, and classified using the trained

algorithm. For that, it is necessary to consider which algorithm to choose, so that the

processing time is as low as possible without penalizing the correct classification of the

phoneme.

For the development of the platform, we used the Unity game engine, regarding its

advantages. First, Unity is cross-platform, which means that through a single develop-

ment process and the same code, we can launch it on different platforms. On the other

hand, it is free, quite intuitive and has an extended community of platform users which

facilitates and supports the rapid implementation of the tool.

4.2.2 Game’s storyboard

In order to integrate the functionalities for both therapist and child, the game platform

presents the following structure: (1) initial page, (2) choose character page, (3) fill child

basic info page, (4) choose scenarios page, (5) add exercise parameters page, (6) SLP

choose options page, (7) map page, (8) prizes page, and (9) the exercise page (The char-

acter, the scene, and the target vary according to the scenario chosen). These scenes

35

CHAPTER 4. GAME AND ARCHITECTURE

Customer ATM Machine Bank

Exit

Update child info

Choose game scenarios

Game Controller

Is first parametrization?

Yes

Edit?

Add Session

No

Child

Exercise Controller Server

Choose character

See gifts

Exit game?

Yes

Send connection request

Text

Establish connection

Are parameters

thresholds ok?

Stop

Move

No

Classify sample

Send result

Send sample recorded

Yes

Is vowel correct?

Is target reach?

Start SVE with level updated

Record data

No

Restart?Yes

No

Yes

Yes

No

No

Choose Exercise

Option 1

Option2t

Option 3

Option 1

Option 2

Update player performance

No

Yes

SLP Choose player type

Choose option

Choose Map options

Choose gift

Yes

Save results

No

See child results

Option 3

Figure 4.8: Activity diagram of the game platform.

represent the set of activities available in the evolution of the game status. The game

presents different flow options concerning the players’ type (if the player is an SLP or

a child). These funcionalities are performed within out game platform through a set of

activities. In the figure 4.8, we present a flowchart that describes these activities and the

flow between the user and the system.

For a better comprehension of the platform interface, we describe each page in detail

with the corresponding scenes, function and relevant implementation details.

36

4.2. PLATFORM ARCHITECTURE, DESIGN AND STRUCTURE

1. Initial page

The game starts with the presentation of this page, as illustrated in figure 4.9, and

it offers to the player two possible flows, one if the player is a therapist and another

the player is the child, with a different path for each case.

Figure 4.9: Start page scenario.

2. Choose character page

The choose character page is the following scene in the child’s flow if we choose

the correspondent button on the start page. The figure 4.10a illustrates the page

responsible for changing the available characters from figure 4.3.

3. Fill child basic info page

The game introduces a set of three pages for the steps in the game configuration

process, to add information about the child in treatment. The first parametrization

page - fill child basic info page - allows the therapist to add the elemental informa-

tion about the child, as shown in figure 4.11a. More specifically, the name, gender,

age and a field to fill with extra information, for example, the description of the

pathology to treat. This scene concerns the therapist flow. These parameters are

validated before the user proceeds to the next page, as shown in figure 4.11b. Note

that, if the therapist already added the child’s information, the game will not show

this page. Instead, the platform will present the choose options scene 4.13a.

4. Choose scenarios page

The choose scenarios page is the second parametrization step and allows the SLP

to choose the most appropriate scenarios to the child in treatment, as illustrated in

figure 4.12a.

5. Parameters page

Regarding the child’s needs, it is essential to include the possibility to parametrize

37

CHAPTER 4. GAME AND ARCHITECTURE

(a) Choose character page scenario. (b) Rewards pages.

Figure 4.10: Choose characters (left) and see rewards (right) scenarios.

(a) Add basic info page (b) Basic info validation

Figure 4.11: Add child basic info scene.

.

(a) Choose scenes page (b) Choose parameters page

Figure 4.12: Treatment editable parameters.

the exercise variables, as presented in figure 4.12b. This scene corresponds to the

third step in the configuration process.

The therapist is allowed to choose the adequate parameters to each case. Besides

the scenes, the intensity level (low, medium or high) and a variable that indicates

if the difficulty adjust level is automatic or manual. If the SLP chooses the manual

parametrization option then, there are appended the options MPT level (from 2s

to 10s) and intensity index (easy for a larger loudness interval, moderate for an

intermediate range or hard for a small interval).

38

4.2. PLATFORM ARCHITECTURE, DESIGN AND STRUCTURE

(a) Therapist game options page. (b) Map scenes available to the child.

Figure 4.13: SLP and children’ game options.

6. Therapist choose options page

After the therapist adds the child’s first parametrization, the therapist will see the

choose options page, as illustrated in figure 4.13a. From there, the therapist can edit

the child profile, and the treatment information, as seen in figure 4.12. Otherwise,

the SLP can see the map with the scenes previously chosen.

7. Map page

In the map scene, the child is allowed to choose the session with the therapist’s

configuration. Otherwise, without previous parametrization, the child can perform

in a random exercise. Besides selecting the task, the kid can see his conquered

prizes or exit the game. In this last case, the platform saves the current status

in a serializable object which contains the player information, his performance

concerning his trials, as well as, the treatment situation. The figure 4.13b illustrates

the map scene.

8. Rewards page

Te page to choose a reward is achievable when the child completes the exercise

selects the rewards button. The figure 4.10b presents the page where the child can

interact through drag and drop with the rewards present in figure 4.4.

9. Exercise page

This scene has three main elements: the character, the scenario, and the target and

they vary according to the scene selected in the map available scenarios and the

chosen player in the choose character page. This page focus on the practice of the

SVE.

Overall, the game’s goal is to make the main character reach a target, for each of the

proposed SVE scenarios. To make the main character move, the child has to produce a

sustained vowel and achieve the values expected in the speech parameters of interest to

the therapy exercise: the phonation time and the intensity level. In the course of following

chapter 5, we discuss how do the therapy variables are measured in the game, during the

39

CHAPTER 4. GAME AND ARCHITECTURE

SVE, and how do we pretend to update them regarding the child’s performance, which

results in the development of a dynamic difficulty adjustment model (DDA).

40

Chapter

5A Novel Dynamic Difficulty Adjustment

model

Our SVE serious game is controlled by the child’s voice in real-time and offers several

scenarios based on an infant theme that aims to keep the child interested. During the

therapy sessions, the SLP can update the treatment variables so that the game’s difficulty

follows the child’s performance. However, without the presence of a therapist, if the

child is struggling with the asset, she may feel frustrated for the incapacity to succeed.

Otherwise, she would be bored with an effortless task. Thus, here we propose a player

adaptable difficulty model, so that she sees the therapy not as a hassle but as a fun and

challenging experience.

The models described in chapter 3 may be appropriate to the respective game, but

inadequate for other applications. For our problem, we suggest an innovative scheme,

where the player’s experience should result in a balance between challenge and relish,

through the influence of the variables relevant to therapy.

5.1 The DDAmodel

Some of the scenarios for the SVE are illustrated in Figure 4.2. During the SVE, the child

has to produce a vowel for as long as possible. The maximum phonation time (MPT) is an

important measure of voice quality [8]. It helps evaluating the child’s ability to control

the breathing while producing a sound at the requested intensity level. The analysis

of this variable can be used both for assessing the individual’s aerobic capacity and for

vocal treatment, for example, for stabilizing the intensity of the sound produced [3]. The

repetition of the SVE will help the child improve her MPT and control her voice intensity

and stability.

The proposed game allows the SLP to parameterize the expected MPT, that is, the

41

CHAPTER 5. A NOVEL DYNAMIC DIFFICULTY ADJUSTMENT MODEL

Table 5.1: Allowed intensity levels and intensity interval sizes in dB (SPL).

intensity level Le L1m L1M L2m L2M L3m L3mhigh 85 80 90 70 100 50 100

medium 60 50 70 47.5 75 45 80

low 40 35 45 30 50 30 55

MPT that the child should reach, and which we call MPTe. In the game’s graphics, MPTe

is the time that the main character needs to move (walk, fly, swim, etc.) to reach the game

target. We use the expression MPTe(r) to refer to MPTe of trial run r.

Depending on the pathology, the child may need to train the SVE with different

intensity levels. Children who usually speak very low, must train speaking at higher

intensities, whereas children who tend to increase the volume and, as a consequence,

strain their vocal cords, must practice speaking at softer intensities. In order to correct

these behaviors, the child can practice the SVE according to his needs. The game allows

the SLP to choose the intensity level to be practiced from three possible values (one value

for low intensity, one for medium and another for high intensity) [29]. We call this the

expected intensity level (or expected loudness), Le.

The child may have difficulties in stabilizing the requested intensity and obviously, it

is not expected that the child achieves a perfectly constant intensity level. Thus small vari-

ations in intensity should be allowed and it is essential to establish the allowed variation

interval. The algorithm uses a minimum and maximum threshold, Lm and LM , around Le

to define this interval:

∆L = [Lm,LM ] , (5.1)

where Le ∈ [Lm,LM ]. We use the expressions ∆L(r) and Le(r) to refer to the intensity

level interval and expected intensity of trial run r, but we will often drop the variable

r for simplicity. Table 5.1 shows the possible expected intensity levels and the allowed

intensity intervals. The SLP has the possibility of manually adjusting these values. To

choose these values we consulted an SLP and related work [16]. The values were adjusted

empirically with children in the aimed age group.

The game’s difficulty depends onMPTe and∆L. The game offers five different possible

values for MPTe: 2, 4, 6, 8 and 10 seconds as the MPT estimated for children is 10 sec-

onds [40]. On the other hand, while the game’s first version used a fixed ∆L size [29], we

now defined three intensity level intervals for ∆L, such that different difficulty levels use

different intensity interval sizes (∆Ln = [Lnm,LnM ] with n ∈ 1,2,3). The lowest difficulty

level allows the widest ∆L, while the highest difficulty level allows the narrowest ∆L.

Combining the different possibilities for the values ofMPTe and ∆L sizes, the game offers

15 different difficulty levels (for each Le, that is, for the low, medium and high intensity

values). Note that the expression ∆L(r) = ∆Ln means that trial run r uses the n-th intensity

level interval size.

Here we propose a new dynamic difficulty adjustment model that aims to keep the

42

5.1. THE DDA MODEL

child motivated on playing. The game’s current version offers the option of adapting

the game’s difficulty manually or automatically. In the latter case, the SLP chooses the

initial difficulty level (defined through an initial value for MPTe and an initial ∆L), and

afterwards, the game runs an algorithm for adapting the difficulty level before each new

trial, that is, the game adapts the values of MPTe and ∆L.

Difficulty adjustments should take into account the player’s performance, and the

player’s performance should be a measure of the parameters to be improved in therapy:

the maximum phonation time and the voice’s intensity and stability. If the child achieves

the expected values for these parameters, the child is ready to access a more demanding

level, with more ambitious expected values. Otherwise, if the values achieved are lower

than what is expected, it means that the child had a poor performance in the game and

the challenge difficulty should be decreased.

Below we first discuss a simpler adaptation model that measures the child’s perfor-

mance only in terms of the MPT achieved by the child (section 5.1.1), and then we

discuss the proposed DDA model, which is a more complete model that measures per-

formance both in terms of the achieved MPT and speech production intensity stability

(section 5.1.2).

5.1.1 Maximum phonation time

When the child starts playing, the SLP should parameterize the expected MPT, that is,

MPTe. The actual MPT achieved by the child, MPTa(r), is measured at each trial run r. It

is intended that during the task the child obtains MPTa(r) =MPTe(r). For simplicity, we

sometimes refer to these fucntions simply as MPTa and MPTe.

If the child is not able to reach the expected MPT (because MPTa <MPTe), the main

character will not reach the target. If this happens for several trials, the child can feel

frustrated with the game. In these situations, the child’s achieved performance is below

the expected performance. Thus, it is important to define a lower value forMPTe. On the

other hand, if the child is achieving a positive performance, that is, obtaining MPTa =

MPTe in successive trials, she may be ready for a higher difficulty level, since she has

already stabilized the aerobic capacity requested for the respective degree of difficulty.

Figure 5.1: Scheme for updating MPTe.

It is important to define when to decrease or increase MPTe. There are situations in

43

CHAPTER 5. A NOVEL DYNAMIC DIFFICULTY ADJUSTMENT MODEL

which the child achievesMPTe andwe should not increase the expectedMPT immediately.

A difficulty level enhancement should be difficult to achieve in order for the child to

stabilize his performance with the current degree of difficulty. On the other hand, a lower

level should be easier to achieve to avoid frustration when the level is not appropriate to

the child’s ability. Also, if the child fails once, it is not necessarily good to decrease MPTe

immediately. In some cases, we should give the child another chance and let her try again.

For instance, if the child does not reach MPTe but MPTe −MPTa is small, we should let

her try again. However, if MPTe −MPTa is large, we should soon decrease the value of

MPTe. We use a threshold of 13MPTe. This scheme is represented in figure 5.1 and can be

summarized as follows:

1. If MPTe <MPTa we will shortly increase MPTe but first we let the child play a few

more trials with this MPTe so that the child stabilizes his performance with the

current MPTe.

2. If 13MPTe < MPTa ≤MPTe we let the child try again a few more trial runs before

changing MPTe.

3. If MPTa ≤13MPTe we will soon decrease MPTe. The achieved MPT was much

smaller than the expetect MPT, which means that the exercise is too difficult for the

child with this MPTe value.

In order to define when and how to change MPTe, we use a cumulative value that

measures the time-evolving child’s performance in terms of the MPT, which we call

PMPT (r), and where r represents the trial run. The value of PMPT increases (that is,

PMPT (r) > PMPT (r − 1)) when the child has a good performance, and decreases otherwise.

We increase or decrease the value of MPTe by 2 seconds depending on the value of PMPT .

PMPT of the current trial r is updated as follows:

PMPT (r) =

PMPT (r − 1) +MPTa(r) , if MPTe(r) ≤MPTa(r)

PMPT (r − 1)−MPTe(r)MPTa(r)

, if 13MPTe(r) <MPTa(r) ≤MPTe(r)

−13MPTe(r) , if MPTa(r) ≤

13MPTe(r)

(5.2)

with PMPT (0) = 0 (the initial value of PMPT before the start of the first trial).

In addition to defining the function’s behavior, it is necessary to establish the limits

between which the PMPT may vary before there is an update of the value of MPTe. The

interval ]− 13MPTe(r), 2MPTe(r)[ determines the possible variation for PMPT (r). Thus,

MPTe(r +1) =

MPTe(r)− 2 , if PMPT (r) ≤ −13MPTe(r)∧MPTe(r) ≥ 4

MPTe(r) + 2 , if PMPT (r) ≥ 2MPTe(r)∧MPTe(r) ≤ 8

MPTe(r) , otherwise

(5.3)

44

5.1. THE DDA MODEL

Note that when the level change is reached, the current performance is reset, that is,

it is reduced to 0. Thus, we add the following first line to equation 5.2:

PMPT (r) =

0 , MPTe(r − 1) ,MPTe(r)∨ r = 0

PMPT (r − 1) +MPTa(r) , if MPTe(r) ≤MPTa(r)

PMPT (r − 1)−MPTe(r)MPTa(r)

, if 13MPTe(r) <MPTa(r) ≤MPTe(r)

−13MPTe(r) , if MPTa(r) ≤

13MPTe(r)

(5.4)

Let us see a few examples.

Example 1 Let us suppose that MPTe(1) = 6 s. Then, while

PMPT ∈ ]− 2, 12[ s, MPTe will remain with the same value.

If MPTa(1) = MPTa(2) = MPTa(3) = 4 s, then

PMPT (1) = −6/4 = −1.5 s, PMPT (2) = −6/4 × 2 = −3, and MPTe(3) decreases, that is,

MPTe(3) = 4 s.

Example 2 Let us suppose that we still have MPTe(1) = 6 s but MPTa(1) = 2 s. In this

case, the difficulty will decrease faster. PMPT (1) = −6/2 = −3, and MPTe decreases

immediately, that is, MPTe(2) = 4 s.

Example 3 Now let us suppose that the child can achieve the expectedMPTwithMPTe(1) =

6 s. That is, MPTa(1) = MPTa(2) = 6 s, then PMPT (1) = 6 s, and PMPT (2) = 12 s.

Thus, the level increases after two trial runs, that is MPTe(3) = 8 s.

5.1.2 Speech intensity level

While it is important to consider the MPTa by the child at trial run r to decide the value

of MPTe of the next trial, it is also important to consider how the speech production

intensity varies during the trial. When the game starts, the SLP chooses an appropriate

expected intensity level, Le. While performing the SVE, one must keep the intensity

level as stable as possible and as close to Le as possible. Thus, the sound intensity level

achieved by the child, La, is one of the variables measured by our algorithm.

The intensity of the speech production, La, is allowed to fluctuate within the interval

∆L. The speech intensity level achieved by the child is a time-varying function, La(t)

(figure 5.2). During the game, the main character moves towards the target, exclusively,

when the speech production intensities (La(t0), La(t1), La(t2), . . .) are within the defined

thresholds, that is, within ∆L.

As explained above, different difficulty levels use different intensity interval sizes ∆L.

Figure 5.2 illustrates the three allowed interval sizes. The first trial run always starts with

the widest intensity interval size, ∆L(1) = ∆L3.

Now let us analyze how the difficulty of the game changes in reaction to La, that is,

how MPTe and ∆L of the next trial are updated. Like in the previous section, here we

also measure the child’s performance to determine when to change the game’s difficulty.

45

CHAPTER 5. A NOVEL DYNAMIC DIFFICULTY ADJUSTMENT MODEL

Figure 5.2: Allowed variation intensity level intervals, ∆L. The figure illustrates a casewith Lm = 50 dB, LM = 70 dB, and Le = 60 dB. The orange line illustrates the time-varyingspeech intensity level achieved by the child, La(t).

There are several situations that we must take into account. (1) If the child is able to keep

La within the expected limits during the whole trial run r, that is, if MPTe(r) ≤MPTa(r)

and La(t) ∈ ∆L(r), for all t, then the difficulty of the next trial run can increase by reducing

the size of ∆L or increasingMPTe. (1.1) Let us suppose that ∆L(r) is too wide (for instance,

∆L(r) = ∆L3). In this case, before increasing MPTe for the next trial, we should reduce

the size of the intensity level interval, that is, if ∆L(r) = ∆Ln then ∆L(r +1) = ∆Ln−1. (1.2)

However, if ∆L(r) = ∆L1, the narrowest interval size, then since the intensity interval size

cannot be reduced anymore, to increase the difficulty of the next trial run, we can increase

the size of MPTe. (1.3) Another possibility is when we have a wide ∆L(r) = ∆Ln but the

child achieves an intensity variation within ∆Ln−i , with i > 1. In this case, we can make a

bigger reduction on the size of ∆L.

On the other hand, (2) if the child registers a variation that exceeds the limits (La(t) <

∆L, for a few t), his intensity levels are not stable. Note that, if the intensity variation is

too discrepant, there is no point in having the child trying to achieve a long MPT and

maintain or increase ∆L. It is preferable to reduce the expected MPT and have the child

learn how to stabilize his voice for a shorter time. In this case, PMPT (r) will assume the

46

5.1. THE DDA MODEL

value 13MPTe(r). The expression for PMPT (r) now reflects all these cases:

PMPT (r) =

0 , MPTe(r − 1) ,MPTe(r)∨ r = 0

PMPT (r − 1) +MPTa(r) , if MPTe(r) ≤MPTa(r)∧

∀t La(t) ∈ ∆L1

PMPT (r − 1)−MPTe(r)MPTa(r)

, if 13MPTe(r) <MPTa(r) ≤MPTe(r)

−13MPTe(r) , if MPTa(r) ≤

13MPTe(r)∨

∧ ∃t La(t) < ∆Ln+1 ,n < 3

(5.5)

MPTe(r) is still defined by equation 5.3 but PMPT (r) is now defined by equation 5.5.

Note that there are slight differences between the old definition of PMPT (r) (equation 5.4)

and its new definition (equation 5.5).

In order to decide when to update ∆L we use another measure of performance that

takes into account both the achieved MPT and intensity variation. We call this measure

P∆L:

P∆L(r) =

0 if ∆L(r − 1) , ∆L(r) ∨ r = 0

∨MPTe(r − 1) ,MPTe(r)

P∆L(r − 1) +1

|∆La(r)|if MPTe(r) ≤MPTa(r) ∧ ∀t La(t) ∈ ∆L(r)

P∆L(r − 1)−|∆La(r)||∆L(r)|

if |∆La(r)| > |∆Ln|where∆L(r) = ∆Ln

P∆L(r − 1) otherwise

(5.6)

where |∆La| measures the achieved intensity variation. |∆La| = Lamax − Lamin, where

Lamax ≥ La(t) and Lamin ≤ La(t) for every La(t) in trial run r. Note that if the difficulty level

changes (because there is a reduction on the size of the allowed intensity interval or in

the value of MPTe) the performance P∆L is reset to 0. The performance P∆L increases for

situation (1) above. On the other hand, it should decrease for situation (2) when ∆La(r)

exceeds ∆Ln. Considering a trial, where the MPTe increased, the ∆L(r) = ∆L1. If the

child experienced a bad performance during consecutive trials, the game should give the

change to try with the next larger ∆L. However, if the child’s performance remains low,

the PMPT decrements and the MPTe will decrease.

As mentioned above, the game starts with the widest interval size. Once the child

starts to achieve a good performance with this interval size, the interval size decrements.

As mentioned above, it is possible to have big decrements on the size of ∆L when the

intensity variation achieved by the child is much smaller than |∆L|. On the other hand,

once the child has reached the narrowest interval size (∆L1), it is possible to increase the

difficulty level by increasing MPTe. If with this new value of MPTe the child’s intensity

variation is wider than |∆L1|, then we increase the interval size a bit, to give the child

47

CHAPTER 5. A NOVEL DYNAMIC DIFFICULTY ADJUSTMENT MODEL

some more time to stabilize his voice for this longer MPT. In this case, we increase the

interval to ∆L2 but we will not increase it further than that, because the child has already

achieved a smaller interval size for a shorter MPT. The following expression reflects how

and when to change ∆L:

∆L(r +1) =

∆L2 , if ∆L(r) = ∆L1 ∧P∆L(r) ≤ 0

∆Ln−i , if ∆L(r) = ∆Ln ∧P∆L(r) >2

|∆L(r)|∧ n > 1

∆L(r) , otherwise

(5.7)

where i determines the size of the decrement on ∆L and is defined as follows: 1 ≤

i ≤ n− 1 and i = n− n′ where |∆La| ≤ |∆Ln′ | and (n′ = 1∨ |∆La| > |∆L

n′−1|), with n′ < n. For

instance, if ∆L(r) = ∆L3 and the child is able to make a correct sound production with

|∆La| ≤ |∆L1|, then ∆L(r + 1) can be updated to ∆L1. The combined behavior with our

variables in therapy is presented in figure 5.3.

Figure 5.3: Scheme for updating ∆L, with the influence of MPT variable.

Example 4 Supposing thatMPTe(1) = 2s. Then, while PMPT ∈]−23 ,4[s,MPTe will remain

with the same value.

When focusing in ∆L variable, we started the first trial r = 1 with ∆L(1) = ∆L4,

which means that was expected the largest ∆L.

Consider that, for each trial r ∈ 1,2,3,4 ,MPTe(r) = 2s and thatMPTa(r) >MPTe(r).

The PMPT will not be updated until the player achieves ∆L1. An exemplified behav-

ior is present next in figure 5.2:

After the trial r = 4, the MPTe will be updated and, as follow, the difficulty will be

increased. Additionally,

∆L will remain the same, unless the child’s performance decrease and she cannot

achieve the target for the expected ∆L thresholds.

The proposed model was developed taking in consideration all the variables in the

therapy process and our main goal. This was finding the right balance between the child’s

skills and the game challenges, with a correlation that must stand within the flow channel.

48

5.1. THE DDA MODEL

Table 5.2: Evolution of child’s performance during four trials.

Trial 1 Trial 2 Trial3 Trial 4MPTa(1) = 2 MPTa(2) = 2 MPTa(3) = 2 MPTa(4) = 2

PMPT (1) = 0 PMPT (2) = 0 PMPT (3) = 2 PMPT (4) = 4

∆L(r) = ∆L3 ∆L(r) = ∆L3 ∆L(r) = ∆L1 ∆L(r) = ∆L1

∆L(1) = [52,69] ∆L(2) = [54,65] ∆L(3) = [53,64] ∆L(4) = [55,64]

P∆L(1) = 0+0,054 P∆L(2) = 0,145 P∆L(3) = 0+0,091 P∆L(4) = 0,2022

|∆L(r)|= 0,057 2

|∆L(r)|= 0,057 2

|∆L(r)|= 0,1 2

|∆L(r)|= 0,1

Thus, it was imperative to analyze how to manipulate the game difficulty according to

the performance in the task’s variables, whose behavior was presented in the (1)-(7)

equations. In cases wherever the player’s performance is low for a specific variable, the

correspondent performance is decremented. Otherwise, if he performs correctly the task

then, the variables’ performances are positively updated.

Additionally, when the performance of a variable reaches a lower or upper boundary,

the variable’s parameters become easier or harder, respectively. These thresholds were

carefully defined to avoid that, in any moment, the child’s experience moves her out of

the flow channel neither triggers a feeling of anxiety or boredom with the game-play

experience.

49

Chapter

6Automatic Sound Recognition System

As we previously discussed, the game platform should detect the child’s failures in pro-

ducing the desired phoneme for the task. In order to be able to identify the sound

produced by the child, it is necessary to implement an ASR system. This system requires

a process of extracting the features of each received sample and processing them through

an automatic learning algorithm.

In this chapter, we discuss the development of an ASR system for vowels. There-

fore, we start with a description of the dataset preprocessing and the feature extraction

techniques to provide the right information of each sound and create a robust classifier.

Furthermore, we describe the results of the combination of the classification algorithms

to improve the accuracy test results. Lastly, we propose a final solution to build the model

that best fits our data.

6.1 Data set characterization

For the practice of the game’s exercises, the child will be asked to produce the sustained

vowel, which can be /a/, /e/, /i/, /o/ or /u/ sustained. To train our ASR model, we used

2 data sets, (A) vowels /a/, /e/, /i/, /o/ and /u/ from the data set created by Aníbal J.

S. Ferreira, and presented in [15]. These records have a 32kHz sampling of the speech

sound with 100 ms, corresponding to the most common EP vowels. Additionally, we

used (B) BioVisualSpeech’s sustained vowels data set, which includes files with 48kHz

sampling and around 4 seconds of the speech sound. The data set includes exclusively

the sustained vowels /a/, /i/ and /u/ [10].

The description regarding each data set is presented in the tables 6.1 and 6.2. Due

to the difference between the two data sets - the sampling frequency and the file length -

we needed to perform a few changes in the original data sets. Giving the differences in

51

CHAPTER 6. AUTOMATIC SOUND RECOGNITION SYSTEM

Table 6.1: Number of samples for each vowel.

PhonemeNr. samples in (1)

with 100 msNr. samples in (2)

with 100 msTotal nr. samples

after feature extraction

/ a / 27 21 222

/ e / 27 0 149

/ i / 27 19 215

/ o / 27 0 148

/ u / 27 19 215

Total 135 59 949

Table 6.2: Total number of children that perform the records from both datasets

(a) Dataset A

Age Boy Girl Both

4 1 0 1

5 5 4 9

6 1 0 1

9 5 6 11

10 2 3 5

Total 14 13 27

(b) Dataset B

Age Boy Girl Both

8 and 9years-old

12 9 21

sampling rates, we perform a sampling-frequency conversion on data set A, from 32kHz

to 48kHz. With respect to the file length, we split the files from the data set B in 100 ms

samples each. Before the split, the records were analyzed to ensure that none of the files

included silence regions. Otherwise, we would have samples of 100 ms with moments of

silence, representing a specific vowel.

Note that, the samples from the sustained phonemes have a total size of around four

seconds each. Initially, we split each field into 100 ms sub recordings and concatenated

them to the data set 1. However, this approach would influence our results, since more

than one sample of each vowel /a/, /i/ and /u/ were associated with the same child,

and so, the number of samples from each child would be disproportional. Hence, the

following sections describe our data with, exclusively, one representation for each vowel

from a specific child.

Although we do not know the conditions and the equipment used to record the data

set A, for the recording of data set B, the sustained vowel exercise was performed in a

small room, with a table and three chairs, in the presence of an SLP and a member of our

team. The computer and the microphone were strategically placed to stand in the front of

the child. The model of the microphone used was Fame audio MS-1800S. Regarding the

lack of isolation in the room, there is noise in a few records. This fact is an advantage since

it helps in the creation of a robust algorithm, which is ready to receive future samples

with environmental noise during the future sessions with the game.

The SLP asked the child to take a deep breath and to maintain the phoneme sustained

for the maximum time as possible. During the record, three cards were shown to each

52

6.2. AUTOMATIC RECOGNITION SYSTEM OF VOWELS

(a) /a/ (b) /i/ (c) /u/

Figure 6.1: Comparative samples with 100 ms from the sustained phonemes /a/, /i/ and/u/, with pitch and formants marked as blue and red, respectively.

child, with the words "Ave", "Iva"and "Uva", since they should perform the MPT exercise

without any adulterate factor. Additionally, for each child, the sequence of the MPT

exercises was randomly selected. In the Figure 6.1 three samples are shown representing

each of the sustained vowels recorded. As observed in Figure 6.1 each vowel presents

different spectrum characteristics. When observing the formant plot, we are completely

able to separate the shown vowels. However, as we previous analyzed, the children’s

pitch tends to affect the formants pattern, and so, in some cases these features are not

capable of completely separate the vowels [15, 27, 37].

6.2 Automatic recognition system of vowels

In order to develop the ASR system, we combine several techniques to find the solution

that best fits our purpose, regarding the possible solutions analyzed in section 2.2. The

development of our system is described in the following subsections and illustrated in

the figure 6.2.

6.2.1 Feature extraction techniques

The ASR system needs to train the classifier with information that characterizes each

sound data of interest, to adequately distinguish the phonemes produced. Besides the

analysis of the data set characteristics, we had to choose the features that represent each

phoneme and capture important perceptual cues, that allows our model to distinguish

them. In chapter 2, section 2.2, we show the most popular features in sound data analysis.

The previous spectral analysis approach was replaced for the MFCCs analysis.

Among other details, the mel frequency cepstral coefficients are obtained through the

application of a mel-filter in the signal frames and further applying of the DCT func-

tion. Despite the popularity of MFCCs, some studies tried to analyze if the process of

decorrelating the Mel-filter banks through DCT method is necessary to improve perfor-

mance. Otherwise, it may be an unnecessary step if it discards important information

53

CHAPTER 6. AUTOMATIC SOUND RECOGNITION SYSTEM

Figure 6.2: Steps in the development of our vowel ASR system.

from the original speech signal [25]. Consequently, we perform the extraction of MFCCs

features and we complement their analysis with the filter bank features. Figure 6.3 shows

a comparison between the FB and MFCCs from the sustained phonemes /a/, /i/ and /u/.

The MFCCs are produced in the format of feature vectors. Each vector defines a frame

from one input signal, with 26 Mel-filter cepstral coefficients on it. Usually, for ASR,

only the coefficients 2-13 are maintained. Depending on the problem, i.e., the sound

to decompose in the ASR system, the number of coefficients may change. For instance,

other vowel recognition researchers used 12 and 16 MFCCs [10, 15]. In our project, we

needed to find the best number of coefficients to use, and so we trained our classifier using

from 5 to 16 coefficients. For the extraction process, we used the Librosa python library,

which performs audio and music signal analysis [32]. This library includes, among other

functionalities, the computation of the filter banks, MFCCs with customized parameters

and derivation features - delta and double delta.

The Librosa solution includes the steps described in section 2.2. These steps and the

correspondent parameters are described below:

1. Preprocessing: In the beginning, we convert ".wav"files into float arrays.

2. Framing and Windowing: For this step, we had to choose the size to split our

sample into a stack of frames. It is usual to take 20-40 ms frames with partial

overlap between consecutive frames. In our implementation, we used 20 ms and 10

ms of overlap. For example, for the dataset A with a sampling rate of 32kHz, the

first 640 samples starts at sample 0, the next 640 samples start at frame 320 until

the end of speech is reached. The hamming windows are applied to each frame.

54

6.2. AUTOMATIC RECOGNITION SYSTEM OF VOWELS

3. Compute the STFT: Here we compute the N-point FFT on each frame to calculate

the frequency spectrum. In our case, the number of FFTs is 1024. The computation

of the power spectrum follows this step.

4. Compute Mel-Filterbank energy features: In this step, we used 40 filters, which

is the standard choice in the ASR context.

5. Take the log of the features: Here we calculate the logarithm of the features, which

gives us the filter banks features.

6. Apply DCT: We apply the Discrete Cosine Transform of type 2, which gives us the

cepstral coefficients.

7. Normalization: We normalize the final feature vectors obtained.

After this step, we compute the delta and double delta features.

6.2.2 Data preprocessing and analysis

To use the computed features to train the classifier, we create multiple data sets with

different combinations of features. We establish 6 possible data sets, combining Filter

Banks or MFCCs in the following cases:

1. MFCCs for all frames;

2. MFCCs with delta and double delta for all frames;

3. MFCCs with delta and double delta for the mean and standard deviation of all

frames;

4. MFCCs for the mean and standard deviation of all frames;

5. Filterbanks for all frames;

6. Filterbanks for the mean and standard deviation of all frames;

Moreover, after defining our data sets, we need to train the model while we try to find

the right balance, without overfitting and underfitting the data. To adjust the relationship

between the predictions of our model and the correct values and the generalization of

our model, we need to train the model with a more substantial portion of the data set,

multiple times and with different train and test sets distribution. For this purpose, we

apply the following techniques:

• Random split

For this option, we use a random split to create both training and test sets. Note that,

with this splitting method it is more likely to have samples from the same child in

both subsets which can cause a bias in our classification process. Thus, randomize

the data might not be enough;

55

CHAPTER 6. AUTOMATIC SOUND RECOGNITION SYSTEM

(a) FB from a file with label /a/. (b) MFCCs from a file with label /a/.

(c) FB from a file with label /i/. (d) MFCCs from a file with label /i/.

(e) FB from a file with label /u/. (f) MFCCs from a file with label /u/.

Figure 6.3: Comparative samples from the sustained phonemes /a/, /i/ and /u/, with 40filter banks and 13 MFCCs.

• One Child Out experiment

Instead of using the previous naive approach, we separate the data set in a specific

number of subsets with an approach of leaving one child out for the test set. So,

56

6.2. AUTOMATIC RECOGNITION SYSTEM OF VOWELS

(a) Data set 1

Figure 6.4: Radial visualization of the data set 1.

we run n tests, where n is the number of children in our data set. In each test, the

test data must only contain the data from one child, while the training set uses the

remaining data. With this technique, we ensure that our model is not biased with

samples in both train and test sets from the same child since in future prediction it

classifies unseen data;

• Cross validation split

For the CV-split, we separate the children using the stratified k-folds. Using strati-

fied folds, we ensure that each subset includes the equal number of samples of each

class. Considering that, we have a slightly class imbalance, a randomly selected fold

may not represent a minor class adequately. Moreover, using the cross-validation

technique, we process out training and test sets multiple times, with different dis-

tributions.

6.2.3 Data visualization and feature analysis

In order to analyze the problem complexity and verify if the features’ correlation with

each label is appropriate to our problem-solving, we tried to visualize our data. Giving

that our data is multidimensional, we use radial viz, which is a multi-dimensional visu-

alization technique. The figure 6.4 presents the result of this first approach without any

data processing. Our data instances /a/, /e/, /i/, /o/ and /u/ are presented as anchor

points. If they are close to a set of variable anchors, they have higher values for these

variables than for others. However, as we can see in the figure 6.4, this projection is not

perceptible neither useful.

Therefore, we apply a methodology of dimensionality reduction and multidimen-

sional data preprocessing using Principal Component Analysis (PCA) and Linear Dis-

criminant Analysis (LDA) techniques and we perform visualization through scatter plots.

Specifically, the PCA allows the mapping of the features information into n principal

57

CHAPTER 6. AUTOMATIC SOUND RECOGNITION SYSTEM

(a) Dimensionality reduction with PCA. (b) Dimensionality reduction with LDA.

Figure 6.5: Comparative dimensionality reduction for two features, with PCA e LDAtechniques.

components that summarize the data representation, where n is the number of compo-

nents chosen to represent the data. The PCA algorithm uses the training data exclusively,

without taking into account the dependent variable. Thus, the components chosen are

not suitable for discriminating the different classes. As we can see in figure 6.5a, this tech-

nique induces the mixture of the samples, which is not an adequate representation in this

context. On the other hand, LDA is a supervised technique, since it considers the training

data with the class labels, and tries to maximize the linear separability between classes.

With this technique, we obtain a mapping from a 12-dimensional space to 2-dimensional

space, as shown in figure 6.5b. Contrarily to the PCA representation, with the LDA we

can observe a partial separation of the classes in clusters and, given that we are almost

able to separate each class, we expect a good accuracy test score.

6.2.4 Model estimation methodology

Besides preparing our training and test sets, we followed a set of steps until we attained

a robust model. The data sets created have a large number of features that represent

different aspects of our data. Considering that we have a high dimensionality, we choose

algorithms that can work properly in this context. As discussed in the literature chap-

ter 2, we use the Quadratic Discriminant Classifier, the SVM Classifier with the Gaussian

kernel function and the Random Forest. Particularly, SVM and RF have hyperparameters

that must be chosen to find the optimal parameters and, consequently, the model that

best fits our data and that increases the accuracy test score. Ourmethodology is as follows:

58

6.3. EVALUATION

• Standardize the data;

• Choose the best parameters for each classifier algorithm, if needed. For the SVM

with RBF kernel, the correspondent parameters C and gammawere carefully chosen

with the grid-Search cross-validation. In the case of the Random Forest classifier,

we tested different combinations for the number of estimators and the maximum

depth. This function iterates over the given parameters and indicates, after a cross

validation process, the optimal parameters to our problem;

• Compare the accuracy test score for each classifier, for each data set, combined with

different training and test sets techniques;

• Select the best model, i.e., the one that has the best accuracy results, which will

predict the future samples.

6.3 Evaluation

6.3.1 Comparison between different classifiers

After defining the model preparation steps, we are going to start our evaluation by ana-

lyzing different classifiers’ performance, according to the train and test splitting methods

previously mentioned. Figure 6.6 presents this comparison, where SVM classifier has the

highest accuracy test score, for all training and test sets. On the other hand, although

QDA can create quadratic boundaries in our data, it performs the worse when compared

with the remaining classifiers. Concerning the Random Forest classifier, the performance

is slightly lower, around 2% for all training and test sets. When comparing the training

and test sets, we observe in figure 6.6, that the One Child Out approach has a worst per-

formance, which is due to the fact that it uses one child data as the test set. Thus, if the

chosen child data represents an outlier, it will influence the classifier results negatively.

This behavior is further analyzed in section 6.3.3.

The three classifiers are inherently multiclass. To choose the hyperparameters for

the SVM and RF classifiers, we used grid-search with cross-validation, a method that

performs an exhaustive search over the parameters provided for our estimators. This

method outputted the values 1.0 and 0.1 as the optimal values for C and gamma parame-

ters, respectively, for SVM with RBF kernel. In the case of the Random Forest classifier,

we trained the number of estimators and maximum depth, for which we obtained the

optimal values 200 and 8, respectively. QDA does not need to fine tune any parameter.

The QDA performance is lower than the remaining algorithms. This classifier is

based in the assumption that the data follows a Gaussian distribution. In the presence of

outliers that do not follow this normal distribution, the model has a worse performance.

The performance degrades even more with One Child Out approach. Additionally, since

the model does not have hyperparameters to tune, there is no possibility to adjust the

model to the data characteristics.

59

CHAPTER 6. AUTOMATIC SOUND RECOGNITION SYSTEM

Figure 6.6: Classifiers’ performance comparison regarding different train and test split-ting methods.

The RF results from the bagging of multiple decision trees. These trees are created

based on sub-optimal splits that are made by introducing randomness. Regarding the

high dimensionality of our data, it is plausible that less relevant features are selected

over each split, and so the result can be slightly worst. In a way, the hyperparameters

can be tuned to fit better our model and to generate soft linear boundaries at the model’s

decision surface.

Nevertheless, SVM with the RBF kernel maximizes the margins and generates a curve

as the non-linear boundary which depends on the value of C and gamma. C is a regu-

larization parameter therefore, higher values make the decision surface smooth. On the

other hand, the gamma value controls how sensitive the model would be to outliers. It

follows that distant points influence the creation of the decision boundary. Considering

the values 1.0 and 0.1 for C and gamma, respectively, our model produces a high accuracy

score without overfitting and taking into account a few outliers. Since the SVM with RBF

kernel produces the best results, we decided to use it through the remaining experiments

performed in the context of this thesis.

6.3.2 Effect of varying the number of MFCCs

Besides analyzing the classifiers’ performance, we need to choose the number of MFCCs

to use, as it influences the algorithm’s results. In fundamental concepts chapter 2, par-

ticularly, in Carvalho et al., the original data set 6.2a was tested with 12 MFCCs. In our

classification problem, after testing in the range from 5 to 25 coefficients, we reach other

results. The extracted MFCCs were combined with the SVM estimator, for each proposed

60

6.3. EVALUATION

data set. As we can see in figure 6.7, even starting at the 5-th MFCCs we obtain results

higher than 90%. Starting at 9 MFCCs the result are higher our equal than 97%. The

increase in score stagnates at 12 MFCCs with 98%. We decided to choose 12 coefficients,

a number sustained by most literature for speech processing, which tends to oscillate the

number between 12 and 16 coefficients.

Figure 6.7: Classifier performance for the kernel SVM and the randon split, regardingthe number of MFCCs and different data sets.

6.3.3 Effect of varying the train and test sets

From the last two sections, we verify that the kernel SVM produced higher accuracy test

scores and unanimous higher scores with 16 MFCCs for all the data sets. Thus, in this

section, we are going to compare the options described in section 6.2.2, regarding the

different train and test split strategies described, with SVM and 16 MFCCs for the 1-4

datasets.

From the results displayed in figure 6.7, we can observe that data set 1, 2 and 5 have

high values. In the case of Random split and CV-split it had 98% in all cases whereas the

One Child Out approach performed marginally worst. Additionally, we can verify that

the results produced for the data set 4 and 6 were considerably lower, independently of

the train and test split methods chosen.

6.3.3.1 Random split vs. CV-split conclusions

In the random split approach, we performed 10 iterations, with a shuffle data distribution

and a fixed value of 25% for test samples. Furthermore, we tried to increase the test ratio

61

CHAPTER 6. AUTOMATIC SOUND RECOGNITION SYSTEM

Figure 6.8: Classifier performance for the kernel SVM with different data sets.

to 50%, where we obtain similar results, which indicates that our model can be robust,

even if the train set size is smaller. In CV-split, we divided the training and test set into

five stratified folds, with a shuffle data distribution. We performed these steps for 10

iterations.

Although both methods are relatively similar, the fact that the data in CV-split was

carefully subdivided in stratified folds leads to less bias, when it comes to each subset

that was represented with the same number of samples for each class, and less overfitting.

Yet, it decreases slightly the accuracy test score.

6.3.3.2 One Child Out approach conclusions

The results with the One Child Out approach are always lower than the other methods

scores. To understand and justify this situation, we analyzed the scores obtained for

each child, separately, considering that, each n-th iteration corresponds to the n-th child,

separated to the test set. Consequently, iterations with lower scores corresponded to the

children with the worst records.

These particular recordings barely represent the vowels labeled, and they are also

difficult to recognize based on human perception. These cases represent outliers in our

original data set, and as observed, influence negatively the final score, computed by the

mean of the n iterations. The recording problem is related to the fact that, when we split

the sustained vowel samples into 100 ms files, the retained segments from the records

with less quality were even harder to recognize given the used segment size.

To improve the results, we could (1) remove these outliers from the original data set

or (2) stick to the original data with the 135 samples, as presented in table 6.2a, instead

62

6.3. EVALUATION

of including the data set 6.2b with the sustained vowels. Still, none of the options are

good. We know that a robust machine learning model needs a large data set, otherwise

the model will not have enough data to learn from. On the other hand, if we remove these

few outliers from the data set, our model has no idea how to adjust to these cases and

will fail the task, when the samples recorded have a lower quality. Considering this, we

choose to include these outliers since that training the model with variations in the data

set prepare it for future predictions.

6.3.3.3 Comparative analysis between data sets

Each data set has particular characteristics, regarding the features and the number of

features in each data set:

• The number of samples is summarized in the table 6.3:

Table 6.3: Number of samples in the data sets 1-6.

VowelData set1, 2 and 5

Data set3, 4 and 6

/a/ 222 48

/e/ 149 27

/i/ 215 46

/o/ 148 27

/u/ 215 46

• The data set 1 only has 13 MFCCs, while the data set 2 has 13 MFCCs, 13 Deltas

(D) and 13 Double Deltas (DD) which results in 39 features. Since the accuracy test

score is the same in both data sets, we conclude that, in this particular case, the

extra features, D and DD, do not contribute to increase the classification results;

• The data sets 3 and 4 contain the same features in data sets 1 and 2, respectively,

although we applied the mean and standard deviation of all frames, which reduce

the number of total samples for each vowel. When we compare the results shown

in figure 6.8, we see that with the data set 3, the model achieved better results

than with data set 4. This case suggests that the derivative features from MFCCs,

provided important information, that increased the accuracy test scores by 10%;

• The data set 5 has 40 FB, while the data set 6 has 40 FB and, in this last case, the

frames were reduced to its mean and standard deviation. Thus, the number of

samples for each vowel is lower than the total size in the data set 5. As we can

see in figure 6.9, the feature distribution and shape of each data set is the same,

although the number of samples is significantly lower in data set 6. This difference

is reflected in the classification results, hence the lower number of samples in data

set 6 penalizes drastically the final score;

63

CHAPTER 6. AUTOMATIC SOUND RECOGNITION SYSTEM

(a) Data set 5 (b) Data set 6

Figure 6.9: Comparative features distribution with radial visualization for each data setswith FB.

• The data sets 1 and 6 include MFCCs and FB, respectively, which clearly does not

change the score. We know that these features provides different detailed informa-

tion, since the FB does not include the step with the DCT function, where the data

is decorrelated and converted to the chosen coefficients. Nevertheless, given that

the results with the MFCCs are already high there is no need to seek for further

information.

Analyzing chart 6.8, we can conclude that the data sets with the smaller amount of

samples, namely sets 3, 4 and 6, performed drastically worst, independently of the fea-

tures. Note that, the mean and standard deviation may also remove relevant information

in each frame, causing a negative influence on the model learning process. The confusion

matrices (CMs), in figure 6.10, show the prediction results on our classification problem.

The information in the CMs from the data sets 3, 4 and 6 indicates that the model failed

to recognize the underlying trend in the correspondent data.

On the other side, the results that concern the data sets 1, 2 and 5 show that our model

is highly effective regardless of the features computed. The extended features, Delta and

Double Delta, included in data set 2 do not seem to add value to the model predictions.

We take the same conclusions from the FB features included in the data set 5 since the

result is similar. Moreover, if we recall the information analyzed in the section 6.3.2,

we observe that the results are higher than 95%, starting from the coefficient 9, which

suggests that some of the features until this coefficient can perform the task with success.

6.4 Conclusions

In this chapter we presented the model building process to develop an automatic vowels

recognition system, with the purpose of identifying the sustain vowels that children are

producing while they play the game. During this process, we started by analyzing our

data, regarding the fact that the overall amount of data is restricted to 48 children and

64

6.4. CONCLUSIONS

(a) Data set 1 (b) Data set 2 (c) Data set 3

(d) Data set 4 (e) Data set 5 (f) Data set 6

Figure 6.10: Vowel detection confusion matrices.

the number of recordings from /e/ and /o/ is about half the number of the remaining

classes. Moreover, part of the samples had just 100 ms which includes a small amount

of information from the correspondent vowel. Therefore, we seek to extract features

that mainly represent our data, enough to distinguish the classes of interest, and so, we

prepared the training of 6 different solutions. As we concluded in the last subsection,

part of the features extracted are less significant for this classification problem, and the

data set 1 is adequate and enough to train our model.

Additionally, we presented the results of each classifier. QDA produced results lower

than 94%. Random Forest classifier, achieved a maximum accuracy test score of 97%, for

which we used the optimal values 200 and 8 for the number of estimators and maximum

depth, respectively. SVM algorithm with RBF kernel produced accuracy test scores of

98%, with the hyperparameters C equals 1 and gamma with the value of 0.1. As we

previously concluded, the SVM with RBF kernel obtain the best accuracy test results,

independently of the train and test sets.

At the same time, each classifier were tested, varying the number of MFCCs (in the

case of data set 1-4) and considering the different train and test splitting methods. Con-

cerning the number of MFCCs, with a reduced number of features, it is possible to achieve

accuracy test results higher that 90%. After testing from 5 to 25 MFCCs, we verified that

9 MFCCs and 12 MFCCs produced 97 and 98% respectively, with the Random split tech-

nique and for the data set 1, 2 and 5. After the 12-th coefficient, the accuracy test score

65

CHAPTER 6. AUTOMATIC SOUND RECOGNITION SYSTEM

stagnate. Thus, the 12 MFCCs were chosen as a component of our final solution.

Moreover, if we consider, exclusively, the accuracy test score achieved through each

train and test split methods, we would choose the Random split. In this technique, we

used different test sizes, from 25 to 50%, with which the algorithm maintained the high

accuracy score of 98%. Therefore, this method does not overfit our data set and so, provide

low bias. On the other hand, we experimented the CV-split, which produces good results

as well, and it performs through more iterations, over the 5 stratified folds created in

each iteration, which ensures the same class distribution in each fold. Nevertheless, both

methods do not separate the samples from the same child that are in both train and test

sets, and so, it will bias the model.

This problem is solved with the One Child Out approach, although it produced less

significant results, around 2% lower than what was observed in the previous experiments,

which is related to the quality of the samples used in the test set. The lack of quality is due

to the fact the sustain vowel recordings contain silence zones, low volume segments or

noise. The segments with these problems were causing the incorrect classification and so,

they performed a negative influence in the final score. We could remove these outliers and

then produce a score as high as in the previous techniques. However, these cases allow the

model to capture small variations in the data set and for future predictions, considering

that, we will certainly have future recording with noise, and children’ productions with a

lower volume.

Therefore, we created our best model, regarding our previous conclusions. The SVM

is the classifier chosen with the parameters C equals to 1.0 and gamma equals to 0.1.

For the train and test split method, we opted by the One Child Out approach. Lastly,

we decide to choose the data set 1, with 12 MFCCs, considering that, it is adequate to

generate a high accuracy test score.

66

Chapter

7Feedback and Validation

Before the proposed game can be used to complement speech therapy sessions, it is impor-

tant to receive SLP’s feedback and test the implemented functionalities with the target

audiences. During the platform development, we received feedback from therapists,

students and teachers to whom we were able to demonstrate our platform.

Besides that, after the game development process, to acknowledge and evaluate whereas

the functionatities fulfill the game’s objectives, we applied user tests with children. Dur-

ing the course of this chapter we present the experiment procedure and conclusions

according to the results observed. Moreover, in order to collect information regarding the

SLP’s opinion, we also applied questionnaires with them. The questions’ results are fur-

ther analyzed. Lastly, we take the overall conclusion of the application of both validation

methodologies.

7.1 Feedback from SLP(s) and heterogeneous audiences

During the development of our solution, we were able to demonstrate the game at Expo

FCT NOVA 2018 and the European Congress of Speech and Language Therapy 2018.

Each event focused on a different target audience, which allowed us to received diversi-

fied feedback from both groups. The Expo FCT NOVA 2018 is an open day at FCT NOVA

which receives high school students and their teachers from all over the country, to see

each department offers, with demonstrations of the work developed at the university.

Therefore, we were there to demonstrate the BioVisualSpeech project which includes the

worked performed in this dissertation. At the time of the demonstration, the game plat-

form was already developed with the main scenes and functionalities, without the ASR

system included. The players, both students, and teachers produced sustained phonemes

that had to stand within the defined intensity thresholds. We received their feedback in

67

CHAPTER 7. FEEDBACK AND VALIDATION

respect to the interface intuitiveness and usability, as well as, the game functionalities.

Figure 7.1: Game presentation in the European Congress of Speech and Language Ther-apy, May 2018.

Besides the demonstration described previously, the BioVisualSpeech team members

were present in the 10th CPLOL Congress - European Congress of Speech and Language

Therapy in Estoril, Portugal. The congress involved the participation of other researchers

and SLP(s) from all over the world, for whom we were able to demonstrate our game,

as illustrated in figure 7.1. At the time of that demonstration the complete game as

presented in this dissertation was not finished. We used a simpler version that did not

include the ASR system. Only the ASR system was not integrated with the game platform.

The participants practiced the SVE and tested the main functionalities for which they

gave their total approval. Additionally, to the feedback received during those events, we

received professional opinions from voice disorders specialists.

7.2 Validation

After the game development is essential to perform a platform’s evaluation to ensure that

the game fulfills the objectives proposed. Therefore, we were able to perform two types

of evaluation: user testing sessions with children and questionnaires with SLPs.

In this study, we pretend to analyze the following aspects:

• The feasibility and intuitiveness of the platform and the adaptability regarding

different children needs;

• The software design, specifically, the scenarios, scenes transitions, characters and

rewards;

• The software functionalities;

• The impact of the gamification strategy;

• The children’s opinion regarding the user tests;

• The SLPs’s opinion according to the information collected through the question-

naires.

68

7.2. VALIDATION

7.2.1 User testing sessions

In order to ensure the quality of our system, we deployed a testing methodology lever-

aging in field experiments with children. This study was not designed to determine if

the game platform improved the children conditions. Instead, we evaluate if the game is

appealing, intuitive and feasible for the children.

7.2.1.1 Participants

This experiment took place in the nursery school Alfredo da Mota, in Castelo Branco. The

program included the SLP from the institution and 14 children between 4 and 5 years-old.

As we can see in figure 7.2, most of the children have 5 years-old and are female.

(a) Age of children. (b) Gender of children. (c) Number of children with dys-phonia.

Figure 7.2: Basic information regarding the participants.

Concerning the children voice analysis, we had three participants with 5 years-old

and diagnosed dysphonia. To further identify and analyze the results from each child,

we use numerations. Thus, child 14 present rough voice, child 3 has asthenia (weak

voice) and child 9 has abnormal loudness/ unsteady volume. The SLP performed the

auditory-perceptual quality evaluation of these children.

7.2.1.2 Experiment procedure

We performed user test sessions, individually for each child in a quiet room, with the

presence of the SLP. The equipament used is illustrated in the figure 7.3.

The children were submitted to the same conditions, with the following steps:

1. We introduce them the idea of the game, a journey with different characters where

they can be rewarded with gifts if they conclude the tasks.

2. The SLP asks the child to perform the SVE, as long as possible. During the child’s

production, we recorded the sound and we measured the MPT achieved. For the

recording we used Audacity 2.3.0. At the time, the children did not see the game;

69

CHAPTER 7. FEEDBACK AND VALIDATION

Figure 7.3: The setup used for the recordings.

3. The SLP starts the game, inserts the basic information concerning child and she

choose the scenes and the exercise parameters;

4. Afterwards, we show the game to the child, allowing her to choose the character

and the scenario;

5. When the child enters the scene, the SLP tell her to produce the chosen vowel, until

the character reaches the target. As so, the child practice the SVE;

6. In the end of the trial, we measure the MPT achieved and child chooses a reward;

7. We repeat the process one more time for each child, in order to register two perfor-

mances during the in-game experience. For the next time, we update the exercise

parameters in 3, if the child performed the exercise without breaks. Otherwise, we

start the process from step 4.

For the first game trial, the therapist increment 4s 1 to the expected MPT, according to

the MPT obtained in the recording. For the second game trial, if the child performed the

exercise with success, the therapist increments the expected MPT with more 4s 1, until

the maximum (10s). Otherwise, she maintains the previous parametrization.

Note that, for the recording and the first game trial, the SLP instruct the child to

produce the vowel /a/. In the second game trial, she asked the child to choose another

vowel. We start with the vowel /a/, since the SLP considered the /a/ the easier vowel to

perform the SVE.

7.2.1.3 Experiment results

Figure 7.4 present the children’s results from 1 to 14. If we compare each child’s perfor-

mance, we realize that most of the children improved progressively until trial 2. During

the first recording, they were less motivated and consequently achieving a lower MPT

1Note that, for a child with dysphonia, the increments is just 2s.

70

7.2. VALIDATION

Figure 7.4: Children’s performance during the experiment.

value. Afterward, when we presented them the game, they showed interest to fin-

ish the exercise to avoid the character to stop the movement and to start falling. The

children even understood that they did not perform well the task when the character

stopped and asked to repeat. As verified, this interactive environment prompts them the

will to give their best. For example, the children 9 and 10 produced short MPT values in

the first recording, and increased drastically during the following performances.

In chapter 2, table 2.2, we present the normative values for children without voice

pathologies. From 4 to 6 years old, the MPT normative values stands in 6.12 +/- 1.89

seconds. Regarding this information, as we verify in the figure, children without diag-

nosed dysphonia did not have problems on overcoming this value, and most of them were

able to perform the maximum requested. However, the children 3, 9 and 14 had weak

performances, with MPTs lower than the normative value according to their age.

For instance, child 3’ performance is lower or equals than 3 seconds. After the record-

ing trial, the SLP parametrize the exercise MPT with 4 seconds. During child 3’s attempt

at performing the task, she was not increasing the MPT value. After two trials without

achieving the goal, we asked her if the trial was being hard, and she agreed. Consequently,

the SLP decreased the MPT parameter to 2 seconds, with which she was able to conclude

71

CHAPTER 7. FEEDBACK AND VALIDATION

the task with success. In this situation, we used the manual parametrization option to

update the game’s difficulty to the child’s capacities. With an automatic parametrization,

it would not be necessary. In the case of child 9’s performance, while she was trying

to achieve the target with the MPT requested, she was also dealing with an unsteady

loudness. On the other hand, child 14’s has a rough voice, which made it difficult for him

to perform the task.

With respect to the game’ UI elements, the children choose the character most similar

to them. On the other hand, the rewards’ choice was random. In the case of the SVE

scenarios, the approval was unanimously. Overall, they asked to keep playing and to

conquer more rewards.

7.2.1.4 User tests conclusions

• In most of the cases, children improve MPT immediately, as they start the game tri-

als, which indicates that the game UI elements, with the character moving through

a target, motivate them to endeavor the task with commitment;

• At the end of each user test, the children’s feedback was positive and unanimous.

We concluded that the UI elements suit our purpose. The children wanted to keep

on playing to conquer rewards and play with different scenarios;

• A key aspect of maintaining the child engaged in keep on playing is to provide

challenges according to her needs. In the case of child 3, her experience was moving

her out the flow channel and can trigger frustration feelings. Thus, manual or

automatic parametrization in the game was proved to be an essential functionality.

Moreover, automatic parametrization allows the child to play the game at home,

without going through this problem;

• Children with dysphonia struggled more in achieving the MPT requested than chil-

dren without this pathology. Child 14 increase his performance in trial 2, and child

9 increased after the recording task. Nevertheless, child 3 needed to stabilize the

MPT value in 2 seconds, before the increment. Otherwise, she would be frustrated.

7.2.2 Questionnaire to SLTs

The questionnaire focus on the evaluation of the following aspects:

• Interaction of the child with the game, in which the child practices the exercises

with difficulty parameterized manually:

1. Maximum phonation time predefined in the therapist’s area;

2. Predefined intensity interval for voice pathologies intervention;

3. Control the character with the keys A (forward), S (stop + fall).

72

7.2. VALIDATION

(a) Answers within the question Q1. (b) Answers about the question Q2.

Figure 7.5: Results regarding the SLPs and children interactions with the game platform.

• Interaction of the child with the character and the reward system to assess whether

the interactive elements of the game capture the child’s interest;

• Integration of scenarios, scenario transition, and child theme adequacy;

• Usability, clarity, and goals of the game;

• Global game feedback;

7.2.2.1 Participants

The questionnaire involved the participation of three SLPs, where one of them is special-

ized in voice disorders. Before answering the questionnaire, the SLPs tested the platform

with at least one patient, a child between 4 and 10 years-old with speech sound disorders.

7.2.2.2 Software design: scenarios and transitions

In order to evaluate the game’s scenarios and transitions between scenes, we prepared the

following questions:

• Q1: (Concerning interaction with the therapist) From 1 to 5 how clear and practical

is the game?

• Q2: (Concerning interaction with the child) From 1 to 5 how clear and practical is

the game?

• Q3: Do you consider the transition of scenarios clear and objective, from the initial

page toward the exercise page?

• Q4: Do you consider the scenarios appealing and suitable for children?

• Q5: Do you consider the scenarios inappropriate for the SVE?

• Q6: Are there any scenarios in the therapist settings that are not appropriate?

Through the questions’ results shown in figures 7.5a and 7.5b, we can understand if

the platform structure and design fits our goals. The SLPs considered, unanimously, that

73

CHAPTER 7. FEEDBACK AND VALIDATION

Figure 7.6: Answers regarding the question Q10.

the game transitions are intuitive and appealing for children, and none of them rejected

the scenes for the exercise page. Besides, they considered the set of operations for the

therapist are adequate and useful, and so, we consider these UI elements appropriate for

therapists and children.

7.2.2.3 Software design: Characters, rewards and other interactive UI elements

In order to evaluate the game’s UI elements, we introduced the following questions:

• Q7: Do you consider the rewards system to be capable of engaging the child ih the

game?

• Q8: Has the child ever been disinterested in the presented rewards?

• Q9: Do the characters’ variety, with different ethnicities for each gender, contribute

to the child’s satisfaction?

Considering both characters’ and rewards’ diversity, we can suit different children

tastes and, as confirmed by the SLPs, it contributes to the child’s satisfaction. Specifically,

the rewards stimulate and involve the children in the task, boosting their therapeutic

effects while they keep on playing.

• Q10:Did you find the visual cues in Bio Visual Speech adequate? (E.g.: the character

stops and begins to fall when the child produces the sound with a different intensity

than the requested)

Nevertheless, according to the answers presented in chart 7.6 about the characters’

movement, one of the three SLPs acknowledge that the falling of the figure does not suit

the theme under specific scenarios. In an open question, the same therapist proposed

an adaptation of the movement for different scenes. In the case of the train scene, the

SLP suggested a backward movement to substitute the fall of the figure, although it stills

appropriate in the bird and desert scenes.

74

7.2. VALIDATION

(a) Answers about Q13. (b) Answers about the question Q14.

These results oppose the feedback we received during the game development process.

We tried to find a visual cue that would be clear enough to represent the child’s wrong

behavior. According to the feedback received, the movement should be consistent for all

the scenarios in furtherance of intuitiveness.

7.2.2.4 Software functionalities

Besides the platform design, we will further detail the results concerning the game func-

tionalities:

• Q11: Did you find the therapist’s environment to add the child’s basic information

with appropriate fields (name, age, gender, and pathology description)?

• Q12: Did you consider that the therapist’s editing parameters are appropriate for

therapy with children (expected sound intensity - low, medium, high, MFT and

intensity range expected for manual parametrization of difficulty)?

Regarding the child’s basic information and the treatment parameters, the SLPs col-

lectively agree that the fields are relevant to complement the therapy task. On the other

hand, concerning the editable parameters within the treatment, the therapist of voice

disorders suggests the addition of the pitch parameter.

• Q13: In the manual parametrization state, did you have to change the parameters

in therapy until the child can complete the exercise without stopping?

• Q14: In the manual parametrization state, did you have to change the parameters

in therapy at least once so that the child would feel challenged to complete the

exercise?

• Q15: In the manual parameterization state, did you have to use the A key so the

child could complete the exercise?

• Q16: (In case of using manual control, keys A and S) Did the child realize that the

therapist could control the characters?

75

CHAPTER 7. FEEDBACK AND VALIDATION

Figure 7.8: Answers about the question Q15.

According to the manual parametrization questions, we inspect from the charts 7.7a

and 7.7b that the possibility to parametrize the variables in therapy is essential, regarding

the fact that children may have different capabilities and the game should fit them. On the

other side, none of the SLPs had to increase the game difficulty to challenge the child on

playing. Note that, the difficulty parametrization depends on the progress of the player.

Thus, during one or two trials the child may not be comfortable yet with the obstacles in

the game, and so, we can not collect enough data to make reliable conclusions.

Observing the chart 7.8, one SLP needed to use the keys A and S to manipulate the

character’s movement. In a consensual way, the child did not realized when the therapist

used it.

• Q17: Did the child improve his or her performance in the exercises regarding feed-

back with the movement of the character? (to move further, stop or to fall)

• Q18: From 1 to 5 how difficult was it for the child to perform the exercise success-

fully?

• Q19: From 1 to 5, how do you rate the use of this game to captivate the child’s

interest in therapy?

• Q20: From 1 to 5, how useful do you consider the use of this platform for voice

therapy with children?

Overall, the results showed that the children improved her performance, in respect to

the movement of the character, which can capture the interest of the children. Moreover,

we can observe that in the evaluation scenarios of two SLPs, the children considered the

task with a medium difficulty level, while the other considered it hard.

7.2.3 Validation conclusions

From the previous assessments, we can take the following conclusions:

• The software UI elements are adequate and appealing for children;

76

7.2. VALIDATION

• The professionals, unanimously, agreed that the parametrizable functionalities are

beneficial and useful for prompting an healthy and challenging therapy environ-

ment;

• Besides the exercise parameters, MPT and Loudness, we received the suggestion

to combine the Pitch variable, concerning its equal importance in the treatment of

voice disorders;

• In specific scenarios, the representative visual cue of the children’s error is not

adequate. However, establishing different character’s responses according to the

scenario’ theme may affect the intuitiveness of the exercise, since this user feedback

introduces non-consistent behaviors;

• In both validation methods, we verified that the SVE with the character moving

towards the target, challenge the children to increase their performance and so,

improve their therapy results;

• The gamification strategy with rewards and the game integration in a childhood

theme stimulates the children to keep on playing with different scenarios to conquer

new rewards;

• In regard to the questionnaires, the SLPs considered this game a valuable speech

therapy tool to complement therapy sessions with children.

77

Chapter

8Conclusion and future work

8.1 Conclusion

With the current advances in technology, computer-based therapy games have shown a

high potential to help to deliver therapy to children in a sophisticated encouragement

solution. These computer-assisted technologies must be designed to serve not only a

therapeutic purpose but also to be appealing and appropriate to the children’s cognitive

and emotional abilities as well as their age, gender and culture. Furthermore, these

platforms might increase the accessibility of speech therapy services for children with

SSDs, specifically voice disorders, and extend the time that they spend practicing the

addressed therapy tasks. The development of a speech therapy tool is a long-drawn and

complex journey inside the speech therapy field and where we can undoubtedly disperse

without the proper guidance from the SLPs. Thus, the development of these solutions

requires a interdisciplinary approach between computer engineers and clinicians.

Therefore, in this dissertation, we propose a computer platform that units game-like

features for children with a beneficial therapy exercise focus on voice disorders, the SVE.

In order to perform the exercise, the child must produce the sustained vowel /a/, /e/, /i/,

/o/ or /u/ from the European Portuguese language. Meanwhile, the child’s utterances

produced must stand between the thresholds associated to the variables in therapy. The

movement of a character towards a specified target in the scene represents the platform

feedback to the children’ performance. Thus, when the child correctly performs the vowel,

the character moves to the right. Otherwise, the character stops and starts falling until

the system receives a correct production.

Different children can have different levels of a specific pathology and children of

different ages accomplish singular performances during their in-game experience. There-

fore, it is desirable that the therapy games can be adjusted to each child situation. In

79

CHAPTER 8. CONCLUSION AND FUTURE WORK

response to their needs, we provide the possibility of manually adapting the difficulty,

by allowing the SLP to choose the desired maximum phonation time and intensity level.

Therefore, the SLP can customize the variables thresholds, according to the child needs.

Nevertheless, if the child is practicing the therapy task at home, this static parametriza-

tion model cannot deal with child’ improvements or struggles, since it does not imple-

ment a real-time adaptation. Therefore, to extend our parametrization functionality to

an effective intervention at home, we introduce a novel automatic difficulty adjustment

(DDA) model that evaluates the child’s performance in real-time and changes the state

of the game variables dynamically. This model relies on the basic principles of the flow

model, which measures performance based on specific parameters of interest to speech

therapy and allows the fluctuation between tense and release moments, keeping the child

engaged to hold on playing, while she increases her skills. This technique allows different

children to practice repetitive therapy exercises, within a flow channel that challenge

them and supports the creation of a fun and relaxing environment.

Besides this novel DDA scheme, we also implemented an ASR system for vowels

recognition. This system identifies the vocal utterances produced during the in-game

experience. With this functionality, the game is controlled with the child’s voice and can

be used outside the traditional therapy session, without the professional’ supervision to

validate the child’s utterances produced. To create the optimal model for our data, we

tested different model estimation methodologies. We built data sets with distinct features’

combinations, arranged in different train and test sets, as input to different classification

algorithms. In the end, our best model produces an accuracy test score of 96%. This

solution results from the data set with 12 MFCCs, where we applied the One Child Out

approach, which prepared the data to train the SVM classifier with the RBF kernel.

The aloft solution does not reflect the maximum score achieved during the evaluation

phase, regarding the lack of quality of a few segments from our data recordings. However,

these outliers support the creation of a robust algorithm considering that, for future

predictions, we can undoubtedly receive samples with noise or low volume. In this way,

we are preparing our model to deal with small variations in the data set and to improve

future predictions. Furthermore, even that the system produces a misplaced classification

for a correct child production, regarding the nature of the exercise (repeating the vowel

sustainable) and the low error ratio of ourmodel, that mistake should not have a revealing

consequence in the user feedback.

With both DDA and ASR models, we can accommodate a fulfilling experience that en-

sures that the child’s sustained vowels stand according to the previous therapist parametriza-

tion and, as follows, the child cannot produce the long task inefficiently and counterpole

the treatment’ progress. This system imports an extensive background preprocessing.

Thus, to load the game in low end-devices we created a client-server architecture. Hence,

we forward the ASR complex computation to the server side, while the client displays the

game platform.

Moreover, we focused on the validation of another main component of this project, the

80

8.2. FUTURE WORK

game UI for children and their SLPs, the target audiences of this game. This validation

process started with user tests with children, and questionnaires with therapists. Overall,

according to the validation results, the children improved their performances regarding

the game feedback. Specifically, the character movement as proved to be a easily inter-

pretable visual cue. When the character stops, the children realize they failed to move

the character without breaks as requested, and so, they seek to improve in the their next

trial. Moreover, through the user testing session, the children’ feedback was positive.

We verified the success of our gamification strategy proving it keep children motivated,

since they asked to keep playing, to collect rewards and try different scenarios. Finally,

according to the SLPs opinion, the game was considered a valuable speech therapy tool,

to integrate in therapy sessions for children with dysphonia.

Regarding the previous conclusions, we can highlight the following contributions:

• A game platform with the SVE for children;

• A gamification strategy with multiple scenes, characters and rewards;

• A novel DDA model for children with dysphonia;

• An ASR system for vowels from the EP language;

• A validation process with children and therapists;

There are also a few limitations of our game platform that must be taken into

account:

The development of a speech therapy tool is a complex journey. Numerous exercises

can be implemented, regarding each SSD, although not all of them can be designed in

a computer-based format. For example, dysphonia treatment is not efficient with exclu-

sively the SVE practice. In many dysphonia situations, besides the SVE, there are others

voice-producing mechanisms that include phonation, respiration, and musculoskeletal

function, to instruct a healthy vocal production. Thus, a limitation of our system concerns

the lack of exercises that our platform offers, and so, the utility for multiple application.

Moreover, according to the current implementation, the global game data is saved

locally in the client device. So, if the game is set up in the therapist device, the child

cannot perform the tasks at home. On the other side, if they install the game on child’s

device, during the in-game experience both child and parents may enter the therapist

mode and easily change the treatment parameters, without the SLP approval.

8.2 Future work

The area of therapeutic computer-based games is promising, despite the need for rigorous

outcome studies and applications. Besides the current functionalities of our tool, we can

81

CHAPTER 8. CONCLUSION AND FUTURE WORK

introduce new features, regarding the potential of the SVE. The proposed solution uses

the sound parameters maximum phonation time and vocal intensity. These are the most

relevant variables to the treatment of many dysphonia cases. However, there are plenty of

other sound features that may be extracted, analyzed and carefully included in a future

extension of the model. The right correlation between all variables must be ensured, as

well as, the evolution of an efficient and enthusiastic child’s learning curve in all of them,

independently.

The insurance that the player’s experience stands within the flow channel involves a

complex testing process. Our validation approach must include a long-term experiment

of the game-play. As future work, we should create a rigorous experimental design

with control groups and multiple groups baselines, to verify the efficacy of the different

platform functionalities.

82

Bibliography

[1] American Speech-Language-Hearing Association (ASHA) - Voice Disorders: Overview.

url: http://www.asha.org/Practice- Portal/Clinical- Topics/Voice-

Disorders/.

[2] American Speech-Language-Hearing Association (ASHA) - Voice Disorders: Overview.

url: https://www.asha.org/PRPSpecificTopic.aspx?folderid=8589942600Âğion=

Overview.

[3] American Speech-Language-Hearing Association (ASHA) - Voice Disorders: Treat-

ment. url: https : / / www . asha . org / PRPSpecificTopic . aspx ? folderid =

8589942600Âğion=Treatment.

[4] P. F. Assmann, T. M. Nearey, and J. T. Hogan. “Vowel identification: Orthographic,

perceptual, and acoustic aspects.” In: The Journal of the Acoustical Society of America

71.4 (1982), pp. 975–989. issn: 0001-4966. doi: 10.1121/1.387579.

[5] C. Bowen. “Childhood apraxia of speech.” In: Childrens Speech Sound Disorders.

2nd ed. John Wiley and Sons, L.td., 2015, 343–350.

[6] L. F. Brinca, A. P. F. Batista, A. I. Tavares, I. C. Goncalves, andM. L. Moreno. “Use of

Cepstral Analyses for Differentiating Normal From Dysphonic Voices: A Compar-

ative Study of Connected Speech Versus Sustained Vowel in European Portuguese

Female Speakers.” In: Journal of Voice 28.3 (2014), 282–286. doi: 10.1016/j.

jvoice.2013.10.001.

[7] M. Carvalho and A. Ferreira. “Real-Time Recognition of Isolated Vowels.” In:

vol. 5078. June 2008, pp. 156–167. doi: 10.1007/978-3-540-69369-7_18.

[8] R. H. Colton, J. K. Casper, and R Leonard. Understanding voice problem: A physiolog-

ical perspective for diagnosis and treatment: Fourth edition. 2011, pp. 202–205. isbn:

9781451123951. url: https://www.scopus.com/inward/record.uri?eid=2-s2.

0-84970002476\&partnerID=40\&md5=2d669bd30832ac856398981c707a739f.

[9] S. Demediuk, W. L. Raffe, and X. Li. “An Adaptive Training Framework for In-

creasing Player Proficiency in Games and Simulations.” In: Proceedings of the 2016

Annual Symposium on Computer-Human Interaction in Play Companion Extended

Abstracts - CHI PLAY Companion ’16 (2016). (Visited on 12/07/2016).

83

BIBLIOGRAPHY

[10] M. Diogo, M. Eskenazi, J. Magalhaes, and S. Cavaco. “Robust scoring of voice

exercises in computer-based speech therapy systems.” In: 2016 24th European Signal

Processing Conference (EUSIPCO) (2016). doi: 10.1109/eusipco.2016.7760277.

[11] M. C. Duff, A. Proctor, and E. Yairi. “Prevalence of voice disorders in African

American and European American preschoolers.” In: Journal of Voice 18.3 (2004),

pp. 348–353. issn: 08921997.

[12] A. El Sharkawi, L. Ramig, J. A. Logemann, B. R. Pauloski, A. W. Rademaker, C. H.

Smith, A. Pawlas, S. Baum, and C. Werner. “Swallowing and voice effects of Lee

Silverman Voice Treatment (LSVT®): A pilot study.” In: Journal of Neurology Neu-

rosurgery and Psychiatry 72.1 (2002), pp. 31–36. issn: 00223050. doi: 10.1136/

jnnp.72.1.31.

[13] E. Eriksson, O. Balter, O. Engwall, A.-M. Oster, and H. Kjellstrom. “Design Rec-

ommendations for a Computer-Based Speech Training System Based on End-User

Interviews.” In: 2005 (SPECOM) (2005).

[14] H. Fayek. Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cep-

stral Coefficients (MFCCs) and What’s In-Between | Haytham Fayek. 2016. url:

https://haythamfayek.com/2016/04/21/speech-processing-for-machine-

learning.html (visited on 09/11/2018).

[15] A. J. S. Ferreira. “Static features in real-time recognition of isolated vowels at high

pitch.” In: The Journal of the Acoustical Society of America 122.4 (2007), pp. 2389–

2404. issn: 0001-4966. doi: 10.1121/1.2772228.

[16] C. M. Fox and C. A. Boliek. “Intensive Voice Treatment (LSVT LOUD) for Children

With Spastic Cerebral Palsy and Dysarthria.” In: Journal of Speech Language and

Hearing Research 55.3 (2012), p. 930. issn: 1092-4388.

[17] R. Fraile and J. I. Godino-Llorente. “Cepstral peak prominence: A comprehensive

analysis.” In: Biomedical Signal Processing and Control 14.1 (2014), pp. 42–54. issn:

17468108. doi: 10.1016/j.bspc.2014.07.001.

[18] Freepik. url: https://www.freepik.com/ (visited on 11/11/2018).

[19] L. Furlong, S. Erickson, andM. E.Morris. “Review: Computer-based speech therapy

for childhood speech sound disorders.” In: Journal of Communication Disorders 68

(2017), pp. 50 –69. issn: 0021-9924.

[20] B. Gold, N. Morgan, and D. Ellis. Speech and audio signal processing: processing and

perception of speech and music. Wiley, 2011.

[21] I. Guimaraes. A ciencia e a arte da voz humana. 1st ed. ESSA, 2007.

84

BIBLIOGRAPHY

[22] A. E. Halpern, L. O. Ramig, C. E. C. Matos, J. A. Petska-Cable, J. L. Spielman, J. M.

Pogoda, P. M. Gilley, S. Sapir, J. K. Bennett, and D. H. McFarland. “Innovative

technology for the assisted delivery of intensive voice treatment (LSVT(R)LOUD)

for Parkinson disease.” In: American journal of speech-language pathology 21.4 (2012),

pp. 354–367. issn: 1558-9110.

[23] H. Hanks, C. Hanks, andM. Suman. Little Bee Speech. url: http://littlebeespeech.

com/articulation_station.php.

[24] Y. D. Heman-Ackah, R. T. Sataloff, G. Laureyns, D. Lurie, D. D. Michael, R. Heuer,

A. Rubin, R. Eller, S. Chandran, M. Abaza, K. Lyons, V. Divi, J. Lott, J. Johnson,

and J. Hillenbrand. “Quantifying the cepstral peak prominence, a measure of

dysphonia.” In: Journal of Voice 28.6 (2014), pp. 783–788. issn: 18734588. doi:

10.1016/j.jvoice.2014.05.005.

[25] G. Hinton, L. Deng, D. Yu, G. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Van-

houcke, P. Nguyen, T. Sainath, and B. Kingsbury. “Deep neural networks for acous-

tic modeling in speech recognition: The shared views of four research groups.” In:

IEEE Signal Processing Magazine (2012). issn: 10535888. doi: 10.1109/MSP.2012.

2205597. arXiv: 1207.0580.

[26] T Jonathan, B Bruno, and B Abdenour. “Understanding and implementing adaptive

difficulty adjustment in video games.” In: Algorithmic and Architectural Gaming

Design: Implementation and Development. 2012, pp. 82–106. isbn: 9781466616349.

doi: 10.4018/978-1-4666-1634-9.ch005.

[27] R. D. Kent. “Anatomical and Neuromuscular Maturation of the Speech Mecha-

nism: Evidence from Acoustic Studies.” In: Journal of Speech, Language, and Hearing

Research 19.3 (1976), pp. 421–447. doi: 10.1044/jshr.1903.421.

[28] T. Lan, S. Aryal, B. Ahmed, K. Ballard, R. Gutierrez-Osuna, R. Gutierez-osuna, and

A Texas. “Flappy Voice: An Interactive Game for Childhood Apraxia of Speech

Therapy.” In: Proceedings of the First ACM SIGCHI Annual Symposium on Computer-

human Interaction in Play 2.1 (2014), pp. 429–430. doi: 10.1145/2658537.2661305.

url: http://doi.acm.org/10.1145/2658537.2661305.

[29] M. Lopes, J. a. Magalhães, and S. Cavaco. “A Voice-controlled Serious Game for

the Sustained Vowel Exercise.” In: Proceedings of the 13th International Conference

on Advances in Computer Entertainment Technology. ACE ’16. Osaka, Japan: ACM,

2016, 32:1–32:6. isbn: 978-1-4503-4773-0. doi: 10.1145/3001773.3001807. url:

http://doi.acm.org/10.1145/3001773.3001807.

[30] V. Lopes, J. Magalhaes, and S. Cavaco. “A dynamic difficulty adjustment model

for dysphonia therapy games.” In: HUCAPP 2019 - 3rd International Conference on

Human Computer Interaction Theory and Applications (2018).

[31] M. H. M. Mateus, F. Isabel, and F. M. Joao. Fonetica e fonologia do portugues. 2nd ed.

Universidade Aberta, 2005.

85

BIBLIOGRAPHY

[32] B. McFee, M. McVicar, S. Balke, C. Thomé, V. Lostanlen, C. Raffel, D. Lee, O.

Nieto, E. Battenberg, D. Ellis, R. Yamamoto, J. Moore, WZY, R. Bittner, K. Choi,

P. Friesch, F.-R. Stöter, M. Vollrath, S. Kumar, nehz, S. Waloschek, Seth, R. Naktinis,

D. Repetto, C. F. Hawthorne, C. Carr, J. F. Santos, JackieWu, Erik, and A. Holovaty.

librosa/librosa: 0.6.2. Aug. 2018. doi: 10.5281/zenodo.1342708.

[33] L. A. A. Mota, C. M. B. Santos, J. M. d. Vasconcelos, B. C. Mota, and H. C. Mota.

“Applying the technique of sustained maximum phonation time in a female patient

with adductor spasmodic dysphonia: a case report.” pt. In: Revista da Sociedade

Brasileira de Fonoaudiologia 17 (2012), pp. 351 –356. issn: 1516-8034.

[34] National Instruments. Understanding FFTs and Windowing Overview. Tech. rep.,

p. 15. url: http://download.ni.com/evaluation/pxi/UnderstandingFFTsandWindowing.

pdf.

[35] T. M. Nearey. “Static, dynamic, and relational properties in vowel perception.” In:

The Journal of the Acoustical Society of America 85.5 (1989), pp. 2088–2113. issn:

0001-4966. doi: 10.1121/1.397861.

[36] R. C. Oliveira, L. C. Teixeira, A. C. C. Gama, and A. M. de Medeiros. “Análise

perceptivo-auditiva, acústica e autopercepção vocal em crianças.” In: Jornal da

Sociedade Brasileira de Fonoaudiologia 23.2 (2011), pp. 158–163. issn: 2179-6491.

doi: 10.1590/S2179-64912011000200013.

[37] S Palethorpe, R Wales, J. E. Clark, and T Senserrick. “Vowel classification in chil-

dren.” In: The Journal of the Acoustical Society of America 100.6 (1996), pp. 3843–

3851. issn: 00014966. doi: 10.1121/1.417240.

[38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-

del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,

M. Brucher, M. Perrot, and E. Duchesnay. “Scikit-learn: Machine Learning in

Python.” In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830.

[39] A. Pompili, A. Abad, and I. Trancoso. Vithea. url: https://vithea.l2f.inesc-

id.pt/wiki/index.php/Main_Page.

[40] R. J. Prater and R. Swift. Manual de Terapeutica de la voz. Masson-Litle, Brown,

1995.

[41] A. Rietveld, S. Bakkes, and D. Roijers. “Circuit-adaptive challenge balancing in

racing games.” In: Conference Proceedings - 2014 IEEE Games, Media, Entertainment

Conference, IEEE GEM 2014. 2015.

[42] J. Schell. The Art of Game Design: A Book of Lenses. San Francisco, CA, USA: Morgan

Kaufmann Publishers Inc., 2008. isbn: 0-12-369496-5.

[43] Simple harmonic motion. url: https://commons.wikimedia.org/wiki/File:

Simple_harmonic_motion.svg.

[44] SLPinfo - Loudness. 2017. url: https://www.sltinfo.com/loudness/.

86

BIBLIOGRAPHY

[45] SLPinfo - Pitch. 2017. url: https://www.sltinfo.com/pitch/.

[46] SLTinfo - Maximum Phonation Time (MPT). 2017. url: https://www.sltinfo.

com/maximum-phonation-time/.

[47] B. J. D. Swart, S. C. Willemse, B. Maassen, and M. W. Horstink. “Improvement of

voicing in patients with Parkinsons disease by speech therapy.” In: Neurology 60.3

(2003), 498–500. doi: 10.1212/01.wnl.0000044480.95458.56.

[48] C. T. Tan, A. Johnston, K. Ballard, S. Ferguson, and D. Perera-Schulz. “sPeAK-

MAN: towards popular gameplay for speech therapy.” In: Proceedings of The 9th

Australasian Conference on Interactive Entertainment: Matters of Life and Death. 2013,

p. 28. isbn: 978-1-4503-2254-6. doi: 10.1145/2513002.2513022. url: http:

//dl.acm.org/citation.cfm?id=2513022.

[49] E. L. M. Tavares, A. Brasolotto, M. F. Santana, C. A. Padovan, and R. H. G. Mar-

tins. “Epidemiological study of dysphonia in 4-12 year-old children.” In: Brazilian

Journal of Otorhinolaryngology 77.6 (2011), pp. 736–746. issn: 18088686.

[50] E. L. M. Tavares, A. G. Brasolotto, S. A. Rodrigues, A. B. B. Pessin, and R. H. G.

Martins. “Maximum Phonation Time and s/z Ratio in a Large Child Cohort.” In:

Journal of Voice 26.5 (2012). doi: 10.1016/j.jvoice.2012.03.001.

[51] J. Tavares, J. Lopes, M. Cunha, and R. Saldanha. Falar a brincar. 2017. url: https:

//falarabrincar.wordpress.com/2016/01/16/sobre-nos/.

[52] A. L. Williams. Intensity in phonological intervention: Is there a prescribed amount?

2012.

[53] W. A. Yost. Fundamentals of hearing: an introduction. 5th ed. Brill, 2013.

[54] C. Yun, P. Trevino, W. Holtkamp, and Z. Deng. “PADS: Enhancing Gaming Expe-

rience Using Profile-Based Adaptive Difficulty System.” In: Proceedings of the 5th

ACM SIGGRAPH Symposium on Video Games - Sandbox ’10. July. 2010, pp. 31–36.

isbn: 9781450300971.

87