18
1 © 2011 The University of Sheffield Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 1 Discovering the Particulate Structure of Speech Prof. Roger K. Moore Prof. Roger K. Moore Chair of Spoken Language Processing Dept. Computer Science, University of Sheffield, UK (Visiting Prof., Dept. Phonetics, University College London) (Visiting Prof., Bristol Robotics Laboratory) © 2011 The University of Sheffield Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 2 Overview Human versus machine speech recognition Developmentally-inspired ASR Research conducted in the EU-FP6 ACORNS FET project The particulate structure of speech Phylogenetic and ontogenetic perspectives The role of the production system Relevant research at USFD

Discovering the Particulate Structure of Speechroger/publications/RKM - Bielefeld - Feb 2011.pdf · Discovering the Particulate Structure of Speech ... human speech perception superimpose

  • Upload
    dodat

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

1

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 1

Discovering the Particulate Structure of Speech

Prof. Roger K. MooreProf. Roger K. Moore

Chair of Spoken Language Processing

Dept. Computer Science, University of Sheffield, UK

(Visiting Prof., Dept. Phonetics, University College London)

(Visiting Prof., Bristol Robotics Laboratory)

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 2

Overview

• Human versus machine speech recognition

• Developmentally-inspired ASR

• Research conducted in the EU-FP6 ACORNS FET project

• The particulate structure of speech

• Phylogenetic and ontogenetic perspectives

• The role of the production system

• Relevant research at USFD

2

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 3

Human SR vs. Machine SR

0.001

0.01

0.1

1

10

100

Connected

Digits

Alphabet

Letters

Resource

Management

Wall Street

Journal

Business

News

Switchboard

Wo

rd E

rro

r R

ate

(%

)ASR

Human

0.001

0.01

0.1

1

10

100

Connected

Digits

Alphabet

Letters

Resource

Management

Wall Street

Journal

Business

News

Switchboard

Wo

rd E

rro

r R

ate

(%

)ASR

Human

Taken from Lippmann, R. P. (1997). Speech recognition by

machines and humans. Speech Communication, 22, 1-16.

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 4

Human SR vs. Machine SR

• What’s going on here?

• The definition of ‘recognition’ in machine SR is fundamentally correct …– “the most likely explanation of the incoming

data given a model of how it was produced”

• Any shortfalls in performance must therefore be due to ...– insufficient fidelity of the data– having the wrong model

• ASR researchers have been investigating both for ~60 years

3

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 5

0

10

20

30

40

50

60

70

0 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000

Hours

Wo

rd E

rro

r R

ate

(%

)

Supervised Unsupervised Unsupervised (reduced LM training)

0

10

20

30

40

50

60

70

0 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000

Hours

Wo

rd E

rro

r R

ate

(%

)

Supervised Unsupervised Unsupervised (reduced LM training)

Human SR vs. Machine SR80 year-old80 year-old10 year-old10 year-old >70 lifetimes>70 lifetimes2 year-old2 year-old

Moore, R. K. (2003). A comparison of the data requirements of automatic

speech recognition systems and human listeners, EUROSPEECH03. Geneva.

Human

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 6

Human SR vs. Machine SR

• What’s going on here?

• From an ML perspective …

– wrong type of data?

– underusing the data?

– lack of suitable priors?

• Answer = all three!

4

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 7

Human SR vs. Machine SR

Human Machine

Learning incremental one-shot

Contextrich

(situated & embodied)

poor(domain-specific)

Styleconversational & communicative

formal & performed

Priors acquisition device AM & LM structure

Structure constructed calibrated

Memorydynamic

(episodic & semantic)

static(probabilistic)

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 8

Developmentally-Inspired ASR

• These key differences have inspired a number of investigations into the possibility of an artificial embodied agent acquiring spoken language through incremental learning in a situated environment

• The classic study was published by Deb Roy in 1998

• In December 2006 the EU funded a 3-year Future and Emerging Technology project called ‘ACORNS’ (Acquisition of COmmunication and RecogNition Skills)

http://www.acorns-project.org/

5

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 9

‘Little ACORNS’ (LA)

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 10

ACORNS Memory Architecture

ten Bosch, L., Van

hamme, H., Boves,

L., & Moore, R. K.

(2009). A

computational model of language

acquisition: the emergence of

words.

Fundamenta

Informaticae, 90,

229-249.

6

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 11

ACORNSPattern

Discovery Algorithms

Acoustic DP-Ngrams

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 12

Acoustic DP-Ngrams

Aimetti, G., &

Moore, R. K.

(2009). Discovering

keywords from

cross-modal input: ecological vs.

engineering methods for

enhancing

acoustic

repetitions,

INTERSPEECH. Brighton, UK.

7

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 13

10 20 30 40 50 60 70 80

10

20

30

40

50

60

70

80

90

100

110 0

5

10

15“Ew

an s

its o

n th

e c

ouch”

“Ewan is shy”

Zit9?m\

Acoustic DP-Ngrams

Aimetti, G., &

Moore, R. K.

(2009). Discovering

keywords from

cross-modal input:

ecological vs.

engineering

methods for

enhancing

acoustic

repetitions,

INTERSPEECH.

Brighton, UK.

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 14

Episodic Traces

2 14 3 7 1 28 11 12 18 29 16 22 19 17 5 6 9 10 4 23 24 27 25 26 20 13 8 15 21 30

5

10

15

20

25

30

Dendrogram of Exemplar units Within Internal Class DUCK

Exemplar Index

Min

-Cost

Dis

tance

2 14 3 7 1 28 11 12 18 29 16 22 19 17 5 6 9 10 4 23 24 27 25 26 20 13 8 15 21 30

5

10

15

20

25

30

Dendrogram of Exemplar units Within Internal Class DUCK

Exemplar Index

Min

-Cost

Dis

tance

“duck” “theduck” “the” “is”

Exemplar Units

8

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 15

“nappy” “book”

“shoe” “bath”

“daddy” “car”

“telephone” “mummy”

“Ewan” “bottle”

Pattern Discovery(after 100 utterances)

‘objects’ emerging from audio-visual

pattern discovery

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 16

Word Recognition

9

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 17

Epigenetic Landscape

Aimetti, G., ten

Bosch, L., &

Moore, R. K.

(2009). Modelling

early language

acquisition with a

dynamic systems

perspective, 9th

Int. Conf. on

Epigenetic

Robotics. Venice.

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 18

Effect of Fetal Hearing

Aimetti, G., &

Moore, R. K.

(2009). Discovering

keywords from

cross-modal input:

ecological vs.

engineering methods for

enhancing

acoustic

repetitions,

INTERSPEECH. Brighton, UK.

10

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 19

Whole Words → Sub-Words

Time-frequency ‘patches’ derived using ‘non-negative matrix factorisation’ (NMF)

Van Segbroeck, M., & Van hamme, H. (2009). Unsupervised learning

of time-frequency patches as a noise-robust representation of

speech. Speech Communication, 51(11), 1124-1138.

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 20

Whole Words → Sub-Words

Parsing words using NMF-based sub-word structure

Van Segbroeck, M., & Van hamme, H. (2009). Unsupervised learning of time-frequency patches as a noise-robust representation of

speech. Speech Communication, 51(11), 1124-1138.

11

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 21

Towards a General Principle

• It is not enough simply to ‘decompose’ speech into a hierarchy of seemingly arbitrarily units

• There needs to be an underlying driving principle for the existence (and hence learning) of such structure

• One candidate is ‘the particulate principle of self-diversifying systems’ (Abler, 1989)

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 22

Self-Diversifying Systems

Abler, W. L. (1989). On the particulate principle of self-

diversifying systems. Social Biological Structures, 12, 1-13.

+ →‘Blending’ constituents

→+‘Particulate’ constituents

12

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 23

Self-Diversifying Systems

• Examples …– chemical interaction– biological inheritance– human language

• All such systems “make infinite use of finite means” (Humbold, 1836)

• Properties– multidimensional– hierarchical– periodic

Abler, W. L. (1989). On the particulate principle of self-

diversifying systems. Social Biological Structures, 12, 1-13.

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 24

The Particulate Structure of Speech• Grounded in …

– sensorimotor channels– drives and intentions

• Structure is constructed …– phylogenetically– ontogenetically

Abler, W. L. (1989). On the particulate principle of self-

diversifying systems. Social Biological Structures, 12, 1-13.

• Emergent structures …– pragmatic– semantic– syntactic– lexico-morphemic– phonological– articulatory

13

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 25

The Particulate Structure of Speech• Abler noted that the physical basis of human speech is

fundamentally different from that of biological inheritance or chemical systems

• Consecutive speech gestures and their consequent acoustic signals exhibit blending

• This increases the length of time during which information concerning any one speech sound is present in the speech signal, thus giving the speech signal resistance to interference

• However, if blending ran to completion, it would obliterate most of the communicative power

• Abler concluded that the psychophysical thresholds of human speech perception superimpose a particulate structure over a blending structure

Abler, W. L. (1989). On the particulate principle of self-

diversifying systems. Social Biological Structures, 12, 1-13.

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 26

A Phylogenetic Perspective

• Spoken language also appears to differ from other particulate systems in that it is driven by ‘contrast’

• This is because it is a behaviour exhibited by living organisms, and it has evolved as a consequence of managing ‘energetics’

• In fact, the structure of all particulate systems is the result of constraints/attractors in …– energy– entropy– time

• Living systems have solved the ‘persistence’ problem by actively managing these dimensions

Moore, R. K. (2007). Spoken

language

processing:

piecing together

the puzzle. Speech

Communication,

49, 418-435.

14

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 27

A Phylogenetic Perspective

• Dependencies exist between many living organisms, and some actively manage such dependencies

• Managing inter-organism dependencies represents a ‘communication’ system

• Many communication systems have evolved which exploit …– information transfer– manipulation

• Human speech has emerged as the highest information-rate system (probably because of the high DoF of the vocal articulators)

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 28

A Phylogenetic Perspective

“The evolution of the active management of

communication under energetic, informational

and temporal constraints leads to an efficient

contrastive particulate system with a structure

and complexity that is a direct consequence of

the degrees-of-freedom of the available

signalling apparatus and the discriminability

supported by the sensory inputs.”

Roger K. Moore, Feb 2011

15

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 29

An Ontogenetic Perspective

• So, what are the implications for an organism/system that has to acquire communicative skills?

• Is the particulate structure …– pre-programmed? �– inferred/acquired from the signal? �

– an emergent consequence? �

• “Ontogeny recapitulates phylogeny” (Haeckel, 1866)

• Learning proceeds through a process of differentiation and factorisation, rather than clustering and segmentation(Karmiloff-Smith, 1992; Hendriks-Jansen, 1996)

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 30

The Role of Production

• The child is an active participant, not a passive observer

• Meaning is grounded in doing (Rizzolatti & Arbib, 1998)

• Speech understanding (and hence speech recognition) arises from inferring ‘communicative intentions’

• I.e. it is an ‘inverse’ problem (that can be solved computationally using ‘analysis-by-synthesis’)

• This is equivalent to invoking generative processes in perceptual interpretation (by recruiting information from the actual motor system)

• Production and perception develop hand-in-hand

16

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 31

Relevant Research at USFD

• Incremental learning of particulate phonological structure– acquisition of phonemic contrast in word pairs

• Speech energetics– biomimetic/animatronic model of the human

tongue and vocal tract (AnTon)

Hofe, R., & Moore, R. K. (2008). Towards an investigation of speech

energetics using 'AnTon': an animatronic model of a human tongue

and vocal tract. Connection Science, 20(4), 319–336.

Aimetti, G., Moore, R. K., & ten Bosch, L. (2010). Discovering an optimal

set of minimally contrasting acoustic speech units: a point of focus for

whole-word pattern matching, INTERSPEECH. Makahuri, Japan.

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 32

Relevant Research at USFD

• Vocabulary growth– no evidence for the ‘vocabulary spurt’

• PRESENCE– predictive sensorimotor control and emulation

Moore, R. K., & ten Bosch, L. (2009). Modelling vocabulary growth from birth to young adulthood, INTERSPEECH. Brighton, UK.

0

5000

10000

15000

20000

25000

30000

35000

40000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Age (years)

Nu

mb

er

of

Wo

rds

0

5000

10000

15000

20000

25000

30000

35000

40000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Age (years)

Nu

mb

er

of

Wo

rds

Moore, R. K. (2007). PRESENCE: A human-inspired

architecture for speech-based human-machine interaction. IEEE Trans. Computers, 56(9), 1176-1188.

S:i S:mx-x -

S:E(U:m)

S:E(U:E(S:m ))

S:E(U:m)

S:E(U:E(S:i))

S:E(U:i)

-

-

x

S:E(U:n)

-

S:n

-mo

tiva

tion

feeling

sensitivity

att

ent

ion

interpretation

actionneeds

nois

e,

dis

tort

ion,

rea

ction

, dis

turb

ance

intention

17

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 33

Relevant Research at USFD

vocal interactivity in and between humans, animals and robots

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 34

Summary

• Human versus machine speech recognition

• Developmentally-inspired ASR

• Research conducted in the EU-FP6 ACORNS FET project

• The particulate structure of speech

• Phylogenetic and ontogenetic perspectives

• The role of the production system

• Relevant research at USFD

18

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 35

Thanks

to …

The ACORNS teamThe ACORNS team

© 2011 The University of Sheffield

Developmental Speech Recognition, Bielefeld: 17-18 Feb. 2011 slide 36

Thank You

http://www.dcs.shef.ac.uk/~roger