Cook - singing voice synthesis

7/27/2019 Cook - singing voice synthesis

1/10

Singing Voice Synthesis: History, Current Work, and Future DirectionsAuthor(s): Perry R. CookSource: Computer Music Journal, Vol. 20, No. 3 (Autumn, 1996), pp. 38-46Published by: The MIT PressStable URL: http://www.jstor.org/stable/3680822

Accessed: 12/01/2010 06:40

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at

http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless

you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you

may use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at

http://www.jstor.org/action/showPublisher?publisherCode=mitpress.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed

page of such transmission.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of

content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms

of scholarship. For more information about JSTOR, please contact [email protected].

The MIT Pressis collaborating with JSTOR to digitize, preserve and extend access to Computer Music

Journal.

http://www.jstor.org
http://www.jstor.org/stable/3680822?origin=JSTOR-pdfhttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/action/showPublisher?publisherCode=mitpresshttp://www.jstor.org/action/showPublisher?publisherCode=mitpresshttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/stable/3680822?origin=JSTOR-pdf


2/10

Perry

R.

Cook

Department

of

Computer

Science

and

Department

of Music

Princeton

University

Princeton,

New

Jersey,

USA

[email protected]

This

article

will

briefly

review the

history

of

sing-

ing

voice

synthesis,

and will

highlight

some cur-

rently

active

projects

in

this area. It will

survey

and

discuss the benefits and trade-offs of

using

different

techniques

and

models. Performance

control,

some

attractions of composing with vocal models, and ex-

citing

directions

for

future

research will be

high-

lighted.

Basic VocalAcoustics

The

voice

can be characterizedas

consisting

of one

or more

sources,

such

as the

oscillating

vocal folds

or turbulence

noise,

and a

system

of filters whose

properties

are

controlled

by

the

shape

of the vocal

tract.

By moving

various

articulators,

we

change

the

ways

the

sources

and filters

behave.

The

spec-

trum of the voice is characterized

by

resonant

peaks

called formants. The location and

shapes

of

these

resonances

are

strong perceptual

cues

that

hu-

mans use

to

differentiate and

identify

vowels and

consonants. For a

system

to

generate speech-like

sounds,

it

should allow for

manipulation

of

the res-

onant

peaks

of

the

spectrum,

and

also

for

manipula-

tion of source

parameters

(voice

pitch,

noise

level,

etc.)

independent

of

the

resonances of the

vocal

tract. Voice

pitch

is

commonly

denoted

as

fo,

and

the formant frequencies are commonly denoted as

fl,

f2,

3,

etc.

Figure

1 shows a

vocal tractcross-

section

forming

the vowel

/

i

/

(as

in

"beet"),

where

the

quasi-periodic

oscillations of the vocal

folds

are

shaped

by

the

resonant filter of the vocal

tract

tube. The

spectrum

of

the vowel shows

the

harmon-

ics of

the voice source

outlining

the

peaks

and val-

leys

of the

vocal

tract

response. Figure

2

shows the

vocal tract

cross-section for

forming

the

conso-

Computer

Music

Journal,

20:3,

pp.

38-46,

Fall

1996

?

1996 Massachusetts Institute of

Technology

S i n g i n g

V o i c e

Synthesis

H i s t o r y

C u r r e n t

W o r k

n d

u t u r e

Directions

nant

/ /("shh"),

where the "source" s not the vo-

cal

folds,

but turbulence

noise

formed

by forcing

air

through

a constriction.

Also shown is the noise-

like

spectrum

of the

consonant,

showing

two

princi-

pal

formant

peaks corresponding

o the

resonances

of the vocal tract upstreamfrom the noise source.

A

Brief

History

f

Digital inging

Speech)

Synthesis

The earliest

computer

music

project

at

Bell Labs in

the late

1950s

yielded

a

number of

speech synthe-

sis

systems

capable

of

singing,

one

being

the

acous-

tic tube model of

Kelly

and Lochbaum

(1962).

This

model was

actually

an

early

physical

model.

At

that time it was

considered too

computationally

ex-

pensive

for

commercialization

as

a

speech synthe-

sizer,

and too

expensive

to be

practical

for musical

composition.

Max

Mathews worked with

Kelly

and

Lochbaum to

generate

some

early

examples

of

sing-

ing

synthesis

(Computer

Music

Journal 1995;

Wergo 1995).

Other

techniques

to arise from the

early

legacy

of

speech signal processing

include

the

channel

vo-

coder

(VOice

CODER)

(Dudley

1939)

and linear

pre-

dictive

coding (LPC) Atal 1970;

Makhoul

1975).

In

the

vocoder,

the

spectrum

is

broken into sections

called sub-bands,and the information in each sub-

band is

analyzed,

then

parameters

are stored or

transmitted for

reconstruction

at another time or

site.

The

parametric

data

representing

the

informa-

tion

in

each

sub-band can be

manipulated,

yielding

transformations such as

pitch

or time

shifting,

or

spectral

shaping.

The vocoder does not

strictly

as-

sume that the

signal

is

speech,

and thus

generalizes

to other sounds.

The

phase

vocoder,

implemented

using

the

discrete Fourier

transform,

has found ex-

tensive

use

in

computer

music

(Moorer

1978;

Dol-

son

1986).

Computer

Music

Journal

8


3/10

Figure

1. Vocal tract

shape

and

spectrum

of

vowel

/

i

/

(as

in

"beet"),

show-

ing formants

and harmon-

ics

of

periodic

voice

source.

Figure

2. Vocal tract

shape

(left)

and

spectrum

(right)

of

consonant

/

f

/

("shh"),

showing

a

noisy spectrum

with two

formants.

^

Formants

ea.}3.oov~~~~I

Figure

1

Consonant

f/

(as

in

shh)

,r,

Figure

2

The introduction

of linear

predictive coding

(Atal

1970)

revolutionized

speech technology,

and had a

great impact

on

musical

composition

as well

(Moorer 1979;

Steiglitz

and

Lansky 1981;

Lansky

1989).

With

LPC,

a

time-varying

filter is

automati-

cally designed that predicts the next value of the

signal,

based on

past samples.

An

error

signal

is

pro-

duced

which,

if

fed back

through

the

time-varying

filter,

will

yield

exactly

the

original signal.

The

fil-

ter models

linear

correlations

in

the

signal,

which

correspond

to

spectral

features such as

formants.

The error

signal

models

the

input

to the

formant

filter,

and

typically

is

periodic

and

impulsive

for

voiced

speech,

and

noise-like for

unvoiced

speech.

The success of

LPC

in

speech

coding

is

largely

due to

the

similarity

between

the

source/filter

de-

composition

yielded by

the

mathematics of

linear

prediction,

and the

source/filter

model of

the hu-

man vocal tract.

The

power

of

LPC as

a

speech

com-

pression

technique

(Spanias 1994)

stems from

its

ability

to

parametrically

code

and

compress

the

source and

filter

parameters.

The

effectiveness of

LPCas a compositional tool emerges from its abil-

ity

to

modify

the

parameters

before

resynthesis.

There are

weaknesses,

however,

in

LPC,

which are

related to

the

assumption

of

linearity

inherent

in

the

filter model.

Also,

all

spectral

properties

are

modeled

in

the

filter.

In

actuality

the

voice has

mul-

tiple possible

sources of

non-linear

behavior,

nclud-

ing

source-tract

coupling,

non-linear wall

vibration

losses,

and

aerodynamic

effects.

Due to

these devia-

tions from the

ideal

source-filter

model,

the

result

of

analysis/modification/resynthesis

using

LPC or

a

sub-band

vocoder often

sounds

"buzzy."

Cook

i

i

I

39


4/10

Cross-Synthesis

nd Other

Compositional

Attractions

f

Vocal

Models

The

compositional

interest

in

vocal

analysis/syn-

thesis has at least

three foundations.

The first is

rooted

in the

human

as

a

linguistic organism,

for

it

seems

in

the

nature

of

humans

to

find

interest

in

voice-like sounds.

Any

technique

or

device that

allows

independent

control over

pitch

and

spectral

peaks

tends

to

produce

sounds that are

vocal in

nature,

and such sounds catch

the

interest

of

hu-

mans. The second

compositional

interest

in

using

systems

that

decompose

sounds

in

a

source/filter

paradigm

s to

allow

for

cross-synthesis.

Cross-

synthesis

involves

the

analysis

of two

instruments,

typically

a voice and a non-voice

instrument,

with

the

parameters

exchanged

and modified

on

resyn-

thesis. This allows the resonances of

the

voice

to

be

imposed

on the source of a non-voice instru-

ment. The

third

interest

comes

from the fact that

once

pitch

and

resonance structure are

analyzed

as

they

evolve

in

time,

these three

dimensions are

in-

dependently availableto some extent for manipula-

tion

on

resynthesis.

The

elusive

goal

of

being

able

to

stretch

time

without

changing pitch,

to

change

pitch

without

changing

timbral

quality,

etc.,

are all

of

high

interest to

computer

music

composers.

Other

PopularSynthesisTechniques

Frequency

modulation

(FM)

proved

successful for

singing

synthesis

(Chowning

1981,

1989)

as

well as

the

synthesis

of

other

sounds.

As

described in com-

munications

literature,

FM

involves the

modula-

tion of the

frequency

of

one oscillator with

the

output

of another to create a

spread spectrum

con-

sisting

of side-bands

surrounding

the

original

car-

rier

(oscillator

that is

modulated)

frequency.

In FM

sound

synthesis,

both the carrier

and modulator

oscillators

typically

store

a

sinusoidal

waveform,

and

operate

in

the audio

band.

By

controlling

the

amount of

modulation,

and

using

multiple carrier/

modulator

pairs,

spectra

of

somewhat

arbitrary

shape

can be

constructed. This

technique

proved

ef-

ficient yet sufficiently flexible for music composi-

tion,

and

became

the basis

for

the

most

successful

commercial music

synthesizers

in

history.

In

vocal

modeling,

carriers

placed

near

formant locations

in

the

spectrum

are modulated

by

a

common

modula-

tor oscillator

operating

at the voice fundamental fre-

quency.

Sinusoidal

speech

modeling (McAulay

and

Qua-

tieri

1986)

has been

improved

and

applied

to music

synthesis by

Julius

Smith and XavierSerra

(Smith

and Serra

1987;

Serra

and

Smith

1990),

Xavier

Ro-

det and

Philippe Depalle (1992),

and

others.

These

techniques use Fourieranalysis to locate and track

individual

sinusoidal

partials.

Individual

trajector-

ies

(tracks)

of sinusoidal

amplitude, frequency,

and

phase

as a

function of time

are

extracted from the

time-varying peaks

in a series

of short-time

Fourier

transforms.

To

help

define

tracks,

heuristics

regard-

ing

physical

systems

and the

voice

in

particular

are

used,

such as the fact that a

sinusoid

should

not

ap-

pear, disappear,

or

change frequency

or

phase

instan-

taneously.

The

sinusoids

can

be

resynthesized

from

the

track

parameters,

after

modification or

coding,

by

additive

synthesis.

Noise can

be treated as

rap-

idly varyingsinusoids,

or

explicitly

as a non-

sinusoidal

component.

Formant wave

functions

(FOFs

n

French)

were

pioneered by

Xavier

Rodet

(1984)

at

Institute de

Recherche et

Coordination,

Acoustique/Musique

(IRCAM).

An FOF s

a time-domain

waveform

model of the

impulse

response

of individual for-

mants,

characterized

as a

sinusoid

at the

formant

center

frequency

with an

amplitude

that

rises

rap-

idly

upon

excitation

and

decays

exponentially. By

describing

a

spectral region

as

a

windowed sinusoi-

dal

oscillation

in

the time

domain,

an

FOF can be

viewed as a

special type

of

wavelet. The

control

pa-

rameters define the

center

frequency

and

band-

width of

the formant

being modeled,

and the rate

at which

the FOFs

are

generated

and

added deter-

mines

the

base

frequency

of

the voice.

The

synthe-

sis

system

for

using

FOFs

was

dubbed

CHANT,

and

found

application

in

general synthesis

(Rodet,

Po-

tard,

and

Barriere

1984).

Gerald Bennett

and Xavier

Rodet used CHANT

to

produce

a

number of

impres-

sive

singing examples

and

compositions

(Bennett

and Rodet

1989).

Formantsynthesizers, in which individual for-

Computer

Music

Journal

0


5/10

mants

are modeled

by

second-orderresonant

filters,

have

been

investigated

by many speech

researchers

(Rabiner

1968;

Klatt

1980).

An

attractive

feature

of

formant

synthesizers

is that Fourier

or LPC

analy-

sis can be used to

automatically

extract formant

frequencies

and source

parameters

from recorded

speech.

Charles

Dodge

used such

techniques

in

a

composition

in

1973

(Dodge 1989).

The

group

that

has

accomplished

the most in the domain of

sing-

ing synthesis using

formant models is the

Speech

Transmission

Laboratory STL)

of the

Royal

Insti-

tute of Technology (KTH),Stockholm. This STL

MUSSE

DIG

(MUsic

and

Singing Synthesis

Equip-

ment,

DIGital

version)

synthesizer

(Carlson

and

Neovius

1990)

has been used

in

singing

synthesis

(Zera, Gauffin,

and

Sundberg

1984),

for

studying

performance

synthesis-by-rule

(Sundberg

1989),

and

has been

adapted

for real-time control

in

perfor-

mance

(Carlson,

et al.

1991).

The

KTH

has con-

ducted and

published

extensively

on

speech,

and

has

arguably

produced

the

largest body

of research

on

singing (Sundberg1987)

and

music,

both acous-

tics and

performance.

Robert C. Maher

(1995)

re-

cently

demonstrated

singing synthesis

using

modi-

fied

forms

of

the second-orderresonant

filter which

lend

themselves

to

parallel implementation.

Acoustic

TubeModelsof the Vocal

Tract

Acoustic

tube models solve

the wave

equation,

usu-

ally

in one

dimension,

inside a

smoothly varying

tube.

The one-dimensional

approximation

is

justi-

fied

by noting

that

the

length

of the

vocal tract is

significantly larger

than

any

width

dimension,

and

thus the

longitudinal

modes

dominate the reso-

nance

structure

up

to about

4,000

Hz.

Modal stand-

ing

waves in

an

acoustic tube

correspond

to

the for-

mants.

The basic

Kelly

and

Lochbaum model

(Kelly

and

Lochbaum

1962)

critically

samples space

and

time

by approximating

the smooth

vocal

tract

tube with

cylindrical

segments

equal

in

length

to

the dis-

tance

traveled

by

a

soundwave in

one

time

sample.

The SPASM

and

Singer

systems (Cook

1992)

are

based on a physical model of the vocal tract filter,

developed

using

the

waveguide

formulation

(Smith

1987).

This model is

a direct descendent of

the

Kelly

and Lochbaum

model,

but with

many

en-

hancements,

such as a nasal

tract,

modeling

of radi-

ation

through

the throat

wall,

various

steady

and

pulsed

noise

sources

(Chafe

1990),

and

real-time

controls.

Shinji

Maeda's

(1982)

model

numerically

integrates

the wave

equation using

the

rectangular

method

in

space,

and the

trapezoidal

rule in

time.

Wall losses are

also

modeled,

and an

articulatory

layer

of control modifies

the basic tube

shape

from

higher-orderdescriptions like tongue and jawposi-

tion. Rene Carre's

1992)

model is

based on distinc-

tive

regions

(DR)

arising

from

sensitivity

analysis,

noting

that movements

in

particularregions

of the

vocal tract affect formant

frequencies

more than

movements in others.

Hill,

Manzara,

and Taube-

Schock

(1995)

have

implemented

a

synthesis-by-

rule

system using

a

model based on

distinctive re-

gions,

with

libraries and

examples

that include ex-

amples

of

singing synthesis. Liljencrants

(1985)

in-

vestigated

an

undersampled

acoustic tube model

and

derived rules for

modifying

the

shape

without

adding

unnaturally

to the

energy

contained within

the

vocal tract. The

computer

music

research

group

in

Helsinki

(Vilimaki

and

Karjalainen

1994)

have

used fractional

sample

interpolation

and

truncated

conical tube

segments

to derive

an

improved

ver-

sion of

the

Kelly

and Lochbaum

model.

OtherActive

SingingSynthesis

Projects

Pabon

(1993)

has

constructed a

singing synthesizer,

with

real-time

formant

control via

spectrogram-

like

displays

called

phonetograms,

and

source wave-

form

synthesis

using

FOF-like

controls. Titze

and

Story (1993)

have

produced

a

super-computer

tenor

called "Pavarobotti"

hat

sings

duets with

Titze,

and

is used for

studying

many

aspects

of

the

voice,

including

advanced

physical

models

of

normal and

pathological

vocal

folds.

Howard

and

Rossiter

(How-

ard

and Rossiter

1993;

Rossiter

and

Howard

1994)

have

studied source

parameters

for

more

natural

singing synthesis,

as

well as

interactive

singing

analysis software for pedagogicalapplications.

Cook

41


6/10

Spectral

Modelsvs.

Physical

Models

Synthesis

models

can

be

loosely

broken into two

groups:

spectral

models,

which can

be

viewed

as

based on

perceptual

mechanisms,

and

physical

models,

which

can

be viewed as based

on

produc-

tion

mechanisms. Of the

models and

techniques

discussed

above,

the

spectrally

based models in-

clude

FM,

FOFs,

vocoders,

and

sinusoidal models.

Acoustic

tube models are

physically

based,

while

formant

synthesizers

are

spectral

models,

but could

be classified as

pseudo-physical

because of the

source/filter decomposition.

It's

possible

to inter-

pret

LPC three

ways:

as a

least-squares

linear

predic-

tion

in the

time

domain,

as a

least-squares

match-

ing process

on the

spectrum,

and as a source-filter

decomposition.

Therefore,

LPC is both

spectral

and

pseudo-physical,

but not

strictly

a

physical

model

because wave variables are not

propagateddirectly,

and

no

articulation

parameters

go

into the basic

model. Since LPC can be

mapped

to

a filter related

to the acoustic tube model

(Markel

and

Gray 1976),

it may be broughtinto the physical camp.

Both

physical

and

spectral

models have

merit,

and one or another

might

be more suitable

given

a

specific goal

and set of

computational

resources.

The main

attraction

of

physical

models is

that

most of the control

parameters

are those that a

hu-

man uses to control

his/her

own

vocal

system.

As

such,

some intuition

can

be

brought

into the

de-

sign

and

composition

processes.

Another motiva-

tion is that

time-varying

model

parameters

can be

generated by

the model

itself,

if

the

model

is

con-

structed so that it

sufficiently

matches

the

physical

system. Disadvantagesof physical models are that

the

number of

control

parameters

can be

large,

and

while some

parameters

might

have intuitive

sig-

nificance for humans

(jaw

drop),

others

might

not

(specific

muscles

controlling

the vocal

folds).

Fur-

ther,

parameters

often

interact

in

non-obvious

ways.

In

general

there exist no exact

methods for

analysis/resynthesis using physical

models. Parame-

ter

estimation

techniques

have

been

investigated,

but for

physical

models of

reasonable

complexity,

especially

those

involving any

non-linear

compo-

nent,

identity analysis/resynthesis

is a

practical

and often theoretical

impossibility

(Cook

1991b;

Scavone and Cook

1994).

ModelExtensions nd FutureWork

Work remains to be done in

refining

techniques

for

spectral analysis

and

synthesis

of the voice. For

ex-

ample,

a

spectral

envelope

estimation

technique

like that of

Galas and Xavier Rodet

(1990)

allows

more accurate

formant

tracking

on

even

high

fe-

male

tones,

which because of the

large

inter-

harmonic

spacing

have

proven

difficult for

analysis

systems

in the

past.

There are far more

directions

for

research to

proceed

in

improving physical

mod-

els and

source models

for

pseudo-physical

models

of the voice.

Most

of them

involve some

significant

component

of

non-linearity,

and/or higher

dimen-

sional models. The main

research

areas involve

modeling

of

airflow

in

the vocal

tract,

development

of more

exact models

of

the inner

shape

of the

vo-

cal tract

tube,

physical

models

of

the

tongue

and

other articulators,more accuratemodels of the vo-

cal

folds,

and

facial

animation

coupled

to voice

syn-

thesis.

The

modeling

of flow

is

a

difficult but

important

task,

and until

recently

it has been

confined

to

the-

oretical

explorations,

occasionally

verified

experi-

mentally

with

hot-wire

anemometry

or other

flow

measurement

techniques (Teager1980).

Mico

Hirschberg

has

begun

to

make advances in

actually

photographing

low in

constructed models of

musi-

cal

instruments,

and the vocal

tract

(Pelorson

et al.

1994).

These

techniques,

combined with

classical

and new theories, should yield greaterunderstand-

ing

about air

flow and how it

affects

vocal

acous-

tics.

Along

with more

exact

solutions to

the flow-

physics problems,

development

of

efficient

means

for

calculating

the

flow

simulations,

allowing

the

inclusion of these

non-linear

effects

in

practical

synthesis

models must

also

emerge

(Chafe

1995;

Verge 1995).

Constructing

a

physical

model that

includes

more

detailed

simulations of the

dynamics

of the

tongue

and

articulators

would allow

the model to

calculate the

time-varying

parameters,

rather than

Computer

Music

Journal

I

42


7/10

having

the

shape,

etc.

explicitly

specified

or calcu-

lated.

Wilhelms-Tricarico

1995)

has

developed

a set

of

models

of

soft

tissue,

and has

used these

to con-

struct

a

tongue

model.

Such models can be

cali-

brated

from the

results of articulation

studies

using

X

ray pellets,

magnetic

resonance

imaging,

and

other

techniques.

All of this

can

combine to

yield

models

that "behave"

correctly

in

a

dynamical

sense,

and

give

a better

picture

of the fine structure

of

the

space

inside

the

vocal tract.

This

latter

infor-

mation is critical

if

flow simulations

are

to

be

ac-

curate.

Vocal fold models continue

to be

the

target

of

much

research, and,

like the case

of

airflow,

theo-

ries are difficult

to

conclusively prove

or

disprove.

More elaborate models

of the vocal fold tissue are

being developed

(Story

and Titze

1995),

and theoret-

ical and

experimental

studies

revisiting

and

compar-

ing

the classic models

are

being

conducted

(Rodet

1995).

Facial

animation

coupled

with

speech

synthesis

is

important

for a number

of

reasons. One reason is

for

pedagogy,

where

speech

synthesizers

with ani-

mated

displays

could be used as

teaching

and reha-

bilitation

tools. Another

important

reason

involves

speech perception

in

general,

because humans use

a

significant

amount

of

lip

reading

in

understand-

ing

speech.

Workhas been done

by

Massaro

(1987)

and

Hill,

Pearce,

and

Wyvill (1988),

employing

fa-

cial animation

to

study

coupling

of visual and

audi-

tory

information

in

human

speech

understanding

(McGurk

and MacDonald

1976).

Musically,

we

know that the face of the

singer

can

carry

even

more information about

the

meaning

of

music

than the actual text

being sung (Scotto

Di

Carlo

and Guaitella

1995),

further

motivating

the combi-

nation of facial animation

with

singing synthesis.

Modeling

Performance

One

of

the

distinguishing

features of the

voice is

the continuous nature of

pitch

control,

both inten-

tional

and

uncontrolled. Research

in

random and

periodic

pitch

deviations

(Sundberg

1987;

Chown-

ing

1989;

Ternstrom and

Friberg

1989;

Prame

1994;

Cook

1995),

and the

synthesis

and

perception

of

short vibrato tones

(d'Allessandro

and

Castellengo

1993),

has

provided

data and models

for more natu-

ral

sounding

voice

synthesis.

On

the macro

scale,

rule

systems

for vocal

performance

and

phrasing

(Berndtsson

1995),

and

composition

(Rodet

and

Cointe

1984;

Barriere, ovino,

and Laurson

1991)

have

been constructed. The Stockholm KTH rule

system

is available

on the

compact

disc

Informa-

tion

Technology

and Music

(KTH 1994).

These

im-

portant

areas of research shall

remain a

topic

for a

future survey paper.

Extended

Singing

and

Language

ystems

Investigations

into

non-Western traditional Bel

Canto

singing

styles, traditions,

and acoustics

include

studies

of

overtone

singing

(Bloothooft,

et

al.

1992),

traditional Scandanavian

shepherd

sing-

ing (Johnson,Sundberg,

and

Willbrand

1983),

a

highly

structured

system

of funeral laments

(Ross

and Lehiste

1993),

and even castrati

singing

(De-

palle, Garcia,

and Rodet

1994).

Language

systems

for the

SPASM/Singer

nstruments include

an

Eccle-

siastical

Latin

system

called

LECTOR

Cook

1991a),

and a

system

for

modern Greek called

IGDIS

(Cook,

et

al.

1993).

The IGDIS

system

in-

cludes

support

for

arbitrary

uning

systems,

and

common vocal ornaments can be called

up

by

name,

allowing

traditional folk

songs

and

Byzan-

tine chants to be

synthesized

quickly.

Real-Time

oice

Processing

and

Interactive

Karaoke

Recently,

commercial

products

have been intro-

duced that allow for real-time "smartharmonies"

to

be added to a vocal

signal,

or

implement

real-

time score

following

with

accompaniment.

Vocod-

ers and

LPC,

by

virtue of

being

analysis/synthesis

systems,

allow

potential

for real-time modification

of

voice

signals

under the control of rules

or

real-

time

computer processes.

We

will soon see

systems

that

integrate pitch

detection,

score

following,

and

Cook

I

43


8/10

sophisticated

voice

processing

algorithms

into a

new

generation

of

interactive

karaoke

systems.

This will remain

a

topic

for a future review

paper.

References

Atal,

B.

1970.

"Speech

Analysis

and

Synthesis

by

Linear

Prediction of the

Speech

Wave."

ournalof

the Acousti-

cal

Society

of

America

47:65(A).

Barriere,

J.

B.,

E

Iovino,

and

M.

Laurson.

1991.

"A

New

CHANT Synthesizerin C and its Control Environ-

ment

in

Patchwork."

n

Proceedings

of

the 1991 Inter-

national

Computer

Music

Conference.

San

Francisco,

California:International

Computer

Music

Association,

pp.

11-14.

Bennett, G.,

and X. Rodet.

1989.

"Synthesis

of the

Sing-

ing

Voice."

In

Mathews,

M. and

J. Pierce, eds.,

Current

Directions

in

Computer

Music Research.

Cambridge,

Massachusetts:

The MIT

Press,

pp.

19-44.

Berndtsson,

G.

1995,

"The KTH Rule

System

For

Singing

Synthesis."

Computer

Music

Journal

20(1):76-91.

Bloothooft, G.,

et

al.

1992.

"Acoustics

and

Perception

of

Overtone

Singing."

Journal

of

the

Acoustical

Society of

America

92(4):1827-1836.

Carlson, G.,

and

L.

Neovius.

1990.

"Implementations

of

Synthesis

Models for

Speech

and

Singing."

STL-

Quarterly

Progress

and

Status

Report.

Stockholm:

KTH,

pp. 2/3:63-67.

Carlson, G.,

et

al.

1991. "A

New

Digital System

for

Sing-

ing

Synthesis

Allowing Expressive

Control."

n

Proceed-

ings

of

the

1991

International

Computer

Music

Confer-

ence. San

Francisco,


Computer

Music

Association,

pp.

315-318.

Carre,

R.

1992.

"Distinctive

Regions

in

Acoustic Tubes."

Journal

d'Acoustique, 5(141):141-159.

Chafe,

C.

1990.

"Pulsed Noise in

Self-Sustained Oscilla-

tions of Musical Instruments." n

Proceedings

of

the

IEEE

nternational

Conference

on

Acoustics,

Speech,

and

Signal

Processing.

New York: EEE

Press,

pp.

1157-1160.

Chafe,

C.

1995.

"Adding

Vortex

Noise to Wind

Instru-

ment

Physical

Models."

In

Proceedings

of

the

1995

In-

ternational

Computer

Music

Conference.

San Fran-

cisco,


Computer

Music

Association,

pp.

57-60.

Chowning,

J.

1981,

"Computer Synthesis

of the

Singing

Voice."In

Research

Aspects

on

Singing.

Stockholm:

KTH,

pp.

4-13.

Chowning,

J.

1989.

"Frequency

Modulation

Synthesis

of

the

Singing

Voice."

In

Mathews,

M. and

J.

Pierce, eds.,

Current

Directions in

Computer

Music Research. Cam-

bridge,

Massachusetts:

The MIT

Press,

pp.

57-64.

Computer

Music

Journal.1995.

Computer

Music

Journal

Volume 19

Compact

Disc.

Cambridge,

Massachusetts:

The

MIT

Press.

Cook,

P. 1991a. "LECTOR:An EcclesiasticalLatin

Con-

trol

Language

or

the

SPASM/Singer

nstrument."

n

Proceedings

of

the 1991

International

Computer

Mu-

sic

Conference.

San

Francisco,

California:

International

Computer

Music

Association,

pp.

319-321.

Cook,

P.

1991b.

"Non-Linear

Periodic Prediction for On-

Line Identification of Oscillator Characteristics n

WoodwindInstruments." n

Proceedings

of

the

Interna-

tional

Computer

Music

Conference.

San

Francisco,

Cal-

ifornia:

International

Computer

Music

Association,

pp.

157-160.

Cook,

P.

1992. "SPASM:

A

Real-Time Vocal

Tract

Physi-

cal

Model

Editor/Controller

and

Singer:

he

Compan-

ion

Software

Synthesis System."

Computer

Music

Jour-

nal

17(1):30-44.

Cook,

P.

1995.

"A

Study

of Pitch

Deviation in

Singing

as

a Function of Pitch

and

Dynamics."

13th

International

Congressof

Phonetic Sciences.

Stockholm:

KTH,

pp.

1:202-205.

Cook, P.,

et al. 1993. "IGDIS:A ModernGreek Text to

Speech/Singing Program

or

the

SPASM/Singer

nstru-

ment."

In

Proceedings

of

the

International

Computer

Music

Conference.

San

Francisco,

California:

Interna-

tional

Computer

Music

Association,

pp.

387-389.

d'Allessandro,C.,

and M.

Castellengo.

1993. "ThePitch

of

Short-Duration

VibratoTones:

Experimental

Data

and

Numerical Model."In

Proceedingsof

the

Stock-

holm

Music Acoustics

Conference.

Stockholm:

KTH,

pp.

25-30.

Depalle,

P.,

G.

Garcia,

and

X. Rodet.

1994,

"AVirtual

Cas-

trato

( ?)"

n

Proceedings of

the

1994

International

Computer

Music

Conference.

San

Francisco,

Califor-

nia: International

Computer

Music

Association,

pp.

357-360.

Dodge,

C.

1989.

"On

Speech

Songs."

n

Mathews,

M.

and

J.

Pierce,

eds.,

Current

Directions in

Computer

Music

Research.

Cambridge,

Massachusetts: The MIT

Press,

pp.

9-18.

Dolson,

M.

1986,

"The

Phase

Vocoder:

A

Tutorial."

Com-

puter

Music

Journal

10(4):14-27.

Dudley,

H.

1939.

"The

Vocoder."

Bell

Laboratories

Rec-

ord,

December.

Galas,

T.,

and X. Rodet.

1990 "An

mprovedCepstral

Method for

Deconvolution of

Source-Filter

Systems

with

Discrete

Spectra:Application

to

Musical

Sound

Computer

Music

Journal

I

44


9/10

Signals."

n

Proceedings

of

the 1990 International

Computer

Music

Conference.

San

Francisco,

Califor-

nia: International

Computer

Music

Association,

pp.

82-84.

Hill, D.,

L.

Manzara,

and C. Taube-Schock.1995.

"Real-

Time

Articulatory

Speech-Synthesis-By-Rules."

AVIOS. San

Jose,

California.

Hill, D.,

A.

Pearce,

and B.

Wyvill.

1988.

'Animating

Speech:

An Automated

Approach

Using

Speech

Synthe-

sized

by

Rules."The Visual

Computer

3(5):277-289.

Howard,D.,

and D.

Rossiter. 1993.

"Real-TimeVisual

Displays

for

Use

in

Singing Training:

An

Overview."

n

Proceedings of the Stockholm Music Acoustics Confer-

ence. Stockholm:

KTH,

pp.

191-196.

Johnson, A.,

J.

Sundberg,

and

H.

Willbrand.

1983.

"K61n-

ing:

A

Study

of Phonation and

Articulation

in

a

Type

of Swedish

Herding Song."

n

Proceedings

of

the Stock-

holm Music Acoustics

Conference.

Stockholm:

KTH,

pp.

187-202.

Kelly,

J.,

and C. Lochbaum.

1962.

"Speech Synthesis" (pa-

per

G42).

In

Proceedings

of

the Fourth

International

Congress

on Acoustics.

pp.

1-4.

Klatt,

D.

1980.

"Software or a

Cascade/Parallel

Formant

Synthesizer."

Journalof

the Acoustical

Society

of

America

67(3):971-995.

KTH. 1994. Information

Technology

and Music

(a

com-

pact

disc to celebrate the

75th

anniversary

of

the

Royal

Swedish

Academy

of

EngineeringScience).

Stockholm:

KTH.

Lansky,

P. 1989.

"Compositional

Applications

of

Linear

Predictive

Coding."

n

Mathews,

M.

and

J. Pierce, eds.,

CurrentDirections

in

Computer

Music

Research. Cam-

bridge,

Massachusetts: The MIT

Press,

pp.

5-8.

Liljencrants,J.

1985.

Speech Synthesis

With a

Reflection-

Type

Line

Analog,

DS

Dissertation,

Speech

Communi-

cation and

Music

Acoustics,

Stockholm: KTH.

Maeda,

S. 1982.

"A

Digital

Simulation Method of

the Vo-

cal Tract

System." Speech

Communication 1:199-299.

Maher,

R. 1995. "Tunable

Bandpass

Filtersin Music

Syn-

thesis"

(paper

4098

L2).

In

Proceedings

of

the Audio

Engineering Society Conference.

Makhoul, J.

1975.

"LinearPrediction:A

Tutorial Re-

view."In

Proceedings

of

the IEEE

63:561-580.

Markel, J.,

and A.

Gray.

1976.

Linear

Prediction

of

Speech.

New

York:

Springer.

Massaro,

D.

1987.

Speech

Perception by

Ear and

Eye.

Hillsdale,

New

Jersey:

Erlbaum

Associates.

Mathews,

M.,

and

J.

Pierce,

eds.

1989. CurrentDirections

in

Computer

Music

Research.

Cambridge,

Massachu-

setts: The MIT Press.

McAulay,

R.,

and

T.

Quatieri.

1986.

"Speech

Analysis/

Synthesis

Based on

a

Sinusoidal

Representation."

EEE

Transactionson

Acoustics,

Speech,

and

Signal

Pro-

cessing 34(4):744-754.

McGurk, H.,

and

J.

MacDonald.

1976.

"HearingLips

and

Seeing

Voices."Nature

264:746-748.

Moorer,

A.

1978.

"TheUse of the Phase Vocoder n

Com-

puter

Music

Applications."

Journalof

the Audio

Engi-

neering

Society

26

(1/2):42-45.

Moorer,

A.

1979,

"The Use

of

Linear Prediction of

Speech

in

Computer

Music

Applications."

Journal

of

the Audio

EngineeringSociety

27(3):134-140.

Pabon,

P.

1993,

"AReal-Time

Singing

Voice

Synthesizer."

In Proceedingsof the Stockholm Music Acoustics Con-

ference.

Stockholm:

KTH,

pp.

288-293.

Pelorson, X.,

et

al.

1994.

"Theoreticaland

Experimental

Study

of

Quasi-Steady

Flow

Separation

Within

the

Glottis

During

Phonation.

Applications

to a

Modified

Two-MassModel."

Journal

of

the Acoustical

Society of

America

96

(6):3416-3431.

Prame,

E. 1994. "Measurementsof the

Vibrato Rate of

Ten

Singers."

Journal

of

the Acoustical

Society

of

America

96(4):1979-1984.

Rabiner,

L.

1968.

"Digital

Formant

Synthesizer."

Journal

of

the Acoustical

Society

of

America

43(4):822-828.

Rodet,

X.

1984. "Time-Domain

Formant-Wave-Function

Synthesis." Computer

Music

Journal

8(3):9-14.

Rodet,

X.

1995.

"One and

Two Mass

Model

Oscillations

for Voice and

Instruments."

n

Proceedings

of

the 1995

International

Computer

Music

Conference.

San Fran-

cisco,


Computer

Music Asso-

ciation,

pp.

207-210.

Rodet, X.,

and

P. Cointe.

1984. "FORMES:

Composition

and

Scheduling

of

Processes."

Computer

Music

Journal

8(3):32-50.

Rodet, X.,

and

P.

Depalle.

1992.

"Spectral

Envelopes

and

Inverse FFT

Synthesis"

(paper

3393

H3).

In

Proceed-

ings

of

the Audio

EngineeringSociety

Conference,

NY:

AES.

Rodet, X.,

Y.

Potard,

and

J.

B. Barriere.

1984. "The

CHANT

Project:

From the

Synthesis

of the

Singing

Voice to

Synthesis

in

General."

Computer

Music

Jour-

nal

8(3):15-31.

Ross, J.,

and I.

Lehiste.

1993. "Estonian

Laments:

A

Study

of Their

Temporal

Structure."

n

Proceedings

of

the Stockholm

Music Acoustics

Conference.

Stock-

holm:

KTH,

pp.

244-248.

Rossiter, D.,

and D. Howard.

1994.

"Voice

Source and

Acoustic

Output

Qualities

for

Singing Synthesis."

In

Proceedings

of

the

1994

International

Computer

Mu-

sic

Conference.

San

Francisco,

California:

International

Computer

Music

Association,

pp.

191-196.

Cook

45


10/10

Scavone, G.,

and P. Cook.

1994.

"Combined

Linearand

Non-Linear

Periodic Prediction

in

Calibrating

Models

of Musical

Instruments to

Recordings."

n

Proceedings

of


Computer

Music

Confer-

ence. San

Francisco,

California:International Com-

puter

Music

Association,

pp.

433-434.

Scotto

Di

Carlo, N.,

and I. Guaitella.

1995.

"Facial

Ex-

pressions

in

Singing."

n

Proceedings

of

the

13th Inter-

national

Congress

of

Phonetic Sciences.

Stockholm:

KTH,

pp.

1:226-229.

Serra,

X.,

and

J.

Smith.

1990.

"SpectralModeling

Synthe-

sis:

A

Sound

Analysis/Synthesis System

Based on a De-

terministic plus Stochastic Decomposition." Computer

Music

Journal

14(4):12-24.

Smith,

J.

1987.

"Musical

Applications

of

Digital

Wave-

guides."

Technical

report

STAN-M-39.

StanfordUniver-

sity

Center

for

Computer

Research

n Music and

Acoustics.

Smith, J.,

and

X.

Serra.

1987.

"PARSHL:

nalysis/Synthe-

sis

Program

or Non-Harmonic Sounds Based

on a

Si-

nusoidal

Representation."

n

Proceedings

of

the 1987

International

Computer

Music

Conference.

San Fran-

cisco,

California:

International

Computer

Music Asso-

ciation,

pp.

290-297.

Spanias,

A. 1994.

"Speech Coding:

A

Tutorial

Review."

n

Proceedingsof the IEEE82(10):1541-1582.

Steiglitz,

K.,

andP.

Lansky.

1981.

"Synthesis

of Timbral

Families

by

Warped

Linear Prediction."

Computer

Music

Journal

5(3):45-49.

Story,

B.,

and

I. Titze.

1995.

"Voice

Simulation

With a

Body-Cover

Model

of

the Vocal

Folds."

Journal

of

the

Acoustical

Society

of

America

97(2):3416-3431.

Sundberg,J.

1987.

The

Science

of

the

Singing

Voice.

De-

kalb,

Illinois:

Northern Illinois

University

Press.

Sundberg,

J.

1989.

"Synthesis

of

Singing

by

Rule."

In

Mathews,

M. and

J. Pierce,

eds.,

Current

Directions

in

Computer

Music Research.

Cambridge,

Massachusetts:

The

MIT

Press,

pp.

45-56.

Teager,

H.

1980.

"Some Observations

on

Oral

Air

Flow

During

Phonation." EEE

Transactions

on

Acoustics,

Speech,

and

Signal Processing

28(5):599-601.

Ternstrom,S.,

and

A.

Friberg.

1989.

"Analysis

and

Simula-

tion of Small Variations

n

the Fundamental

Frequency

of

Sustained

Vowels."

STL-Quarterly

Progress

and Sta-

tus

Report

3:1-14.

Titze, I.,

and B.

Story.

1993.

"The Iowa

Singing Synthe-

sis." In

Proceedings

of

the Stockholm

Music

Acoustics

Conference.Stockholm:KTH,p. 294.

Valimaki, V.,

and M.

Karjalainen.

1994.

"Improving

he

Kelly-Lochbaum

Vocal TractModel

Using

Conical

Tube

Sections and Fractional

Delay

Filtering

Tech-

niques."

In

Proceedings of

the 1994

International Con-

ference

on

Spoken Language Processing. Yokohama,

Ja-

pan,

pp.

18-22.

Verge,

M.

1995.

Aeroacoustics

of

Confined Jets,

with

Applications

to

the

Physics of

Recorder-Like nstru-

ments.

Thesis,

Technical

University

of Eindhoven

(also

availablefrom

IRCAM).

Wergo.

1995.

The

Historical

CD

of

Digital

Sound

Synthe-

sis. WER2033-2.

Wilhelms-Tricarico,R. 1995. "PhysiologicalModelingof

Speech

Production:Methods for

Modeling

Soft-Tissue

Articulators."

Journal

of

the Acoustical

Society

of

America

97(5):3085-3098.

Zera, J.,

J.

Gauffin,

and

J.

Sundberg.

1984.

"Synthesis

of

Selected

VCV-Syllables

n

Singing."

n

Proceedings of


Computer

Music

Conference.

San

Francisco,

California:

International

Computer

Music

Association, pp.

83-86.

Computer

Music

Journal

6

Documents

Cook - singing voice synthesis