Cook - singing voice synthesis

Embed Size (px)

Citation preview

  • 7/27/2019 Cook - singing voice synthesis

    1/10

    Singing Voice Synthesis: History, Current Work, and Future DirectionsAuthor(s): Perry R. CookSource: Computer Music Journal, Vol. 20, No. 3 (Autumn, 1996), pp. 38-46Published by: The MIT PressStable URL: http://www.jstor.org/stable/3680822

    Accessed: 12/01/2010 06:40

    Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at

    http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless

    you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you

    may use content in the JSTOR archive only for your personal, non-commercial use.

    Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at

    http://www.jstor.org/action/showPublisher?publisherCode=mitpress.

    Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed

    page of such transmission.

    JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of

    content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms

    of scholarship. For more information about JSTOR, please contact [email protected].

    The MIT Pressis collaborating with JSTOR to digitize, preserve and extend access to Computer Music

    Journal.

    http://www.jstor.org

    http://www.jstor.org/stable/3680822?origin=JSTOR-pdfhttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/action/showPublisher?publisherCode=mitpresshttp://www.jstor.org/action/showPublisher?publisherCode=mitpresshttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/stable/3680822?origin=JSTOR-pdf
  • 7/27/2019 Cook - singing voice synthesis

    2/10

    Perry

    R.

    Cook

    Department

    of

    Computer

    Science

    and

    Department

    of Music

    Princeton

    University

    Princeton,

    New

    Jersey,

    USA

    [email protected]

    This

    article

    will

    briefly

    review the

    history

    of

    sing-

    ing

    voice

    synthesis,

    and will

    highlight

    some cur-

    rently

    active

    projects

    in

    this area. It will

    survey

    and

    discuss the benefits and trade-offs of

    using

    different

    techniques

    and

    models. Performance

    control,

    some

    attractions of composing with vocal models, and ex-

    citing

    directions

    for

    future

    research will be

    high-

    lighted.

    Basic VocalAcoustics

    The

    voice

    can be characterizedas

    consisting

    of one

    or more

    sources,

    such

    as the

    oscillating

    vocal folds

    or turbulence

    noise,

    and a

    system

    of filters whose

    properties

    are

    controlled

    by

    the

    shape

    of the vocal

    tract.

    By moving

    various

    articulators,

    we

    change

    the

    ways

    the

    sources

    and filters

    behave.

    The

    spec-

    trum of the voice is characterized

    by

    resonant

    peaks

    called formants. The location and

    shapes

    of

    these

    resonances

    are

    strong perceptual

    cues

    that

    hu-

    mans use

    to

    differentiate and

    identify

    vowels and

    consonants. For a

    system

    to

    generate speech-like

    sounds,

    it

    should allow for

    manipulation

    of

    the res-

    onant

    peaks

    of

    the

    spectrum,

    and

    also

    for

    manipula-

    tion of source

    parameters

    (voice

    pitch,

    noise

    level,

    etc.)

    independent

    of

    the

    resonances of the

    vocal

    tract. Voice

    pitch

    is

    commonly

    denoted

    as

    fo,

    and

    the formant frequencies are commonly denoted as

    fl,

    f2,

    3,

    etc.

    Figure

    1 shows a

    vocal tractcross-

    section

    forming

    the vowel

    /

    i

    /

    (as

    in

    "beet"),

    where

    the

    quasi-periodic

    oscillations of the vocal

    folds

    are

    shaped

    by

    the

    resonant filter of the vocal

    tract

    tube. The

    spectrum

    of

    the vowel shows

    the

    harmon-

    ics of

    the voice source

    outlining

    the

    peaks

    and val-

    leys

    of the

    vocal

    tract

    response. Figure

    2

    shows the

    vocal tract

    cross-section for

    forming

    the

    conso-

    Computer

    Music

    Journal,

    20:3,

    pp.

    38-46,

    Fall

    1996

    ?

    1996 Massachusetts Institute of

    Technology

    S i n g i n g

    V o i c e

    Synthesis

    H i s t o r y

    C u r r e n t

    W o r k

    n d

    u t u r e

    Directions

    nant

    / /("shh"),

    where the "source" s not the vo-

    cal

    folds,

    but turbulence

    noise

    formed

    by forcing

    air

    through

    a constriction.

    Also shown is the noise-

    like

    spectrum

    of the

    consonant,

    showing

    two

    princi-

    pal

    formant

    peaks corresponding

    o the

    resonances

    of the vocal tract upstreamfrom the noise source.

    A

    Brief

    History

    f

    Digital inging

    Speech)

    Synthesis

    The earliest

    computer

    music

    project

    at

    Bell Labs in

    the late

    1950s

    yielded

    a

    number of

    speech synthe-

    sis

    systems

    capable

    of

    singing,

    one

    being

    the

    acous-

    tic tube model of

    Kelly

    and Lochbaum

    (1962).

    This

    model was

    actually

    an

    early

    physical

    model.

    At

    that time it was

    considered too

    computationally

    ex-

    pensive

    for

    commercialization

    as

    a

    speech synthe-

    sizer,

    and too

    expensive

    to be

    practical

    for musical

    composition.

    Max

    Mathews worked with

    Kelly

    and

    Lochbaum to

    generate

    some

    early

    examples

    of

    sing-

    ing

    synthesis

    (Computer

    Music

    Journal 1995;

    Wergo 1995).

    Other

    techniques

    to arise from the

    early

    legacy

    of

    speech signal processing

    include

    the

    channel

    vo-

    coder

    (VOice

    CODER)

    (Dudley

    1939)

    and linear

    pre-

    dictive

    coding (LPC) Atal 1970;

    Makhoul

    1975).

    In

    the

    vocoder,

    the

    spectrum

    is

    broken into sections

    called sub-bands,and the information in each sub-

    band is

    analyzed,

    then

    parameters

    are stored or

    transmitted for

    reconstruction

    at another time or

    site.

    The

    parametric

    data

    representing

    the

    informa-

    tion

    in

    each

    sub-band can be

    manipulated,

    yielding

    transformations such as

    pitch

    or time

    shifting,

    or

    spectral

    shaping.

    The vocoder does not

    strictly

    as-

    sume that the

    signal

    is

    speech,

    and thus

    generalizes

    to other sounds.

    The

    phase

    vocoder,

    implemented

    using

    the

    discrete Fourier

    transform,

    has found ex-

    tensive

    use

    in

    computer

    music

    (Moorer

    1978;

    Dol-

    son

    1986).

    Computer

    Music

    Journal

    8

  • 7/27/2019 Cook - singing voice synthesis

    3/10

    Figure

    1. Vocal tract

    shape

    and

    spectrum

    of

    vowel

    /

    i

    /

    (as

    in

    "beet"),

    show-

    ing formants

    and harmon-

    ics

    of

    periodic

    voice

    source.

    Figure

    2. Vocal tract

    shape

    (left)

    and

    spectrum

    (right)

    of

    consonant

    /

    f

    /

    ("shh"),

    showing

    a

    noisy spectrum

    with two

    formants.

    ^

    Formants

    ea.}3.oov~~~~I

    Figure

    1

    Consonant

    f/

    (as

    in

    shh)

    ,r,

    Figure

    2

    The introduction

    of linear

    predictive coding

    (Atal

    1970)

    revolutionized

    speech technology,

    and had a

    great impact

    on

    musical

    composition

    as well

    (Moorer 1979;

    Steiglitz

    and

    Lansky 1981;

    Lansky

    1989).

    With

    LPC,

    a

    time-varying

    filter is

    automati-

    cally designed that predicts the next value of the

    signal,

    based on

    past samples.

    An

    error

    signal

    is

    pro-

    duced

    which,

    if

    fed back

    through

    the

    time-varying

    filter,

    will

    yield

    exactly

    the

    original signal.

    The

    fil-

    ter models

    linear

    correlations

    in

    the

    signal,

    which

    correspond

    to

    spectral

    features such as

    formants.

    The error

    signal

    models

    the

    input

    to the

    formant

    filter,

    and

    typically

    is

    periodic

    and

    impulsive

    for

    voiced

    speech,

    and

    noise-like for

    unvoiced

    speech.

    The success of

    LPC

    in

    speech

    coding

    is

    largely

    due to

    the

    similarity

    between

    the

    source/filter

    de-

    composition

    yielded by

    the

    mathematics of

    linear

    prediction,

    and the

    source/filter

    model of

    the hu-

    man vocal tract.

    The

    power

    of

    LPC as

    a

    speech

    com-

    pression

    technique

    (Spanias 1994)

    stems from

    its

    ability

    to

    parametrically

    code

    and

    compress

    the

    source and

    filter

    parameters.

    The

    effectiveness of

    LPCas a compositional tool emerges from its abil-

    ity

    to

    modify

    the

    parameters

    before

    resynthesis.

    There are

    weaknesses,

    however,

    in

    LPC,

    which are

    related to

    the

    assumption

    of

    linearity

    inherent

    in

    the

    filter model.

    Also,

    all

    spectral

    properties

    are

    modeled

    in

    the

    filter.

    In

    actuality

    the

    voice has

    mul-

    tiple possible

    sources of

    non-linear

    behavior,

    nclud-

    ing

    source-tract

    coupling,

    non-linear wall

    vibration

    losses,

    and

    aerodynamic

    effects.

    Due to

    these devia-

    tions from the

    ideal

    source-filter

    model,

    the

    result

    of

    analysis/modification/resynthesis

    using

    LPC or

    a

    sub-band

    vocoder often

    sounds

    "buzzy."

    Cook

    i

    i

    I

    39

  • 7/27/2019 Cook - singing voice synthesis

    4/10

    Cross-Synthesis

    nd Other

    Compositional

    Attractions

    f

    Vocal

    Models

    The

    compositional

    interest

    in

    vocal

    analysis/syn-

    thesis has at least

    three foundations.

    The first is

    rooted

    in the

    human

    as

    a

    linguistic organism,

    for

    it

    seems

    in

    the

    nature

    of

    humans

    to

    find

    interest

    in

    voice-like sounds.

    Any

    technique

    or

    device that

    allows

    independent

    control over

    pitch

    and

    spectral

    peaks

    tends

    to

    produce

    sounds that are

    vocal in

    nature,

    and such sounds catch

    the

    interest

    of

    hu-

    mans. The second

    compositional

    interest

    in

    using

    systems

    that

    decompose

    sounds

    in

    a

    source/filter

    paradigm

    s to

    allow

    for

    cross-synthesis.

    Cross-

    synthesis

    involves

    the

    analysis

    of two

    instruments,

    typically

    a voice and a non-voice

    instrument,

    with

    the

    parameters

    exchanged

    and modified

    on

    resyn-

    thesis. This allows the resonances of

    the

    voice

    to

    be

    imposed

    on the source of a non-voice instru-

    ment. The

    third

    interest

    comes

    from the fact that

    once

    pitch

    and

    resonance structure are

    analyzed

    as

    they

    evolve

    in

    time,

    these three

    dimensions are

    in-

    dependently availableto some extent for manipula-

    tion

    on

    resynthesis.

    The

    elusive

    goal

    of

    being

    able

    to

    stretch

    time

    without

    changing pitch,

    to

    change

    pitch

    without

    changing

    timbral

    quality,

    etc.,

    are all

    of

    high

    interest to

    computer

    music

    composers.

    Other

    PopularSynthesisTechniques

    Frequency

    modulation

    (FM)

    proved

    successful for

    singing

    synthesis

    (Chowning

    1981,

    1989)

    as

    well as

    the

    synthesis

    of

    other

    sounds.

    As

    described in com-

    munications

    literature,

    FM

    involves the

    modula-

    tion of the

    frequency

    of

    one oscillator with

    the

    output

    of another to create a

    spread spectrum

    con-

    sisting

    of side-bands

    surrounding

    the

    original

    car-

    rier

    (oscillator

    that is

    modulated)

    frequency.

    In FM

    sound

    synthesis,

    both the carrier

    and modulator

    oscillators

    typically

    store

    a

    sinusoidal

    waveform,

    and

    operate

    in

    the audio

    band.

    By

    controlling

    the

    amount of

    modulation,

    and

    using

    multiple carrier/

    modulator

    pairs,

    spectra

    of

    somewhat

    arbitrary

    shape

    can be

    constructed. This

    technique

    proved

    ef-

    ficient yet sufficiently flexible for music composi-

    tion,

    and

    became

    the basis

    for

    the

    most

    successful

    commercial music

    synthesizers

    in

    history.

    In

    vocal

    modeling,

    carriers

    placed

    near

    formant locations

    in

    the

    spectrum

    are modulated

    by

    a

    common

    modula-

    tor oscillator

    operating

    at the voice fundamental fre-

    quency.

    Sinusoidal

    speech

    modeling (McAulay

    and

    Qua-

    tieri

    1986)

    has been

    improved

    and

    applied

    to music

    synthesis by

    Julius

    Smith and XavierSerra

    (Smith

    and Serra

    1987;

    Serra

    and

    Smith

    1990),

    Xavier

    Ro-

    det and

    Philippe Depalle (1992),

    and

    others.

    These

    techniques use Fourieranalysis to locate and track

    individual

    sinusoidal

    partials.

    Individual

    trajector-

    ies

    (tracks)

    of sinusoidal

    amplitude, frequency,

    and

    phase

    as a

    function of time

    are

    extracted from the

    time-varying peaks

    in a series

    of short-time

    Fourier

    transforms.

    To

    help

    define

    tracks,

    heuristics

    regard-

    ing

    physical

    systems

    and the

    voice

    in

    particular

    are

    used,

    such as the fact that a

    sinusoid

    should

    not

    ap-

    pear, disappear,

    or

    change frequency

    or

    phase

    instan-

    taneously.

    The

    sinusoids

    can

    be

    resynthesized

    from

    the

    track

    parameters,

    after

    modification or

    coding,

    by

    additive

    synthesis.

    Noise can

    be treated as

    rap-

    idly varyingsinusoids,

    or

    explicitly

    as a non-

    sinusoidal

    component.

    Formant wave

    functions

    (FOFs

    n

    French)

    were

    pioneered by

    Xavier

    Rodet

    (1984)

    at

    Institute de

    Recherche et

    Coordination,

    Acoustique/Musique

    (IRCAM).

    An FOF s

    a time-domain

    waveform

    model of the

    impulse

    response

    of individual for-

    mants,

    characterized

    as a

    sinusoid

    at the

    formant

    center

    frequency

    with an

    amplitude

    that

    rises

    rap-

    idly

    upon

    excitation

    and

    decays

    exponentially. By

    describing

    a

    spectral region

    as

    a

    windowed sinusoi-

    dal

    oscillation

    in

    the time

    domain,

    an

    FOF can be

    viewed as a

    special type

    of

    wavelet. The

    control

    pa-

    rameters define the

    center

    frequency

    and

    band-

    width of

    the formant

    being modeled,

    and the rate

    at which

    the FOFs

    are

    generated

    and

    added deter-

    mines

    the

    base

    frequency

    of

    the voice.

    The

    synthe-

    sis

    system

    for

    using

    FOFs

    was

    dubbed

    CHANT,

    and

    found

    application

    in

    general synthesis

    (Rodet,

    Po-

    tard,

    and

    Barriere

    1984).

    Gerald Bennett

    and Xavier

    Rodet used CHANT

    to

    produce

    a

    number of

    impres-

    sive

    singing examples

    and

    compositions

    (Bennett

    and Rodet

    1989).

    Formantsynthesizers, in which individual for-

    Computer

    Music

    Journal

    0

  • 7/27/2019 Cook - singing voice synthesis

    5/10

    mants

    are modeled

    by

    second-orderresonant

    filters,

    have

    been

    investigated

    by many speech

    researchers

    (Rabiner

    1968;

    Klatt

    1980).

    An

    attractive

    feature

    of

    formant

    synthesizers

    is that Fourier

    or LPC

    analy-

    sis can be used to

    automatically

    extract formant

    frequencies

    and source

    parameters

    from recorded

    speech.

    Charles

    Dodge

    used such

    techniques

    in

    a

    composition

    in

    1973

    (Dodge 1989).

    The

    group

    that

    has

    accomplished

    the most in the domain of

    sing-

    ing synthesis using

    formant models is the

    Speech

    Transmission

    Laboratory STL)

    of the

    Royal

    Insti-

    tute of Technology (KTH),Stockholm. This STL

    MUSSE

    DIG

    (MUsic

    and

    Singing Synthesis

    Equip-

    ment,

    DIGital

    version)

    synthesizer

    (Carlson

    and

    Neovius

    1990)

    has been used

    in

    singing

    synthesis

    (Zera, Gauffin,

    and

    Sundberg

    1984),

    for

    studying

    performance

    synthesis-by-rule

    (Sundberg

    1989),

    and

    has been

    adapted

    for real-time control

    in

    perfor-

    mance

    (Carlson,

    et al.

    1991).

    The

    KTH

    has con-

    ducted and

    published

    extensively

    on

    speech,

    and

    has

    arguably

    produced

    the

    largest body

    of research

    on

    singing (Sundberg1987)

    and

    music,

    both acous-

    tics and

    performance.

    Robert C. Maher

    (1995)

    re-

    cently

    demonstrated

    singing synthesis

    using

    modi-

    fied

    forms

    of

    the second-orderresonant

    filter which

    lend

    themselves

    to

    parallel implementation.

    Acoustic

    TubeModelsof the Vocal

    Tract

    Acoustic

    tube models solve

    the wave

    equation,

    usu-

    ally

    in one

    dimension,

    inside a

    smoothly varying

    tube.

    The one-dimensional

    approximation

    is

    justi-

    fied

    by noting

    that

    the

    length

    of the

    vocal tract is

    significantly larger

    than

    any

    width

    dimension,

    and

    thus the

    longitudinal

    modes

    dominate the reso-

    nance

    structure

    up

    to about

    4,000

    Hz.

    Modal stand-

    ing

    waves in

    an

    acoustic tube

    correspond

    to

    the for-

    mants.

    The basic

    Kelly

    and

    Lochbaum model

    (Kelly

    and

    Lochbaum

    1962)

    critically

    samples space

    and

    time

    by approximating

    the smooth

    vocal

    tract

    tube with

    cylindrical

    segments

    equal

    in

    length

    to

    the dis-

    tance

    traveled

    by

    a

    soundwave in

    one

    time

    sample.

    The SPASM

    and

    Singer

    systems (Cook

    1992)

    are

    based on a physical model of the vocal tract filter,

    developed

    using

    the

    waveguide

    formulation

    (Smith

    1987).

    This model is

    a direct descendent of

    the

    Kelly

    and Lochbaum

    model,

    but with

    many

    en-

    hancements,

    such as a nasal

    tract,

    modeling

    of radi-

    ation

    through

    the throat

    wall,

    various

    steady

    and

    pulsed

    noise

    sources

    (Chafe

    1990),

    and

    real-time

    controls.

    Shinji

    Maeda's

    (1982)

    model

    numerically

    integrates

    the wave

    equation using

    the

    rectangular

    method

    in

    space,

    and the

    trapezoidal

    rule in

    time.

    Wall losses are

    also

    modeled,

    and an

    articulatory

    layer

    of control modifies

    the basic tube

    shape

    from

    higher-orderdescriptions like tongue and jawposi-

    tion. Rene Carre's

    1992)

    model is

    based on distinc-

    tive

    regions

    (DR)

    arising

    from

    sensitivity

    analysis,

    noting

    that movements

    in

    particularregions

    of the

    vocal tract affect formant

    frequencies

    more than

    movements in others.

    Hill,

    Manzara,

    and Taube-

    Schock

    (1995)

    have

    implemented

    a

    synthesis-by-

    rule

    system using

    a

    model based on

    distinctive re-

    gions,

    with

    libraries and

    examples

    that include ex-

    amples

    of

    singing synthesis. Liljencrants

    (1985)

    in-

    vestigated

    an

    undersampled

    acoustic tube model

    and

    derived rules for

    modifying

    the

    shape

    without

    adding

    unnaturally

    to the

    energy

    contained within

    the

    vocal tract. The

    computer

    music

    research

    group

    in

    Helsinki

    (Vilimaki

    and

    Karjalainen

    1994)

    have

    used fractional

    sample

    interpolation

    and

    truncated

    conical tube

    segments

    to derive

    an

    improved

    ver-

    sion of

    the

    Kelly

    and Lochbaum

    model.

    OtherActive

    SingingSynthesis

    Projects

    Pabon

    (1993)

    has

    constructed a

    singing synthesizer,

    with

    real-time

    formant

    control via

    spectrogram-

    like

    displays

    called

    phonetograms,

    and

    source wave-

    form

    synthesis

    using

    FOF-like

    controls. Titze

    and

    Story (1993)

    have

    produced

    a

    super-computer

    tenor

    called "Pavarobotti"

    hat

    sings

    duets with

    Titze,

    and

    is used for

    studying

    many

    aspects

    of

    the

    voice,

    including

    advanced

    physical

    models

    of

    normal and

    pathological

    vocal

    folds.

    Howard

    and

    Rossiter

    (How-

    ard

    and Rossiter

    1993;

    Rossiter

    and

    Howard

    1994)

    have

    studied source

    parameters

    for

    more

    natural

    singing synthesis,

    as

    well as

    interactive

    singing

    analysis software for pedagogicalapplications.

    Cook

    41

  • 7/27/2019 Cook - singing voice synthesis

    6/10

    Spectral

    Modelsvs.

    Physical

    Models

    Synthesis

    models

    can

    be

    loosely

    broken into two

    groups:

    spectral

    models,

    which can

    be

    viewed

    as

    based on

    perceptual

    mechanisms,

    and

    physical

    models,

    which

    can

    be viewed as based

    on

    produc-

    tion

    mechanisms. Of the

    models and

    techniques

    discussed

    above,

    the

    spectrally

    based models in-

    clude

    FM,

    FOFs,

    vocoders,

    and

    sinusoidal models.

    Acoustic

    tube models are

    physically

    based,

    while

    formant

    synthesizers

    are

    spectral

    models,

    but could

    be classified as

    pseudo-physical

    because of the

    source/filter decomposition.

    It's

    possible

    to inter-

    pret

    LPC three

    ways:

    as a

    least-squares

    linear

    predic-

    tion

    in the

    time

    domain,

    as a

    least-squares

    match-

    ing process

    on the

    spectrum,

    and as a source-filter

    decomposition.

    Therefore,

    LPC is both

    spectral

    and

    pseudo-physical,

    but not

    strictly

    a

    physical

    model

    because wave variables are not

    propagateddirectly,

    and

    no

    articulation

    parameters

    go

    into the basic

    model. Since LPC can be

    mapped

    to

    a filter related

    to the acoustic tube model

    (Markel

    and

    Gray 1976),

    it may be broughtinto the physical camp.

    Both

    physical

    and

    spectral

    models have

    merit,

    and one or another

    might

    be more suitable

    given

    a

    specific goal

    and set of

    computational

    resources.

    The main

    attraction

    of

    physical

    models is

    that

    most of the control

    parameters

    are those that a

    hu-

    man uses to control

    his/her

    own

    vocal

    system.

    As

    such,

    some intuition

    can

    be

    brought

    into the

    de-

    sign

    and

    composition

    processes.

    Another motiva-

    tion is that

    time-varying

    model

    parameters

    can be

    generated by

    the model

    itself,

    if

    the

    model

    is

    con-

    structed so that it

    sufficiently

    matches

    the

    physical

    system. Disadvantagesof physical models are that

    the

    number of

    control

    parameters

    can be

    large,

    and

    while some

    parameters

    might

    have intuitive

    sig-

    nificance for humans

    (jaw

    drop),

    others

    might

    not

    (specific

    muscles

    controlling

    the vocal

    folds).

    Fur-

    ther,

    parameters

    often

    interact

    in

    non-obvious

    ways.

    In

    general

    there exist no exact

    methods for

    analysis/resynthesis using physical

    models. Parame-

    ter

    estimation

    techniques

    have

    been

    investigated,

    but for

    physical

    models of

    reasonable

    complexity,

    especially

    those

    involving any

    non-linear

    compo-

    nent,

    identity analysis/resynthesis

    is a

    practical

    and often theoretical

    impossibility

    (Cook

    1991b;

    Scavone and Cook

    1994).

    ModelExtensions nd FutureWork

    Work remains to be done in

    refining

    techniques

    for

    spectral analysis

    and

    synthesis

    of the voice. For

    ex-

    ample,

    a

    spectral

    envelope

    estimation

    technique

    like that of

    Galas and Xavier Rodet

    (1990)

    allows

    more accurate

    formant

    tracking

    on

    even

    high

    fe-

    male

    tones,

    which because of the

    large

    inter-

    harmonic

    spacing

    have

    proven

    difficult for

    analysis

    systems

    in the

    past.

    There are far more

    directions

    for

    research to

    proceed

    in

    improving physical

    mod-

    els and

    source models

    for

    pseudo-physical

    models

    of the voice.

    Most

    of them

    involve some

    significant

    component

    of

    non-linearity,

    and/or higher

    dimen-

    sional models. The main

    research

    areas involve

    modeling

    of

    airflow

    in

    the vocal

    tract,

    development

    of more

    exact models

    of

    the inner

    shape

    of the

    vo-

    cal tract

    tube,

    physical

    models

    of

    the

    tongue

    and

    other articulators,more accuratemodels of the vo-

    cal

    folds,

    and

    facial

    animation

    coupled

    to voice

    syn-

    thesis.

    The

    modeling

    of flow

    is

    a

    difficult but

    important

    task,

    and until

    recently

    it has been

    confined

    to

    the-

    oretical

    explorations,

    occasionally

    verified

    experi-

    mentally

    with

    hot-wire

    anemometry

    or other

    flow

    measurement

    techniques (Teager1980).

    Mico

    Hirschberg

    has

    begun

    to

    make advances in

    actually

    photographing

    low in

    constructed models of

    musi-

    cal

    instruments,

    and the vocal

    tract

    (Pelorson

    et al.

    1994).

    These

    techniques,

    combined with

    classical

    and new theories, should yield greaterunderstand-

    ing

    about air

    flow and how it

    affects

    vocal

    acous-

    tics.

    Along

    with more

    exact

    solutions to

    the flow-

    physics problems,

    development

    of

    efficient

    means

    for

    calculating

    the

    flow

    simulations,

    allowing

    the

    inclusion of these

    non-linear

    effects

    in

    practical

    synthesis

    models must

    also

    emerge

    (Chafe

    1995;

    Verge 1995).

    Constructing

    a

    physical

    model that

    includes

    more

    detailed

    simulations of the

    dynamics

    of the

    tongue

    and

    articulators

    would allow

    the model to

    calculate the

    time-varying

    parameters,

    rather than

    Computer

    Music

    Journal

    I

    42

  • 7/27/2019 Cook - singing voice synthesis

    7/10

    having

    the

    shape,

    etc.

    explicitly

    specified

    or calcu-

    lated.

    Wilhelms-Tricarico

    1995)

    has

    developed

    a set

    of

    models

    of

    soft

    tissue,

    and has

    used these

    to con-

    struct

    a

    tongue

    model.

    Such models can be

    cali-

    brated

    from the

    results of articulation

    studies

    using

    X

    ray pellets,

    magnetic

    resonance

    imaging,

    and

    other

    techniques.

    All of this

    can

    combine to

    yield

    models

    that "behave"

    correctly

    in

    a

    dynamical

    sense,

    and

    give

    a better

    picture

    of the fine structure

    of

    the

    space

    inside

    the

    vocal tract.

    This

    latter

    infor-

    mation is critical

    if

    flow simulations

    are

    to

    be

    ac-

    curate.

    Vocal fold models continue

    to be

    the

    target

    of

    much

    research, and,

    like the case

    of

    airflow,

    theo-

    ries are difficult

    to

    conclusively prove

    or

    disprove.

    More elaborate models

    of the vocal fold tissue are

    being developed

    (Story

    and Titze

    1995),

    and theoret-

    ical and

    experimental

    studies

    revisiting

    and

    compar-

    ing

    the classic models

    are

    being

    conducted

    (Rodet

    1995).

    Facial

    animation

    coupled

    with

    speech

    synthesis

    is

    important

    for a number

    of

    reasons. One reason is

    for

    pedagogy,

    where

    speech

    synthesizers

    with ani-

    mated

    displays

    could be used as

    teaching

    and reha-

    bilitation

    tools. Another

    important

    reason

    involves

    speech perception

    in

    general,

    because humans use

    a

    significant

    amount

    of

    lip

    reading

    in

    understand-

    ing

    speech.

    Workhas been done

    by

    Massaro

    (1987)

    and

    Hill,

    Pearce,

    and

    Wyvill (1988),

    employing

    fa-

    cial animation

    to

    study

    coupling

    of visual and

    audi-

    tory

    information

    in

    human

    speech

    understanding

    (McGurk

    and MacDonald

    1976).

    Musically,

    we

    know that the face of the

    singer

    can

    carry

    even

    more information about

    the

    meaning

    of

    music

    than the actual text

    being sung (Scotto

    Di

    Carlo

    and Guaitella

    1995),

    further

    motivating

    the combi-

    nation of facial animation

    with

    singing synthesis.

    Modeling

    Performance

    One

    of

    the

    distinguishing

    features of the

    voice is

    the continuous nature of

    pitch

    control,

    both inten-

    tional

    and

    uncontrolled. Research

    in

    random and

    periodic

    pitch

    deviations

    (Sundberg

    1987;

    Chown-

    ing

    1989;

    Ternstrom and

    Friberg

    1989;

    Prame

    1994;

    Cook

    1995),

    and the

    synthesis

    and

    perception

    of

    short vibrato tones

    (d'Allessandro

    and

    Castellengo

    1993),

    has

    provided

    data and models

    for more natu-

    ral

    sounding

    voice

    synthesis.

    On

    the macro

    scale,

    rule

    systems

    for vocal

    performance

    and

    phrasing

    (Berndtsson

    1995),

    and

    composition

    (Rodet

    and

    Cointe

    1984;

    Barriere, ovino,

    and Laurson

    1991)

    have

    been constructed. The Stockholm KTH rule

    system

    is available

    on the

    compact

    disc

    Informa-

    tion

    Technology

    and Music

    (KTH 1994).

    These

    im-

    portant

    areas of research shall

    remain a

    topic

    for a

    future survey paper.

    Extended

    Singing

    and

    Language

    ystems

    Investigations

    into

    non-Western traditional Bel

    Canto

    singing

    styles, traditions,

    and acoustics

    include

    studies

    of

    overtone

    singing

    (Bloothooft,

    et

    al.

    1992),

    traditional Scandanavian

    shepherd

    sing-

    ing (Johnson,Sundberg,

    and

    Willbrand

    1983),

    a

    highly

    structured

    system

    of funeral laments

    (Ross

    and Lehiste

    1993),

    and even castrati

    singing

    (De-

    palle, Garcia,

    and Rodet

    1994).

    Language

    systems

    for the

    SPASM/Singer

    nstruments include

    an

    Eccle-

    siastical

    Latin

    system

    called

    LECTOR

    Cook

    1991a),

    and a

    system

    for

    modern Greek called

    IGDIS

    (Cook,

    et

    al.

    1993).

    The IGDIS

    system

    in-

    cludes

    support

    for

    arbitrary

    uning

    systems,

    and

    common vocal ornaments can be called

    up

    by

    name,

    allowing

    traditional folk

    songs

    and

    Byzan-

    tine chants to be

    synthesized

    quickly.

    Real-Time

    oice

    Processing

    and

    Interactive

    Karaoke

    Recently,

    commercial

    products

    have been intro-

    duced that allow for real-time "smartharmonies"

    to

    be added to a vocal

    signal,

    or

    implement

    real-

    time score

    following

    with

    accompaniment.

    Vocod-

    ers and

    LPC,

    by

    virtue of

    being

    analysis/synthesis

    systems,

    allow

    potential

    for real-time modification

    of

    voice

    signals

    under the control of rules

    or

    real-

    time

    computer processes.

    We

    will soon see

    systems

    that

    integrate pitch

    detection,

    score

    following,

    and

    Cook

    I

    43

  • 7/27/2019 Cook - singing voice synthesis

    8/10

    sophisticated

    voice

    processing

    algorithms

    into a

    new

    generation

    of

    interactive

    karaoke

    systems.

    This will remain

    a

    topic

    for a future review

    paper.

    References

    Atal,

    B.

    1970.

    "Speech

    Analysis

    and

    Synthesis

    by

    Linear

    Prediction of the

    Speech

    Wave."

    ournalof

    the Acousti-

    cal

    Society

    of

    America

    47:65(A).

    Barriere,

    J.

    B.,

    E

    Iovino,

    and

    M.

    Laurson.

    1991.

    "A

    New

    CHANT Synthesizerin C and its Control Environ-

    ment

    in

    Patchwork."

    n

    Proceedings

    of

    the 1991 Inter-

    national

    Computer

    Music

    Conference.

    San

    Francisco,

    California:International

    Computer

    Music

    Association,

    pp.

    11-14.

    Bennett, G.,

    and X. Rodet.

    1989.

    "Synthesis

    of the

    Sing-

    ing

    Voice."

    In

    Mathews,

    M. and

    J. Pierce, eds.,

    Current

    Directions

    in

    Computer

    Music Research.

    Cambridge,

    Massachusetts:

    The MIT

    Press,

    pp.

    19-44.

    Berndtsson,

    G.

    1995,

    "The KTH Rule

    System

    For

    Singing

    Synthesis."

    Computer

    Music

    Journal

    20(1):76-91.

    Bloothooft, G.,

    et

    al.

    1992.

    "Acoustics

    and

    Perception

    of

    Overtone

    Singing."

    Journal

    of

    the

    Acoustical

    Society of

    America

    92(4):1827-1836.

    Carlson, G.,

    and

    L.

    Neovius.

    1990.

    "Implementations

    of

    Synthesis

    Models for

    Speech

    and

    Singing."

    STL-

    Quarterly

    Progress

    and

    Status

    Report.

    Stockholm:

    KTH,

    pp. 2/3:63-67.

    Carlson, G.,

    et

    al.

    1991. "A

    New

    Digital System

    for

    Sing-

    ing

    Synthesis

    Allowing Expressive

    Control."

    n

    Proceed-

    ings

    of

    the

    1991

    International

    Computer

    Music

    Confer-

    ence. San

    Francisco,

    California:International

    Computer

    Music

    Association,

    pp.

    315-318.

    Carre,

    R.

    1992.

    "Distinctive

    Regions

    in

    Acoustic Tubes."

    Journal

    d'Acoustique, 5(141):141-159.

    Chafe,

    C.

    1990.

    "Pulsed Noise in

    Self-Sustained Oscilla-

    tions of Musical Instruments." n

    Proceedings

    of

    the

    IEEE

    nternational

    Conference

    on

    Acoustics,

    Speech,

    and

    Signal

    Processing.

    New York: EEE

    Press,

    pp.

    1157-1160.

    Chafe,

    C.

    1995.

    "Adding

    Vortex

    Noise to Wind

    Instru-

    ment

    Physical

    Models."

    In

    Proceedings

    of

    the

    1995

    In-

    ternational

    Computer

    Music

    Conference.

    San Fran-

    cisco,

    California:International

    Computer

    Music

    Association,

    pp.

    57-60.

    Chowning,

    J.

    1981,

    "Computer Synthesis

    of the

    Singing

    Voice."In

    Research

    Aspects

    on

    Singing.

    Stockholm:

    KTH,

    pp.

    4-13.

    Chowning,

    J.

    1989.

    "Frequency

    Modulation

    Synthesis

    of

    the

    Singing

    Voice."

    In

    Mathews,

    M. and

    J.

    Pierce, eds.,

    Current

    Directions in

    Computer

    Music Research. Cam-

    bridge,

    Massachusetts:

    The MIT

    Press,

    pp.

    57-64.

    Computer

    Music

    Journal.1995.

    Computer

    Music

    Journal

    Volume 19

    Compact

    Disc.

    Cambridge,

    Massachusetts:

    The

    MIT

    Press.

    Cook,

    P. 1991a. "LECTOR:An EcclesiasticalLatin

    Con-

    trol

    Language

    or

    the

    SPASM/Singer

    nstrument."

    n

    Proceedings

    of

    the 1991

    International

    Computer

    Mu-

    sic

    Conference.

    San

    Francisco,

    California:

    International

    Computer

    Music

    Association,

    pp.

    319-321.

    Cook,

    P.

    1991b.

    "Non-Linear

    Periodic Prediction for On-

    Line Identification of Oscillator Characteristics n

    WoodwindInstruments." n

    Proceedings

    of

    the

    Interna-

    tional

    Computer

    Music

    Conference.

    San

    Francisco,

    Cal-

    ifornia:

    International

    Computer

    Music

    Association,

    pp.

    157-160.

    Cook,

    P.

    1992. "SPASM:

    A

    Real-Time Vocal

    Tract

    Physi-

    cal

    Model

    Editor/Controller

    and

    Singer:

    he

    Compan-

    ion

    Software

    Synthesis System."

    Computer

    Music

    Jour-

    nal

    17(1):30-44.

    Cook,

    P.

    1995.

    "A

    Study

    of Pitch

    Deviation in

    Singing

    as

    a Function of Pitch

    and

    Dynamics."

    13th

    International

    Congressof

    Phonetic Sciences.

    Stockholm:

    KTH,

    pp.

    1:202-205.

    Cook, P.,

    et al. 1993. "IGDIS:A ModernGreek Text to

    Speech/Singing Program

    or

    the

    SPASM/Singer

    nstru-

    ment."

    In

    Proceedings

    of

    the

    International

    Computer

    Music

    Conference.

    San

    Francisco,

    California:

    Interna-

    tional

    Computer

    Music

    Association,

    pp.

    387-389.

    d'Allessandro,C.,

    and M.

    Castellengo.

    1993. "ThePitch

    of

    Short-Duration

    VibratoTones:

    Experimental

    Data

    and

    Numerical Model."In

    Proceedingsof

    the

    Stock-

    holm

    Music Acoustics

    Conference.

    Stockholm:

    KTH,

    pp.

    25-30.

    Depalle,

    P.,

    G.

    Garcia,

    and

    X. Rodet.

    1994,

    "AVirtual

    Cas-

    trato

    ( ?)"

    n

    Proceedings of

    the

    1994

    International

    Computer

    Music

    Conference.

    San

    Francisco,

    Califor-

    nia: International

    Computer

    Music

    Association,

    pp.

    357-360.

    Dodge,

    C.

    1989.

    "On

    Speech

    Songs."

    n

    Mathews,

    M.

    and

    J.

    Pierce,

    eds.,

    Current

    Directions in

    Computer

    Music

    Research.

    Cambridge,

    Massachusetts: The MIT

    Press,

    pp.

    9-18.

    Dolson,

    M.

    1986,

    "The

    Phase

    Vocoder:

    A

    Tutorial."

    Com-

    puter

    Music

    Journal

    10(4):14-27.

    Dudley,

    H.

    1939.

    "The

    Vocoder."

    Bell

    Laboratories

    Rec-

    ord,

    December.

    Galas,

    T.,

    and X. Rodet.

    1990 "An

    mprovedCepstral

    Method for

    Deconvolution of

    Source-Filter

    Systems

    with

    Discrete

    Spectra:Application

    to

    Musical

    Sound

    Computer

    Music

    Journal

    I

    44

  • 7/27/2019 Cook - singing voice synthesis

    9/10

    Signals."

    n

    Proceedings

    of

    the 1990 International

    Computer

    Music

    Conference.

    San

    Francisco,

    Califor-

    nia: International

    Computer

    Music

    Association,

    pp.

    82-84.

    Hill, D.,

    L.

    Manzara,

    and C. Taube-Schock.1995.

    "Real-

    Time

    Articulatory

    Speech-Synthesis-By-Rules."

    AVIOS. San

    Jose,

    California.

    Hill, D.,

    A.

    Pearce,

    and B.

    Wyvill.

    1988.

    'Animating

    Speech:

    An Automated

    Approach

    Using

    Speech

    Synthe-

    sized

    by

    Rules."The Visual

    Computer

    3(5):277-289.

    Howard,D.,

    and D.

    Rossiter. 1993.

    "Real-TimeVisual

    Displays

    for

    Use

    in

    Singing Training:

    An

    Overview."

    n

    Proceedings of the Stockholm Music Acoustics Confer-

    ence. Stockholm:

    KTH,

    pp.

    191-196.

    Johnson, A.,

    J.

    Sundberg,

    and

    H.

    Willbrand.

    1983.

    "K61n-

    ing:

    A

    Study

    of Phonation and

    Articulation

    in

    a

    Type

    of Swedish

    Herding Song."

    n

    Proceedings

    of

    the Stock-

    holm Music Acoustics

    Conference.

    Stockholm:

    KTH,

    pp.

    187-202.

    Kelly,

    J.,

    and C. Lochbaum.

    1962.

    "Speech Synthesis" (pa-

    per

    G42).

    In

    Proceedings

    of

    the Fourth

    International

    Congress

    on Acoustics.

    pp.

    1-4.

    Klatt,

    D.

    1980.

    "Software or a

    Cascade/Parallel

    Formant

    Synthesizer."

    Journalof

    the Acoustical

    Society

    of

    America

    67(3):971-995.

    KTH. 1994. Information

    Technology

    and Music

    (a

    com-

    pact

    disc to celebrate the

    75th

    anniversary

    of

    the

    Royal

    Swedish

    Academy

    of

    EngineeringScience).

    Stockholm:

    KTH.

    Lansky,

    P. 1989.

    "Compositional

    Applications

    of

    Linear

    Predictive

    Coding."

    n

    Mathews,

    M.

    and

    J. Pierce, eds.,

    CurrentDirections

    in

    Computer

    Music

    Research. Cam-

    bridge,

    Massachusetts: The MIT

    Press,

    pp.

    5-8.

    Liljencrants,J.

    1985.

    Speech Synthesis

    With a

    Reflection-

    Type

    Line

    Analog,

    DS

    Dissertation,

    Speech

    Communi-

    cation and

    Music

    Acoustics,

    Stockholm: KTH.

    Maeda,

    S. 1982.

    "A

    Digital

    Simulation Method of

    the Vo-

    cal Tract

    System." Speech

    Communication 1:199-299.

    Maher,

    R. 1995. "Tunable

    Bandpass

    Filtersin Music

    Syn-

    thesis"

    (paper

    4098

    L2).

    In

    Proceedings

    of

    the Audio

    Engineering Society Conference.

    Makhoul, J.

    1975.

    "LinearPrediction:A

    Tutorial Re-

    view."In

    Proceedings

    of

    the IEEE

    63:561-580.

    Markel, J.,

    and A.

    Gray.

    1976.

    Linear

    Prediction

    of

    Speech.

    New

    York:

    Springer.

    Massaro,

    D.

    1987.

    Speech

    Perception by

    Ear and

    Eye.

    Hillsdale,

    New

    Jersey:

    Erlbaum

    Associates.

    Mathews,

    M.,

    and

    J.

    Pierce,

    eds.

    1989. CurrentDirections

    in

    Computer

    Music

    Research.

    Cambridge,

    Massachu-

    setts: The MIT Press.

    McAulay,

    R.,

    and

    T.

    Quatieri.

    1986.

    "Speech

    Analysis/

    Synthesis

    Based on

    a

    Sinusoidal

    Representation."

    EEE

    Transactionson

    Acoustics,

    Speech,

    and

    Signal

    Pro-

    cessing 34(4):744-754.

    McGurk, H.,

    and

    J.

    MacDonald.

    1976.

    "HearingLips

    and

    Seeing

    Voices."Nature

    264:746-748.

    Moorer,

    A.

    1978.

    "TheUse of the Phase Vocoder n

    Com-

    puter

    Music

    Applications."

    Journalof

    the Audio

    Engi-

    neering

    Society

    26

    (1/2):42-45.

    Moorer,

    A.

    1979,

    "The Use

    of

    Linear Prediction of

    Speech

    in

    Computer

    Music

    Applications."

    Journal

    of

    the Audio

    EngineeringSociety

    27(3):134-140.

    Pabon,

    P.

    1993,

    "AReal-Time

    Singing

    Voice

    Synthesizer."

    In Proceedingsof the Stockholm Music Acoustics Con-

    ference.

    Stockholm:

    KTH,

    pp.

    288-293.

    Pelorson, X.,

    et

    al.

    1994.

    "Theoreticaland

    Experimental

    Study

    of

    Quasi-Steady

    Flow

    Separation

    Within

    the

    Glottis

    During

    Phonation.

    Applications

    to a

    Modified

    Two-MassModel."

    Journal

    of

    the Acoustical

    Society of

    America

    96

    (6):3416-3431.

    Prame,

    E. 1994. "Measurementsof the

    Vibrato Rate of

    Ten

    Singers."

    Journal

    of

    the Acoustical

    Society

    of

    America

    96(4):1979-1984.

    Rabiner,

    L.

    1968.

    "Digital

    Formant

    Synthesizer."

    Journal

    of

    the Acoustical

    Society

    of

    America

    43(4):822-828.

    Rodet,

    X.

    1984. "Time-Domain

    Formant-Wave-Function

    Synthesis." Computer

    Music

    Journal

    8(3):9-14.

    Rodet,

    X.

    1995.

    "One and

    Two Mass

    Model

    Oscillations

    for Voice and

    Instruments."

    n

    Proceedings

    of

    the 1995

    International

    Computer

    Music

    Conference.

    San Fran-

    cisco,

    California:International

    Computer

    Music Asso-

    ciation,

    pp.

    207-210.

    Rodet, X.,

    and

    P. Cointe.

    1984. "FORMES:

    Composition

    and

    Scheduling

    of

    Processes."

    Computer

    Music

    Journal

    8(3):32-50.

    Rodet, X.,

    and

    P.

    Depalle.

    1992.

    "Spectral

    Envelopes

    and

    Inverse FFT

    Synthesis"

    (paper

    3393

    H3).

    In

    Proceed-

    ings

    of

    the Audio

    EngineeringSociety

    Conference,

    NY:

    AES.

    Rodet, X.,

    Y.

    Potard,

    and

    J.

    B. Barriere.

    1984. "The

    CHANT

    Project:

    From the

    Synthesis

    of the

    Singing

    Voice to

    Synthesis

    in

    General."

    Computer

    Music

    Jour-

    nal

    8(3):15-31.

    Ross, J.,

    and I.

    Lehiste.

    1993. "Estonian

    Laments:

    A

    Study

    of Their

    Temporal

    Structure."

    n

    Proceedings

    of

    the Stockholm

    Music Acoustics

    Conference.

    Stock-

    holm:

    KTH,

    pp.

    244-248.

    Rossiter, D.,

    and D. Howard.

    1994.

    "Voice

    Source and

    Acoustic

    Output

    Qualities

    for

    Singing Synthesis."

    In

    Proceedings

    of

    the

    1994

    International

    Computer

    Mu-

    sic

    Conference.

    San

    Francisco,

    California:

    International

    Computer

    Music

    Association,

    pp.

    191-196.

    Cook

    45

  • 7/27/2019 Cook - singing voice synthesis

    10/10

    Scavone, G.,

    and P. Cook.

    1994.

    "Combined

    Linearand

    Non-Linear

    Periodic Prediction

    in

    Calibrating

    Models

    of Musical

    Instruments to

    Recordings."

    n

    Proceedings

    of

    the 1994 International

    Computer

    Music

    Confer-

    ence. San

    Francisco,

    California:International Com-

    puter

    Music

    Association,

    pp.

    433-434.

    Scotto

    Di

    Carlo, N.,

    and I. Guaitella.

    1995.

    "Facial

    Ex-

    pressions

    in

    Singing."

    n

    Proceedings

    of

    the

    13th Inter-

    national

    Congress

    of

    Phonetic Sciences.

    Stockholm:

    KTH,

    pp.

    1:226-229.

    Serra,

    X.,

    and

    J.

    Smith.

    1990.

    "SpectralModeling

    Synthe-

    sis:

    A

    Sound

    Analysis/Synthesis System

    Based on a De-

    terministic plus Stochastic Decomposition." Computer

    Music

    Journal

    14(4):12-24.

    Smith,

    J.

    1987.

    "Musical

    Applications

    of

    Digital

    Wave-

    guides."

    Technical

    report

    STAN-M-39.

    StanfordUniver-

    sity

    Center

    for

    Computer

    Research

    n Music and

    Acoustics.

    Smith, J.,

    and

    X.

    Serra.

    1987.

    "PARSHL:

    nalysis/Synthe-

    sis

    Program

    or Non-Harmonic Sounds Based

    on a

    Si-

    nusoidal

    Representation."

    n

    Proceedings

    of

    the 1987

    International

    Computer

    Music

    Conference.

    San Fran-

    cisco,

    California:

    International

    Computer

    Music Asso-

    ciation,

    pp.

    290-297.

    Spanias,

    A. 1994.

    "Speech Coding:

    A

    Tutorial

    Review."

    n

    Proceedingsof the IEEE82(10):1541-1582.

    Steiglitz,

    K.,

    andP.

    Lansky.

    1981.

    "Synthesis

    of Timbral

    Families

    by

    Warped

    Linear Prediction."

    Computer

    Music

    Journal

    5(3):45-49.

    Story,

    B.,

    and

    I. Titze.

    1995.

    "Voice

    Simulation

    With a

    Body-Cover

    Model

    of

    the Vocal

    Folds."

    Journal

    of

    the

    Acoustical

    Society

    of

    America

    97(2):3416-3431.

    Sundberg,J.

    1987.

    The

    Science

    of

    the

    Singing

    Voice.

    De-

    kalb,

    Illinois:

    Northern Illinois

    University

    Press.

    Sundberg,

    J.

    1989.

    "Synthesis

    of

    Singing

    by

    Rule."

    In

    Mathews,

    M. and

    J. Pierce,

    eds.,

    Current

    Directions

    in

    Computer

    Music Research.

    Cambridge,

    Massachusetts:

    The

    MIT

    Press,

    pp.

    45-56.

    Teager,

    H.

    1980.

    "Some Observations

    on

    Oral

    Air

    Flow

    During

    Phonation." EEE

    Transactions

    on

    Acoustics,

    Speech,

    and

    Signal Processing

    28(5):599-601.

    Ternstrom,S.,

    and

    A.

    Friberg.

    1989.

    "Analysis

    and

    Simula-

    tion of Small Variations

    n

    the Fundamental

    Frequency

    of

    Sustained

    Vowels."

    STL-Quarterly

    Progress

    and Sta-

    tus

    Report

    3:1-14.

    Titze, I.,

    and B.

    Story.

    1993.

    "The Iowa

    Singing Synthe-

    sis." In

    Proceedings

    of

    the Stockholm

    Music

    Acoustics

    Conference.Stockholm:KTH,p. 294.

    Valimaki, V.,

    and M.

    Karjalainen.

    1994.

    "Improving

    he

    Kelly-Lochbaum

    Vocal TractModel

    Using

    Conical

    Tube

    Sections and Fractional

    Delay

    Filtering

    Tech-

    niques."

    In

    Proceedings of

    the 1994

    International Con-

    ference

    on

    Spoken Language Processing. Yokohama,

    Ja-

    pan,

    pp.

    18-22.

    Verge,

    M.

    1995.

    Aeroacoustics

    of

    Confined Jets,

    with

    Applications

    to

    the

    Physics of

    Recorder-Like nstru-

    ments.

    Thesis,

    Technical

    University

    of Eindhoven

    (also

    availablefrom

    IRCAM).

    Wergo.

    1995.

    The

    Historical

    CD

    of

    Digital

    Sound

    Synthe-

    sis. WER2033-2.

    Wilhelms-Tricarico,R. 1995. "PhysiologicalModelingof

    Speech

    Production:Methods for

    Modeling

    Soft-Tissue

    Articulators."

    Journal

    of

    the Acoustical

    Society

    of

    America

    97(5):3085-3098.

    Zera, J.,

    J.

    Gauffin,

    and

    J.

    Sundberg.

    1984.

    "Synthesis

    of

    Selected

    VCV-Syllables

    n

    Singing."

    n

    Proceedings of

    the 1984 International

    Computer

    Music

    Conference.

    San

    Francisco,

    California:

    International

    Computer

    Music

    Association, pp.

    83-86.

    Computer

    Music

    Journal

    6