Computer Audio and Music - Princeton University

Computer Audio and Music

Perry R. Cook

Princeton Computer Science

(also Music)

Perry R. CookPerry R. Cook

Princeton Computer SciencePrinceton Computer Science

(also Music)(also Music)

First, Some Questions

What is Music?What is Music?What is Music?


What is Sound?What is Sound?What is Sound?


How are sound and music represented on the computer?

How are sound and How are sound and music represented music represented on the computer?on the computer?

Music/Sound Overview

• Basic Audio storage/playback (sampling)

• Human Audio Perception

• Digital Sound and Music Compression and Representation

• Sound Synthesis

• Music Control and Expression

•• Basic Audio storage/playback (sampling)Basic Audio storage/playback (sampling)

•• Human Audio PerceptionHuman Audio Perception

•• Digital Sound and Music Digital Sound and Music Compression Compression and Representationand Representation

•• Sound SynthesisSound Synthesis

•• Music Control and ExpressionMusic Control and Expression

Waveform Sampling and Playback

• Sample and Hold (rate vs. Aliasing)

• Quantize

Word Size vs. Quantization Noise

• Reconstruct: Hold and Smooth (filter)

•• Sample and Hold Sample and Hold (rate (rate vsvs . Aliasing). Aliasing)

•• QuantizeQuantize

Word Size vs. Quantization NoiseWord Size vs. Quantization Noise

•• Reconstruct: Hold and Smooth (filter)Reconstruct: Hold and Smooth (filter)

Waveform Sampling: Quantization

Quantization

Introduces

Noise

Quantization Quantization

Introduces Introduces

NoiseNoise

Compression and Parametric Representation (Why Bother??)So Many Bits, So Little Time (Space)

• CD audio rate: 2 * 2 * 8 * 44100 = 1,411,200 bps

• CD audio storage: 10,584,000 bytes / minute

• A CD holds only about 70 minutes of audio

• An ISDN line can only carry 128,000 bps

• Even a cable modem might carry only 1Mbps

Security: Best representation removes all

recognizable about the original sound

Graphics people get all the bandwidth, cycles, memo ry

Expression, composition, interaction wanted too!

So Many Bits, So Little Time (Space)So Many Bits, So Little Time (Space)

•• CD audio rate: 2 * 2 * 8 * 44100 = 1,411,200 bpsCD audio rate: 2 * 2 * 8 * 44100 = 1,411,200 bps

•• CD audio storage: 10,584,000 bytes / minuteCD audio storage: 10,584,000 bytes / minute

•• A CD holds only about 70 minutes of audioA CD holds only about 70 minutes of audio

•• An ISDN line can only carry 128,000 bpsAn ISDN line can only carry 128,000 bps

•• Even a cable modem might carry only 1MbpsEven a cable modem might carry only 1Mbps

Security: Best representation removes all Security: Best representation removes all

recognizable about the original soundrecognizable about the original sound

Graphics people get all the bandwidth, cycles, memo ryGraphics people get all the bandwidth, cycles, memo ry

Expression, composition, interaction wanted too!Expression, composition, interaction wanted too!

Views of Sound

– Sound is Perceived: Perception-BasedPsychoacoustically Motivated Compression

– Sound is Produced: Production-Based Physics/Source Model Motivated Compression

– Music(Sound) is Performed/Published/Represented: Event-Based Compression

– Sound is a Waveform / Statistical Distribution / et c.(these are not very good ideas in general,

unless we get lucky (LPC))

–– Sound is Perceived: PerceptionSound is Perceived: Perception --BasedBasedPsychoacoustically Psychoacoustically Motivated CompressionMotivated Compression

–– Sound is Produced: ProductionSound is Produced: Production --Based Based Physics/Source Model Motivated CompressionPhysics/Source Model Motivated Compression

–– Music(Sound) is Performed/Published/Represented: Music(Sound) is Performed/Published/Represented: EventEvent --Based CompressionBased Compression

–– Sound is a Waveform / Statistical Distribution / et c.Sound is a Waveform / Statistical Distribution / et c.(these are not very good ideas in general, (these are not very good ideas in general,

unless we get lucky (LPC))unless we get lucky (LPC))

Psychoacoustics

Human sound perception:Human sound perception:Human sound perception:

Auditory cortex: Auditory cortex: further refine further refine time & frequency time & frequency informationinformation

Cochlea: Cochlea: convert to convert to frequency frequency dependent dependent nerve firingsnerve firings

Ear: Ear: receive receive 11--D D waveswaves

Brain:Brain:Higher level Higher level cognition, cognition, object object formation, formation, interpretationinterpretation

Perceptual Models

Exploit masking, etc., to discard

perceptually irrelevant information.

• Example: Quantize soft sounds more accurately, loud sounds less accurately

Benefits: Generic, does not require assumpt ions

about what produced the sound

Drawbacks: Highest compression is difficult to a chieve

Exploit masking, etc., to discardExploit masking, etc., to discard

perceptually irrelevant information.perceptually irrelevant information.

•• Example: Quantize soft sounds more accurately, Example: Quantize soft sounds more accurately, loud sounds less accuratelyloud sounds less accurately

Benefits: Generic, does not require assumpt ionsBenefits: Generic, does not require assumpt ions

about what produced the soundabout what produced the sound

Drawbacks: Highest compression is difficult to a chieveDrawbacks: Highest compression is difficult to a chieve

Production Models

Build a model of the sound production system, then fit the parameters

• Example: If signal is speech, then a well parameterized vocal model can yield highest quality and compression ratio

Benefits: Highest possible compression

Drawbacks: Signal source(s) must be

assumed, known, or identified

Build a model of the sound production system, Build a model of the sound production system, then fit the parametersthen fit the parameters

•• Example: If signal is speech, then a well Example: If signal is speech, then a well parameterized vocal model can yield parameterized vocal model can yield highest quality and compression ratiohighest quality and compression ratio

Benefits: Highest possible compressionBenefits: Highest possible compression

Drawbacks: Signal source(s) must be Drawbacks: Signal source(s) must be

assumed, known, or identifiedassumed, known, or identified

Audio Compression

Classical Data Compression View:

Take advantage of

• Redundancy/Correlation

• Statistics (Local/Global)

• Assumptions / Models

Problem: Much of this doesn’t work directly on sound waveform data

Classical Data Compression View:Classical Data Compression View:

Take advantage ofTake advantage of

•• Redundancy/CorrelationRedundancy/Correlation

•• Statistics (Local/Global)Statistics (Local/Global)

•• Assumptions / ModelsAssumptions / Models

Problem: Much of this doesnProblem: Much of this doesn ’’ t work t work directly on sound waveform datadirectly on sound waveform data

Transform (Subband) Coders

Split signal into frequency subbands, then allocate bits to regions adaptively,

based on where ear is most sensitive

Lossless (variable bit rate & comp. ratio)

Lossy (fixed rate and ratio) MP3

Split signal into frequencySplit signal into frequency subbandssubbands , then , then allocate bits to regions adaptively, allocate bits to regions adaptively,

based on where ear is most sensitivebased on where ear is most sensitive

Lossless (variable bit rate Lossless (variable bit rate & comp. ratio)& comp. ratio)

Lossy Lossy (fixed rate and ratio) MP3(fixed rate and ratio) MP3

Production Models

Build a parametric model of the production system, then either

Fit the parameters to a given signal

Use signal processing techniques to extract parameters

Drive the parameters directly (no encode/decode)

Examples: Rule system to drive speech synthesize r

MIDI file to drive music synthesizer

Build a parametric model of the Build a parametric model of the production system, then eitherproduction system, then either

Fit the parameters to a given signalFit the parameters to a given signal

Use signal processing techniques to Use signal processing techniques to extract parametersextract parameters

Drive the parameters directly (no encode/decode)Drive the parameters directly (no encode/decode)

Examples: Rule system to drive speech synthesize r Examples: Rule system to drive speech synthesize r

MIDI file to drive music synthesizerMIDI file to drive music synthesizer

Speech Coders (production)

Assume speech is produced by a source-filter system (vocal folds/noise + vocal tract tube)

Identify filter, type of source, then code paramete rs

Takes advantage of slowly varying nature of vocal t ract shape and other speech parameters

Assume speech is produced by a sourceAssume speech is produced by a source --filter system filter system (vocal folds/noise + vocal tract tube)(vocal folds/noise + vocal tract tube)

Identify filter, type of source, then code paramete rsIdentify filter, type of source, then code paramete rs

Takes advantage of slowly varying nature of vocal t ract Takes advantage of slowly varying nature of vocal t ract shape and other speech parametersshape and other speech parameters

Future: Multi-Model Parametric Compressors?

Analysis front end identifies source(s)

Audio is (separated and) sent to optimal model(s)

Benefits:

High compression

Other knowledge

Drawbacks:

We don’t know how

to do all this yet

Analysis front end identifies source(s)Analysis front end identifies source(s)

Audio is (separated and) sent to optimal model(s)Audio is (separated and) sent to optimal model(s)

Benefits: Benefits:

High compressionHigh compression

Other knowledgeOther knowledge

Drawbacks: Drawbacks:

We donWe don ’’ t know howt know how

to do all this yetto do all this yet

More Questions:

What can be (musically or sonically)

computed?

What can be What can be (musically or sonically)(musically or sonically)

computed?computed?

MIDI and Other ‘Event’ Models

Musical I nstrument D igital I nterface

Represents Music as Notes and Eventsand uses a synthesis engine to “render” it.

An Edit D ecision L ist (EDL) is another example.

A history of source materials, transformations, and processing steps is kept. Operations can be undone or recreated easily. Intermediate non-parametric files are not saved.

MMusical usical IInstrument nstrument DDigital igital IInterfacenterface

Represents Music as Notes and EventsRepresents Music as Notes and Eventsand uses a synthesis engine to and uses a synthesis engine to ““ renderrender ”” it.it.

An An EEdit dit DDecision ecision LList (EDL) is another example.ist (EDL) is another example.

A history of source materials, transformations, A history of source materials, transformations, and processing steps is kept. Operations can and processing steps is kept. Operations can be undone or recreated easily. Intermediate be undone or recreated easily. Intermediate nonnon --parametric files are not saved.parametric files are not saved.

Event Based Music Representation

MIDI and Other Scorefiles

• A Musical Score is a very compact representation of music

• Even the score itself can be compressed further

Benefits: Highest possible compression

Encodes “expression”

Drawbacks: Cannot guarantee the “performance”

Cannot assure the quality of the sounds

Cannot make arbitrary sounds (yet)

MIDI and OtherMIDI and Other ScorefilesScorefiles

•• A Musical Score is a very compact A Musical Score is a very compact representation of musicrepresentation of music

•• Even the score itself can be compressed furtherEven the score itself can be compressed further

Benefits: Highest possible compressionBenefits: Highest possible compression

Encodes Encodes ““ expressionexpression ””

Drawbacks: Cannot guarantee the Drawbacks: Cannot guarantee the ““ performanceperformance ””

Cannot assure the quality of the soundsCannot assure the quality of the sounds

Cannot make arbitrary sounds (yet)Cannot make arbitrary sounds (yet)

MIDI

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

Event Based Representation

Enter General MIDI

• Guarantees a base set of instrument sounds,

• and a means for addressing them,

• but doesn’t guarantee any quality

Better Yet, Downloadable Sounds

• Download samples for instruments

• Benefits: Does more to guarantee quality

• Drawbacks: Samples aren’t reality

Enter General MIDIEnter General MIDI

•• Guarantees a base set of instrument sounds,Guarantees a base set of instrument sounds,

•• and a means for addressing them,and a means for addressing them,

•• but doesnbut doesn ’’ t guarantee any qualityt guarantee any quality

Better Yet, Downloadable SoundsBetter Yet, Downloadable Sounds

•• Download samples for instrumentsDownload samples for instruments

•• Benefits: Does more to guarantee qualityBenefits: Does more to guarantee quality

•• Drawbacks: Samples arenDrawbacks: Samples aren ’’ t realityt reality

Event Based Representation

Downloadable Algorithms

• Specify the algorithm,the synthesis engine runs it,

and we just send parameter changes

• Part of “Structured Audio” (MPEG4)

Benefits: Can upgrade algorithms laterCan implement scalable synthesis

Drawbacks: Different algorithm for each class of s ounds(but can always fall back on samples)

Downloadable AlgorithmsDownloadable Algorithms

•• Specify the algorithm,Specify the algorithm,the synthesis engine runs it,the synthesis engine runs it,

and we just send parameter changesand we just send parameter changes

•• Part of Part of ““ Structured AudioStructured Audio ”” (MPEG4)(MPEG4)

Benefits: Can upgrade algorithms laterBenefits: Can upgrade algorithms laterCan implement scalable synthesisCan implement scalable synthesis

Drawbacks: Different algorithm for each class of s oundsDrawbacks: Different algorithm for each class of s ounds(but can always fall back on samples)(but can always fall back on samples)

Physical Modeling for Music

Strings (plucked, struck, bowed)

Winds (clarinet, flute, brass), voice

Plates, membranes, bar percussion

Shakers, scrapers

The Voice

Sounds Effects (PhOLISE)

Strings (plucked, struck, bowed)Strings (plucked, struck, bowed)

Winds (clarinet, flute, brass), voiceWinds (clarinet, flute, brass), voice

Plates, membranes, bar percussionPlates, membranes, bar percussion

Shakers, scrapersShakers, scrapers

The VoiceThe Voice

Sounds Effects (Sounds Effects ( PhOLISEPhOLISE ))

Physical Modeling: the “Real World”

Synthesizing Solids

QuickTime™ and a YUV420 codec decompressor are needed to see this picture.

O’Brien, Cook, and Essl

SIGGRAPH 01

OO’’Brien, Cook, Brien, Cook, and and EsslEssl

SIGGRAPH 01SIGGRAPH 01

QuickTime™ and a DV/DVCPRO - NTSC decompressor are needed to see this picture.

Composition and Creation

Garton“Rough

Raga Riffs”

Lansky“mild

und leise”

GartonGarton““ Rough Rough

Raga RiffsRaga Riffs ””

LanskyLansky ““ mild mild

und und leiseleise ””

Music for Unprepared PianoMusic for Unprepared PianoBargarBargar ,, ChoiChoi , Betts, Cook, Betts, Cook

QuickTime™ and a Sorenson Video decompressor are needed to see this picture.

Expression and Control

Trueman: BoSSATruemanTrueman :: BoSSABoSSA

Cook/Morrill TrumpetCook/Morrill Trumpet

Other ControllersOther Controllers

PICOs (musical and “real-world” sonic controllers)K-FrogJ-MugP-PedalPhilGlasP-GrinderT-shoe T-bourinePico GloveP-Ray’s Cafe

KK--FrogFrogJJ--MugMugPP--PedalPedalPhilGlasPhilGlasPP--GrinderGrinderTT--shoe shoe TT--bourinebourinePico GlovePico GlovePP--RayRay’’s Cafes Cafe

Sound Analysis and Classification

Cochlear Modeling

Multi-feature analysis(Tzanetakis)

Segmentation, Classification, Annotation, Thumbnail s

Cochlear ModelingCochlear Modeling

MultiMulti --feature analysisfeature analysis ((TzanetakisTzanetakis ) )

Segmentation, Classification, Annotation, Thumbnail sSegmentation, Classification, Annotation, Thumbnail s

Music, AI, andNeuroscienceReplace (model) the Performer (Garton, Miranda, RagaMatic)

Replace (model) the Composer (Cope (EMI), Tom & Andy (the “Brain”))

Replace (model) the Listener?? (MoodLogic.com, other Recommenders)

Replace (model) the Performer Replace (model) the Performer ((GartonGarton , Miranda, , Miranda, RagaMaticRagaMatic ))

Replace (model) the Composer Replace (model) the Composer (Cope (EMI), Tom & Andy (the (Cope (EMI), Tom & Andy (the ““ BrainBrain ”” )) ))

Replace (model) the Listener?? Replace (model) the Listener?? ((MoodLogicMoodLogic .com, other Recommenders).com, other Recommenders)

Music (Art) and Technology

COS: Human-Computer Interfacing, Pervasive Information Systems, Transforming Reality

FRS: TechnoMusic I: 100,000 BC - 1999

FRS/414: Princeton Laptop Orchestra (PLOrk)

MUS 539: Technology and Voice

Broad view of Technology: “Any intentionally fashioned tool or technique”

Broad view of Music: Organized Sound

COS: COS: HumanHuman --Computer Interfacing, Pervasive Computer Interfacing, Pervasive Information Systems, Transforming RealityInformation Systems, Transforming Reality

FRS:FRS: TechnoMusic TechnoMusic I: 100,000 BC I: 100,000 BC -- 19991999

FRS/414: FRS/414: Princeton Laptop Orchestra (Princeton Laptop Orchestra ( PLOrkPLOrk ))

MUS 539: Technology and VoiceMUS 539: Technology and Voice

Broad view of Technology: Broad view of Technology: ““ Any intentionally fashioned tool or techniqueAny intentionally fashioned tool or technique ””

Broad view of Music: Organized SoundBroad view of Music: Organized Sound

Audio and Computer Music

Documents

Computer Audio and Music - Princeton University