Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Computer Audio and Music
Perry R. Cook
Princeton Computer Science
(also Music)
Perry R. CookPerry R. Cook
Princeton Computer SciencePrinceton Computer Science
(also Music)(also Music)
First, Some Questions
What is Music?What is Music?What is Music?
First, Some Questions
What is Sound?What is Sound?What is Sound?
First, Some Questions
How are sound and music represented on the computer?
How are sound and How are sound and music represented music represented on the computer?on the computer?
Music/Sound Overview
• Basic Audio storage/playback (sampling)
• Human Audio Perception
• Digital Sound and Music Compression and Representation
• Sound Synthesis
• Music Control and Expression
•• Basic Audio storage/playback (sampling)Basic Audio storage/playback (sampling)
•• Human Audio PerceptionHuman Audio Perception
•• Digital Sound and Music Digital Sound and Music Compression Compression and Representationand Representation
•• Sound SynthesisSound Synthesis
•• Music Control and ExpressionMusic Control and Expression
Waveform Sampling and Playback
• Sample and Hold (rate vs. Aliasing)
• Quantize
Word Size vs. Quantization Noise
• Reconstruct: Hold and Smooth (filter)
•• Sample and Hold Sample and Hold (rate (rate vsvs . Aliasing). Aliasing)
•• QuantizeQuantize
Word Size vs. Quantization NoiseWord Size vs. Quantization Noise
•• Reconstruct: Hold and Smooth (filter)Reconstruct: Hold and Smooth (filter)
Waveform Sampling: Quantization
Quantization
Introduces
Noise
Quantization Quantization
Introduces Introduces
NoiseNoise
Compression and Parametric Representation (Why Bother??)So Many Bits, So Little Time (Space)
• CD audio rate: 2 * 2 * 8 * 44100 = 1,411,200 bps
• CD audio storage: 10,584,000 bytes / minute
• A CD holds only about 70 minutes of audio
• An ISDN line can only carry 128,000 bps
• Even a cable modem might carry only 1Mbps
Security: Best representation removes all
recognizable about the original sound
Graphics people get all the bandwidth, cycles, memo ry
Expression, composition, interaction wanted too!
So Many Bits, So Little Time (Space)So Many Bits, So Little Time (Space)
•• CD audio rate: 2 * 2 * 8 * 44100 = 1,411,200 bpsCD audio rate: 2 * 2 * 8 * 44100 = 1,411,200 bps
•• CD audio storage: 10,584,000 bytes / minuteCD audio storage: 10,584,000 bytes / minute
•• A CD holds only about 70 minutes of audioA CD holds only about 70 minutes of audio
•• An ISDN line can only carry 128,000 bpsAn ISDN line can only carry 128,000 bps
•• Even a cable modem might carry only 1MbpsEven a cable modem might carry only 1Mbps
Security: Best representation removes all Security: Best representation removes all
recognizable about the original soundrecognizable about the original sound
Graphics people get all the bandwidth, cycles, memo ryGraphics people get all the bandwidth, cycles, memo ry
Expression, composition, interaction wanted too!Expression, composition, interaction wanted too!
Views of Sound
– Sound is Perceived: Perception-BasedPsychoacoustically Motivated Compression
– Sound is Produced: Production-Based Physics/Source Model Motivated Compression
– Music(Sound) is Performed/Published/Represented: Event-Based Compression
– Sound is a Waveform / Statistical Distribution / et c.(these are not very good ideas in general,
unless we get lucky (LPC))
–– Sound is Perceived: PerceptionSound is Perceived: Perception --BasedBasedPsychoacoustically Psychoacoustically Motivated CompressionMotivated Compression
–– Sound is Produced: ProductionSound is Produced: Production --Based Based Physics/Source Model Motivated CompressionPhysics/Source Model Motivated Compression
–– Music(Sound) is Performed/Published/Represented: Music(Sound) is Performed/Published/Represented: EventEvent --Based CompressionBased Compression
–– Sound is a Waveform / Statistical Distribution / et c.Sound is a Waveform / Statistical Distribution / et c.(these are not very good ideas in general, (these are not very good ideas in general,
unless we get lucky (LPC))unless we get lucky (LPC))
Psychoacoustics
Human sound perception:Human sound perception:Human sound perception:
Auditory cortex: Auditory cortex: further refine further refine time & frequency time & frequency informationinformation
Cochlea: Cochlea: convert to convert to frequency frequency dependent dependent nerve firingsnerve firings
Ear: Ear: receive receive 11--D D waveswaves
Brain:Brain:Higher level Higher level cognition, cognition, object object formation, formation, interpretationinterpretation
Perceptual Models
Exploit masking, etc., to discard
perceptually irrelevant information.
• Example: Quantize soft sounds more accurately, loud sounds less accurately
Benefits: Generic, does not require assumpt ions
about what produced the sound
Drawbacks: Highest compression is difficult to a chieve
Exploit masking, etc., to discardExploit masking, etc., to discard
perceptually irrelevant information.perceptually irrelevant information.
•• Example: Quantize soft sounds more accurately, Example: Quantize soft sounds more accurately, loud sounds less accuratelyloud sounds less accurately
Benefits: Generic, does not require assumpt ionsBenefits: Generic, does not require assumpt ions
about what produced the soundabout what produced the sound
Drawbacks: Highest compression is difficult to a chieveDrawbacks: Highest compression is difficult to a chieve
Production Models
Build a model of the sound production system, then fit the parameters
• Example: If signal is speech, then a well parameterized vocal model can yield highest quality and compression ratio
Benefits: Highest possible compression
Drawbacks: Signal source(s) must be
assumed, known, or identified
Build a model of the sound production system, Build a model of the sound production system, then fit the parametersthen fit the parameters
•• Example: If signal is speech, then a well Example: If signal is speech, then a well parameterized vocal model can yield parameterized vocal model can yield highest quality and compression ratiohighest quality and compression ratio
Benefits: Highest possible compressionBenefits: Highest possible compression
Drawbacks: Signal source(s) must be Drawbacks: Signal source(s) must be
assumed, known, or identifiedassumed, known, or identified
Audio Compression
Classical Data Compression View:
Take advantage of
• Redundancy/Correlation
• Statistics (Local/Global)
• Assumptions / Models
Problem: Much of this doesn’t work directly on sound waveform data
Classical Data Compression View:Classical Data Compression View:
Take advantage ofTake advantage of
•• Redundancy/CorrelationRedundancy/Correlation
•• Statistics (Local/Global)Statistics (Local/Global)
•• Assumptions / ModelsAssumptions / Models
Problem: Much of this doesnProblem: Much of this doesn ’’ t work t work directly on sound waveform datadirectly on sound waveform data
Transform (Subband) Coders
Split signal into frequency subbands, then allocate bits to regions adaptively,
based on where ear is most sensitive
Lossless (variable bit rate & comp. ratio)
Lossy (fixed rate and ratio) MP3
Split signal into frequencySplit signal into frequency subbandssubbands , then , then allocate bits to regions adaptively, allocate bits to regions adaptively,
based on where ear is most sensitivebased on where ear is most sensitive
Lossless (variable bit rate Lossless (variable bit rate & comp. ratio)& comp. ratio)
Lossy Lossy (fixed rate and ratio) MP3(fixed rate and ratio) MP3
Production Models
Build a parametric model of the production system, then either
Fit the parameters to a given signal
Use signal processing techniques to extract parameters
Drive the parameters directly (no encode/decode)
Examples: Rule system to drive speech synthesize r
MIDI file to drive music synthesizer
Build a parametric model of the Build a parametric model of the production system, then eitherproduction system, then either
Fit the parameters to a given signalFit the parameters to a given signal
Use signal processing techniques to Use signal processing techniques to extract parametersextract parameters
Drive the parameters directly (no encode/decode)Drive the parameters directly (no encode/decode)
Examples: Rule system to drive speech synthesize r Examples: Rule system to drive speech synthesize r
MIDI file to drive music synthesizerMIDI file to drive music synthesizer
Speech Coders (production)
Assume speech is produced by a source-filter system (vocal folds/noise + vocal tract tube)
Identify filter, type of source, then code paramete rs
Takes advantage of slowly varying nature of vocal t ract shape and other speech parameters
Assume speech is produced by a sourceAssume speech is produced by a source --filter system filter system (vocal folds/noise + vocal tract tube)(vocal folds/noise + vocal tract tube)
Identify filter, type of source, then code paramete rsIdentify filter, type of source, then code paramete rs
Takes advantage of slowly varying nature of vocal t ract Takes advantage of slowly varying nature of vocal t ract shape and other speech parametersshape and other speech parameters
Future: Multi-Model Parametric Compressors?
Analysis front end identifies source(s)
Audio is (separated and) sent to optimal model(s)
Benefits:
High compression
Other knowledge
Drawbacks:
We don’t know how
to do all this yet
Analysis front end identifies source(s)Analysis front end identifies source(s)
Audio is (separated and) sent to optimal model(s)Audio is (separated and) sent to optimal model(s)
Benefits: Benefits:
High compressionHigh compression
Other knowledgeOther knowledge
Drawbacks: Drawbacks:
We donWe don ’’ t know howt know how
to do all this yetto do all this yet
More Questions:
What can be (musically or sonically)
computed?
What can be What can be (musically or sonically)(musically or sonically)
computed?computed?
MIDI and Other ‘Event’ Models
Musical I nstrument D igital I nterface
Represents Music as Notes and Eventsand uses a synthesis engine to “render” it.
An Edit D ecision L ist (EDL) is another example.
A history of source materials, transformations, and processing steps is kept. Operations can be undone or recreated easily. Intermediate non-parametric files are not saved.
MMusical usical IInstrument nstrument DDigital igital IInterfacenterface
Represents Music as Notes and EventsRepresents Music as Notes and Eventsand uses a synthesis engine to and uses a synthesis engine to ““ renderrender ”” it.it.
An An EEdit dit DDecision ecision LList (EDL) is another example.ist (EDL) is another example.
A history of source materials, transformations, A history of source materials, transformations, and processing steps is kept. Operations can and processing steps is kept. Operations can be undone or recreated easily. Intermediate be undone or recreated easily. Intermediate nonnon --parametric files are not saved.parametric files are not saved.
Event Based Music Representation
MIDI and Other Scorefiles
• A Musical Score is a very compact representation of music
• Even the score itself can be compressed further
Benefits: Highest possible compression
Encodes “expression”
Drawbacks: Cannot guarantee the “performance”
Cannot assure the quality of the sounds
Cannot make arbitrary sounds (yet)
MIDI and OtherMIDI and Other ScorefilesScorefiles
•• A Musical Score is a very compact A Musical Score is a very compact representation of musicrepresentation of music
•• Even the score itself can be compressed furtherEven the score itself can be compressed further
Benefits: Highest possible compressionBenefits: Highest possible compression
Encodes Encodes ““ expressionexpression ””
Drawbacks: Cannot guarantee the Drawbacks: Cannot guarantee the ““ performanceperformance ””
Cannot assure the quality of the soundsCannot assure the quality of the sounds
Cannot make arbitrary sounds (yet)Cannot make arbitrary sounds (yet)
MIDI
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
Event Based Representation
Enter General MIDI
• Guarantees a base set of instrument sounds,
• and a means for addressing them,
• but doesn’t guarantee any quality
Better Yet, Downloadable Sounds
• Download samples for instruments
• Benefits: Does more to guarantee quality
• Drawbacks: Samples aren’t reality
Enter General MIDIEnter General MIDI
•• Guarantees a base set of instrument sounds,Guarantees a base set of instrument sounds,
•• and a means for addressing them,and a means for addressing them,
•• but doesnbut doesn ’’ t guarantee any qualityt guarantee any quality
Better Yet, Downloadable SoundsBetter Yet, Downloadable Sounds
•• Download samples for instrumentsDownload samples for instruments
•• Benefits: Does more to guarantee qualityBenefits: Does more to guarantee quality
•• Drawbacks: Samples arenDrawbacks: Samples aren ’’ t realityt reality
Event Based Representation
Downloadable Algorithms
• Specify the algorithm,the synthesis engine runs it,
and we just send parameter changes
• Part of “Structured Audio” (MPEG4)
Benefits: Can upgrade algorithms laterCan implement scalable synthesis
Drawbacks: Different algorithm for each class of s ounds(but can always fall back on samples)
Downloadable AlgorithmsDownloadable Algorithms
•• Specify the algorithm,Specify the algorithm,the synthesis engine runs it,the synthesis engine runs it,
and we just send parameter changesand we just send parameter changes
•• Part of Part of ““ Structured AudioStructured Audio ”” (MPEG4)(MPEG4)
Benefits: Can upgrade algorithms laterBenefits: Can upgrade algorithms laterCan implement scalable synthesisCan implement scalable synthesis
Drawbacks: Different algorithm for each class of s oundsDrawbacks: Different algorithm for each class of s ounds(but can always fall back on samples)(but can always fall back on samples)
Physical Modeling for Music
Strings (plucked, struck, bowed)
Winds (clarinet, flute, brass), voice
Plates, membranes, bar percussion
Shakers, scrapers
The Voice
Sounds Effects (PhOLISE)
Strings (plucked, struck, bowed)Strings (plucked, struck, bowed)
Winds (clarinet, flute, brass), voiceWinds (clarinet, flute, brass), voice
Plates, membranes, bar percussionPlates, membranes, bar percussion
Shakers, scrapersShakers, scrapers
The VoiceThe Voice
Sounds Effects (Sounds Effects ( PhOLISEPhOLISE ))
Physical Modeling: the “Real World”
Synthesizing Solids
QuickTime™ and a YUV420 codec decompressor are needed to see this picture.
O’Brien, Cook, and Essl
SIGGRAPH 01
OO’’Brien, Cook, Brien, Cook, and and EsslEssl
SIGGRAPH 01SIGGRAPH 01
QuickTime™ and a DV/DVCPRO - NTSC decompressor are needed to see this picture.
Composition and Creation
Garton“Rough
Raga Riffs”
Lansky“mild
und leise”
GartonGarton““ Rough Rough
Raga RiffsRaga Riffs ””
LanskyLansky ““ mild mild
und und leiseleise ””
Music for Unprepared PianoMusic for Unprepared PianoBargarBargar ,, ChoiChoi , Betts, Cook, Betts, Cook
QuickTime™ and a Sorenson Video decompressor are needed to see this picture.
Expression and Control
Trueman: BoSSATruemanTrueman :: BoSSABoSSA
Cook/Morrill TrumpetCook/Morrill Trumpet
Other ControllersOther Controllers
PICOs (musical and “real-world” sonic controllers)K-FrogJ-MugP-PedalPhilGlasP-GrinderT-shoe T-bourinePico GloveP-Ray’s Cafe
KK--FrogFrogJJ--MugMugPP--PedalPedalPhilGlasPhilGlasPP--GrinderGrinderTT--shoe shoe TT--bourinebourinePico GlovePico GlovePP--RayRay’’s Cafes Cafe
Sound Analysis and Classification
Cochlear Modeling
Multi-feature analysis(Tzanetakis)
Segmentation, Classification, Annotation, Thumbnail s
Cochlear ModelingCochlear Modeling
MultiMulti --feature analysisfeature analysis ((TzanetakisTzanetakis ) )
Segmentation, Classification, Annotation, Thumbnail sSegmentation, Classification, Annotation, Thumbnail s
Music, AI, andNeuroscienceReplace (model) the Performer (Garton, Miranda, RagaMatic)
Replace (model) the Composer (Cope (EMI), Tom & Andy (the “Brain”))
Replace (model) the Listener?? (MoodLogic.com, other Recommenders)
Replace (model) the Performer Replace (model) the Performer ((GartonGarton , Miranda, , Miranda, RagaMaticRagaMatic ))
Replace (model) the Composer Replace (model) the Composer (Cope (EMI), Tom & Andy (the (Cope (EMI), Tom & Andy (the ““ BrainBrain ”” )) ))
Replace (model) the Listener?? Replace (model) the Listener?? ((MoodLogicMoodLogic .com, other Recommenders).com, other Recommenders)
Music (Art) and Technology
COS: Human-Computer Interfacing, Pervasive Information Systems, Transforming Reality
FRS: TechnoMusic I: 100,000 BC - 1999
FRS/414: Princeton Laptop Orchestra (PLOrk)
MUS 539: Technology and Voice
Broad view of Technology: “Any intentionally fashioned tool or technique”
Broad view of Music: Organized Sound
COS: COS: HumanHuman --Computer Interfacing, Pervasive Computer Interfacing, Pervasive Information Systems, Transforming RealityInformation Systems, Transforming Reality
FRS:FRS: TechnoMusic TechnoMusic I: 100,000 BC I: 100,000 BC -- 19991999
FRS/414: FRS/414: Princeton Laptop Orchestra (Princeton Laptop Orchestra ( PLOrkPLOrk ))
MUS 539: Technology and VoiceMUS 539: Technology and Voice
Broad view of Technology: Broad view of Technology: ““ Any intentionally fashioned tool or techniqueAny intentionally fashioned tool or technique ””
Broad view of Music: Organized SoundBroad view of Music: Organized Sound
Audio and Computer Music