Upload
lythu
View
217
Download
0
Embed Size (px)
Citation preview
The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism.
(Signature of student)
Automatic Musical Pitch Correction
James Kirkbright
BSc. Computer Science
Session 2003/2004
i
Summary
The purpose of this project was to design and implement an Automatic Pitch Correction System that
is capable of detecting and correcting pitch errors within music. The initial stages of the project
involved research to investigate a range of existing methods currently used for pitch detection and
pitch correction. A system was then designed, based on selected existing algorithms, which is capable
of identifying incorrect notes within an audio sample, determining the pitch that the note should be,
and shifting the pitch of the note by the appropriate amount. The system was implemented within the
MATLAB environment, and operates on both monophonic and polyphonic wav files.
Following the implementation stages, analytical and qualitative evaluation was carried out in order to
assess the system’s performance over a range of different musical input. Analytical evaluation
involved inputting signals of known frequency and observing the performance of both the pitch
detection and the pitch correction within the Automatic Pitch Correction System. Qualitative
evaluation involved the assembly of a group of judges, which enabled user evaluation to be carried
out in order to gain a subjective opinion on the performance of the system.
ii
Acknowledgments
I would like to thank my project supervisor, James Handley for his constant advice and guidance
throughout this project.
Special thanks also goes to those who participated as part of the group of judges during the user
evaluation (see Appendix D).
iii
Table of Contents
1 Introduction ...................................................................................................1
1.1 Problem Definition ...............................................................................................................1 1.2 Project Aim and Objectives..................................................................................................2 1.3 Minimum Requirements .......................................................................................................2 1.4 Possible Extensions ..............................................................................................................3 1.5 Project Schedule ...................................................................................................................3
2 Background Research....................................................................................5
2.1 Introduction ..........................................................................................................................5 2.2 The Fourier Transform .........................................................................................................5 2.3 Pitch Detection .....................................................................................................................6 2.4 Pitch Correction....................................................................................................................8 2.5 Formant Correction ............................................................................................................10
3 Methodology ...............................................................................................12
3.1 Time Domain vs. Frequency Domain................................................................................. 12 3.2 Design Approach ................................................................................................................ 12 3.3 Real-time Considerations ................................................................................................... 13 3.4 Modelling environment ...................................................................................................... 14
4 Design..........................................................................................................15
4.1 Converting to Frequency Domain ...................................................................................... 15 4.2 Pitch Detection ................................................................................................................... 17 4.3 Error Detection ................................................................................................................... 21 4.4 Pitch Correction.................................................................................................................. 24 4.5 Signal Reconstruction......................................................................................................... 28
5 Evaluation ....................................................................................................30
5.1 Evaluation Criteria.............................................................................................................. 30 5.2 Module Performance Results.............................................................................................. 31 5.3 User Evaluation .................................................................................................................. 35
6 Conclusions .................................................................................................39
6.1 Evaluation of Minimum Requirements .............................................................................. 39 6.2 Evaluation of Possible Extensions...................................................................................... 40 6.3 Suggestions for Further Work ............................................................................................ 41
iv
7 Alternative Methods....................................................................................43
7.1 Note Detection.................................................................................................................... 43 7.2 Modified Phase Vocoder .................................................................................................... 44 7.3 Alternative Phase Alignment Techniques .......................................................................... 46
References ...........................................................................................................49
Appendix A .........................................................................................................52
Reflection on Project Experience ..................................................................................................... 52
Appendix B .........................................................................................................53
Project Schedule Gantt Chart ........................................................................................................... 53
Appendix C .........................................................................................................54
External Devices............................................................................................................................... 54
Appendix D .........................................................................................................55
User Evaluation Results.................................................................................................................... 55
1
1 Introduction 1.1 Problem Definition
The standard for modern day musical recordings is higher than ever and the cost of studio time is ever
increasing, often costing in the region of several hundred pounds for a few days recording. Many
hours are wasted in the recording studio, redoing vocal takes or fixing instrument tracks that would
otherwise be perfect except for “that wrong note”. The availability of a system that could
automatically identify and correct imperfect notes could save hours of valuable studio time, avoiding
the frustration caused by constant retakes or the tedious process of correcting pitch errors by hand.
Used to improve the quality of tracks recorded by lesser performer or simply to provide more time to
focus on the creative aspects of recording music, an automatic pitch correction system would be a
valuable addition to any recording studio setup.
Although fixing mistakes in the studio is tedious and time-consuming, it is however possible. Such a
luxury cannot be afforded in live situations. The application of an automatic pitch correction system
that is able to run in real time is required, where wrong notes are detected and corrected in real-time,
as they are being performed. The system would need to output the corrected signal with minimal
delay, accurately correcting any errors in pitch without producing any noticeable coloration or
distortion of the original signal.
The aim of this project is to create a system (note - from here on the system shall be referred to as the
Automatic Pitch Correction System) that will automatically detect and correct the pitch on a single
instrument track without introducing distortion, phase errors, or other artefacts, ideally in real time.
The system shall function for a variety of different recorded instruments and shall aim to produce a
sound signal of the same quality and timbre as the original sound.
The use of computers to present algorithmic solutions to solve problems once only tackled through
traditional analogue circuitry is becoming increasingly important. The number of digital domain
recording and processing tools now available is greater than ever and many applications used within
the musical industry rely upon digital signal processing techniques. The use of computers within the
musical field is therefore highly relevant and such modules as PS23 – Introduction to Scientific
Computing and AR21 – Speech, Audio and Image Processing, provide a basis of knowledge from
which digital signal processing tools such as the Automatic Pitch Correction System may be created.
2
1.2 Project Aim and Objectives
The objectives of the project are to:
• To conduct a thorough investigation into existing methods of pitch detection and pitch
correction and evaluate the possibility of such methods being implemented in real-time.
• To create a system that will accept an audio signal as input, detect any errors in pitch and
correct them accordingly to the nearest note in the chromatic scale, outputting the corrected
signal in real-time.
• The output signal obtained should not only be free from pitch-defects but should maintain the
characteristics and timbre of the original input sound.
• Carry out an evaluation of the system, assessing its performance with respect to its ability to
correct pitch defects, maintain original characteristics of the input sound and output in real-
time.
1.3 Minimum Requirements
The minimum requirements are:
• Discuss current methods and implementations used for detecting and correcting pitch errors
both in real time and in batch processing.
• Create a system that is capable of detecting and correcting pitch errors on a single, simple
instrument track, or tuning fork, without introducing distortion, phase errors, or other
artefacts.
• Evaluate system against current implementations already used for automatic pitch correction.
3
1.4 Possible Extensions
The possible extensions are:
• Perform automatic pitch-correction of input track in real-time, allowing for application in live
performances and real-time monitoring.
• Implement a feature that will allow a specific key to be selected before signal input. This
would enable out of pitch notes to be shifted to the nearest note in the selected key, as
opposed to the nearest note in the chromatic scale.
• The system could possibly be extended to allow pitch-correction of polyphonic material, such
as a de-tuned instrument playing a chord.
1.5 Project Schedule
As an undergraduate studying many other modules with exams and coursework deadlines, it is
important to optimise the amount of time there is available. In order to ensure good project
management, a project schedule is required. The table below details the specific tasks that require
completion in order to fulfil the requirements of this project. Each task is assigned a start and an end
date, outlining the proposed time schedule and the relative length of time required for each
deliverable.
Task no. Start Date End Date Objective/Deliverable
1 17/10/03 24/10/03 Identify Aims and Minimum Requirements
2 07/11/03 28/11/03 Background Research
3 14/11/03 12/12/03 Mid Project Report
4 12/12/03 26/12/03 Implementation of Pitch Detection Module – stage 1
5 23/01/04 06/02/04 Implementation of Pitch Detection Module – stage 2
6 06/02/04 19/03/04 Implementation of Pitch Correction Module
7 27/02/04 19/03/04 Analytical Evaluation
8 19/03/04 16/04/04 Qualitative Evaluation
9 19/03/04 28/04/04 Final Report
(Gantt Chart for this schedule is available in Appendix B)
4
The project objectives and deliverables were in most cases completed on schedule. However, the
original project schedule did not allocate a specific time period to implement the Error Detection
module required for the Automatic Pitch Correction System. As a result, task no. 5 – Implementation
of Pitch Detection module encompassed both Pitch Detection and Error Detection, and thus required
extra time. The table below illustrates the revisions made to the project schedule (altered dates are in
italics). Note that task no. 5 and task no.6 are now overlapping – this is due to the requirements of
compatibility between the Error Detection and Pitch Correction modules; the Error Detection module
must produce an error measurement that is usable by the Pitch Correction module.
The original Project Schedule also did not allocate enough time to implement the Pitch Correction
module. Coursework deadlines reduced the available time and unforeseen implementation issues (see
7 Alternative Methods) extended the required amount of work. However, the time allocated for the
Evaluation stages and writing the Final Report allowed for some flexibility and therefore also the
extra time required for the preceding implementation stages.
Task no. Start Date End Date Objective/Deliverable
1 17/10/03 24/10/03 Identify Aims and Minimum Requirements
2 07/11/03 28/11/03 Background Research
3 14/11/03 12/12/03 Mid Project Report
4 12/12/03 26/12/03 Implementation of Pitch Detection Module – stage 1
5 23/01/04 17/02/04 Implementation of Pitch Detection Module – stage 2
6 06/02/04 26/03/04 Implementation of Pitch Correction Module
7 27/02/04 26/03/04 Analytical Evaluation
8 26/03/04 16/04/04 Qualitative Evaluation
9 26/03/04 28/04/04 Final Report
(Gantt Chart for this schedule is available in Appendix B)
5
2 Background Research
2.1 Introduction
Automatic pitch correction of an audio signal involves the implementation of two separate operations;
pitch detection and pitch correction. Both these operations present many potential problems and
difficulties. Jehan [29] discusses the difficulties associated with detecting the pitch of musically
interesting sounds, since musical sounds are often harmonically rich and have extremely large
frequency ranges. Pitch detection algorithms must be designed to cope with a very large bandwidth,
and must determine pitch during the attack of a note, where amplitude is greatest and harmonic
complexity is at a maximum. Other complications may also arise from the existence of ambiguously
pitched sounds such as multiphonics or un-pitched sounds.
During pitch correction of an audio signal, most problems arise as a result of phase propagation errors
[13]. When performing pitch-modification of a given signal, it is non trivial to alter the frequency at a
given time instant without adversely affecting the phase of the signal too. Algorithms that maintain or
restore phase coherence are required to avoid unwanted defects that are not present in the original
sound. Similarly, harmonics of a signal can be adversely affected by pitch-modification. Formant
information (see 2.5 Formant Correction) particularly can have a significant effect on the nature of a
sound, so it is important that appropriate techniques are employed to preserve formant information
from the original spectrum [28].
2.2 The Fourier Transform
The Fourier Transform [8] may be applied to any sampled signal in order to obtain a representation of
that signal as a group of sinusoidal waves. This allows us to perform complex spectral analysis of the
signal and perform many modifications, such as filtering out certain frequencies, shifting phase, or
even pitch scaling. However, direct evaluation of the Fourier Transform is often extremely
computationally expensive. For this reason, many algorithms exist that allow the Fourier Transform of
a signal to be computed with considerably fewer computations. The set of these algorithms are known
as Fast Fourier Transforms (FFT) [8]. These algorithms work by exploiting the inherent symmetry
present within the expression for the Fourier Transform and contribute significantly to the availability
of real-time signal processing.
6
The Short Time Fourier Transform (STFT) is an important and powerful tool used for spectral
analysis of time-varying signals [8]. When performing spectrum analysis of a time-varying signal,
simply taking the Fourier Transform of the whole signal will not yield useful or meaningful results.
By using a windowing function applied at various time points along the signal, single “frames” of the
signal may be considered to be almost stationary, although the signal is changing over time. Spectral
analysis using the Fourier Transform may then be applied at each of these frames, providing a series
of Fourier Transforms that represent both time-domain and frequency-domain properties of the signal.
2.3 Pitch Detection
Jehan [29] explains the two ways of classifying pitch detection and pitch tracking. The first is
“spectral-domain” based pitch detection, whereby estimations of the time between each change in
pitch (known as pitch period), are obtained by applying a Fourier transform to separate samples of an
input signal. The second is “time-domain” pitch detection, where determination of the Glottal Closure
Instant [5] and measurement of the time period between “events” within the input signal allow
estimation of the pitch period. However, the latter approach is often not suited to music input signals
due to the inherently wide range of fundamental frequencies present.
Godsill et al. [27] discuss a spectral domain based method that is used to detect deviation of pitch over
a long-time scale. This method is employed in the “smoothing” of audio signals that contain defects
such as “wow” or “flutter” (time-varying pitch defects not present in the original recording), common
in many old musical recordings. The method works by using “frequency tracking”, a process whereby
the input data is converted into a time frequency “map” that can be used to detect the pitch of
principle frequency components. Once this stage is complete, pitch variations can be analysed and any
variations that affect all tones present within the music may be attributed to the defects caused by
“wow” or “flutter”. Other variations present may be attributed to genuine note changes or
progressions within the music and therefore can be ignored.
The frequency tracking uses the discrete Fourier Transform [8] to estimate as many tonal frequency
components present within the data as possible. When sampling the input audio, window lengths are
chosen to be short enough such that the signal within a single block is almost constant, and therefore
non-time-varying. This then allows blocks of data sharing similar frequency and amplitude to be
placed together along the same “frequency track”. The evolution of these frequency tracks may then
be used to estimate a pitch variation curve and then through subsequent processing, the defects may
be removed (see 2.4 Pitch Correction).
7
Jehan [29] proposes a multi-resolution, multi-scale analysis approach to pitch detection using
mathematical functions known as “wavelets” [11]. Wavelets work by separating data into different
frequency components and then applying a windowing length appropriate to the present frequency.
For example, long window lengths are used at low frequencies, whilst short window lengths are used
at high frequencies. This approach is advantageous over traditional Fourier methods since input
signals may contain such features as sharp peaks that require analysis at a greater resolution. It is
considered desirable to perform analysis in this way since human hearing works in a similar way [29].
Noll [1] introduces a frequency domain based pitch determination technique that may be used for
human speech known as the Harmonic Product Spectrum (HPS). The algorithm works by analysing
the short-term frequency content of a signal obtained using the STFT. The algorithm is
computationally efficient and is capable of running in real-time [21]. HPS works on the theory that the
spectrum of a musical note consists of a series of peaks, where one peak corresponds to the
fundamental frequency and all remaining peaks correspond to harmonic components at integer
multiples of the fundamental frequency. To obtain the fundamental frequency, the spectrum is
compressed multiple times by an integer factor and compared against the original unaltered spectrum.
Multiplying the spectrums together produces strong peaks where harmonics line up, where the largest
of these peaks created represents the fundamental frequency.
There are two main drawbacks to the Harmonic Product Spectrum algorithm. Firstly, the accuracy of
the calculated fundamental frequency depends on the size of the Fourier transform used; a larger
Fourier transform corresponds to a larger number of frequency bins and therefore a higher accuracy in
identifying the fundamental frequency, whilst a smaller Fourier Transform corresponds to a smaller
number of frequency bins and therefore reduced accuracy; if the fundamental frequency of the input
signal falls between two frequency bins, the frequency has to be approximated. The second problem
occurs when multiplying spectrums together results in more than one major harmonic peak in the
power spectrum [20]. This almost always results in detecting the fundamental frequency one octave
too high [21]. The latter problem may be overcome by performing a post-processing algorithm
whereby amplitude peaks in the power spectrum are compared and if a lower peak exists that is of
sufficiently large amplitude then the lower octave peak is selected as the fundamental frequency.
A popular time-domain based solution for pitch detection is to use the Autocorrelation function [9, 17,
22]. The autocorrelation function is defined as the sum of absolute differences between points along
two different signals over a given interval. To detect the fundamental frequency of a signal, windowed
samples are taken that correspond to at least twice as long as the longest period to be detected. For
each windowed sample, a copy of the signal is shifted and compared with the original. Since all
periodic signals remain similar from one period to the next, as the shift amount approaches the
8
fundamental period of the windowed signal, the pointwise difference between the two signals will
decrease and therefore so will the autocorrelation function. To calculate the fundamental frequency
therefore requires finding the first minimum within the autocorrelation function, which corresponds to
the fundamental period of the signal, from which the fundamental frequency may be calculated. This
value may be detected by differentiating the autocorrelation function to find regions where polarity is
reversed from negative to positive, thus corresponding to a minimum.
Although the above method is robust to noise and is capable of accurate results, using the
autocorrelation function to detect fundamental frequency is computationally expensive and requires a
high sampling rate in order to achieve high-resolution pitch detection. A lower sampling rate restricts
the amount by which signals may be shifted for comparison and therefore limits the resolution of
frequencies that may be detected.
2.4 Pitch Correction
There exist two fundamental methods for altering the pitch of a signal [28]. The first is “Frequency
Shifting”, a process whereby the input signal is shifted in frequency by modulating an analytical
signal by a complex exponential. However, this tends to lead to unwanted distortion of the original
sound signal, creating a metallic, inharmonic sound, non-comparable to the input signal. The second
method is to use “Time/Pitch Scaling”, where altering the length of a sound and then applying some
sample rate conversion technique to change the frequency achieve a change in pitch whilst preserving
the harmonic qualities of the input signal.
Bernsee [28] introduces a popular technique used for time/pitch scaling, known as Time Domain
Harmonic Scaling (TDHS). Based on a method proposed by Rabiner and Schafer [18], TDHS works
by estimating the basic pitch period [27, 18] of the input signal. The fundamental frequency is then
calculated using the Short Time Average Magnitude Difference [18]. An output signal can then be
created by copying the input signal in an “overlap-and-add” fashion, whilst simultaneously
incrementing the input pointer relative to the fundamental frequency. This method of “Synchronised
Overlap and Add” results in the input signal being traversed at a different speed, thus creating a
change in pitch. Then by using the pitch period estimate, the signal may be aligned such that the time
base is unchanged, resulting in a pitch corrected signal of unchanged length.
However, time domain based techniques such as TDHS are often unsuited to polyphonic material and
suffer from high complexity due to the fact that estimation of fundamental frequency is required
before pitch scaling may be performed. To overcome these drawbacks, time/pitch scaling may be
9
performed in the frequency domain, where no estimation of fundamental frequency is required at all,
resulting in lower complexity and fewer calculations than those performed in the time domain. [16]
The Phase Vocoder [4, 14, 15, 16, 18] is a well-established frequency domain method used for
time/pitch scaling of audio signals. The Phase Vocoder is an algorithm that allows either the timescale
or pitch of a signal to be modified, without adversely affecting the other. For example, timescale
modification on a signal may be performed without altering pitch, or pitch modification may be
performed whilst retaining the original time base of the signal.
Phase Vocoder based techniques accomplish pitch modification by a sequence of analysis,
modification and re-synthesis. During analysis, an STFT is applied to the input signal. The calculation
obtained at each time point within the STFT corresponds to a vocoder “channel”. Following this,
individual channels may then be altered accordingly to create the desired pitch-modification. Re-
synthesis is then performed using the Inverse Fast Fourier Transform [8]. The Phase Vocoder is
considered a powerful tool, due mainly to its efficient implementation using the Fast Fourier
Transform (FFT) [14].
Laroche et al [14] explain that pitch scaling is achieved using the Phase Vocoder by altering the time
base of an input signal, thus creating a change in pitch. The signal is then resampled at an appropriate
sampling rate in order to restore the original replay rate of the signal whilst maintaining the change in
pitch. An important drawback of this scheme, highlighted by Laroche et al [14] is that only linear
frequency-alterations may be made to the input signal. To overcome this lack of flexibility and the
restrictions it imposes, Laroche et al [14] introduce two alternative implementations of the phase
vocoder, both aiming to improve flexibility by using a two-stage system of peak detection and peak
translation to achieve a change in frequency. A subsequent phase-adjustment is then required to avoid
phasing errors and maintain phase coherence between frames. Since each peak may be “shifted”
individually, non-linear frequency modifications may be obtained.
Garas et al [16] suggest an improved implementation of the phase vocoder inspired by the human
auditory system, where spectral analysis is performed in a non-uniform manner, thus simulating the
non-uniform way in which humans decode audio signals. To achieve this, a warping function is
applied to modify the spectral resolution produced by the use of the Fast Fourier Transform [4, 8, 18,
19]. The modification is made so that the constant-bandwidth resolution becomes a constant-Q
resolution, otherwise known as a percentage bandwidth. For example, resolution is decreased at lower
frequency and is increased at higher frequencies. This method of percentage bandwidth spectral
analysis is similar to the multi-resolution, multi-scale pitch detection algorithm discussed by Jehan
[29] (see 4.2 Pitch Detection).
10
This concept of a “constant-Q phase vocoder” overcomes the issue of lower signal quality resulting
from working in the frequency domain, whilst still maintaining relatively low complexity, comparable
to the implementation of the traditional phase vocoder.
The phase vocoder does however have its drawbacks. Laroche et al. [13] discuss how, without proper
pre-processing, unwanted artefacts such as “transient smearing” and “phase incoherence” can occur in
the output signal. Transient smearing manifests itself as a reduction in the percussive nature of a
signal - notes lose their “attack”, whilst phase incoherence results in a loss of “presence” in the output
signal. For example, a vocalist or solo instrument may appear to be further away from the microphone
than in the original recording.
Use of the STFT is widely considered to be the underlying reason for phase propagation errors such as
transient smearing and phase incoherence [13]. This is due to the fact that although the STFT ensures
phase consistency within each channel over time, but does not however ensure phase consistency
across all channels (known as vertical phase coherence). Although this problem may be avoided by
only using integer modifications factors, the use of non-integer modification factors can often lead to
significant defects within the output signal. Laroche et al. [13] explain how the application of a
“Phase-Locked Phase Vocoder” may help to eliminate phase incoherence, thus creating a more
desirable output signal. The vocoder works by using a peak detection algorithm to detect peak
channels within the input signal, then by only allowing the phase of peak channels to be updated, the
phase of all other channels may be “locked”, therefore maintaining vertical phase coherence across
channels.
2.5 Formant Correction
Whenever pitch-modification of an audio signal is performed, not only is the pitch of the signal
changed, but formants present within the signal are moved as well [28]. Since the position and
frequency of formants very much determine the character and nature of a sound, it is important to
apply some formant-correction or formant-preservation technique in order to achieve desirable results.
Bernsee [28] explains how formant-correction may be achieved when using the phase vocoder. The
technique works by removing any newly generated formants within the output signal and
superimposing the original formant information from the input signal. This is achieved by
normalizing the spectral amplitude envelope of the output signal and multiplying it by the original
non-pitch scaled version. Since this is an amplitude-only frequency domain based method, the
additional computational costs involved are minimal.
11
Laroche [15] proposes a similar method, whereby pitch and formant modification are performed
simultaneously. The method is primarily designed to work on monophonic sound sources, achieving
results comparable with that of time-domain based methods, whilst maintaining the flexibility of
working in the frequency domain. Pitch correction is performed by translating peaks within the signal
to a new “target frequency” then rotating the phases of peaks and (surrounding bins) by an amount
relative to the phase-increment as a result of the change in pitch. The formant correction is
implemented by locating individual harmonics within the original spectrum. Selected harmonic
regions with frequency closest to that of the output harmonic are then pasted into the output signal at
the desired frequency, thus preserving the original formant information.
Within the time domain, pitch and formant information may be manipulated independently. Bernsee
[28] discusses a formant-preservation technique where Time Domain Harmonic Scaling is
implemented as a granular synthesis, where grains of length equivalent to one cycle of the
fundamental frequency are output at a new destination frequency rate. Pitch modification is then
achieved simply by altering the output rate of grains, whilst discarding some grains in the process to
maintain the length of the original sample. Since no transposition actually takes place during this
process, formants are not moved.
12
3 Methodology 3.1 Time Domain vs. Frequency Domain
The first decision to be made regarding the proposed design methodology to be used for the
Automatic Pitch Correction System involved comparison between time domain and frequency domain
based techniques. As previously discussed (see 2 Background Research), frequency domain based
techniques for both Pitch Detection and Pitch Correction are more computationally efficient than time
domain based techniques. Time domain based techniques such as the autocorrelation function (used
for pitch detection) and Time Domain Harmonic Scaling (used for pitch correction) both suffer from
large computational expense whilst their frequency domain counterparts, such as Harmonic Product
Spectrum, and the Phase Vocoder require fewer calculations and are more computationally efficient.
For this reason, the Automatic Pitch Correction system implements both pitch detection and pitch
correction in the frequency domain.
3.2 Design Approach
The Automatic Pitch Correction System requires three separate operations to perform automatic
correction: Pitch Detection, Error Detection and Pitch Correction. For this reason it was decided that a
pipeline process model should be used where each of these stages is implemented as a separate
“stand-alone” module. Two more modules that dealt with the conversion between time domain and
frequency domain were also used in order to maintain the modular construction of the system
throughout. Thus the resulting pipeline involves five separate processes: Conversion to Frequency
Domain, Pitch Detection, Error Detection, Pitch Correction and Reconstruction of the Waveform.
Implementation as a modular system offers greater flexibility and allows for improvements and
modifications to be made to each module individually without affecting other modules present within
the system. For example, the use of different Pitch Detection process would not require any changes
to be made to the Error Detection and Pitch Correction modules.
Fig 3.2 demonstrates the interaction between the separate modules, detailing the flow of data and
information required by each module. The system architecture is designed in a linear fashion, without
the need for feedback or recursive loops.
13
Fig 3.2 Visualisation of the flow of data between separate modules
3.3 Real-time Considerations
When considering real-time processing of audio signals, it is important to note that audio processing
on a PC can never be instantaneous; an unavoidable delay will always be present due to the necessary
reading and writing of data into buffers and transferring data to and from memory [23]. A real time
application is therefore not described as a system that may perform a task instantaneously, but as a
system that is able to perform a specific task within certain time constraints:
"Real-time audio processing for PCs can be performed if the audio input and
output can keep up with each other, without interruption, allowing some
finite delay between recording and playback” [23]
14
In terms of signal processing, a real-time process should in theory be executed “on the fly” and
therefore regarded as a sequential process, whereby the signal input is split up into consecutive
discrete sections that may be operated on individually, in a sequential fashion. Each section is
operated on separately and therefore processing of a single section should be completed before the
following section is reached.
3.4 Modelling environment
The choice of developing environment for the Automatic Pitch Correction System is of significant
importance and may greatly affect the potential capabilities of the resulting software. A number of
different options are available such as C, Java, Maple or MATLAB. MATLAB is a comprehensive
programming development environment with a large library of existing functions and therefore highly
suited to the task of developing the Automatic Pitch Correction System. The decision to use
MATLAB was influenced by several key features:
• Large library of useful existing functions
• Scripts and programs can be created without the need for compilation
• Interactive displays and debugging capabilities
• Array based computation allows for very fast processing times
• MATLAB scripts may be embedded in C code, allowing for further development as a
real-time application
The Automatic Pitch Correction system is implemented as a frequency domain based system, and
therefore requires Fourier representation of audio signals. MATLAB allows for multiple file formats
to be read or written, including WAV files and is therefore highly suitable for manipulating audio
data. MATLAB has built in functions for the Fast Fourier Transform and the Inverse Fourier
Transform, and therefore simplifying the task of converting between time domain and frequency
domain. There are also many functions available within MATLAB’s existing libraries that are able to
operate on complex numbers and functions are provided that allow manipulation of real and
imaginary parts of complex numbers separately – an important consideration when manipulating
phase angles of an audio signal in its frequency domain representation
15
4 Design
4.1 Converting to Frequency Domain
The first stage is to convert the signal waveform from the time-domain to the frequency domain. This
is done using the Short Time Fourier Transform [8]. The sampling rate sr of the input signal, and the
desired length (in ms) of each window w determine the size of each Fourier Transform:
The Short Time Fourier Transform returns a series of overlapping short-term Fast Fourier Transform
(FFT) frames, each one corresponding to an analysis window within the input signal. The reason for
choosing to overlap successive FFT frames is twofold: Firstly, the process of overlapping helps to
create a cross-fade effect between frames, giving a smoother transition when re-assembling the
spectrogram back to a waveform. Secondly, a greater number of frames may be retrieved over a given
time period, which, as will be discussed later, allows for greater resolution during the pitch shifting
process (see 4.4 Pitch Correction).
The choice of windowing function used is important when performing a Fourier transform and can
have a significant affect on the accuracy of the results that may be obtained [25]. The windowing
function used throughout the Automatic Pitch Correction System is the Hanning function. Using a
Hanning window allows for a 75% overlap, which allows a greater number of FFTs to be calculated
over a shorter section of signal. This is desirable since the resolution of the Pitch correction used
depends on the number of FFT frames calculated (see 4.4 Pitch correction). The more frames used,
the greater the resolution of the pitch shifting and for real-time application, these FFT frames need to
be calculated in as short a time as possible. The Hanning window also prevents leakage, or smearing
between frequencies where successive frames overlap [25] and therefore offers superior frequency
representation, thus increasing the potential accuracy of the pitch detection that will follow.
The Hanning function h(n) is applied to each analysis window, allowing for a 75% overlap between
each fft frame (see Fig 4.1). The function is applied to all points in each analysis window ( n =
1,…,ftsize ) and is defined as
16
The Hanning window does have the disadvantage of introducing some amplitude error in the power
spectrum of a signal. However, this does not present a problem in this application since the exact
amplitudes of peaks within the spectrum are not of importance, only the relative amplitudes are taken
into consideration (see 4.3 Pitch Detection).
Fig 4.1 Hanning Windows with a 75% Overlap
The signal is now represented as a series of overlapping FFT frames, each corresponding to an
analysis window of length L. If (x0,...,xn-1) is a set of complex numbers, the formula for each FFT may
be defined as
which returns an array of values each containing a real and a complex part. The real part corresponds
to the amplitude at a given frequency and the imaginary part corresponds to the phase shift in radians.
One of the key properties of the FFT is that the real parts of the values returned are symmetric around
the centre of the FFT frame. This may be explained by the fact that the FFT is essentially a more
efficient implementation of the DFT, which may be defined as
17
and since the real part returned by the DFT is essentially the cos function and the complex part is the
sin, the real parts of the FFT values must be symmetrical. For this reason, efficiency is improved in
the Automatic Pitch Correction System by only calculating and storing the first (ftsize/2) values
returned by the FFT for each analysis window.
As discussed previously, in order to achieve non-linear frequency alterations, it is required that the
input signal is divided into separate sections of finite length, each of which may be manipulated
individually. To make this possible, the corresponding spectrogram representation of the signal must
be separated into sections, each containing an equal number of FFT frames. Since individual FFT
frames are overlapping, the separate sections must also overlap. Given a 75% overlap, the hop size
between two successive frames is equal to ftsize/4 whilst the actual overlapping region is equal to
3*ftsize/4. Therefore, given two separate sections of contiguous FFT frames, the overlapping region
will consist of three frames from each section (see Fig 4.10). This frame overlap is important in
ensuring a smooth transition between successive sections upon reconstruction of the output signal.
However, overlaps consisting of modified sections require special treatment since the waveform
characteristics have been altered and therefore phase re-alignment will need to be performed. This
will be discussed in a later chapter (see 4.4 Pitch Correction).
4.2 Pitch Detection
Pitch detection in the Automatic Pitch Correction System is implemented as a frequency domain
based technique. Based on the Harmonic Product Spectrum [1], the algorithm requires analysis of the
power spectrum of the input signal, whereby the magnitude and position of peaks detected within the
spectrum are used to determine the fundamental frequency. However, unlike the Harmonic Product
Spectrum, where the power spectrum is multiplied and downsampled multiple times, the algorithm
used in the Automatic Pitch Correction System is capable of returning the same results but without the
multiplication/downsampling stages and therefore is more computationally efficient.
Firstly, the input signal is sampled using a fixed window size. A Hanning window function is applied
to each window and the fast Fourier transform is applied across all points in the windowed sample in
order to obtain the power spectrum.
18
Fig 4.2 Power Spectrum of a 50ms sample of a piano note C3 (130.8 Hz)
As can be seen from Fig 4.2, the power spectrum displays multiple peaks. Each of these peaks
corresponds to either the fundamental frequency, or an integer multiple of the fundamental frequency.
Intuitively, it would appear that largest of these peaks (centered at 261.6 Hz) corresponds to the
fundamental frequency of the signal, whilst all others simply represent additional harmonics.
However, this is not the case since harmonics are always integer multiples of the fundamental
frequency, and since frequency cannot be negative, peaks representing harmonics must always be
located at higher frequencies than the fundamental frequency. A peak cannot exist such that it is
positioned at a lower frequency than the peak centered at the fundamental frequency, therefore the
fundamental frequency must be represented by the lowest frequency peak present within the power
spectrum, centered at 130.8 Hz. Pitch detection now becomes a case of simply locating the position of
the first peak within the power spectrum (see Fig 4.3).
19
Fig 4.3 Power Spectrum indicating fundamental frequency (130.8 Hz)
Peak detection is implemented by simply comparing the magnitudes located at each frequency bin. If
the magnitude at a frequency bin is larger than its two neighbours on both sides, the bin is considered
a peak. Starting with the lowest value frequency bins, the first such peak that occurs in the power
spectrum is taken as the fundamental frequency. In order to prevent miscellaneous peaks created by
noise being detected as the fundamental frequency, a tolerance is introduced such that any peak with
magnitude below the given tolerance is identified as noise and is ignored. The tolerance is calculated
relative to the magnitude of the largest peak present within the spectrum. For example, given the
tolerance T
T = Pmax / 4
where Pmax is the magnitude of the largest peak present. If the value for T is set too high, it is possible
that if a harmonic peak exists that is excessively large; the peak corresponding to the fundamental
frequency will be classified as noise (see Fig 4.4).
20
Fig 4.4 Power Spectrum demonstrating noise tolerance level set too high.
Fundamental frequency at 110Hz is classified as noise; large harmonic peak at 220 Hz is detected as
fundamental frequency
Similar to the octave errors encountered when using the Harmonic Product Spectrum [21], the
frequency calculated almost always corresponds to the fundamental frequency being one octave too
high (see Fig 4.4). This is however not a problem since the purpose of pitch detection within the
Automatic Pitch Correction System is not to accurately track frequency, but to detect errors in pitch
that exist within the input signal. Since error detection is performed on a relative scale, the error is
calculated as a percentage. The percentage error at the fundamental frequency will be the same as the
error at the fundamental frequency one octave higher. For example, given a fundamental frequency of
107 Hz, the target frequency would be 110Hz (A2) and the error percentage 2.8% (see 4.3 Error
Detection). If the harmonic one octave higher is detected, the fundamental frequency would be 214Hz
and the target frequency would be 220Hz (A3), also with an error percentage of 2.8%. The pitch
correction module also works on a relative scale (see 4.4 Pitch Correction), and so is only concerned
with the error percentage and not absolute frequencies and therefore the octave error produced will
have no affect on the resulting pitch shift.
21
Performing Pitch detection in this way has both its advantages and its disadvantages. Firstly, the
technique is computationally very efficient, requiring fewer computations than the Harmonic Product
Spectrum, which is capable of running in real time [21]. The algorithm is reasonably robust to noise
and works for a variety of different inputs. The disadvantage of using any technique that uses the
power spectrum to determine fundamental frequency is that resolution is dependent on the length of
the FFT used. A short and fast FFT results in a limited number of frequency bins and therefore a
lower resolution in pitch determination. For a greater number of frequency bins and therefore a higher
resolution, a longer window must be used to calculate the FFT, which requires a greater amount of
time.
4.3 Error Detection
4.3.1 Pitch Perception and Frequency of Musical Notes
In order to implement error detection correctly, it is important to understand the difference between
pitch and frequency. ‘Pitch’ is a description of the subjective sound of a signal whilst frequency is an
actual representation of the sound’s physical structure. The difference between pitch and frequency is
demonstrated by the fact that polyphonic sounds involving more than one frequency are often
perceived as a single pitch [2].
In terms of pitch, musical notes are separated on a linear scale where each note is divided by a
“semitone”. However, when describing musical notes in terms of their frequency, the scale becomes
logarithmical, where the frequency of a given note is double the frequency of the note one octave
below. Thus, a difference in pitch does not always correspond to a specific difference in frequency.
For example, the difference in pitch between C4 (middle C) and C5 is one octave (12 semitones),
which is equal to the difference in pitch between C3 and C4. However, the difference in frequency
between C4 and C5 is not the same as the difference between C3 and C4. In fact the frequency
difference between C4 and C5 is twice the magnitude of the frequency difference between C3 and C4
(see Fig 4.5).
Note Frequency (Hz) No. of semitones from C4
C3 130.81 -12
C#3/Db
3 138.59 -11
D3 146.83 -10
22
D#3/Eb
3 155.56 -9
E3 164.81 -8
F3 174.61 -7
F#3/Gb
3 185.00 -6
G3 196.00 -5
G#3/Ab
3 207.65 -4
A3 220.00 -3
A#3/Bb
3 233.08 -2
B3 246.94 -1
C4 261.63 0
C#4/Db
4 277.18 1
D4 293.66 2
D#4/Eb
4 311.13 3
E4 329.63 4
F4 349.23 5
F#4/Gb
4 369.99 6
G4 392.00 7
G#4/Ab
4 415.30 8
A4 440.00 9
A#4/Bb
4 466.16 10
B4 493.88 11
C5 523.25 12
Fig 4.5 Frequencies of Musical notes on the chromatic scale
(Taken from http://www.phy.mtu.edu/~suits/notefreqs.html)
Hence, one octave is not a fixed frequency difference but may be described as a frequency ratio of
2:1. Similarly, the size of each semitone (in Hz) is also not fixed; the higher up the musical scale, the
larger each semitone becomes. For example, the difference between C3 and C#3/Db
3 is 7.78 Hz,
corresponding to one semitone, whilst the difference between C4 and C#4/Db
4 is 15.55 Hz, also
corresponding to a single semitone.
23
Since higher pitched notes are separated by a greater amount on the frequency scale than lower notes,
frequency resolution is also increased as notes move up the musical scale. This is even demonstrated
in human hearing. A human is far more capable of discerning pitch at higher frequencies since a
difference in pitch corresponds to a larger difference in frequency and therefore is more recognizable
to the human ear.
4.3.2 Implementation
Since pitch correction in the Automatic Pitch Correction System is implemented to shift pitch on a
relative scale, error detection in the Automatic Pitch Correction System is also implemented on a
relative scale. The difference between the detected note and the calculated “target note” is returned as
an error percentage. As discussed earlier, higher musical notes are separated by larger frequencies
than those lower down in the musical scale and therefore the frequency resolution will be higher for
higher frequencies, and lower for lower frequencies.
Error detection is implemented by storing an array in memory containing the frequency values for C0
to B1. Multiplying each of these vales by 2 multiple times then creates a lookup table containing
frequency values for C0 to C5. For each windowed sample, the calculated frequency of the input signal
is compared with the values stored in the lookup table. The “target note” for the windowed sample is
calculated using a “nearest neighbour” approach implemented in Matlab using the DSEARCH
function, which uses an algorithm known as Delaunay triangulation [6]. The closest match (i.e.
nearest neighbour) contained within the lookup table is returned as the target note/frequency. Given
the target frequency, the error percentage is then calculated as
e = 100 * (T – f) / f
where e is the relative percentage error, T is the target frequency (in Hz) and f is the calculated
frequency (in Hz) of the windowed sample. The error e corresponds to the relative increase/decrease
in frequency required to reach the target frequency. As an example, Fig 4.6 shows a section of the
lookup table, containing frequency values for C3 to F3. Given a calculated frequency of 150.0Hz, the
returned target note would be D3 (146.83Hz) since this yields the smallest difference in frequency.
The target frequency is therefore 146.83, giving an error percentage of -2.11%. Thus, if the windowed
sample’s frequency is decreased by 2.11%, the new frequency will be equal to the target frequency.
C3 C#3/Db
3 D3 D#3/Eb
3 E3 F3
130.81 138.59 146.83 155.56 164.81 174.61
Fig 4.6 A section of lookup table containing frequency values for chromatic scale
24
4.4 Pitch Correction
4.4.1 Underlying Idea
The underlying technique used to obtain a pitch shift in the Automatic Pitch Correction System is
based on the Phase Vocoder. The Phase Vocoder is a high quality frequency domain solution to pitch
alteration that works equally well for both monophonic and polyphonic material and therefore is well
suited for the purposes of the Automatic Pitch Correction System. The Phase Vocoder does however
have its drawbacks; the most important is that in its standard form it is fundamentally only capable of
linear frequency alterations [14]. To explain this, the standard technique obtains a shift in pitch by
first modifying the time base of the signal and then altering the sampling rate of playback (i.e.
resampling) to effect a pitch change and restore the signal to its original length. Since the resampling
stage cannot be implemented until the entire signal has been processed, any shift in pitch that is
created applies to the entire signal and therefore non-linear frequency alterations are not possible.
For the Automatic Pitch Correction System, a modified implementation of the Phase Vocoder is used,
where sequential processing is used to solve the problem of non-linear frequency alterations. As the
input signal is being read, it is separated into multiple smaller discrete sections and the time-
scale/resample Phase Vocoder technique is applied to each section individually. The implementation
of the resampling stage differs from that used in the standard Phase Vocoder in that instead of actually
playing the signal back at a different sampling rate, the waveform itself is modified using
interpolation to restore the signal back to its original time base. This form of “resampling” allows
each section to be played back at the same sampling rate, regardless of the pitch shift required.
The final stage involves the construction of the output array, produced through reconstruction of the
separate sections. The sections are joined back together and played back at the same sampling rate as
the input signal. Those sections that have been modified by the time-scale/resample operation will
play back at the same speed as in the input signal, but at a different pitch whilst all other sections will
remain unchanged. This reconstruction stage may also be considered as a sequential process since it is
possible to add each section to the output array as soon as it has been processed. Therefore, given that
small enough section lengths are chosen, possibilities are created for the pitch shifting process to be
implemented as a real-time application.
25
4.4.2 Time Scaling
Time scaling of each section of the signal is implemented in the frequency domain. As discussed in
the previous section, the STFT returns a series of FFT frames, each corresponding to an analysis
window within the input signal. The number of frames N therefore corresponds to the length (i.e. the
time base) of the signal. Given an FFT size ftsize a window hop size hop and a sampling rate sr
L = ( ftsize + ( (N –1) * hop ) ) * (1000/sr)
where N represents the number of FFT frames present and L represents the length (in ms) of the
corresponding output signal. Therefore, in order to alter the time base of a signal, the number of
frames N must be altered. This is achieved by interpolating between successive FFT frames, thus
introducing or removing frames such that the new value for N corresponds to the new time base. The
modification to the time base may therefore be calculated simply as the number of FFT frames to be
added or removed. Given that the pitch of a section of signal is to be scaled by a factor β, the change
in number of FFT frames ∆N may be calculated as
∆N = ( β * N ) – N
It is important to note that only an integer number of FFT frames may be added or removed and
therefore the degree to which pitch may be shifted is dependent upon the size of N. For example, to
effect a 1% increase in pitch (i.e. β = 1.01), it is required that N ≥ 100. The resolution of pitch shift is
therefore a function of the size of N, where a larger size of N results in higher resolution pitch shifting.
In fact, the minimum frequency alteration fmin (% relative to original frequency) may be described as
The choice of N is therefore an important aspect of the Automatic Pitch Correction system. Although
a larger N corresponds to a higher frequency resolution, it also results in a lower time resolution.
Similarly, a smaller N corresponds to a higher time resolution but with a lower frequency resolution.
26
Each FFT frame yields an array of complex values of the form
X(Ω, t) = H(Ω) exp jφ
where H(Ω) represents the Fourier transform of the analysis window h(n) at time t corresponding to
the frequency Ω. For each FFT frame, the real part of X(Ω, t) represents the magnitude spectrum of
the FFT, whilst the complex part represents the phase angle information denoted by φ. To perform
interpolation between FFT frames, the real and complex parts of the FFT values must be interpolated
separately.
The first step of the time scaling process is to extract the complex modulus H(Ω) (i.e. the magnitude)
of each FFT value. The number of FFT frames that must be added ∆N is calculated according to the
required pitch scale factor β. The magnitudes of successive FFT frames are interpolated so that the
number of frames is equal to N + ∆N (see Fig 4.7). The newly interpolated FFT frames represent the
magnitude spectrogram of the original section of signal, but since the length of each frame remains
constant, the timebase of the signal must change.
Fig 4.7 Interpolation of a sequence of frames (top) in order to:
a) Increase the number of FFT frames by two (middle), thus increasing the timebase of the signal (N = 8, ∆N = 2)
b) Decrease the number of frames by two (bottom), thus decreasing the timebase of the signal (N = 8, ∆N = -2)
27
4.4.3 Phase Alignment
Phase alignment between successive frames is achieved by first extracting the phase angle φ for all
values within the first FFT frame in the series. The phase advance for each of these values is then
calculated as
∆φ = α − φ
where α represents the phase angle of the corresponding FFT value in the second FFT frame in the
series. The phase advance ∆φ may then be used to increment phase angle values for each successive
FFT frame, thus the phase angle for each FFT value within an FFT frame may be calculated as
X(Ω, t) = H(Ω) exp j(φ + ∆φ)
where φ represents the phase angle of the corresponding FFT value in the previous FFT frame in the
series. Since phase is modulus 2π, special consideration must be given where φ + ∆φ does not fall
within the range -2π : 2π. In such a case, the phase angle φ is “wrapped” around by simply adding or
subtracting 4π accordingly in order to ensure φ falls within the range -2π : 2π. As an example, given
that φ + ∆φ returns a value of -3π, by adding 4π, the phase angle now becomes -π, and is now within
the range -2π : 2π. Similarly, a value for φ + ∆φ that is calculated as 3π may be reduced to -π by
simply subtracting 4π.
Although this procedure ensures phase coherence within each section of signal, it does not ensure
phase coherence between adjacent sections of signal. This causes a region of phase-cancellation
where two sections of signal overlap, which in turn creates a momentary reduction in amplitude (see
Fig 4.8) that manifests itself as an audible “blip” in the output signal.
Fig 4.8 Reduction in amplitude caused by phase cancellation
This may be dealt with crudely by simply increasing the amplitude of the waveform over each region
of phase cancellation, or through other various techniques with the aim to remove the inflection
caused by phase cancellation errors entirely (see 7.3 Alternative Phase Alignment Techniques).
28
However, the effect may be reduced significantly through an appropriate choice of window size. As
discussed earlier (see 4.1 Converting to Frequency Domain), the use of a Hanning window provides a
“cross-fade” effect that, with a large enough window size, reduces the effects of phase cancellation
considerably and therefore removes the need for any extra signal processing.
4.5 Signal Reconstruction
Once time scaling within the frequency domain is complete, the next step is to convert each section of
the signal back to a waveform using the Inverse Fourier Transform [8]. The Inverse Fourier
Transform is identical to the Fourier Transform (see 4.1 Converting To Frequency Domain), differing
only in the sign of the exponent. Following this stage, the Fourier representation of the signal is
converted back into the time domain and thus is no longer represented as an array of complex
numbers, but as an array of real values.
As discussed earlier, the original time base of each section is restored through interpolation of the data
values representing the signal’s waveform. This creates the effect of “resampling” each section, whilst
retaining the same sampling rate, thus allowing individual sections to be resampled by different
amounts. It is important that resampling is performed in this way since conventional resampling
would affect the entire waveform and therefore non-linear pitch shifts would not be possible. The
interpolation is performed by first identifying the correct length T for each section of the input signal.
This is calculated using the following equation
T = ( ftsize + ( (N –1) * hop ) )
where N represents the number of FFT frames used in each section, hop is the window hop size
between successive FFT frames, and ftsize is the size of each FFT frame. The value T corresponds to
the number of data values that should be used to represent each section of the signal. Interpolation is
then carried out, either reducing or increasing the size of the array so that the number of data values is
equal to T, thus ensuring that any modified sections are restored to their original timebase and that all
sections within in the output signal are of equal length.
Sections whose timebase has been reduced will require interpolation to increase the number of data
values, whilst sections whose timebase has been extended will require interpolation to decrease the
number of data values. Given that the timebase of a section has been either increased or decreased, the
interpolation of the data array will result in a change in the wavelength of the waveform, which in turn
results in a change in frequency, and therefore a shift in pitch will be created (see Fig 4.9)
29
Fig 4.9 A) A simple waveform
B) Timebase of waveform is increased by a factor of two
C) “Resampled” waveform – frequency is now increased by a factor of two
Following the data array interpolation stage, successive sections of the signal must be joined back
together to create the output signal. Since the sections were created using overlapping Hanning
windows, the sections in the output signal must also overlap. The position P that each section should
be positioned in the output array is calculated as
P = P0 + ( (N + 1) * hop )
where P0 is the position of the previous section, N is the number of frames in each section and hop is
the window hop size being used. This results in an overlap between consecutive sections of (3*hop),
which is equal to the overlap between sections in the input signal (see Fig 4.10).
Fig 4.10 Overlapping sections with (3*hop) overlap
30
5 Evaluation
5.1 Evaluation Criteria 5.1.1 Analytical Evaluation Evaluation of the Automatic Pitch Correction System was divided into two separate stages. The first
stage used analytical testing within the Matlab environment to test the accuracy of the individual
modules. Testing of the Pitch Detection and Pitch correction modules involved inputting a signal of
known frequency and observing the output of each module. Throughout this procedure, different
source signals with varying amounts of harmonics and overtones were required in order to ensure
each module could cope with a wide range of input material. Graphical output within Matlab, and
frequency values obtained through testing on an external machine (see Appendix C), provided
comprehensive results that enabled error calculations to be made.
5.1.2 Qualitative Evaluation The ultimate goal of an automatic pitch-correction system is to achieve a result that is pleasing to the
human ear and therefore the method of evaluation for such a system should reflect this. For this
reason, the second stage of evaluation involved subjective human testing in order to assess the output
qualitatively, where the results depend solely on how the pitch correction sounds to the human ear.
Qualitative evaluation was implemented by assembling a group of external judges (chosen from a
selection of musicians and non-musicians). The system’s performance was then evaluated with
respect to its ability to recreate the characteristics of the original sound, focussing on the importance
of creating a desirable output that is pleasing to the human ear.
5.1.3 Tolerance Analysis Throughout evaluation of the Automatic Pitch Correction System, every effort was made to ensure
that the input test files contained as diverse a range of audio material as possible. Different
instruments produce waveforms that behave very differently and contain vastly differing numbers of
harmonics and overtones. Another important consideration is the fact that frequency vs. pitch is a
logarithmic scale (see 4.3.1 Pitch Perception and Frequency of Musical Notes); higher pitched notes
are separated by a larger degree than lower frequencies and therefore lower frequencies require
greater accuracy to produce an acceptable result. Each module was tested for accuracy with many
different instruments over a wide range of frequencies, varying from very low notes played on a bass
guitar, to very high notes played on a violin. This enabled the identification of a frequency threshold
value, below which the accuracy of the system was found to deteriorate unacceptably.
31
5.2 Module Performance Results
5.2.1 Pitch Detection
Assessment of the Pitch Detection module involved inputting a series of input files of known
fundamental frequency and directly comparing the observed results. This procedure was carried out
using two different window sizes for the conversion to frequency domain process (see 4.1 Conversion
to Frequency Domain). The two window sizes tested spanned 512 and 128 Fourier Transform points.
The table below details the input files used and the results obtained. Further to this, Fig 5.1 shows the
relative error percentage for each input file using the two different window sizes.
* Note - these values equate to 466.71 and 464.98 respectively (see 4.2 Pitch Detection)
Calculated
Frequency (Hz) Input file
No.
File description
(wav - 44100 Hz)
Fundamental
Frequency (Hz)FT size = 512 FT size = 128
1 B0 – Bass guitar 30.87 30.54 33.94
2 A1 – Bass guitar 55.00 50.91 54.30
3 G1 – Grand Piano 49.00 49.22 47.51
4 F2 – Grand Piano 87.31 86.55 88.24
5 B2 – Male vocal no.1 123.47 122.19 122.18
6 C3 – Piano (midi) 130.81 129.83 129.83
7 E3 – Electric Guitar 164.82 164.42 162.90
8 F3 – Male vocal no. 2 174.61 173.10 176.48
9 C4 – Cello 261.63 260.50 261.32
10 C4 – Grand Piano 261.63 257.93 261.35
11 D4 – Cello 293.66 293.60 291.87
12 Eb4 – Saxophone 311.13 310.57 305.45
13 E4 – Male vocal no.2 329.63 322.45 325.81
14 F4 – Female Vocal 349.23 349.57 349.57
15 A#4 – Oboe 466.16 933.42* 929.95*
16 B4 – Flute 493.88 491.32 492.12
17 C5 – Clarinet 523.25 523.70 522.65
18 C#5 – Violin 554.37 554.78 554.10
19 D#5 – Xylophone 622.25 622.84 624.47
20 G#5 – Violin 830.61 707.48 834.84
32
Results were found to be consistently within an error tolerance of 1-2%. However, certain
circumstances prevented a satisfactory performance. As discussed previously (see 4.3.1 Pitch
Perception and Frequency of Musical Notes), lower pitched notes are far closer together on the
frequency scale and therefore require much greater accuracy. The effects of this can be seen from the
results table. The pitch detection performed on the two input files containing notes played by a bass
guitar produce large errors in excess of a single semitone (highlighted in bold). Although the accuracy
of the Pitch Detection module remains constant throughout the frequency spectrum, the same is not
true throughout the scale of musical notes. Fig 5.1 demonstrates how accuracy of pitch detection
increases higher up in the musical scale and decreases lower down.
Performance was also affected considerably by the choice of window size. Generally, the larger
window size of 512 outperformed the smaller window size of 128, producing more consistent and
accurate results (see Fig 5.1). However, the danger of using the larger window size is demonstrated by
the result from the final input file in the results table (highlighted in bold italic). This particular input
file involved several notes played fairly close together and the larger window size was unable to
extract the desired note from the input signal and therefore the window spanned across two notes, thus
producing a confusing result with an error in the magnitude of almost three semitones.
Fig 5.1 Relative Error Percentage Values for Pitch Detection Module
It is clear from these results that given an input note of frequency in the range 0-55Hz, the
performance of the Pitch Detection module becomes unreliable. Further testing exposed the threshold
value for acceptable performance to be within the range 60-65Hz. Although accurate results below
this range are possible, performance was typically reduced from within the range 0-2% to within the
range 0-10%, thus creating potential errors greater than a semitone, which is clearly unacceptable.
33
5.2.2 Pitch Correction
Assessment of the performance of the Pitch Correction module involved two stages. Firstly, input
files containing notes of known frequency were subject to a pitch-shifting process on an external
machine (see Appendix C) to an arbitrary frequency not more than one semitone away. The new
fundamental frequency was then calculated. For accuracy reasons, this was performed using an
external machine (see Appendix C). The Pitch Correction module was then tested using these files
and knowledge of both the fundamental and target frequencies for each note. The second stage
involved the verification of the newly shifted pitch. Again, for accuracy reasons, this was performed
on an external machine (see Appendix C).
Tests were run using two different values for N (see 4.4 Pitch Correction), where N represents the
number of FFT frames present for each pitch-shifted section of signal. The first series of tests used N
= 100, whilst the second used N = 200. These values for N allow for 1% and 0.5% pitch shifts
respectively (see 4.4 Pitch Correction). The table below displays the results obtained.
Pitch-shifted Frequency
(Hz) Input file
No.
File description
(wav - 44100 Hz)
Fundamental
Frequency
(Hz)
Target
Frequency
(Hz) 100 Frames 200 Frames
1 B0 – Bass guitar 30 30.87 (B0) 31 31 2 A1 – Bass guitar 54 55.00 (A1) 55 55 3 G1 – Grand Piano 50 49.00 (G1) 49 49 4 F2 – Grand Piano 85 87.31 (F2) 88 87 5 B2 – Male vocal no.1 120 123.47 (B2) 124 123 6 C3 – Piano (midi) 134 130.81 (C3) 132 131 7 E3 – Electric Guitar 170 164.82 (E3) 165 165 8 F3 – Male vocal no. 2 172 174.61 (F3) 174 174 9 C4 – Cello 255 261.63 (C4) 261 261
10 C4 – Grand Piano 265 261.63 (C4) 262 262 11 D4 – Cello 300 293.66 (D4) 293 294 12 Eb4 – Saxophone 305 311.13 (Eb4) 311 311 13 E4 – Male vocal no.2 335 329.63 (E4) 328 329 14 F4 – Female Vocal 340 349.23 (F4) 347 349 15 A#4 – Oboe 475 466.16 (A#4) 465 465 16 B4 – Flute 505 493.88 (B4) 495 495 17 C5 – Clarinet 510 523.25 (C5) 520 523 18 C#5 – Violin 570 554.37 (C#5) 553 556 19 D#5 – Xylophone 610 622.25 (D#5) 622 622 20 G#5 – Violin 850 830.61 (G#5) 833 829
34
Fig 5.2 demonstrates the performance of the Pitch Correction module with respect to the relative
errors between Target Frequencies and actual Pitch-Shifted Frequencies. Since pitch correction is
implemented to shift pitch on a relative scale (see 4.4 Pitch Correction), the relative errors remain
consistent throughout the frequency spectrum, falling in most cases below 0.8%.
Fig 5.2 Relative Error Percentage Values for Pitch Correction Module
However, as can be seen from the results table, the external machine used to verify both the
fundamental frequency and the pitch-shifted frequency provided accuracy to the order of 1Hz only.
Since both the fundamental frequency and the pitch-shifted frequency may vary by as much as 0.5Hz,
there is a potential rounding error within the results displayed in Fig 5.2. Given the worst-case
scenario, the rounding error could be as great as 1Hz. Fig 5.3 displays the corrected relative error
values for the Pitch Correction module, allowing for the maximum potential rounding error.
Fig 5.3 Relative Error Percentage Values for Pitch Correction Module, with maximum rounding error
35
The rounding error has little effect on the results corresponding to notes higher up in the musical
scale. However, a rounding error of 1Hz has a significant affect on notes lower down in the musical
scale. In order to investigate this further, the same testing procedure was implemented again using
input files no. 1-7. The files were shifted to new arbitrary frequencies and the relative errors plotted
again. This was iterated on a number of occasions and the results observed consistently remained
below 1%, therefore verifying the results demonstrated in Fig 5.2.
5.3 User Evaluation User evaluation involved assembling a group of judges (see Appendix D), containing five musicians
and five non-musicians, and running a series of tests. The first involved playing a number of audio
samples, some of which contained no errors in pitch, and some that had been subject to pitch-
correction through the Automatic Pitch Correction System. The judges were then asked if they could
clearly identify which signals had been subject to pitch-correction. Following this, the judges were
played the original uncorrected versions of the pitch-corrected audio samples and then asked to grade
the performance of the system on a scale from 1-10. A grade of 5 indicated no improvement at all,
whilst a grade of 1 indicated a severe depreciation in quality and a grade of 10 indicated maximum
possible improvement.
In order to obtain continuity throughout the evaluation process, the same input files that were used to
perform analytical evaluation of the Pitch Correction module were also used to perform qualitative
evaluation. Throughout the user testing, a window size of 512 was used and the number of frames N
was set to N = 200.
Performance Rating Average
(1-10) Input file
No.
File description
(wav - 44100 Hz)
Fundamental
Frequency
(Hz)
Target
Frequency
(Hz)
No. of Judges
who identified
presence of
pitch-correctionMusicians Non-musicians
1 B0 – Bass guitar 30 30.87 (B0) 10 5.6 5.4 2 A1 – Bass guitar 54 55.00 (A1) 10 4.8 5.0 3 G1 – Grand Piano 50 49.00 (G1) 4 7.6 8.0 4 F2 – Grand Piano 85 87.31 (F2) 6 8.2 8.0 5 B2 – Male vocal no.1 120 123.47 (B2) 9 5.8 6.2 6 C3 – Piano (midi) 134 130.81 (C3) 3 8.0 8.2 7 E3 – Electric Guitar 170 164.82 (E3) 2 8.8 9.0 8 F3 – Male vocal no. 2 172 174.61 (F3) 8 5.4 5.6 9 C4 – Cello 255 261.63 (C4) 7 6.4 6.4
36
10 C4 – Grand Piano 265 261.63 (C4) 3 7.8 8.0 11 D4 – Cello 300 293.66 (D4) 9 8.0 8.2 12 Eb4 – Saxophone 305 311.13 (Eb4) 6 6.8 6.8 13 E4 – Male vocal no.2 335 329.63 (E4) 8 7.2 7.6 14 F4 – Female Vocal 340 349.23 (F4) 10 6.2 6.2 15 A#4 – Oboe 475 466.16 (A#4) 7 7.8 8.2 16 B4 – Flute 505 493.88 (B4) 9 5.8 6.0 17 C5 – Clarinet 510 523.25 (C5) 8 6.2 6.2 18 C#5 – Violin 570 554.37 (C#5) 6 7.4 7.6 19 D#5 – Xylophone 610 622.25 (D#5) 1 9.2 9.0 20 G#5 – Violin 850 830.61 (G#5) 10 5.6 5.6
The above table shows the average results obtained from the user evaluation (see Appendix D for
judge’s individual grades). As can be seen, the presence of pitch-correction proved to be noticeable in
the majority of input files. However, the grades awarded by the judges suggest that although the pitch-
correction was noticeable, in most cases it proved beneficial to the quality of the audio.
The majority of judges commented that the pitch-corrections appeared accurate and shifted regions of
signal were only noticeable at the “edges”. As discussed in a previous chapter (see 4.4.3 Phase
Alignment), a region of phase-cancellation exists at either end of a pitch-shifted section of signal and
therefore a slight dip in amplitude exists. In percussive signals, such as piano or xylophone, this dip in
amplitude proved to be negligible since it often went unnoticed by the listener. However, with input
signals that include smoother transitions between notes, such as vocals, the dip in amplitude was more
apparent.
Judges also found that for some signals, pitch-correction was apparent due to notes either side of a
corrected note also being shifted. This resulted from using either too large a window size, or too large
a value for N. Given an input signal where the notes are too close together, the smallest region that
could be pitch-shifted was larger than the length of an individual note and therefore incorrect notes
could not be corrected without adversely affecting small regions of the signal either side of the note.
Perhaps somewhat surprisingly, the results obtained from the judges were similar for both musicians
and non-musicians (see Appendix D for grade distribution graph). Hearing depends on skill and
experience [3] and therefore it would be expected that a musician would have a more sensitive ear to
errors in pitch than non-musicians. The similarity in grades between the two groups suggests that the
actual shift in pitch created by the Automatic Pitch Correction System is sufficiently accurate.
However, the fact that both musicians and non-musicians were both able to identify the presence of
pitch-correction demonstrates that the pitch-corrections made are not transparent.
37
The final stage of user evaluation required the judges to evaluate pitch corrections made by the
Automatic Pitch Correction System against those made by other implementations used for automatic
pitch correction. Due to the limited availability of such systems, only a few test files were used
throughout this section of evaluation (see Appendix C). The comparisons were made against three
separate systems:
The Antares AVP-1 is an industry standard vocal pitch corrector, utilising “Auto-Tune” technology
that claims to let you correct the pitch of vocals (or solo instruments), in real time, without distortion
or artefacts, while preserving all of the expressive nuances of the original performance.
The RBC Audio Voice Tweaker Lite is a pitch transposer plug-in for Digital Audio Workstations or
Sound Editing Software. An automatic pitch correction feature is provided that is designed for use on
monophonic signals such as voice or solo instruments. It is also capable of transposing the pitch and
formants of a signal independently.
The third automatic pitch corrector was developed as a “project on pitch detection and correction for
the solo human voice” for Connexions - Rice University, Texas. The implementation uses an
autocorrelation function (see 2.3 Pitch Detection) to detect the pitch of a note and a PSOLA algorithm
(see 2.4 Pitch Correction) to perform the resulting pitch correction.
The test files obtained for all three systems contained male vocals. Judges were played the original,
uncorrected version of each test file, followed by the pitch-corrected version. They were then asked to
grade the performance of each system, again on a scale of 1-10. Fig 5.4 demonstrates the results
obtained, along with the average results obtained from the user evaluation of the Automatic Pitch
Correction System (a table of these results is available in Appendix D).
Fig 5.4 Performance ratings for various automatic pitch correction systems
38
As can be seen, the Antares AVP-1 performs very well. Rated very highly by all the judges, the pitch
correction provided was almost entirely transparent, introducing no defects or artefacts into the output
signal. Some of the non-musicians commented that they could not hear a difference between the
“before” and “after” test files.
However, the RBC Audio and Rice University Project systems did not perform as well. The RBC
Audio Voice Tweaker Lite introduced a noticeable amount of modulation to the signal, resulting in a
less than desirable output. Judges remarked that the result sounded more like an effect had been
added, rather than pitch correction had been applied. Similarly, the Connexions project at Rice
University produced an output that sounded like a chorus effect had been added, most likely caused
by phase propagation errors.
With exception to the Antares AVP-1, the results demonstrate that the Automatic Pitch Correction
System performed well in comparison to similar systems designed to perform pitch correction. It is
however important to note that the results shown for the Automatic Pitch Correction System represent
averages for test files including a range of instruments and not just vocals. The Automatic Pitch
Correction System is primarily designed to function on instrument tracks, whilst the three
implementations it is tested against are designed for vocal tracks. However, Fig 5.5 shows that even
when comparisons are made only using the average for test files containing vocals (test files no. 5, 8,
13 and 14), the Automatic Pitch Correction System still performs favourably in comparison to the
RBC Audio Voice Tweaker Lite and the Connexions project at Rice University.
Fig 5.4 Performance ratings for various automatic pitch correction systems for vocals only
39
6 Conclusions
6.1 Evaluation of Minimum Requirements
• Discuss current methods and implementations used for detecting and correcting pitch errors
both in real time and in batch processing
Many different techniques were researched and studied. These included both frequency and time
domain techniques, the advantages and disadvantages of each method were identified and the possible
application of each method within the Automatic Pitch Correction System was considered. A range of
extra material including areas such as human pitch perception and formant correction was also
covered in order to gain a deeper understanding of the subject area.
• Create a system that is capable of detecting and correcting pitch errors on a single, simple
instrument track, or tuning fork, without introducing distortion, phase errors, or other
artefacts
The system created is able to automatically detect errors in pitch, determine an appropriate “target
note”, and shift the pitch of the appropriate section of signal to the desired frequency. Analytical
evaluation revealed that the system is able to detect pitch within an error margin of 1-2% and perform
pitch correction to accuracy within 0.8% of the target note. The maximum accumulative error that may
occur from the combined process of pitch detection and pitch correction is therefore 2.8%. The human
ear can detect differences of as little as 1 Hz in sustained notes [12], and therefore, given an input note
above “Middle C” (C4), the resulting pitch shift is known to be accurate within human hearing
capabilities. However, this is given the worst-case scenario; qualitative evaluation demonstrated that
even below Middle C, the pitch correction proved to be sufficiently accurate for a group of both
musicians and non-musicians.
Further qualitative evaluation showed that the system was able to perform automatic pitch correction
on simple instrument tracks without introducing audible loss or coloration of signal quality. The pitch
corrections performed on instrument tracks that included piano, xylophone and electric guitar were in
most cases undetectable by the majority of judges. The system was found to perform well on more
percussive tracks where note accents are quite apparent and note changes are not too closely spaced.
This is due to the choice of window size and value for N (no. of frames per pitch shifted section); a
larger window size and value for N increases the accuracy of the system, but a trade-off is made
between time and frequency resolution. A small window size and a smaller value for N prevents the
40
unwanted pitch shifting of notes that may occur with a larger window size and value for N (see 5.3
User Evaluation), but accuracy is lost due to a decreased number of frequency bins available for pitch
detection (see 4.2 Pitch Detection) and a decreased accuracy with which pitch-shifting can be
performed (see 4.4 Pitch Correction).
• Evaluate system against current implementations already used for automatic pitch correction
During user evaluation, the performance of the system was directly compared to three existing
implementations (see 5.3 User Evaluation). These implementations included a professional industry
standard vocal pitch corrector, another university research project and a pitch transposer plug-in for
computer recording software. Further evaluation was carried out to assess the system from both an
analytical and a qualitative point of view, obtaining results that demonstrated both the error tolerance
of the system and the subjective opinions of a group of judges.
6.2 Evaluation of Possible Extensions
• Perform automatic pitch-correction of input track in real-time, allowing for application in
live performances and real-time monitoring
Although this particular extension to the minimum requirements was not realised, the design and
implementation of the Automatic Pitch Correction System throughout the project was focussed on
producing a solution that could potentially be created as a real-time application. The system has
therefore been designed to run sequentially and requires no post-processing or feedback loops.
Although to obtain best results, a small delay would be required (to allow for a larger window size
and therefore increased accuracy); the system does form the basis from which a real-time application
may be implemented.
• Implement a feature that will allow a specific key to be selected before signal input. This
would enable out of pitch notes to be shifted to the nearest note in the selected key, as
opposed to the nearest note in the chromatic scale
Error detection is performed using a lookup table to determine the target note for each given
frequency. The musical scale from which target notes are chosen is purely dictated by the content of
the lookup table. By only including the frequencies for notes within a given musical scale (i.e. C
major), all notes can be shifted only to those frequencies and hence only that musical scale. Similarly,
only frequencies corresponding to notes within a given melody may be included in the lookup table
and therefore pitch-shifting may be made to a specific tune or melody.
41
• The system could possibly be extended to allow pitch-correction of polyphonic material, such
as a de-tuned instrument playing a chord
Both the Pitch Detection and Pitch Correction modules used within the Automatic Pitch Correction
System are capable of functioning on polyphonic material. The Pitch Correction module is waveform
independent and therefore is able to shift the pitch of monophonic and polyphonic material equally
well. The Pitch Detection module is also able to extract the fundamental frequency from a polyphonic
input signal and therefore automatic pitch correction on polyphonic material may be performed. It is
important to note that is not possible to perform corrections on single notes within polyphonic
material; every note within a pitch-shifted section of signal is affected equally. The system therefore
performs very well on polyphonic material that contains mistuned instruments or incorrect chords, but
is unable to function on material where an incorrect note coincides with other correct notes.
6.3 Suggestions for Further Work 6.3.1 Improved Polyphonic Pitch detection and Correction
Pitch correction of polyphonic material is possible using the Automatic Pitch Correction System.
However, this is only under the condition that any errors in pitch are uniform throughout the
spectrum. The pitch detection module used within the Automatic Pitch Correction system is only
capable of detecting the fundamental frequency of a sound, and the pitch correction module is only
capable of shifting the pitch of an entire sound and therefore accurate corrections may only be made if
all components of the sound at a given time are out of tune by the same amount e.g. a chord that is
played out of key.
A possible area for further investigation is the possibility of creating an Automatic Pitch Correction
System that is able to detect and correct non-uniform errors within a polyphonic sound source, for
example, a chord that is played on a guitar where only one string is out of tune. The system should be
able to correct the single incorrect note whilst the correct notes within the chord remain unchanged.
This would involve more intelligent pitch detection, where harmonics present in the spectrum would
also need to be considered in order to identify the number and frequency of notes present. Similarly,
pitch correction would need to be performed to discrete sections of the spectrum as opposed to the
existing method that manipulates the spectrum at each window as a whole.
42
6.3.2 Real-time Automatic Pitch Correction Although the Automatic Pitch Correction System has been designed as a sequential process and
computational expense has been considered at every stage, the system still functions primarily as a
post-processor. The implementation of the Automatic Pitch Correction as a real-time application
would require actual implementation of the solution as a sequential process as well as evaluating the
possibility of optimisation and adjustments that could be made to the existing algorithms used.
The current implementation of the Automatic Pitch Correction system is still based entirely MATLAB
and therefore any further work involving real-time processing will require implementation within a
language such as C or Java, which unlike Matlab, will allow for real-time input streaming of data.
6.3.3 Pitch Correction with Formant Preservation Further work could be carried out, implementing a similar solution to automatic pitch correction that
also considers the position of formants. The existing technique used for pitch correction shifts formant
information as well as pitch. However, formants are consistent in structure and independent of pitch
[23] and therefore some form of formant preservation should be incorporated into the system.
The Automatic Pitch Correction System shifts pitch by at most half a semitone, correcting to a
chromatic scale (or one semitone, correcting to a given musical scale), and therefore for these small
shifts in pitch, the effect of shifting formants may not be apparent. However, for a more flexible
implementation that may be used to shift pitch by larger amounts, for example, the transposition of a
vocal track by more than one key, formant correction or preservation would be required.
6.3.4 Intelligent Error Correction The current implementation simply uses a “nearest neighbour” approach to error correction, whereby
a frame is shifted to either the nearest semitone (given chromatic tuning) or the nearest note in a given
musical scale. However, this may not always lead to the correct choice of target note, since an
instrument may be out of tune or a singer may sing off key by more than half a semitone. An
intelligent system could be developed that would calculate a target note that is based on the
frequencies of previous frames as well as the frequency of the current frame. One option would be to
incorporate the use of a Markov Model that would be able to predict standard note progressions and
musical scales. This would remove the need to hard code the desired musical scale to correct to (given
the chromatic scale is not to be used) and would cope with the problem of “accidentals” being
undesirably corrected.
43
7 Alternative Methods
This section covers various alternative methods that were implemented as an attempt to solve the
problem of real-time automatic pitch correction. Each method was implemented and tested, but
rejected due to the limitations or ineffectiveness of the solution.
7.1 Note Detection
A method that was initially proposed in order to perform pitch detection involved the detection of
individual notes within the input signal. Through the detection of peaks in amplitude over a given
time, the transient (or “attack”) of each note can be identified and the length of each note can be
calculated. The peaks are identified by first calculating the power spectrum for windowed sections of
the input signal. The maximum peak within each power spectrum is identified as the amplitude value
for the signal at the given time point. A plot of consecutive amplitude values is then created, thus
allowing for transients to be detected through detection of peaks within the amplitude plot.
The benefit of such an approach is that the windowing can be dynamically altered. The detection and
correction of a note need only be performed between the boundaries of the note, improving the
performance of both the pitch detection and pitch correction modules.
The pitch detection algorithm (see 4.2 Pitch Detection) is optimised for each note since the pitch
detection is performed between the boundaries of each note and therefore a maximum window size is
being used. Also, windows that overlap consecutive notes within an input signal are eliminated, thus
avoiding confusing results. The “timing” of the pitch correction (see 4.4 Pitch Correction) is also
improved since the pitch shifting is guaranteed to start with the onset of each note and end before the
onset of the next. Pitch shifting that overlaps note boundaries is prevented and therefore unwanted
pitch shifts are avoided.
An important issue that would need to be addressed is the different waveforms exhibited by various
instruments. Certain instruments such as the piano or xylophone create waveforms with very distinct
peaks in amplitude that accurately represent note transients (see Fig 7.1). However, many other
instruments such as the cello or the flute produce waveforms with amplitudes that oscillate and thus a
single note may be detected as several and note boundaries may be missed since no clear peak can be
detected (see Fig 7.2).
44
Fig 7.1 Plot showing four notes played on a piano.
Peaks representing each note transient are clearly visible and can be easily detected
Fig 7.2 Plot showing four notes played on a cello.
Multiple peaks are shown, which creates ambiguity with regards to detecting note transients
The major disadvantage of such an approach is that it is purely a post-processing algorithm. A note
cannot be detected until the transient representing the following note is reached. This requirement
would cause a significant delay in the output signal, which prevents any possibility of real-time
application.
7.2 Modified Phase Vocoder
An alternative method for pitch correction as proposed by Laroche et al [14] achieved pitch shifting
through translation of peaks within the frequency domain representation of a signal. Peaks are
identified within the power spectrum through a simple peak detection process. A “region of influence”
45
for each peak is identified around each peak, the limit of which is set as halfway between successive
peaks. A shift in pitch is achieved by translating each peak along with its region of influence to the
desired frequency location (see Fig 7.3 & Fig 7.4). The peak corresponding to the fundamental
frequency f is shifted by the required frequency shift ∆f ; all other peaks are shifted by the same
relative amount in order to maintain the appropriate structure within the power spectrum. For
example, the peak corresponding to the second harmonic of a sound is shifted by (2 * ∆f ), whilst the
third harmonic is shifted by (3 * ∆f ).
Fig 7.3 Power spectrum of a signal before peaks are translated to new frequencies
Fig 7.4 Power Spectrum after peaks are translated to new frequency locations.
Note that higher harmonics are shifted by a greater degree
46
Phase correction is implemented by rotating the phase angles relative to the amount of pitch shift
required. Given that the frequency w has been shifted to w + ∆w, the phase angles are rotated by ∆wR,
where R is the hop size between consecutive windows.
The method has a number of advantages; the most significant being the allowance for arbitrary
frequency shifts to be made. Peaks and their regions of influence may be shifted to any frequency
location within the power spectrum, requiring a simple integer shift to a new frequency bin location.
Another advantage of this method as apposed to the method implemented in the Automatic Pitch
Correction System (see 4.4 Pitch Correction) is the ability to shift pitch using only a single windowed
sample of the input signal and therefore highly suitable for real-time application.
However, despite the many advantages, the output signal created suffers from phase propagation
errors and transient smearing; sounds lose their presence and notes lose their attack. Since the aim of
this project is to create a sound signal of the same quality and timbre as the original sound (see 1.1
Introduction), the use of such a pitch correction technique is unacceptable.
7.3 Alternative Phase Alignment Techniques
One of the main issues that needed to be addressed during the implementation of the Automatic Pitch
Correction System is the alignment of phase angles between pitch-shifted sections of the signal. A
number of different techniques were applied with the aim to remove any phase propagation errors
causing audible “blips” in the output signal.
7.3.1 Rectangular windowing
The effects of phase cancellation where two sections of signal overlap (see 4.4 Pitch Correction) are
the cause of the audible blips. This overlap is a result of using a Hanning function on each analysis
frame. A potential solution to this problem is to use rectangular windowing, where no overlap is used
and no function is applied to individual analysis frames. Although rectangular windowing offers less
accurate frequency representations [25], the individual frames are not being used by the Pitch
Detection module (see 4.2 Pith Detection) and therefore may be utilised in order to remove any
overlapping sections.
47
However, using rectangular windowing presents other problems. Firstly, since the Pitch Correction
module (see 4.4 Pitch Correction) requires a large number of analysis frames in order to obtain an
accurate pitch shift, the use of rectangular windowing will greatly increase the length of signal
required for each pitch shifted section (see Fig 7.5).
Fig 7.5 Top – Eight consecutive analysis frames with a 75% overlap
Bottom – Three consecutive analysis frames with no overlap, spanning a greater time period than eight overlapping frames
Secondly, as mentioned earlier (see 4.1 Converting to Frequency Domain), the overlap between
successive frames is important to create a “cross-fade” effect. The use of rectangular frames creates
audible distortion between pitch-shifted sections of signal as a result of inconsistent waveforms (See
Fig 7.6). Similar to the phase cancellation problems associate with using overlapping sections, this is
a phase alignment issue. However, in the case of rectangular windowing, the phase propagation errors
are manifested in a far more obtrusive fashion. Consequently, the use of rectangular windowing is not
a feasible option.
Fig 7.6 A) A signal split into two sections where the second section is to be pitch shifted.
B) The shift in pitch results in both a change in frequency and in phase and therefore a discontinuity
in the signal is introduced where the two sections meet.
48
7.3.2 Interpolation within the Frequency Domain
Many techniques exist that allow reconstruction of missing samples, or removal of clicks and pops in
audio signals [10, 24, 26]. A majority of these techniques are known as autoregressive, whereby the
present data at a point within the signal is used to predict future data that will follow. This may be
implemented in the frequency domain though interpolation of consecutive Fourier analysis frames.
A possible solution to the phase alignment problem is to use such an autoregressive technique and
simply remove the region where consecutive sections overlap and perform reconstruction though
interpolation. Since a 75% overlap is utilised throughout the Automatic Pitch Correction System, the
overlap between two successive sections of signal is three frames (see 4.4 Pitch Correction). The
Fourier analysis windows either side of this overlapping region can be interpolated in an attempt to
remove the blip and “blend” the two pitch-shifted sections on either side together.
Although this technique is capable of successfully removing the blips, it requires interpolation of a
large number of FFT frames on either side of the overlapping region in order to obtain a smooth
transition between sections. Interpolation of a smaller number of frames results in an audible jump
between consecutive sections, which bears no improvement on the blips that existed previously. Since
the Automatic Pitch Correction System is fundamentally aimed at being the basis of a real-time
system, such post-processing techniques (that potentially require a considerable amount of time) are
not suitable. Avoiding the cause of such blips entirely is a much more desirable approach, where any
phase alignment problems that may occur are prevented through appropriate rotation of phase angles.
7.3.3 Phase Angle Rotation
A final method involved simply storing the phase angles corresponding to the last frame of each
section of signal. These values are then used as the “starting point” for the next section of signal,
effectively resulting in the phase angles of a section being rotated to align with the preceding section.
Although this technique does help to alleviate the problem of phase cancellation between successive
frames, the output signal begins to suffer from phase incoherence [13], where there is a loss of
“presence” in the output signal and a subtle reverb effect is introduced. A major requirement of the
Automatic Pitch Correction System is the ability to reproduce input audio signals without loss or
coloration of signal and therefore this technique is not suitable.
49
References
[1] A. M. Noll. 1969. “Pitch determination of human speech by the harmonic product spectrum,
the harmonic sum spectrum, and a maximum likelihood estimate”. Proc of the Symposium on
Computer Processing in Communication, pages 779–798, April 1969.
[2] Aniruddh D. Patel, Evan Balaban1. 2001. “Human pitch perception is reflected in the timing of
stimulus-related cortical activity”. The Neurosciences Institute. California, USA.
[3] B Espinoza-Varas, C. S. Watson. 1989. “Perception of complex auditory patterns by humans”.
In R.J. Dooling and S.H. Hulse (Eds.) The Comparative Psychology of Audition: Perceiving
Complex Sounds, Lawrence Erlbaum Associates, Hillsdale, NJ.
[4] Ben Gold, Nelson Morgan. 2000. “Speech and Audio Signal Processing: Processing and
Perception of Speech and Music”. John Wiley & Sons, Inc. New York
[5] Cheol-Woo Jo, Ho-Gyun Bang, William A. Ainsworth. “Improved Glottal Closure Instant
Detector Based on Linear Prediction and Standard Pitch Concept”. Dept. of Control and
Instrumentation Engineering, Changwon University, Korea.
[6] David Claus. 2004. “Nearest Neighbour, Condensing and Editing”. Computer Vision Reading
Group. Oxford.
[7] Edward A. Lee. 1998. “Design Methodology for DSP”. Department of Electrical Engineering
and Computer Science University of California, Berkeley.
[8] Glenn Zelniker, Fred J. Taylor. 1994. “Advanced Digital Signal Processing: Theory and
Applications”. Marcel Dekker, inc. New York.
[9] Goangshiuan S. Ying, Leah H. Jamieson, Carl D. Michell. 1994. “A Probabilistic Approach to
AMDF Pitch Detection”. School of Electrical and Computer Engineering, Purdue University.
[10] I. Potamitis, N. Fakotakis. 2001. “Autoregressive Time-Frequency Interpolation in the Context
of Missing Data Theory for Impulsive Noise Compensation”. Wire Communications
Laboratory, Electrical and Computer Engineering Dept, University of Patras, 261 10 Rion,
Patras, Greece.
50
[11] Jaideva C. Goswami, Andrew K. Chan. 1999. “Fundamentals of Wavelets: Theory, Algorithms
and Applications”. John Wiley & Sons, Inc. New York.
[12] James Athey. 2003. “Eartrainer: Cross-Platform Eartraining Program for Musicians”.
Retreived February 12th, 2004, from http://zoo.cs.yale.edu/classes/cs490/02-
03b/james.athey/Final_Report.html
[13] Jean Laroche, Mark Dolson. 1999. “Improved Phase Vocoder Time-Scale Modification of
Audio”. IEEE Transactions on Speech and Audio Processing, vol. 7, no. 3, pp. 323-332.
[14] Jean Laroche, Mark Dolson. 1999. “New Phase-Vocoder Techniques for Pitch Shifting,
Harmonizing and Other Exotic Effects”. Proc. IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics, New Paltz. New York. pp. 91-94.
[15] Jean Laroche. 2003. “Frequency Domain Techniques for High-Quality Voice Modification”.
Proc. of the 6th int. Conference on Digital Audio Effects, London, UK, pp. dafx72.
[16] John Garas, Piet C. W. Sommen. 1998. “Time/Pitch Scaling Using the Constant-Q Phase
Vocoder”. Eindhoven University of Technology.
[17] L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun. 2003. “Real time voice processing with
audiovisual feedback: toward autonomous agents with perfect pitch”. In S. Becker, S. Thrun,
and K. Obermayer (eds.), Advances in Neural Information Processing Systems 15, pages 1205-
1212.
[18] L. R. Rabiner, R. W. Schafer. 1978. “Digital Processing of Speech Signals”. Prentice Hall, Inc.
New Jersey.
[19] Lawrence J. Ziomek. 1995. “Fundamentals of Acoustic Field Theory and Space-Time Signal
Processing”. CRC Press, Inc. Boca Raton.
[20] Masahiro FURUKAWA, Yusuke HIOKA, Takuro EMA, Nozomu HAMADA. 2003.
“Introducing New Mechanism in the Learning Process of FDICA-Based Speech Separation”.
International Workshop on Acoustic Echo and Noise Control, Sept. 2003, Kyoto, Japan.
51
[21] Patricio de la Cuadra, Aaron Master, Craig Sapp. 2001. “Efficient Pitch Detection Techniques
for Interactive Music”. Center for Computer Research in Music and Acoustics, Stanford
University.
[22] Paul Boersma. 1993. “Accurate Short-Term Analysis of the Fundamental Frequency and the
Harmonics-to-Noise Ratio of a Sampled Sound”. Institute of Phonetic Sciences, University of
Amsterdam, Proceedings 17 (1993), 97-110.
[23] Paul L. Browning. 1997. “Audio Digital Signal Processing in Real Time”. Computer Science
Dept. West Virginia University.
[24] Paulo A. A. Esquef, Vesa V¨alim¨aki, Kari Roth, Ismo Kauppinen. 2003. “Interpolation of
Long Gaps in Audio Signals Using the Warped Burg’s Method”. Proc. of the 6th Int.
Conference on Digital Audio Effects (DAFx-03), London, UK, September 08-11, 2003.
[25] Pierre Wickramarachi. 2003. “Effects of Windowing on the Spectral Content of a Signal”.
Sound and Vision January 2003. Data Physics Corporation, San Jose, California.
[26] R. N. J. Veldhuis. 1990. “Restoration of Lost Samples in Digital Signal”. Prentice-Hall.
[27] Simon J. Godsill and Pete J. W. Rayner. 1998. “Digital Audio Restoration: A Statistical Model
Based Approach”. Springer. London.
[28] Stephan M. Bernsee. 1995. “Time Stretching and Pitch Shifting of Audio Signals”. Retrieved:
November 20th, 2003, from http://www.dspdimension.com/html/timepitch.html
[29] Tristan Jehan. 1997. “Pitch Detection”. CNMAT. Berkeley, California. Retrieved: November
11th, 2003, from http://www.cnmat.berkeley.edu/~tristan/Report/node4.html
52
Appendix A
Reflection on Project Experience I found the project to be both an enjoyable and challenging experience. The subject area covered a
range of different topics and involved a considerable amount of learning throughout the process of
creating both the solution and the report. Mainly incorporating methodology and time management,
the following describe the lessons I have learnt throughout the duration of this project:
Background reading and thorough research within the subject area is essential. A lot of time can be
wasted embarking on the implementation stages of a project if an in-depth understanding and
knowledge of the subject area has not already been achieved. Extra time spent in the early stages of a
project can save a lot of time during the following stages and help to produce a higher quality
solution. Finding appropriate material proved to be very challenging in the early stages of the project
since very little is written about “automatic pitch correction”. However, once a greater understanding
of the subject area had been achieved, searching for and locating appropriate literature became
increasingly easier, with many useful references discovered through the use of the Internet.
Creating a time schedule is a very worthwhile process and the importance of allocating adequate time
for each project stage should be taken into consideration. Most significantly during the
implementation stages, the actual amount of time required often exceeds the predicted time and
therefore the time required for each stage should be overestimated where possible. Throughout this
project, it was found that the appropriateness of an algorithm or technique could often not be assessed
until implementation of that particular algorithm or technique was complete. This increased the time
spent implementing the solution considerably and reduced the available amount of time for other
project stages.
It is also important to identify appropriate evaluation criteria early on during the project. Without
such, the project may lack focus or direction. If the performance of a system cannot be determined
through appropriate evaluation, the reasons for selecting an appropriate methodology and choice of
implementation become ambiguous, and therefore most likely resulting in an inadequate solution.
Finally, it is important to set realistic minimum requirements and to ensure focus is directed towards
meeting the needs of the project. Time spent thinking carefully about the minimum requirements early
on in the project can prevent a lot of wasted time and ensure focus is maintained on the correct aspects
of the project.
53
Appendix B Project Schedule Gantt Chart
Fig A Original Project Schedule
Fig B Revised Project Schedule
54
Appendix C
External Devices
Evaluation of the Pitch Correction module involved shifting the pitch of individual notes within
various audio samples. To implement this, an external track-editing machine was utilised, specifically,
the ZOOM MRS1608. This machine provides full editing capabilities for musical tracks and therefore
ideally suited to effect pitch errors within audio samples. The MRS1608 also provides a built-in
chromatic tuner, which was also used throughout the evaluation of the Pitch Correction module.
More information can be found at www.zoom.co.jp
Below is a list of the existing implementations used for automatic pitch correction that were used as
part of the user evaluation. Included are the URLs from which the test files were obtained:
• Antares Vocal Producer AVP-1
http://onstagemag.com/ar/performance_online_extras_january_2/
• RBC Audio Voice Tweaker Lite
http://www.rbcaudio.com/html/vt_lite.html
• “Project on pitch detection and correction for the solo human voice”
Connexions - Rice University, Texas
http://cnx.rice.edu/content/m11716/latest/
55
Appendix D
User Evaluation Results
The group of judges selected to perform user evaluation included five experienced musicians (Grade7
– Grade 8) and five non-musicians. Below is a list of the people involved, listed with their respective
instruments:
Musicians -
Michael Connolly Guitar
Simon Stevens Bass Guitar, Clarinet
Helen Jackson Piano, Flute
Helen Kirkbright Piano
Nick Parva Guitar, Drums
Non-Musicians -
Leon Savidis
Rouzbeh Safaie
Peter Coleman
Sean Matthews
Sharon Davidson
Performance Ratings -
The following table details the individual performance ratings awarded by each judges for each test
file during user evaluation. The average grade for musicians and non-musicians for each test input file
is also shown (marked as Avg). Further to this, Fig D displays the grade distribution between average
performance ratings for both the musicians and the non-musicians.
56
Performance Rating (1-10)
Musicians Non-Musicians Input file
No.
File description
(wav - 44100 Hz) MC SS HJ HK NP Avg LS RS PC SM SD Avg
1 B0 – Bass guitar 6 6 5 6 5 5.6 5 5 5 7 5 5.4 2 A1 – Bass guitar 4 6 4 5 5 4.8 4 5 5 6 5 5.0 3 G1 – Grand Piano 8 8 7 8 7 7.6 8 7 8 9 8 8.0 4 F2 – Grand Piano 9 8 8 9 7 8.2 8 8 8 8 8 8.0 5 B2 – Male vocal no.1 5 7 5 6 6 5.8 6 6 7 7 5 6.2 6 C3 – Piano (midi) 8 8 7 9 8 8.0 8 7 8 9 9 8.2 7 E3 – Electric Guitar 8 8 9 10 9 8.8 8 8 9 10 10 9.0 8 F3 – Male vocal no. 2 6 5 5 6 5 5.4 6 5 5 6 6 5.6 9 C4 – Cello 6 6 6 6 6 6.4 5 5 6 7 7 6.4
10 C4 – Grand Piano 8 8 7 9 7 7.8 7 8 8 9 8 8.0 11 D4 – Cello 9 8 7 8 8 8.0 7 8 8 9 9 8.2 12 Eb4 – Saxophone 7 7 6 7 7 6.8 6 6 7 8 7 6.8 13 E4 – Male vocal no.2 7 7 7 8 7 7.2 8 7 7 8 8 7.6 14 F4 – Female Vocal 6 6 6 7 6 6.2 5 7 7 6 6 6.2 15 A#4 – Oboe 8 8 7 9 7 7.8 8 8 8 8 9 8.2 16 B4 – Flute 6 6 5 6 6 5.8 6 6 6 7 5 6.0 17 C5 – Clarinet 6 5 6 7 7 6.2 5 7 6 7 6 6.2 18 C#5 – Violin 7 8 7 8 7 7.4 7 8 7 8 8 7.6 19 D#5 – Xylophone 9 9 9 10 9 9.2 9 9 9 10 8 9.0 20 G#5 – Violin 6 6 5 6 5 5.6 5 6 5 7 5 5.6
Fig D Grade distribution graph between Musicians and Non-Musicians
57
Below is a table detailing the individual results given by each judge during the user evaluation stage
involving comparison between other existing implementations used for automatic pitch correction.
Also given is the average grade (marked as Avg) awarded for each system by both the musicians and
the non-musicians:
Performance Rating (1-10)
Musicians Non-Musicians Input file
No.
Pitch Correction
System
MC SS HJ HK NP Avg LS RS PC SM SD Avg
1 Antares AVP-1 10 10 10 10 9 9.8 10 10 10 10 10 10.0
2 RBC Audio Voice Tweaker Lite
5 5 4 5 4 4.6 5 5 4 5 4 4.6
3 Connexions - Rice University
6 5 6 5 6 5.6 6 7 6 6 6 6.2
3 Automatic Pitch Correction System
6.95 7.0 6.4 7.5 6.7 6.91 6.55 6.8 6.95 7.8 7.1 7.04
3 Automatic Pitch Correction System (vocals only)
6.0 6.25 5.75 6.75 6.0 6.15 6.25 6.25 6.5 6.75 6.25 6.40