Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
UNIVERSITY OF OKLAHOMA
GRADUATE COLLEGE
POSITION CONTROL USING PITCH FEEDBACK
A THESIS
SUBMITTED TO THE GRADUATE FACULTY
in partial fulfillment of the requirements for the
Degree of
MASTER OF SCIENCE
By
MUSTAFA A. GHAZINorman, Oklahoma
2013
POSITION CONTROL USING PITCH FEEDBACK
A THESIS APPROVED FOR THESCHOOL OF AEROSPACE AND MECHANICAL ENGINEERING
BY
Dr. David Miller, Chair
Dr. Andrew Fagg
Dr. Peter Attar
c© Copyright by MUSTAFA A. GHAZI 2013All Rights Reserved.
DEDICATION
To my mother, Ammi, for motivating me and pushing me through school allthese years. Without you, I would not be where I am today.
Acknowledgements
First of all, I would like to thank Dr. David Miller, my advisor and mentor,
for having faith in me and encouraging me, despite my lack of experience with
robotics. I am particularly grateful for his patience with me. I would also like
to thank my committee members: Dr. Peter Attar, for teaching me not to
be afraid of math, and Dr. Andrew Fagg, for my formal training in practical
robotics.
I would like to thank Dr. John Fagan and Dr. Miller for teaching me so
much about robot construction, Dr. Dean Hougen for my formal introduction to
robotics, Amber Walker for her advice and guidance on everything from school
work to life in the US, Matthew Walker for introducing me to microcontrollers,
Michael Nash for his help with 8-bit microcontrollers and programming tips,
Andrew Kooiman for teaching me about robot testing and redesigning, Jacob
Henderson for his advice on electronics, and Clayton Stich for his help with
machining parts. I am also grateful for cooperation from Billy Mays and Greg
Williams at the AME machine shop.
Finally, I would like to thank Ammi and Baba, my parents, for their contin-
uing support, encouragement and love, keeping me motivated day after day.
ii
Contents
1 Introduction 11.1 Overview of the Theremin . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Related Work 92.1 Pitch Feedback in Robots and Humans . . . . . . . . . . . . . . 10
2.1.1 Automated systems controlling pitch . . . . . . . . . . . 102.1.2 Automated systems: summary . . . . . . . . . . . . . . . 142.1.3 Human motor control . . . . . . . . . . . . . . . . . . . . 142.1.4 Human motor control: summary & preliminary hypothesis 19
2.2 Pitch Detection Methods . . . . . . . . . . . . . . . . . . . . . . 202.2.1 Pitch detection methods: summary . . . . . . . . . . . . 22
2.3 Human Robot Interfaces in Musical Robots . . . . . . . . . . . 232.3.1 Human Robot Interfaces: summary . . . . . . . . . . . . 25
3 Development of a Thereminist Robot System 263.1 Pitch Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Pitch sensor design requirements . . . . . . . . . . . . . 273.1.2 Pitch sensor design . . . . . . . . . . . . . . . . . . . . . 303.1.3 Pitch sensor implementation . . . . . . . . . . . . . . . . 35
3.2 Mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.1 Mechanical design requirements . . . . . . . . . . . . . . 373.2.2 Mechanical design & implementation . . . . . . . . . . . 38
3.3 Control Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3.1 Control scheme design requirements . . . . . . . . . . . . 473.3.2 Control scheme design & implementation . . . . . . . . . 48
3.4 Robot Controller . . . . . . . . . . . . . . . . . . . . . . . . . . 503.4.1 Robot controller performance requirements . . . . . . . . 503.4.2 Available robot controller options . . . . . . . . . . . . . 513.4.3 Selection of robot controller . . . . . . . . . . . . . . . . 51
iii
3.5 Human-Robot Interface . . . . . . . . . . . . . . . . . . . . . . . 533.5.1 HRI peformance requirements . . . . . . . . . . . . . . . 53
3.6 Music playing algorithm . . . . . . . . . . . . . . . . . . . . . . 54
4 Melody Extraction Scheme (MES) for Human-Robot Interface 564.1 MES Performance Requirements . . . . . . . . . . . . . . . . . . 574.2 MES Design & Implementation . . . . . . . . . . . . . . . . . . 57
4.2.1 Pitch detection . . . . . . . . . . . . . . . . . . . . . . . 584.2.2 Highest octave identification . . . . . . . . . . . . . . . . 604.2.3 Domain reduction and shifting . . . . . . . . . . . . . . . 614.2.4 Note extraction . . . . . . . . . . . . . . . . . . . . . . . 644.2.5 Melody extraction . . . . . . . . . . . . . . . . . . . . . 64
5 Testing and Performance 655.1 Testing of Feedback Control Scheme . . . . . . . . . . . . . . . . 65
5.1.1 Response to step commands . . . . . . . . . . . . . . . . 665.1.2 Effect of environmental disturbance . . . . . . . . . . . . 68
5.2 Testing of Melody Extraction Scheme . . . . . . . . . . . . . . . 695.2.1 Pure tone . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2.2 Whistling . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2.3 Piano recording . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Testing of Overall System . . . . . . . . . . . . . . . . . . . . . 71
6 Conclusion 756.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
iv
Abstract
An interactive robot has been developed to play a musical instrument called the
theremin. The robot has 4 degrees of freedom (DOF) and uses zero crossing
rate (ZCR) for pitch detection. A proportional-derivative (PD) feedback control
scheme is used to control pitch and play music. A user interface converts human
whistling sounds to a commanded melody for the robot. A melody extraction
scheme (MES) filters and shifts the input pitch to fit in the pitch generation do-
main of the theremin. In contrast to other thereminist robots, this robot is a low
cost platform using ‘hobby’ electronics. The interface enables non-musicians and
non-programmers to command the robot to play musical pieces of their choice.
Like some other thereminist robots, it can adapt to a changing environment,
and can be easily re-programmed to test different control schemes. Unlike other
thereminist robots which are set up to test predominantly feedforward control
schemes, our robot is set up to explore feedback as well as feedforward control
schemes. The development of this robot is a first step in creating tools for explo-
ration of aspects of human aural preferences, and human motor control during
musical performances. The benefits of understanding human motor control in
this context can extend beyond musical robots. This understanding can impact
human-robot interaction applications in general, leading to greater potential for
human acceptance of robotic systems.
v
Chapter 1
Introduction
A theremin is an electronic musical instrument that has a physical interface
amenable to manipulation by a simple robot. Unlike most other musical instru-
ments, the absence of physical contact in playing a theremin makes it difficult
to control using tactile feedback. The variability of the surrounding capacitive
field makes proprioceptive feedback inadequate. Together, these make aural
feedback a requirement. As such, it is a good vehicle for exploring issues having
to do with robots and music. This chapter presents the operating principles of
the theremin, and introduces the thesis project, providing some of the motiva-
tions and objectives. These are followed by an outline of the remainder of the
thesis.
1.1 Overview of the Theremin
A theremin (Fig. 1.1) is a musical instrument with a pitch control and a volume
control antenna. These antennae act as capacitors for two LC oscillator circuits
[6]. One antenna is part of the frequency oscillator circuit, and the other is
1
Figure 1.1: A theremin with the pitch and volume antennae labeled.
part of the volume oscillator circuit. Placing a foreign object within range of
an antenna, e.g. a hand, detunes its LC oscillator. This changes the frequency
of oscillation. Typically this change is very small; less than 1 %. For a pitch
circuit oscillating at 500 kHz, for example, this is a variation of about 5 kHz.
The 500-505 kHz signal is then converted to the audible 0-5 kHz range by
heterodyning1 with a 500 kHz signal. Typically, a theremin is designed to give
a pitch range of 3-5 octaves. A different LC oscillator circuit controls the volume
such that the more it is detuned, the weaker the signal gets, and the lower the
volume. Similarly, the more the pitch antenna is detuned, the higher the output
frequency (Fig. 1.2). Therefore, two hands are simultaneously used to play a
theremin, one for each antenna.
Interestingly, the hand movement at one antenna does not interfere with
the other antenna. This is because both antennas produce electric fields with
orthogonal polarizations [21]. A monopole antenna is used for the pitch and
a loop antenna is used for the volume. Furthermore, theremins are nonlinear.
The distance between two consecutive notes is different in each octave [21].
Lastly, the pitch antenna can easily be interfered with. Theremin players
1Taking the difference with another frequency.
2
−350 −300 −250 −200 −150 −100 −50 00
200
400
600
800
1000
1200
1400
1600
1800
2000
motor position (dimensionless)
freq
uenc
y (H
z)
without disturbing bodywith disturbing body
Figure 1.2: Variation of pitch (frequency) with distance from the pitch antennaof the theremin. Note the jump in frequencies further away from the antenna inthe presence of a disturbing body. The disturbing body here is a 150 lb personstanding about 30 cm (12 in) from the pitch antenna.
have reported interference effects in a radius of about 3 m (118 in) [21]. Any
moving object near the theremin will alter the pitch (Fig. 1.2). Since the
volume antenna is controlled by vertical movements, it is much more difficult to
interfere with. There is a large number of different materials, sizes, shapes, and
even environmental conditions that may modify the electrostatic fields around
the antennae, and hence the capacitances of the oscillators. Therefore, it is very
difficult to develop a generalized model of the theremin which accounts for all
the variables.
Our theremin is a Moog standard Etherwave model. It has 4 rotary control
knobs for controlling volume, pitch, wave form, and brightness. The volume
control knob controls the response of the volume antenna. Along with the out-
put volume, it also controls the location of the point at which the volume goes
3
down to zero. The location of this point can be set so that it is on the antenna
(requiring contact) or some distance away from the antenna (not requiring con-
tact) depending on how the user wants to set the volume to zero during play.
The pitch knob controls the response of the pitch antenna, i.e. the size of the
useful pitch control workspace. The wave form knob controls the shape of the
audio wave form. The brightness knobs controls the “character” of the sound.
For our purposes, we set the volume knob to get the lowest possible signal am-
plitude with nothing near the volume antenna. The pitch control knob was
set to get the largest possible pitch workspace. The wave form and brightness
knobs can be set to any pleasant sounding positions as determined by the user.
Testing indicated a reliable frequency variation from approximately 132 Hz to
2000 Hz, which corresponds to the range C]3 to B\
6 on the musical scale.
1.2 Motivation
Humans can have a tendency to prefer slightly distorted music over perfectly
produced music. In an article on music production equipment, Poss [30] points
out that distortion in music is like seasoning in food, and that it makes music
platable. He argues that most guitarists use equipment based on older technol-
ogy because such equipment has a certain distortion which is pleasing to the
human ear. Another article on the resurgence of vinyl records makes a similar
point. Vinyl records produce a warmer and more nuanced sound as compared
to CDs and digitally stored audio [9]. Quoting a music collector, “Most things
sound better on vinyl, even with the crackles and pops and hisses.” Similarly,
Yochim et al. [50] conclude that vinyl is viewed as human in its sonic imperfec-
tions. Vinyl enthusiasts seem to describe its sound as “somehow more alive.”
4
Record store owners have claimed that the scratches and pops often asso-
ciated with the vinyl sound are all part of the “warmth” that vinyl offers [28].
An article looking at the recent growth in vinyl industry in the UK mentions a
similar preference of the warmth of vinyl, and the “being there” quality. It is
said to provide a unique listening experience [16].
It seems that there is something about the distortion and imperfection that
is attractive. Perfection feels less natural and less attractive to humans. Perhaps
this has something to do with being human and being abe to relate better to
what we perceive as human-like. Unlike precision manufacturing robots, our
movements are not as accurate. Our speech does not sound exacly the same,
even if we repeatedly utter the same phrase. When we write, any particular
letter of the alphabet is slightly different every time it is written down. Referring
to repeated movements which are too precise, monotonic, or “unnatural” we
label them as “robotic.” We are frequently able distinguish between the output
of a “robotic” text-to-speech converting program and a “natural” recording
from a human. What is it that distinguishes “natural” from “robotic”? Can we
identify these distinguishing features and build them into robots to make them
more “natural” and acceptable to humans? Is it really a difference between
“natural” and “robotic” actions, or is our perception biased? If so, then is it
possible to manipulate this bias in perception? Creating a testbed to explore
these questions is one motivation behind this thesis.
A second motivation for this thesis is to explore human-robot interfaces
(HRI). As robots continue to assume an ever increasing number of roles in our
daily lives, there has been a surge in demand for more intuitive interfaces. It is
fascinating to see the variety of creative HRI solutions now available. The Nao
[2] humanoid robot, for example, has voice recognition and speech capabilities
5
to communicate in 8 languages. Another example is an interface that allows
a musical robot to understand non-verbal communication from a human flute
player [17]. The robot starts playing a theremin when it senses a visual cue from
the human. During play, the robot uses beat tracking and gesture recognition
to predict instantaneous tempo, and synchronizes its own play on the theremin.
HRI for musical robots can be interesting because of the variety of ways in
which music can be produced, perceived, and stored. We are interested to see
how music can be used as a mode of interaction with robots.
Another motivation is based on Science, Technology, and Math (STEM)
education. STEM education is a vital tool for developing and grooming the
innovators, inventors, and professionals of tomorrow. Meanwhile, as robotics-
related technologies continue to become more affordable and ubiquitious, many
STEM activities based on these technologies have come into being. Most of
these activities actively engage and challenge the imagination and resourceful-
ness of young minds. These minds get educated in multiple disciplines at once;
mechanical engineering, electrical engineering, computer science, mathematics,
management, and even aesthetics. Unfortunately, despite the widespread inter-
est, there are still not as many students attracted to STEM fields as would be
desirable. In 2008-09, no STEM discipline made it to the four most popular
majors in the US [38].
Music has long been a source of entertainment and attraction for adults and
youngsters alike. If music can be incorporated in STEM education activities,
then an additional number of otherwise disinterested youngsters can be moti-
vated to become involved. This is an additional motivation for this research.
That is to add the flavor of music to robotics-related STEM activities.
6
1.3 Objectives
The primary objective of this thesis is to develop a set of tools to facilitate
progress in exploring the areas of motivation outlined in the previous section.
This maps on to the following specific objectives:
1. Develop a set of robotic manipulators to play a theremin.
2. Develop a pitch detection system for feedback control.
3. Develop a robot system that can be used to test different control schemes.
4. Implement a control scheme to demonstrate the robot system’s ability to
be programmed to employ a given control scheme to play a theremin.
5. Develop a user interface for commanding a music playing task to the robot
system.
To make the robot practical for the STEM objective, making the robot
system as low cost as possible is a secondary objective. Keeping the system low
cost makes it more readily accessible for schools and STEM activity platforms.
1.4 Overview of Thesis
The layout of this thesis is as follows. Chapter 1 introduces the topic of the
thesis. In Chapter 2, some related work on control systems and pitch detection is
reviewed. Chapter 3 covers the design and development of the thereminist robot
system. This includes the control scheme, pitch detection sensor, mechanics, and
the human-robot interface. Chapter 4 outlines the melody extraction scheme
(MES) used to convert human audio input to a melody, i.e. a set of music playing
7
commands for the robot. Design requirements for the various subsystems are
presented in Chapters 3 and 4. Chapter 5 reports results from experiments
with the robot system. These experiments are intended to verify the objectives
identified in Chapter 1, and to validate that system performance meets the
requirements presented in Chapters 3 and 4. Chapter 6 discusses the significance
of the results and outlines possible future work.
8
Chapter 2
Related Work
One of the objectives of this thesis is to implement a control scheme for the
thermein playing robot. It can be helpful to review control schemes which
have been implemented in similar scenarios. Since we are dealing with a music
playing task, systems dealing with music and pitch will be considered. We will
seek inspiration from current theories on how humans perform similar tasks,
and perhaps in the process gather some more insight on the human processes
involved. Some terms used in the following sections are defined as follows.
Settling time is defined as the time taken to converge to a pre-defined pitch
error. Overshoot is defined as the maximum deviation (or pitch error) from the
intended pitch, measured from the instant when the error is first reduced to
zero. Error is defined by magnitude, and not by direction or sign. At a given
instant, it is typically defined relative to the target pitch at that instant, i.e. a
fraction or percentage.
9
2.1 Pitch Feedback in Robots and Humans
2.1.1 Automated systems controlling pitch
There have been many approaches in systems where music generation or fre-
quency control is involved: feedforward, feedback, and hybrid feedforward con-
trol. We will look at examples of each of these in order to select one suitable
approach to demonstrate on our thereminist robot system.
Alford et al. [4] used a hybrid feedforward-feedback control system with
their thereminist robot. The robot manipulators were controlled using artificial
pneumatic muscles. Their response is much slower and less accurate as com-
pared to electromechanical systems [4]. The hybrid control scheme involved
moving to a precalibrated position and then using a nonlinear pitch feedback
control loop to make corrections. Performance was compared to that of hu-
man subjects trying to match a note on the theremin with a note being played
to them from another source. Without any disturbance affecting the calibra-
tion, the robot demonstrated a settling time comparable to the best human
performances, and a much lower mean squared error. The ‘best’ performers in
this case were subjects with some musical experience, rather than professional
thereminists. Alford et al. [4] defined settling time as the time taken to reach a
given note, without specifying whether this had to be exact or within a certain
error margin. When the calibration was disturbed, the mean squared error ap-
proached that of human performance. Their results also indicated that with the
calibration disturbed, the overshoot became greater than that of humans. We
note that using a hybrid scheme did not prevent a degradation in performance
when a disturbance changed the feedforward model.
Using feedback control alone is another way of controlling pitch. Mizumoto
10
et al. [24] used a proportional-integral (PI) feedback controller for their therem-
inist robot. According to Mizumoto, feedback control easily makes unintended
vibrations [22]. So they switched to a feedforward scheme outlined below. How-
ever, the fact that they had partial success in using PI control to play a theremin
indicates that there is some potential in using classical feedback control schemes
for our purpose.
Another example of pitch feedback control is an automated guitar tuner.
Rahnamai et al. [32] used a fuzzy logic controller on their PC-based auto-
mated guitar tuner. Software detected pitch from the guitar and controlled
an actuator to physically tune the guitar. The controller design was based on
hardware characteristics and knowledge-base gathered from interviewing expert
musicians. The musicians’ knowledge was used to create fuzzy rule sets. The
pitch error of the tuner was about 5 cent2; small enough to be “acceptable by
most professional musicians”. In another application, Cohen et al. [8] developed
a frequency control system for vibration motors in vibrotactile feedback devices.
They used a proportional (P) feedback control scheme. A steady frequency was
reported to be reached in under 1 s. Both these works indicate that controlling
frequency using feedback control schemes is plausible.
Now we will look at some examples of feedforward control schemes. Mizu-
moto et al. [24] proposed two models for feedforward control for their therem-
inist robot: parametric and non-parametric. The parametric model used a
4-parameter nonlinear relation to fit a curve to measured pitch-position data
points. The non-parametric model described the pitch position relation using a
large number of linear interpolations. The parametric model was not affected
2A logarithmic measure for musical interval. On a 12-note musical scale, each note is 100cent (about 5.9 %) higher than the previous note. 5 cent is about 0.3%
11
by the changing environment, used only 12 data points, but was relatively less
accurate than the non-parametric model. Although the non-parametric model
was relatively more accurate, it was limited in that it required many more data
points (40-80) which had to be measured again every time the environment
changed. Mean absolute error3 was used as the performance metric to deter-
mine accuracy. This is not a useful quantitative metric since this error depends
on the total number of notes, the playing time duration, and the relative po-
sitions of the commanded notes. In later work, Mizumoto et al. [23] proposed
an improved feedforward model which used pitch detection for updating model
parameters while playing. The root mean square of the pitch error4 for this
model while playing a musical piece was 72.9 “cent”. A qualitative look at their
results indicates significant oscillations about the desired note. So although the
idea of a feedforward model being updated in real-time is attractive and possible
to implement, it may be very difficult to have it perform reasonably well.
In a different attempt at using a feedforward control scheme, Wu et al. [49]
used pitch detection to build a pitch map of the theremin. They were able to
develop a linear model based on the following observation: for a one degree of
freedom (rotational) robot arm, the arm rotation angle maps non-linearly to
distance from the theremin’s antenna, and maps linearly to the pitch produced.
They used this to develop a pitch-angle model which was approximately linear.
This model was limited in that it was valid for a specific arm length for the
robot. They also assumed that the environment would not change drastically
during the calibration and playing phases. However, for practical purposes,
3Mean of the error magnitude recorded at all the different instances during playing. Theinterval at which error was noted was not indicated by the authors.
41200*log(p/q) where p and q denote the observed and desired pitch in Hz. According totheir definition of cent, an error of 100 cent represents a half note error.
12
changes in the pitch model due to the environment are normal and are to be
expected.
Yet another example of feedforward control is a miniature pianist robot by
Batula et al. [7]. They used pitch detection to create and update a pitch-pose
map of the piano keys. The miniature robot had a tendency to drift slightly
during play, so an update to the model was required during play. Batula et
al. [7] reported a worst case of playing 1 wrong note after every 70 correct
notes. Such a scheme would perform worse with a thereminist robot because
the pitch domain of a theremin is translated as well as transformed non-linearly
(see Fig. 1.2) so the relative distances of the pitch-producing positions are also
changed. With a piano playing robot, the domain (keyboard) may move but
the keys are still the same size and at the same relative positions. Also, pianos
produce the wrong note only after the position error reaches a criticial point,
while theremins have a pitch error whenever there is any amount of position
error. Again, we see a music playing robot taking advantage of the fact that the
feedforward model is expected or assumed to change very little over time. In
a similar approach to building a feedforward model, Nakamura et al. [25] used
pitch detection to calibrate a motor-pitch map for humanoid robotic mouth.
In contrast to feedforward schemes reviewed thus far, some automated sys-
tems have used feedforward models without using any pitch detection. These
are music playing robots which include a bagpipe playing robot [27], a musi-
cian robot playing the piano [33] a saxophonist robot [39], and a violin5 playing
robot [36]. This indicates that for some instruments, the pitch-position relation
is so clearly defined that feedforward models can be developed without detecting
pitch.
5This study used a specialized electronic violin.
13
Many musical instruments seem to maintain their pitch generation proper-
ties. Once a feedforward model is developed, it can be assumed to remain the
same, and the instrument can be controlled using a feedforward scheme. In
general, a feedforward scheme can be more accurate than a feedback scheme,
and have the potential to respond faster. However, knowing that the theremin’s
acoustic properties change with the environment, and from looking at feedfor-
ward control attempts with other thereminist robots [4, 24, 22, 49] it is apparent
that a feedback control scheme would be most suitable for our purpose.
2.1.2 Automated systems: summary
To sum up the review of automated systems, we saw examples of different types
of control schemes: feedforward, feedback, and hybrid feedforward-feedback.
Given that most musical instruments more or less maintain their acoustic prop-
erties in a given playing session, feedforward control should be a logical choice.
In some cases, such a model can be developed without even detecting pitch.
However, if the acoustic properties change with time, then some element of
feedback control becomes a requirement. Since we are only looking to select a
control scheme for demonstration purposes, it is not important which scheme
we select. However, we do know that a theremin’s pitch properties can change
with time. Therefore, a feedback control scheme seems to be more suitable for
our thereminist robot, and this is the scheme we will use.
2.1.3 Human motor control
Thus far, we have reviewed control schemes primarily in music playing auto-
mated systems, i.e. “robotic” systems. Playing a theremin is a “natural” human
14
motor control task. We also know that humans prefer “natural” to “robotic”
actions. Therefore, we will now review human motor control strategies with the
intention of gaining at least an inspiration for our control system. If we can
learn enough to add “natural” characteristics to the control of our thereminist
robot, its performance might be more acceptable to humans. At the very least,
it will be interesting to compare actual human motor control schemes to robot
muscician control schemes.
MacKenzie [18] deduced that motor program (efferent) and feedback (affer-
ent) aspects play a mutually facilitative role in movement control. An efferent
pathway (or motor neuron) transmits an impulse out of the spinal chord. An
afferent pathway (sensory neuron) transmits an impulse to the spinal chord.
This sensorimotor interaction is representative of a feedback control loop.
Wadman et al. [48] conducted experiments with fast arm movements, mea-
suring electromyography (EMG) activity of the agonist6 and antagonist7 mus-
cles. There were three major outcomes of their study. The first was that the
agonist activity consisted of two bursts of activity separated by a period of
depressed activity. The antagonist muscles were active during this period of
depressed agonist activity. The second outcome was that the time duration of
the initial burst of agonist activity increased with the distance which had to
be covered. This increasing initial burst of agonist activity may well represent
feedback control. Alternatively, this response could also be a preset feedforward
action, with its time duration being a function of some perception information;
in this case, the distance to move. The third finding was that if the movement
was mechanically blocked without the subjects’ knowledge, then the pattern of
6Agonist muscles produce a movement.7Antagonists muscles are those which oppose a movement.
15
the EMG activity over at least the first 100 ms was the same as in the case
of the undisturbed movement. According to the Wadman et al. [48], this in-
dicated that the muscle activation patterns were preset over this period and
not immediately modifed by proprioceptive information. This seems to indicate
feedforward control. Even if 100 ms was the reaction time of subjects, the mus-
cles were still activated while the perception process was still going on. These
are only weak theories which may very well be incorrect, since data was not
collected on how the perception of the blocked movement made its way to the
subject’s brain. We can see here that monitoring muscle activity alone does not
provide much information as to what control scheme is being used.
Rothwell et al. [34] investigated the relationship between automatic and
voluntary phases of human motor control in a reflex action. They defined vol-
untary as the ability to abolish at will or greatly increase by extra effort. They
conducted experiments in which subjects were asked to maintain thumb position
in response to external disturbances. Thumb position and EMG were recorded.
They found that the magnitudes of the position reponses varied for the auto-
matic and voluntary phases, even for the same individual. In one set of trials, a
“saturation effect” was observed: as the size of the disturbance was increased,
the EMG magnitude of the automatic response remained constant. The varia-
tion in responses seems to indicate that multiple control strategies may be at
work. The saturation effect observed may be an indication of an underlying
feedback control mechanism, whereby there is an upper limit to the magnitude
of the control commands, either by design, or due to hardware limitations.
Some works have attempted to find evidence to support hypotheses on hu-
man motor control. One such hypothesis is the equilibrium-point (EP) hypoth-
esis cited by Ghafouri et al. [15]. The EP hypothesis suggests that changing
16
the static component of the torque-position relation forces the reflex action to
kick in, and to either settle into a new equilibrium pose or establish a new level
of static torques. This seems to be identical to a higher level controller chang-
ing the set point for a lower level controller (the lower level controller being
the reflex mechanism). In a thereminist robot system, this would be equivalent
to a higher level controller changing the commanded pitch to play (set point)
for a lower level pitch control system (reflexive in that it reactively tries to
minimize the pitch error at all times). Ghafouri et al. [15] concluded through
experiements that fast control movements may be completed without continu-
ous control guidance, since during the latter part of the movement, the control
signals responsible for the equilibrium shift remained constant. This agreed
with their hypothesis that in unobstructed movements, the equilibrium shifts
end at approximately the same time as peak velocity is reached. This seems
to be similar to saturation of higher level control commands, since the control
systems responsible for equilibrium shift remained constant. For a thereminist
robot, this is identical to the saturation of actuator velocity command once peak
velocity is achieved. Alternatively, this could just be representative of a control
hierarchy; the higher level controller successfully changing the set point of the
lower level controller, and then waiting for the lower level controller to complete
the motion. In a thereminist robot this would be analogous to a higher level
controller changing the commanded pitch to play (set point) and then a lower
level controller using that pitch command to actively control the actuator(s).
Reviewing work on music perception and production, Zatorre at al. [52]
cited work hypothesizing the involvement of two regions of the brain with mo-
tor control: the ventral premotor cortex (vPMC) and the dorsal premotor cor-
tex (dPMC). It has been hypothesized that the vPMC is involved in direct
17
visuomotor transformations and the dPMC is involved in indirect visuomotor
transformations.
Direct transformations represent a one-to-one matching of sensory features
with motor acts, e.g. matching properties of the visual object with an appropri-
ate motor gesture. For a thereminist robot, this could be identical to matching
a commanded pitch to a movement to a specific position in space which would
produce that pitch.
Indirect transformations involve motor information instructed by sensory
cues, e.g. the selection of a motor plan, and movement parameters such as
direction and amplitude. Sensory cues according to Zatorre et al. [52] represent
conditional rules as to which response to select among different alternatives.
In the context of a thereminist robot, this is identical to selecting from a set
of different arm trajectories, given the current and desired position in space to
move to.
If these hypotheses prove to be true, then both vPMC and dPMC indicate
different levels of feedforward control involved in musical performance. This
is identical to feedforward control models used by many musical robots and
automated systems as discussed in Section 2.1.1. However, even if true, we do
not know if these hypotheses also apply to musical instruments with changing
pitch models, such as a theremin. From the examples of theremin playing robots
we reviewed in Section 2.1.1, feedforward schemes seemed to be limited in their
success.
Niu et al. [26] concluded from experiments that the control of reaching
movements had three phases: suppression of proprioceptive feedback control,
followed by velocity feedback control, and then position feedback control late
in the movement. Drawing from these findings, one could hypothesize that
18
the first phase might very well be feedforward control. Alternatively, it could
also be saturated feedback control. The presence of velocity and position con-
trol in succession suggests that not attempting to control position earlier on
during the movement might be more efficient, or at least “natural.” These ex-
periements may not be entirely relevant for our purpose, since they were for
reaching movements rather than movements in a music playing task. But they
offer an interesting insight in that not only are both feedback and feedforward
control schemes used for the same task, different variables are controlled at dif-
ferent stages of the motion. As with other studies we have reviewed thus far,
we cannot deduce the details of the control schemes at work.
2.1.4 Human motor control: summary & preliminary
hypothesis
Based on the limited number of studies which we have reviewed, there is much
evidence indicating that human motor control involves varying degrees of feed-
back and feedforward control. There are some schools of thought and some facts
indicating how feedback control responds. But details, such as the actual feed-
back control algorithms employed by the human motor control system, seem to
be lacking. Furthermore, these studies in general seem to focus on motor con-
trol with either visual or proprioceptive feedback. In contrast, motor actions
with musical instruments also involve aural feedback. Only Wadman et al. [48]
seems to have looked at violin and trombone playing tasks. Other than that,
Alford et al. [4] (Section 2.1.1) recorded human performance on a theremin just
to prove that their thereminist robot performed better. Before attempting to
draw any conclusions about how human motor control functions when playing
19
musical instruments and using aural feedback, it is necessary to conduct a more
in-depth literature review. It may even be helpful to conduct our own studies
in this area.
We can present hypotheses, however. Based on work by Wadman et al [48],
the feedforward part of motor control is a preset action that is some function
of perception information. Similarly, from results by Niu et al. [26], not con-
trolling position earlier on during the feedback control phase may be one factor
that distinguishes “natural” from “robotic’ whereby a classical control system
may attempt to control position for the entire duration of the phase. These
hypotheses may mark the start of a scientific study. But for the purpose of this
thesis, we can opt to use them as inspiration for the development of the control
system in the next chapter.
2.2 Pitch Detection Methods
Another objective of this thesis is to develop a pitch detection system for feed-
back control. We know from the previous section that at least some human and
musical robot control schemes (human as well as robotic) make use of feedback.
In order to perform feedback control we must have some means of observing
the system output. Since this will be a music playing robot, the system output
will be the tones, i.e. pitch at different intensities. The development of a pitch
detection system is therefore critical to the task of feedback control. Therefore,
in this section, we will review pitch detection methods.
In general, pitch detection algorithms can be broadly divided into three
categories: time domain based, frequency domain based, and hybrid time and
frequency domain based. Rabiner et. al [31] compared representative algorithms
20
from all 3 categories. They concluded that in terms of computation time, the
time domain based methods performed the fastest. A faster performance means
lower computation time, or lower sampling time, or both.
The frequency domain based Fast Fourier Transform (FFT) seems to be
a popular method of choice for robot systems. Music playing robots such as
thereminists [49, 4], a miniature pianist [7], and a saxophonist [39] have used
FFT for pitch detection. Various other systems such as a dancing robot [51], an
automatic guitar tuner [32], a music interaction system [11], a music perception
system [43], and a humanoid robotic mouth [25] have also used FFT. Of these,
Wu at al. [49] and Alford et al.[4] reported a frequency resolution of 3.91 Hz at a
sampling frequency of 8kHz. Batula and Kim [7] reported a 99.6 % success rate
in pitch identification for their piano playing robot, correct to the nearest note.
Many systems have made use of real-time pitch feedback [49, 4, 7, 32, 43, 25].
Where identified, FFT was implemented using a computer [49, 4, 7, 32, 43].
From this it seems that FFT is widely used for pitch feedback and the popular
platform for implementation is a computer.
Another frequency domain-based method is the Auto-correlation (AC) pitch
detection method. Mizumoto et al. [24] used an AC-based method for a therem-
inist robot system.
Yet another popular frequency domain based method is the Cepstrum Anal-
ysis (CA) [14]. This has been used on an organ playing robot [33], a flutist
robot [40], a music interaction robot [41, 42], an interactive music generation
system [44], an interactive musical system [45] and a hearing aid [35]. Of these,
Roads [33], Suzuki et al. [41, 42]., Taki et al.[45] and Sakajiri et al. [35] could
detect pitch in human speech. Where mentioned, the implementation was on
computers [41, 42, 44, 45, 35] or a network of computers [33]. Therefore, CA
21
is another method which is widely used and traditionally implemented on a
computer.
Zero crossing (ZCR) is a time domain based pitch detection method. Some
examples of ZCR implementation include a vibro-tactile device control system
[8], a hobbyist frequency measurement instrument [19], and a manufacturer
application note [5]. More than 99 % accuracy was reported [8] or claimed
[5]. All three are implemented on microcontrollers. Being a time domain-based
method, it should perform the fastest (based on conclusions by Rabiner et al.
[31]). But the simplicity of ZCR limits it only to signals with pure frequencies
without any noise. Even if noise is taken care of, most musical instruments
generate multiple overtones, which ZCR is unable to deal with. Therefore, ZCR
is a desirable option for pure frequencies without noise or overtones.
2.2.1 Pitch detection methods: summary
To summarize, we saw examples of frequency domain based FFT, CA and
AC methods for pitch detection. We also saw some examples of the time do-
main based ZCR detection method. FFT and CA implementations have been
computer-based, while the ZCR implementation has been microcontroller-based.
If limiting computation power is not an issue, then FFT and CA seem to be the
best choice. If computation power is limited, then ZCR is the better choice. For
ZCR to work, the pitch signal should be free of noise and without overtones.
For the purpose of playing a theremin, we know that there are no overtones or
noise (more on absence of noise in Fig. 3.2, Section 3.1.1) in the pitch produced.
This makes it possible to use ZCR. Also, since we are trying to make this a low
cost pitch detection sensor, the least computationally demanding ZCR is the
22
preferable to implement on an inexpensive microcontroller which has limited
computation power.
2.3 Human Robot Interfaces in Musical Robots
The last objective in Section 1.3 is to develop an interface to command a music
playing task to the robot system. Such an interface can be as natural as talking
or singing to the robot, or non-intuitive like providing MIDI format files. It could
also require specific skills, such as the ability to play a musical instrument. In
this section, we will review previously implemented interfaces for musical robots.
Using aural cues is one way of commanding musical robots. Petersen et
al. [29] proposed an audio-based HRI for their flutist robot which enabled
it to detect the tempo and harmony from music being played by a human
musician. Similarly, Taki et al. [45] presented an audio-based HRI algorithm
which detected pitch, tempo, and volume. Tempo and volume were detected
to recognize instances of initiative exchange between a human musician and a
robot. Pitch was used to detect the current location on the musical score, which
was provided in advance. They defined the term “initiative” as the authority
to vary the performance tempo and to make another performer follow. Neither
of the two interfaces actually commanded a melody as a music playing task.
Some other systems have used aural cues in combination with additional cues
from humans. Suzuki et al. [41] developed an audio-based HRI system which
detected volume, pitch, and tempo from sounds produced by humans. This
included singing and clapping. It also detected forces and torques as induced
by the user, such as through pushing or shoving. In addition to that, it detected
color information through cameras. All this information was used to generate
23
music based on predefined schemes. This was an interactive music generation
system. So there was no way for the users to command it to play a specific
melody.
Similarly, Lim et al. [17] commanded their robot accompanist system through
visual and aural cues. By observing a human flutist, the robot could detect the
tempo using both visual and aural information. Using visual cues alone, it could
detect performance start and stop times. This system too was not commanded
by users to play a specific melody.
Another system which used multiple cues was a pianist robot which used
vision to read printed musical scores and then play them out [33]. It could
also understand verbal requests from the audience. To achieve this, speech
recognition was performed by using linear predictive and cepstrum analysis.
Another form of HRI which it demonstrated was that it could track a human
singing and adjust its own playing tempo to match it. It used 5 narrowly tuned
band pass filters to detect fundamental pitch from the singer’s voice.
Esnaola and Smithers [11] presented a music based HRI which did not require
speech recognition, or printed musical scores; playing a musical instrument was
optional. They developed a musical language in which the alphabet consisted of
10 musical notes which could either be whistled or played from an instrument.
It used FFT for pitch detection and kept track of the ambient noise levels. That,
coupled with a very limited set of expected musical notes, proved to be very
reliable. The system was tested successfully with classical music as ambient
noise, as well as in environments with large groups of people.
Using MIDI notes is another way of commanding robots to play musical
pieces. Alford et al. [4] interfaced with their thereminist robot by playing
on a keyboard, which transferred note information in the General MIDI (GM)
24
format. Another thereminist robot by Wu et al. [49] was commanded by simply
providing it with a MIDI file. Similarly Batula and Kim. [7] commanded their
miniature pianist robot using a MIDI format customized for the limitations of
the robot.
2.3.1 Human Robot Interfaces: summary
To summarize, there are many HRI implmentations which require musical in-
strument playing skills from a user. Some interfaces make use of simple visual
cues or physcial interactions. Speech recognition and using vision to read musi-
cal scores is another option; it does not require specific skills on part of a user.
Another interface offers both skilled an unskilled interfaces by accepting musical
instrument playing as well as whistling as commands. Yet another way is to
provide MIDI format files, either directly or through a musical instrument.
Vision and speech-based interfaces can require a lot of developmental time,
which is beyond the scope of this project. Physical interactions like touching,
bumping, tapping, etc. may not be the fastest methods of commanding a musi-
cal piece to a robot. Directly creating MIDI format files is not intuitive. Using
a musical instrument with MIDI or some other format is feasible, but would
add to hardware costs.
This leaves whistling as the least demanding interface. Pitch detection is
relatively easier to implement, and the speed of interaction can be as fast as
whistling the tune to the robot. Whistling is more intuitive as compared to
writing MIDI files, and does not require a musical instruments or MIDI inter-
facing hardware. Therefore, an interface which accepts whistling as input is
most suitable for our purpose.
25
Chapter 3
Development of a Thereminist
Robot System
This chapter covers the development of the individual susbsystems for the
thereminist robot system (Fig. 3.1). The ZCR scheme has been used to provide
pitch sensing for pitch control. Two arms, 2 degrees-of-freedom (DOF) each,
were developed to control the pitch and volume by interacting with the two
theremin antennae. A proportional-derivative (PD) control scheme is used to
control the pitch of the theremin. A human robot interface (HRI) accepts com-
mands from a user in the form of a whistled musical piece, and converts it to
a melody data structure recognized by the control system. The music playing
algorithm has been implemented as a higher level control system. The pitch
sensor has been implemented on an 8-bit microcontroller, the Atmel ATmega
32U4. The control system has been implemented on a 32-bit microcontroller,
the mbed LPC1768. The HRI has been implemented on a laptop computer
using Processing language.
26
Figure 3.1: Thereminist robot system with its various subsystems.
3.1 Pitch Sensor
One of the design objectives mentioned in Section 1.3 is to develop a pitch de-
tection system good enough for realtime feedback control. This section presents
the design of a pitch sensor for this purpose.
3.1.1 Pitch sensor design requirements
Factors that dictate the design of the pitch sensor are range, accuracy, response
time, and the physical characteristics of the signal. Range is important be-
cause we need to sense all the possible pitches for this application. Accuracy
is important because we will use this sensor to control pitch. Response time is
important because we need to control pitch in realtime. The faster the pitch can
be measured, the better the realtime performance. Lastly, physical characteris-
tics of the pitch signal dictate how the pitch signal is interpreted and whether
any signal conditioning is required.
27
Ideally, the sensor range should cover the entire audible frequency range,
which is about 20 - 20,000 Hz. But this is unnecessary if the theremin cannot
produce pitches in such a large range. Our theremin indicated a frequency range
of about 132 - 2000 Hz, or C]3 to B6 (Fig. 1.2). With a typical disturbance8, this
range decreased to about 330 - 2000 Hz, or E4 to B6 (Fig. 1.2). Given that there
is no defined set of possible musical pitches for the sensor to work with, we con-
sidered only pitches available on the theremin. Since the disturbances change
the pitch domain size, we only considered the pitches consistently available on
the theremin. Although we observed the range E4 - B6 (Fig. 1.2) to be consis-
tenly available, even with a disturbance, this was one very specific disturbance
in a particular environtmental condition. Under different conditions, the lowest
pitch on the available range can vary. It was undesirable to allow the robot
to attempt to produce pitches which were not available on the theremin; this
has potential to damage the robot. So we did not risk designing for the lowest
available note. To maintain a safety margin, we limited the lowest pitch for the
playing domain to the start of the lowest full octave available. This resulted in
a pitch domain spanning octaves 5 and 6: a pitch range of 523.25 Hz through
1975.53 Hz (C5 through B6). Physically, this pitch range typically spanned a
distance of about 4.5 inches (11.5 cm) for the robot’s pitch manipulator9.
The second requirement is the accuracy. For this, we look at humans for
inspiration. The pitch discrimination threshold in humans seems to be about
1 % 10. Specifically, when actively listening, a less than 1 % mistuned note
can be detected 10 % of the time by non-musicians, and 80 % of the time by
8Typical disturbance is defined as a scenario in which a person approaches the pitch an-tenna. This reduces the pitch domain (see Fig. 1.2).
9As determined experimentally with the manipulator developed in Section 3.2.210 % error in frequency in the context of this thesis is defined as
|frequencydifference|frequencyintended
× 100
28
musicians [46]. Pitch discrimination depends on several factors, and there are
many studies which have covered this topic. For example, Demany and Semal
[10] investigated the influence of memory on auditory perception, Marmel et al.
[20] investigated how tonal expectations affect pitch perception, and Tervaniemi
et al. [47] investigated how musicians and non-musicians differed in their pitch
perception. We are not developing this robot to entertain an audience with
what they perceive as a professional musical performance. Therefore, we select
1 % error as the design requirement for our robot system. To control pitch using
the system, the sensor must be more accurate than 1 %. We select 0.5 % sensor
accuracy to allow for some margin of error for other parts of the system.
Figure 3.2: Theremin audio signal as viewed on an oscilloscope. It is quitesimilar to a sine wave.
The third requirement is the response time. For our purpose, this does not
mean a specific time constraint, but that given the choice, the method with a
faster response time is preferable. Feedback control performance is sensitive to
the response time of the pitch sensor, i.e., the computation time of the pitch
detection algorithm. This translates to a preference for spending the minimum
feasible time on measuring and computing. This preference affects the selection
29
of the pitch detection algorithm.
The last requirement is the ability to deal with the physical characteristics
of the theremin’s audio signal. It is not symmetric about the horizontal axis and
closely resembles a sinusoidal wave form (see Fig. 3.2). At a typical low volume
knob setting11 without anything close to the volume antenna, the pitch signal
cycles between -320 mV and 640 mV. This signal is clearly audible. When
a physical object is close to the antenna, the volume is lowered. The lowest
amplitude signal that can be reliably obtained cycles between about -80 mV
and 100 mV. This lowest amplitude signal is barely audible. Since we are
playing music, only the clearly audible signal is relevant. Therefore, the sensor
must be able to detect at least the audible voltage signal (-320 mV to 640 mV).
To sumarize, design constraints are as follows: a frequency range of at least
523.25 - 1975.53 Hz (C5 - B6), at least 0.5 % accuracy, and ability to process
an audio sinusoidal-like signal (-320 to 640 mV) shown in Fig. 3.2. Given the
choice, a pitch detection algorithm with the shortest response time is preferred.
3.1.2 Pitch sensor design
An early attempt while exploring design ideas was to use a guitar tuner as a pitch
sensor (Fig. 3.3). This tuner featured two arrays of LEDs as indicators: the
pitch indicator, and the tune indicator. The pitch indicator showed the pitch,
correct to the nearest standard pitch on a 12-note musical scale. The pitch
information was incomplete in that only prime and accidental information was
provided. For example, if presented with C5 and C6 in succession, it would only
indicate that C was detected, without distinguishing between C5 and C6. The
11The lowest possible audible volume setting. See Section 1.1 for details on theremin controlknobs and settings used.
30
Figure 3.3: Guitar tuner which was hacked for use as a first pitch sensor
second array of LEDs (the tune indicator) indicated whether the pitch was in
tune with the nearest identified pitch. It had 3 indicator LEDs, which could
indicate whether a pitch was in tune, flat, or sharp. Additional information was
provided by the flat and sharp indicator LEDs. These LEDs cycled between on
and off states. The rate of cycling indicated the degree of flatness or sharpness.
We adapted the tuner as a pitch sensor by hacking into the control signal
lines driving the LEDs in the pitch and tune indicators. Each signal line drove
a single LED. When an LED lit up, indicating a signal, the potential difference
on the signal line with reference to the common line dropped to -4.3 V. For
each of the two indicators, only a single LED could light up at a given instant.
Adapting this system as a sensor was done using two stages. The first stage
converted the signals to positive voltages using inverting Op-Amps. The second
stage was a potential divider that was designed to produce voltage as a linear
function of the signal/LED number. Fig. 3.4 shows the measured voltage ouput
for the pitch indicator. A similar voltage divider was used to monitor the tune
indicater.
There were 3 issues that forced us to abandon this approach. First, this
31
1 2 3 4 5 6 7 8 9 10 11 120
0.5
1
1.5
2
2.5
3
3.5
4
signal number
Vol
tage
Figure 3.4: Output of first pitch sensor after the volatge divider stage indicatingthe pitch detection. Voltage is a function of the signal number activated, i.e.one of the 12 indicator LEDs which light up. Each LED corresponds to oneof 12 notes on a 12-note musical scale. A similar voltage divider was used tomonitor the tuning indicater LED array.
set up did not indicate any octave information; a time history of the pitch
detected and direction of pitch arm motion had to be recorded to keep track of
transitions between octaves. Secondly, the tune indicator signal cycled between
(on off) states when not in tune. This cycle rate was low enough to be noticeable
by the human eye, and could be as low as approximately 1 Hz. This means that
to get pitch information for non-standard pitches, there was a variable delay
in measuring the tuning information from the sensor. This was undesirable for
feedback control as it could hold up a potential control scheme waiting for sensor
data by several hunderd milliseconds. Thirdly, the tradeoff between resolution
and response time was poor in every aspect. Using only the tune indicator
provided limited resolution: only 2 measurement points between 2 standard
32
pitches. For example, between C5 and C]5, we could only detect C5-too-sharp,
and C]5-too-flat. To increase resolution, if we tried to measure the cycling rate of
the tune indicator signal, then that required multiple readings and increased the
measurement time delay even further. To decrease the response time, if we only
detected the in-tune signal from the tune indicator, then we would have been
left with only the ability to detect exact frequencies, essentially eliminating all
resolution information between two standard pitches. Therefore it was decided
that this idea was not feasible for feedback control. If we need to measure time
intervals for high-low transitions to increase accuracy, then we might as well
directly measure the time period of an oscillating audio signal. This led us to
explore alternative pitch detection methods and develop our own sensor.
With reference to implementing Zero Crossing Rate (ZCR) the audio sig-
nal from the theremin has a negative part to it, lacks a clean “edge” and has
a very small amplitude (see Fig. 3.2. In contrast, microprocessors and mi-
crocontrollers typically tolerate voltage inputs from 0 V though 5 V (or 3.3 V
which is becoming more common now). Microcontrollers and integrated circuits
(ICs) performing counting/timing operations typically require step changes in
the input signal in order to positively register quick voltage (logic) level changes.
Typically, these changes must be on the order of 1 V. Therefore, a signal condi-
tioning circuit was designed to convert the audio signal to a square wave varying
from 0 - 5 V and centered around 2.5 V. The circuit schematic is shown in Fig.
3.5. The first stage positively clamps the audio signal so that it oscillates round
2.5 V. The second stage is a voltage comparator with a reference set to 2.5 V, so
that it gives a high (5 V) when the signal is higher than 2.5 V, and a low (0V)
when the signal is lower. The last stage, for a failsafe protection for any micro-
controller or processor input port. In the event of overvoltage or overcurrent,
33
this stage would break down, protecting components after it.
Figure 3.5: Filter for pitch sensor. Converts low amplitude audio signal topositive square waves centered around 2.5 V
There is a choice between frequency domain and time domain-based pitch
detection methods. There is a requirement to perform realtime pitch feedback
control, i.e., the fastest practical pitch detection algorithm is desirable. As
discussed in Chapter 2, the time domain based methods are faster but limited
to working with pure frequencies without noise. The theremin audio signal does
not have any noticable noise (see Fig. 3.2). Therefore, a time domain-based
method is the best choice. The Zero Crossing Rate (ZCR) detection principle
was the time domain based method discussed in Section 2.2. It was also the
method which was found to be most suitable in that section. Therefore, this is
the method we selected to use for the sensor.
A simple method of error rejection was added to the pitch detection algo-
rithm: abnormally high readings (above 5000 Hz) were rejected and the reading
was repeated.
34
3.1.3 Pitch sensor implementation
There are two ways of implementing ZCR. One way is to measure the number
of signal oscillations for a fixed time interval. Another is to measure the period
of a fixed number of signal oscillations. We implemented ZCR using this second
way; measuring the period for a single oscillation is faster than waiting for a
fixed time duration for a number of oscillations.
Initially, ZCR was implemented using a 555 timer, binary coded decimal
counters, and shift registers. A microcontroller was used to read the clock from
this circuit and transmit the clock value over 5 V TTL serial. Although it
worked reliably, it took up a lot of space, was difficult to debug and modify, and
it seemed inefficient to use a microcontroller just for serial transmission.
Eventually, ZCR was implemented on an 8-bit microcontroller (ATmega
32U4) running at 16 MHz. Using a 2 MHz clock, the microcontroller counts the
number of clock pulses between two consecutive rising edges of the incoming
signal. The count is stored in a 16-bit integer and transmitted over TTL serial
(5 V) serial at 57600 bps. The transmission packet consists of 3 bytes: a header
byte and the two data bytes. A 1 ms delay follows each packet transmission
since it was deemed to be sufficient for the purpose of this project. With a 3
byte packet (27 bits at 8N1) per reading, it takes about 0.5 ms to transmit 1
packet. With a 1 ms delay, this can ideally provide pitch feedback at a rate of of
667 Hz. This was deemed more than sufficient for this project. The frequency
is computed at the receiving end using the count and the clock frequency.
The sensor system was tested with a range of sine wave frequencies using a
frequency generator. Since only octaves 5 and 6 were expected to be played,
only that frequency range was tested. The test signal minimum and maximum
35
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
actual frequency (Hz)
mea
n se
nsor
rea
ding
err
or (
%)
mean
2σ
Figure 3.6: Pitch sensor validation test results at various frequencies. Numberof samples is 5992, 10206, 10563, 10617, 10618, and 10618, for test frequenciesof 400 Hz through 2400 hz.
voltages were -300±20 mV and 640±20 mV, respectively. This corresponds to
the audible signal mentioned in Section 3.1.1. Results are illustrated in Fig.
3.6. Mean absolute error was always less than 0.2 %. For each frequency, about
6,000 to 10,000 samples were taken. Informal tests were carried out with the
low amplitude form of the audio signal 12 and the sensor was able to provide
reliable results.
3.2 Mechanics
One of the objectives mentioned in Section 1.3 is to develop a set of robotic
manipulators with adequate degrees of freedom to play a theremin. This section
12Cycling from -80 mV to 100 mV. See Section 3.1.1
36
covers the design of a robot platform with such a set of manipulators.
3.2.1 Mechanical design requirements
As mentioned in Section 3.1.1, the pitch range of interest can be controlled
over a distance of about about 11.5 cm (4.5 in). So some sort of manipulator
or arm is required for pitch control. It can be a linear or rotational degree of
freedom (DOF). Since the arm needs to be pointed towards the pitch antenna
despite the placement of the robot, a rotational degree of freedom is required
for this adjustment. Similarly, a volume control arm is required to control the
volume during play. Just like the pitch control arm, it is required to be pointed
correctly relative to the theremin, depending on the location of the antenna.
Instead of using these additional DOFs to point the antennae we could have
used a moving base with a differential or omni-wheel drive system. But in
such a system, positioning would be more difficult since the pose of both arms
would have to be considered simultaneously; this complicates the kinematics.
In using additional arm DOFs to control arm pose, either arm can be controlled
independently, which simplifies the kinematics involved.
With reference to pointing the arms correctly, it must be emphasized here
that it is not the intention for this design to be optimized to reach the antennae
from the largest possible number of poses. Rather, the intention is to handle
minor variations in pose.
Experiments with earlier versions of the robot involved using pull-pull cables
to control the rotational DOFs on the arms (Fig. 3.7) because such a system
offered a rapid response time. Those arms proved very difficult to control, and
their movement was very non-linear. There was very noticeable backlash and
37
Figure 3.7: An early version of the robot using pull-pull cables to control thearms.This system proved very difficult to control and was not pursued further.However, the idea of not loading any actuator with another actuator was re-tained.
hysteresis. In assuming some poses, the loads on the actuators exceeded their
capacities. These factors required us to design an additional control scheme
for the kinematics, demanding additional computation resources. Also, there
was no guarantee that the actuator loads would not be excessive. This discour-
aged us from pursuing an approach involving a system of cables, and biased
us towards pursuing systems with simpler kinematic models and no apparent
excessive loads on the actuators. However, we did retain the idea of not loading
any actuator with the weight of another actuator.
In terms of space requirements, the robot needed to house the controller
and sensor circuits. Finally, easy access was required for the controller/sensor
circuits for troubleshooting and upgrade purposes.
3.2.2 Mechanical design & implementation
First, we needed to decide whether the different DOFs should be linear or
rotational. For pointing the arms toward the antennae, either could have been
38
Figure 3.8: Illustration of the volume control arm kinematics.
used. A rotational DOF can rotate an arm to align it with the antenna. A
linear DOF can translate the arm to align it with the antenna. This latter
mechanism would have required more space for the arm to physically move
around. This will tend to increase the size of the robot. In contrast, for the
rotational DOF, only mounting space for an actuator is required. Therefore, we
selected rotational DOFs for pointing the arms towards the antennae.
Again, for the control (pitch and volume) DOFs for either arm, both rota-
tional and linear options were possible. The volume antenna of the theremin is
in approximately a horizontal plane (Fig. 1.1). As mentioned in Section 1.1,
it is controlled by relative motion orthogonal to this horzontal plane, i.e., by
vertical movements. If a linear DOF for volume control is used, then the vol-
ume arm would have to be directly above the antenna (vertically moving). A
rotational DOF would not have this restriction. A linear arm moving vertically
would make the robot bigger, while rotational DOF would only require mount-
ing space for the actuator. Therefore a rotational DOF was selected for volume
control. The volume arm kinematics are illustrated in Fig. 3.8.
39
Figure 3.9: Illustration of the pitch control arm kinematics.
For the pitch control DOF of the pitch arm, distance from the antenna
needs to be controlled. Initially, we tried using a rotational DOF, since it was
easier to implement. However, the pitch position-relation is already non-linear.
The angle position-relation is also non-linear and this introduced an additional
non-linear characteristic to the control trajectory of the arm. To simplify the
kinematics and focus just on the theremin control dynamics, a linear DOF for
the pitch control was selected. The pitch arm kinematics are illustrated in Fig.
3.9.
The next step after DOF selection is the selection of actuators. Originally,
we were using the KIPR CBC V2 robot controller (see Table 3.2). So the
selection was limited to actuators supported by the built-in motor controllers
and power supply of the CBC. This included the SG5010 and the CS-60. Both
are available in servo and motor configurations. Their specifications are listed
in Table 3.1.
Testing with earlier versions of the robot indicated that the motion of the
SG5010 actuators was not as smooth as compared to CS-60 actuators, and
40
Table 3.1: Performance specifications for actuator options. All specificationsare listed for 6 V operation.
Actuator CS-60 SG5010
maximum speed (deg/s) 375 429
maximum torque (oz-in) 49 152.8
option servo, DC motor servo, DC motor
position encoder no no
they occasionally jittered. The SG5010 actuators did offer higher torque and
speed. This was an important advantage since the weight of the volume arm
had to be supported, and its rotational inertia had to be overcome as fast as
possible. Therefore, for the volume DOFs we selected the higher torque SG5010
acuators. The occasional jitter is acceptable on the volume antenna, as long as
the approximate position is maintained. To maintain position, the easier and
cheaper solution was to use the servo option for the SG5010. For the pitch
arm, therefore, for predictable operation, the CS-60 option was selected. It was
also acceptable since the weight of the pitch arm did not have to be supported
by either actuator. For the rotational DOF for pointing the antenna correctly,
a CS-60 servo was used, as again, this was the easier and cheaper solution to
maintain position. For the linear DOF for pitch control, a servo would have
been easy to control, but there is a lag due to the internal position controller.
Also, in this servo, there is no way of knowing whether the commanded position
has been reached or not. Therefore, for direct control, (rather than merely
controlling the set point for an internal servo feedback controller) we selected
a CS-60 motor. Control with a motor is more accurate since we can use pitch
detection to determine whether the commanded position (for a commanded
pitch) has been reached or not. Also, with direct control, the lag in using a
41
servo actuator is eliminated. All these actuators are typically used at 6 V, but
they can be safely used at 7.2 V. This improves the performance slightly. We
opted to supply them with 7.2 V during use.
The next step is building the arms. For the pitch arm, we selected a 3.5
mm (0.14 in) carbon fiber tube, which was the lightest reasonably rigid option
available, i.e. it did not flex under its own weight. To support the tube, a
Delrin base with an identically sized groove was used (see Fig. 3.10a). Delrin
was used because it has a very low coefficient of friction. Using the lower torque
CS-60 motor, we needed to minimize friction. To drive the tube back and forth,
a lego wheel and tire system, approximately the same thickness, was used (see
Fig. 3.10a). KIPR motors are available with an attachment for a lego shaft
to be connected. The lego drive wheel was therefore attached to the motor
with a lego axle. The linear motion is illustrated in Fig. 3.10b and 3.10c. An
aluminum foil ‘hand’ was mounted at the end of the arm to act as a grounding
plate and hence control the pitch. The foil was connected to ground. With a
wheel diameter of 1.21 inch and a top motor speed of 375 deg/s[1], the pitch arm
can theoretically travel at 3.96 in/s (at 6 V). The actual speed was later found
out to be about 3.46 in/s (at 7.2 V). This was an average speed from a test
trajectory that included the velocity profiles to accelerate from rest and then
decelerate and come to a stop. The pitch arm travelled a total distance of 4.5
inches in 1.3 s. This actual speed of 3.46 in/s is less than the theoretical 3.96
in/s because friction reduces speed, and the acceleration-deceleration profiles
reduce the total distance travelled as compared to a trajectory where the motor
travels at maximum speed at all times.
Finally, the pitch arm was mounted on a CS-60 servo to allow for correct
orientation of with reference to the theremin (Fig. 3.10d). The pitch arm length
42
(a) Linear motion mechanism
(b) A forward position (c) A backward position
(d) Mounted with servo for correctly orienting thearm in the direction of the antenna.
Figure 3.10: Illustration of pitch arm mehcanics
43
was designed based on the required playing length of about 4 inches, plus 200 %
extra to easily allow for improper positioning13, plus 1 inch safe length to be left
behind the drive wheel axle. Some extra length was also required to allow for
the fact that the axle was mounted away from the front of the robot. With the
axle mounted about 5 inches from the front of the robot, the total arm length
came out to be 18 inches.
The volume arm design is more straightforward. A cardboard tube was used
as an arm. This was mounted onto the volume control servo actuator, which
was in turn mounted on the direction control servo actuator (see Fig. 3.11a).
A cardboard tube was easier to attach to the servo horn than a thin carbon
fiber tube. Also, unlike the carbon fiber, the cardboard tube did not flex when
rotated from one end in this configuration. The arm is rotated up in the ‘high’
position for a high, audible volume (Fig. 3.11a) and rotated down cloase to the
antenna in the ‘low’ position for an almost inaudible volume (Fig. 3.11b. The
length of the tube was selected as follows: when the right side of the robot was
aligned with the theremin, and the robot was 3 inches away from the theremin,
the downward projection of the volume arm, when rotated horizontally, passed
approximately through the center of the circular segment of the antenna. A
distance of 3 inches from the theremin is a typical placement which allows the
robot to be close enough to operate the theremin, but far enough so that the
robot does not collide with any of the plugs connected to the front panel of
the theremin. The design point selected is typically where thereminists place
their hand while controlling the volume. This design point is illustrated in Fig.
3.11c. This resulted in an arm length of 8.5 inches from pivot point to the end.
13So that the robot does not have to be placed at exactly the same relative pose each time.Allowance for improper positioning makes setting up the robot easier.
44
(a) A typical volume up position (b) A typical volume down position
(c) Design point for the volume arm
Figure 3.11: Design and implementation of volume arm. The workspace bound-ary is illustrated.
45
Figure 3.12: Front view of the robot with the theremin. The workspaces of thepitch control arm (left) and volume control arm (right) are indicated.
The volume arm was tested with the servo configuration, and it could easily lift
the cardboard arm; there was no noticeable difference in the servo performance
with and without the weight of the volume arm.
To summarize the design and implementation for the mechanics, there are 2
arms, one each for pitch and volume control. Each arm is 2DOF. The pitch arm
has 1 translational DOF for playing and 1 rotational DOF for pointing the arm
in the correct direction prior to playing. The volume arm has 1 rotational DOF
to move between two possible volume positions and another DOF to point the
arm correctly before initiating play. The translational DOF is provided by a
motor, while the other 3 DOF are provided by servo actuators. The workspace
for the pitch arm is limited to part of a horizontal plane (Fig. 3.12 and 3.13).
The workspace of the volume arm spans the surface of a hemisphere (Fig. 3.12
and 3.13).
To fulfill the space requirement, the platform base was designed somewhat
like a table top with arms attached on either side. The base is about as long as
the theremin and wide enough to provide adequate prototyping space.
46
Figure 3.13: Isometric view of the robot with the theremin. The workspaces ofthe pitch control arm and volume control arm are indicated.
3.3 Control Scheme
One of the objectives mentioned in Section 1.3 is to implement a control scheme
to demonstrate the robot system’s ability to be programmed to employ a given
control scheme to play a theremin. Therefore, this section presents the devel-
opment of such a control scheme.
3.3.1 Control scheme design requirements
Design requirements for a control system typically include settling time, max-
imum overshoot and a maximum steady state error. The primary concern for
our application is to ensure that humans do not perceive any overshoot, steady
state errors, or very slow transition between musical notes.
As mentioned in Section 3.2.2, it was experimentally determined that the
pitch arm moves at a speed of 3.46 in/s (at 7.2 V) moving a distance of 4.5
inches (the pitch domain) in 1.3 s. This is the limitation of the dynamics of the
system. We also noted in Section 3.1 that the desired pitch domain for playing
47
spans about 4.5 inches. Therefore the system can move across the entire pitch
domain in about 1.3 s (which includes acceleration and deceleration times, as
mentioned in Section 3.2.2). This is the best that can be done by the hardware.
Therefore, at this point, we can only comment that this is the best we can do
for moving across 24 notes. The requirement is to reach the desired pitch and
settle on it as quickly as possible.
As discussed in Section 3.1.1, based on human perception ability, a pitch
error of 1 % was selected for the robot system. Therefore, we select 1 % as the
design requirement for maximum steady-state error of the control system.
In playing music, ideally, there should be no perceptible overshoot. Drawing
from the steady-state requirement, we also select 1 % as our design constraint
for the maximum overshoot.
To summarize, the design constraints are as follows: a maximum overshoot
of 1 %, and a maximum steady-state error of 1 %. As for settling time, the
requirement is to reach the desired pitch and settle on it as quickly as possible.
3.3.2 Control scheme design & implementation
As discussed in Chapter 2, a controller can be feedforward, feedback, or a hy-
brid of both. Knowing that the acoustic properties of a theremin do not remain
constant with time, feedforward control will not be useful unless it adapts to the
changing environment, as was eventually done by Mizumoto et al. [23] for their
theremin playing robot. For completeness, we tried using a feedforward con-
trol model using a pitch-position lookup table. But for adequate performance,
measurements for the lookup table had to be taken again every time anything
in the environment changed. This is a cumbersome process. Faced with this
48
scenario we could have proceeded to develop a feedforward model that adapted
to the changing environment. This was the strategy adopted by Mizumoto et
al. [23]. However, with the potential levels of computation that would have
been involved, and the fact that pitch detection would still have to be used, we
decided not to pursue this approach. Instead, we opted to use pitch detection
for feedback control. Without an elaborate feedforward model to modify at
run-time, we feel that a feedback control approach would be computationally
less demanding.
We propose a classical proportional-derivative (PD) controller, with percent-
age frequency error as feedback. Initially, we tried using linear PD control, but
that only worked for a limited range of pitches. This limitation was expected,
given the non-linear variation of pitch as a function of distance from the an-
tenna, as illustrated in Fig. 1.2. To deal with non-linearity, we tried using a
non-linear PD control where the controller gains were functions of frequency.
These functions were derived from curve fitting the pitch position mapping of
the theremin, as illustrated in Fig. 1.2. This scheme didn’t perform very well.
One reason was that the pitch domain is slightly different under different en-
vironmental conditions. Another reason was that our feedback error was still
frequency, which is non-linear. To mitigate the dependance on an absolute
term such as frequency error, we tried using a relative error term: percentage
frequency error (relative to target frequency). We started off by testing it with a
simple PD control scheme. Preliminary testing showed promising results. So we
opted to use this strategy without any modifications. Hence the proposed PD
controller using percentage frequency as feedback error. Below is the algorithm
for the control loop (algorithm 1).
The above controller generates a power command in terms of percentage
49
Algorithm 1 PD feedback control
while timeElapsed < timeout doerror = (fCmd − fCurr)/fCmd ∗ 100powerCmd = pGain∗error+dGain∗(error−errorPrev)/(timeCurr−timePrev)
end while
maximum voltage supplied to the motor: -100 represents full voltage in reverse,
and 100 represents full voltage forward. The power command generated by the
controller is offset by ±24 % (depending on the direction) to compensate for
the motor deadband. For example, if the power command is -30 %, it is offset
to -54 %. Any power command beyond ±76 % is saturated to ±100 %.
3.4 Robot Controller
One of the objectives mentioned in Section 1.3 is to have a robot system on
which different control schemes can be tested. This can be met by using a
programmable controller on which different control schemes can be programmed.
This section outlines the selection of a suitable robot controller.
3.4.1 Robot controller performance requirements
The primary requirements for the controller relate to input, output, memory
and programmability. In terms of input, a 5 V TTL serial port is required to
receive data from the pitch sensor. This is based on the output of the pitch
sensor (see Section 3.1.3). In terms of output, control of 3 servos and one
motor is required. This is based on the selection of actuators for the robot
manipulators in Section 3.2.2. In terms data storage, some way of logging
performance data for later retrieval and analysis is required. This is crucial for a
platform that is to implement different control systems so that their performance
50
can be compared. In terms of programmability, the input and output data
rates should be customizable. This comes from preliminary testing with the
CBC robot controller in which the input and output data rates are, by design,
fixed at 50 Hz. Even though the pitch sensor could provide data at higher
rates, we were limited by this data rate, and could neither sense nor control
at higher rates. Another programmability requirement is that the controller be
relatively easier to develop on, i.e. the supporting libraries should be available
and relatively reliable/stable.
3.4.2 Available robot controller options
Three options were readily available. The CBC robot controller by the KISS
Institute for Practical Robotics (KIPR) is a previous generation educational
robotics controller. It is a complete solution with built-in motor controllers,
servo control, and servo power supply. It has a 3.3 V serial port (5 V tolerant).
The Link is the latest controller by KIPR. It was released at the time of this
selection (January 2013) and has similar features. The mbed LPC1768 is a
prototyping platform based on a 32-bit microcontroller. It lacks the built-in
motor controllers and servo power supplies. It does have PWM output which
can be used to send control signals to external motor controllers and externally-
powered servos. The relevant features of all three systems are summarized in
Table 3.2.
3.4.3 Selection of robot controller
The CBC V2 is limited to accessing the serial port and controlling a motor at
upto 50 Hz, so it is not suitable for this project. The Link and mbed do not have
51
Table 3.2: Summary of robot controller performance parameters.
Controllers CBC V2 Link mbed
clock speed (MHz) 350 800 96
motor control built-in, 50 Hz built-in, 300 Hz none
servo control PWM outputs PWM outputs PWM outputs
servo power built-in built-in none
TTL serial 1, @ 50 Hz 1 3
voltage level 3.3 V 3.3 V 3.3 V
5V tolerant yes yes yes
data-logging USB stick USB stick 2 MB onboard
firmware/IDE stable under development stable
these limitations. Motor control in the Link is limited to 300 Hz, but that is
such a high limit that it is acceptable for the purposes of this project. Although
the Link is superior in terms of features, the firmware and native libraries for the
Link were still under development at the time of this evaluation. So there was no
guarantee that all of the proposed features would be available and be reasonably
bug-free. Therefore the somewhat limited but developmentally stable mbed
microcontroller was the choice for this project.
To complete the controller requirement, the VEX Robotics Motor Controller
29 was selected. This is a low cost motor controller which can supply a maximum
4 A of current and can be controlled by a single PWM pin. Later a TTL serial
controlled servo driver module was added for controlling the servos when it was
discovered that the mbed did not support multiple PWM frequency generation.
Servos are controlled at 50-60 Hz which is lower than the PWM frequency used
for the pitch motor control. In the current implementation (see Section 3.6)
control of one volume servo is required for adjusting volume during play.
52
3.5 Human-Robot Interface
The last specific objective in Section 1.3 is to develop a user interface for enabling
a human to command a music playing task to the robot system. This section
covers the design of such a Human-Robot Interface (HRI).
3.5.1 HRI peformance requirements
The only performance requirement for the HRI is that the interface be intuitive
for people outside the realm of computer science and music. Keeping this in
perspective, the easiest way for an individual to command the robot system to
play a musical piece or melody would be to simply play it to the system. This
can be done in the form of a simulated musical instrument, or it can simply be
hummed or whistled to the robot. The latter is a typical and quick method of
exchange between humans when they do not have the name or recording of a
melody. Therefore it was decided that the HRI would accept a musical piece in
the form of whistling.
The HRI is designed as follows. As a human whistles to the system, a
recording is made. This recording is used to generate a musical piece by the
Melody Extraction Scheme (MES) which is outlined in the next chapter. The
generated musical piece is then written to the local file system of the mbed
microcontroller. After the file write operation is complete, the robot receives a
command signal to begin playing.
Partially processed MES results are played back to the user to indicate the
frequencies detected. The HRI is implemented on a laptop running Windows 7
operating system. Built-in speakers and mic are used for audio input/output.
53
Figure 3.14: Human Robot Interface
3.6 Music playing algorithm
A simple music playing scheme was used to play melodies. For each musical
note, the volume arm servo is commanded to raise the volume, and immediately
after that, the pitch feedback control loop is activated for the desired pitch, for
a time duration equal to the note duration. To protect14 the hardware, at the
end of the note duration time interval, a stop command is issued to the pitch
motor in case it hasn’t reached its target pitch. Following this, the volume arm
is moved to a low volume position and the pitch control loop engaged for the
same note for a time duration equal to the note rest duration. Again, a stop
command is issued at the end of the pitch control loop.
To avoid overstressing the volume control servo with very small rest dura-
tions, the algorithm is designed to skip the volume down command if the note
rest duration is less than 150 ms. This time duration was somewhat arbitrarily
14This is to protect from short circuiting the pitch arm with the pitch antenna, whichcauses the microcontroller to reset. The highest pitch location is about 0.25 inch away fromthe antenna. A limit switch is not suitable because the “limit” changes depending on theplacement of the theremin.
54
selected after observing the behavior of the servo with several musical pieces.
55
Chapter 4
Melody Extraction Scheme
(MES) for Human-Robot
Interface
As mentioned in Section 3.5, the human-robot interface (HRI) for the robot
takes as input a human sound, extracts the melody, and transmits it to the
robot controller to play. At the heart of the HRI is the MES which performs
this melody extraction. It starts by using FFT to extract a time-varying pitch
trajectory from the human audio input. This pitch trajectory is then shifted to
the pitch domain of the theremin. Finally, a melody comprising musical notes
is assembled for transmission to the robot controller. These stages are covered
in detail in the following sections.
In the traditional sense, a melody is defined as a data set of musical notes
along with their rest time durations. A musical note is defined as a specific
pitch on the music scale accompanied by its playing duration. This is different
from the note data object used in the MES (see Table 4.1). The note object also
56
includes rest information. Therefore a melody in this context is a set of note
objects. So although the data structure is different, the information contained
is the same for either definition of term ‘melody’.
Table 4.1: The note data structure used by the MES
field data type definition unit
index float position on standard musical scale -frequency float pitch Hzprime char prime -accidental char accidental -octave int octave -duration int duration msrest int rest msposition int for manipulator -time float timestamp s
4.1 MES Performance Requirements
The main requirement is to extract a melody from an audio recording. This is
based on the design of the HRI in Section 3.5.1 where the HRI records audio
input, and calls upon the MES to process it (see Fig. 3.14). Another require-
ment is to ensure that the extracted melody is compatible with the capabilities
of the theremin. From Section 3.1.1 on pitch sensor requirements, we know
that the theremin only consistently covers two full octaves, i.e. octaves 5 and 6.
Therefore, we need to ensure that all extracted melodies fit within this range.
4.2 MES Design & Implementation
The melody extraction scheme (MES) was implemented using the Processing[3]
language. Input and processing of audio was done using the Minim[12] library
57
which has a number of useful audio related functions. The MES extracts a
melody in 5 stages listed below:
• Detect pitch
• Identify the highest octave
• Delete lower frequencies, and shift the remaining frequencies to the theremin’s
playing domain by octave
• Extract set of notes
• Assemble a melody
These stages are outlined in the following subsections.
4.2.1 Pitch detection
As concluded in Section 2.2, to detect pitch, one can use the frequency domain
based FFT, CA, or AC methods, or the time domain based ZCR method. We
have already used ZCR for the pitch detection sensor (Section 3.1.3) because of
its low computational demand (faster response time) and because it is suitable
for use with the theremin’s audio output. We discussed in Section 3.1.1 that
a low computation time was desired because feedback control is sensitive to
the computation time of the pitch detection algorithm. The theremin’s audio
output is relatively noise-free and does not have any overtones. It is possible
that we use the same scheme or even the pitch sensor itself for pitch detection in
the MES. However, with a live recording, background noise will always be there,
and there is no guarantee that overtones will not be present. ZCR cannot be
reliably used to detect pitch in such a situation. Furthermore, melody extraction
58
is need not be performed in realtime and is hence not sensitive to computation
time of the pitch detection algorithm.
This leads to consideration of the three frequency domain based methods:
FFT, CA, and AC. While it is quite possible to implement these from scratch us-
ing basic principles, it is a time-consuming process. Fortunately, the Minim[12]
audio library for Processing and Java has a built-in FFT implementation[13].
This is a quick solution to implement a frequency domain based pitch detection
method. Therefore, FFT is the method of choice for pitch detection for the
MES.
FFT is performed on the audio recording made by the HRI. The audio
sampling rate is 44.1 kHz. For the transform, the buffer size is 2048 i.e. 2048
equally sized pitch intervals (or bands) are defined. When a FFT is performed,
exact pitch values cannot be identified. Only the intervals can be identified.
However, the pitch value in the center of each interval is assumed to be the
pitch detected for that interval. In this implementation, for a sampling rate of
44.1 kHz the maximum pitch considered is the Nyquist frequency, i.e. 22.1 kHz.
This gives a pitch resolution of about 21.6 Hz15 for the entire range considered,
i.e., 0 - 22.1 kHz. This means that pitch can be detected to the nearest 21.6 Hz.
For every instantaneous transform, a number of pitches with varying am-
plitudes (or loudness) are detected. It is assumed that the pitch produced by
the user, i.e. from whistling, humming, etc., is much louder than the pitches in
the background noise. Based on this assumption, for every time window, the
loudest pitch (one with the greatest amplitude) is taken to be the one produced
by the user.
152/2048*22100 Hz = 21.6 Hz; based on implementation notes from the Minim Manual [13]
59
4.2.2 Highest octave identification
The highest pitch is identified and its octave is computed for use in the next
stage. While computing the octave, pitch data are rounded off to the near-
est standard pitch on the musical scale. The rounding off procedure used is
explained as follows.
A table of standard musical note pitch values is stored in a 1 dimensional
lookup table. This table is based on the reference A4 (440 Hz). Instead of
searching through the table for the closest reference pitch, one can also use a
mathematical relation. For a standard musical note scale know that:
fn = fo × 2n12 where,
• fn is the frequency of the note n half steps away
• fo is the frequency of a fixed reference note. Here, it is the pitch of A4,
i.e. 440 Hz
• n is the number of half steps above or below. This is positive for half steps
above the reference note, and negative for half steps below. A half step of
+1 is the next highest note, +2 is two higher, and so on.
Rearranging the above and accounting for the fact that A4 is at index 57
(counting from C0) of our reference array, we get this relation to determine the
index from the pitch detected:
index = 12 × log(fn/fo)log(2)
+ 57.0
For a non-standard pitch value, the above relation gives a fractional index.
The index is rounded off and the corresponding standard pitch is obtained from
a lookup table. To identify the octave, the rounded index is used in a similar
60
lookup table which has the ocatves listed. This pitch rounding process is also
employed in the next stage.
4.2.3 Domain reduction and shifting
Ideally, to be playable on the theremin, frequencies in a musical piece must fall
within octaves 5 and 6. As concluded from Section 3.1.1 these are the only
two full octaves consistently available. However, human whistling may have
frequencies above octave 6. Or they could also all be below octave 6, or even
below octave 5. They might also not be limited to two octaves. This calls for a
domain reduction and frequency shift so that they may be played on octaves 5
and 6.
The first step is the domain reduction step. We can select two consecutive
octaves of frequencies. If we assume that the background noise is at a lower
frequency than the frequency of the user’s whistling, then we can select the
highest two octaves. This assumption should hold true, as long as the user
is not in a high frequency noise environment like an orchestra or a jet engine
testbed. As for the frequencies in the lower octaves, we could either ignore
them, or shift them to these highest two octaves. Shifting the lower octave
frequencies to higher octaves can create interesting and often confusing acoustic
effects. Also, all the noise would make its way into the musical piece, unless
some method of distinguishing between noise and music is defined. If we assume
that the typical whistled musical piece does not span more than two octaves,
then we can just ignore the lower octaves. Given that whistling is only a single
‘musical instrument’ and not all people are capable of whistling complex musical
pieces spanning 3 or more octaves, this assumption should hold true in most
61
Figure 4.1: Domain reduction by retaining the highest 2 octaves. The highestpitch is in octave 7. The domain is reduced to pitches from octaves 7 and 6(highlighted in blue).
cases. Ignoring the lower octaves also takes care of the low frequency noise.
Therefore, we choose to retain the top two octaves and ignore the rest of the
frequencies. This is illustrated in Fig. 4.1.
The next step is the shifting step which is quite straightforward. We know
the highest octave recorded and we know that frequencies must be adjusted so
that the highest octave is 6. Instead of multiplying or dividing by a factor of 2,
we simply increment/decrement the frequency index (defined in Section 4.2.2)
by multiples of 12 16. There are 12 notes in each octave. This implies that the
indices of corresponding notes in consecutive octaves differ by 12. For example,
16The formula used is: index = index + 12*(6 - highest note octave)
62
Figure 4.2: MES pitch shifting. Pitches from octaves 7 and 6 (Fig. 4.1) areshifted to octaves 6 and 5 respectively, i.e., theremin’s pitch domain.
for the pitches in Fig. 4.1, the shifted pitches would be as shown in Fig. 4.2.
As a result of the shifting step, frequencies for all data points below the
second highest octave are set to zero (see Fig. 4.1). An instant with a frequency
of zero represents a rest instant in the musical piece. This domain reduction
ensures that input data span two octaves to match the two octaves available
on the theremin. It also eliminates the low amplitude background noise which
makes its way into the recording during rest periods. The frequencies are then
shifted so that the highest frequency is in octave 6, which is the highest complete
octave available on the theremin (Fig. 4.2). During this step, frequency data are
rounded off to the nearest standard frequency according to the special rounding
63
off procedure defined in Section 4.2.2.
4.2.4 Note extraction
Note extraction is based on the idea that for any given note, pitch detection will
detect the same frequency for as long as the note is being played. Therefore,
data points which are consecutive in time belong to the same note if their
frequencies are the same.
Accordingly, discrete data points are grouped into notes. This is achieved by
assigning consecutive data points with the same frequency to the same group.
Therefore, there is one note for one group of consecutive frequency-time data
points. Each note consists of a frequency field and a duration field (see Table
4.1). This greatly reduces the number of data points representing the frequency-
time trajectory. The result is a number of notes with zero and non-zero fre-
quencies.
4.2.5 Melody extraction
The last stage is to recognize the note rest durations. Every zero frequency note
is in fact the rest period after the previous note. Each note also has a rest field.
Therefore, the duration field value of each zero-frequency note is assigned to
the rest field of the note immediately before it. The zero-frequency notes are
then discarded.
64
Chapter 5
Testing and Performance
The control system and MES were put through a series of tests to evaluate
their design effectiveness. First, the control system’s response to various input
commands was checked. This was followed by a repeat of a subset of these tests
but with a disturbing body in the presence of the theremin. Subsequently, the
MES was tested with input from pure tones, human whistling, and piano keys.
Finally, the overall system was tested with a whistled musical piece as input.
This chapter presents all these tests and their results.
5.1 Testing of Feedback Control Scheme
First, the PD feedback control system was tested. Frequency trajectories were
provided in the melody input format defined in chapter 4. Two tests were
performed.
65
0 2 4 6 8 10 12400
600
800
1000
1200
1400
1600
1800
time (s)
freq
uenc
y (H
z)
commandactual
Figure 5.1: A typical step command test. Step commands were issued acrossthe entire playing domain, starting from the lowest frequency, up to the highestfrequency, and then back down to the lowest frequency. This figure shows thetrajectory for 4-note steps, i.e. 4 notes skipped per step command.
5.1.1 Response to step commands
The commanded frequency was changed in steps starting from the lowest playing
frequency of 523.25 Hz (C\5) up to the highest playing frequency of 1975.53 Hz
(B\6) and then back down again (5.1). For each step, the controller was given 1 s
to respond. For reference, the design requirements decided in Section 3.3.1 called
for a maximum steady-state error of 1 %. This was the error used to determine
when the controller had settled. This test was performed for commanded steps
corresponding to 1, 2, 4, 8 and 16 note transitions. A note transition of 1
corresponds to all notes being commanded, a transition of 2 corresponds to
every second note commanded, and so on. The mean response times are shown
in Fig. 5.2. As expected, the settling time increases for larger steps. The
response time of the 16-note step could not be recorded because the controller
66
0 1 2 3 4 5 6 7 8 90.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
number of note transitions per step
aver
age
settl
ing
time
(s)
Figure 5.2: Average settling times for commanded frequency steps with 1, 2,4, and 8 note transitions per step. As expected, the settling time increases forlarger steps. The controller was unable to complete its response for the 16-stepcommands within the given time of 1 s.
could not reach the commanded frequency within the allotted 1 second (per
step) time limit for this test. It was also noted that the maximum overshoot
remained below the 1% threshold decided in Section 3.3.1.
The controller seems to speed up for bigger note transitions. Fig. 5.3 shows
the settling time per note, which indicates that the average time spent per note
transition actually decreases for larger step sizes. This is expected, given the
control law. Also, a qualitative look at the data seemed to indicate that the
controller responded faster when stepping from a higher frequency down to a
lower frequency. However, the response time of the controller was different at
different frequencies, so not much could be deduced without further analysis.
67
0 1 2 3 4 5 6 7 8 90.08
0.1
0.12
0.14
0.16
0.18
0.2
number of note transitions per step
aver
age
settl
ing
time
per
note
tran
sitio
n (s
/not
e)
Figure 5.3: Settling time per note transition. This indicates that the averagetime spent per note transition decreases for larger step sizes, i.e., the controllerresponds faster for larger step commands.
5.1.2 Effect of environmental disturbance
This test was used to observe how the control system was affected by distur-
bances. A 1-note step test similar to the one in Fig. 5.1 was used with a fixed
disturbance17 within range of the antenna.
A disturbance affects the pitch produced as follows. Section 1.1 described
the use of heterodyning frequencies from two LC oscillators to produce pitch.
A theremin’s pitch antenna acts as a capacitor for one of the two oscillators.
A disturbance in the vicinity of the antenna increases the capacitance of the
antenna. This increase in capacitance changes the oscillator frequency such
that the difference in frequencies of the two oscillators is increased. This in
turn increases the pitch produced by the theremin. Additional disturbances,
17This disturbance was in the form of a circular Aluminum plate (13 inch diameter, 1/16inch thick) placed approximately 14 inches away from the antenna, facing the antenna.
68
e.g. a pitch controlling arm, tend to add to the capacitance and hence increase
the pitch even more. Skeldon et al. [37] provides a mathematical treatment of
the physics of the disturbance and its relation to pitch produced.
As shown previously in Fig. 1.2 a disturbance “stretches” out the pitch
domain in physical space. The physical distance between pitch positions in
space increases. Consquently, the time taken for the robot arm to move between
pitch positions should also increase. As expected, the 1-note step test showed
that the mean settling time increased from 0.1920 s to 0.2414 s.
5.2 Testing of Melody Extraction Scheme
This section covers tests conducted for the MES. Live audio was provided
through the HRI. These tests were intended to demonstrate that the MES was
functional and to investigate possible limitations.
5.2.1 Pure tone
Audio with a set of pure frequencies (no overtones) was provided using a fre-
quency generator application from a cell phone. Although this was an ideal
input without noise, it tested the basic functionality of converting pitch input
to a melody. The input frequencies and the output are shown in Fig. 5.4.
5.2.2 Whistling
A section of Pachelbel’s Canon was input in the form of whistling. This is a
typical input that the MES should expect and this test can demonstrate if the
MES can process such an input. Results are shown in Fig. 5.5.
69
0 5 10 15
523.25
1046.40
2093.00
time (s)
freq
uenc
y (H
z)
0 5 10 15
C5
C6
C7
music scale
pitch detectedextracted melody
Figure 5.4: Plot of pure tones as processed by the MES. Audio input to theMES (red) is used to extract a melody (blue). The audio input is representedby a frequency time trajectory comprising non-zero frequencies. The melodycomprises a series of pitch values in time marking the start or end of a note.This melody can then be transmitted to the robot to be played on the theremin.The jump in pitch at 3.8 s is due to a pitch detected at that point.
5.2.3 Piano recording
A simulated piano application was used to play the notes C4, D4, E4, F4, G4,
A4, B4, C5, and then B4 through C4. This was intended to test the MES beyond
its design domain for another potentially useful form of input. Notes from a
piano comprise multiple overtones. In contrast, the MES was not designed for
pitch input with overtones. Results are shown in Fig. 5.6. Overtones in the
audio input result in erroneous recognition of pitch. This is a limitation of the
pitch detection algorithm (Section 4.2.1). However, based on the erroneously
70
0 1 2 3 4 5 6 7 8 9
130.81
361.63
523.25
1046.40
time (s)
freq
uenc
y (H
z)
0 1 2 3 4 5 6 7 8 9
C3
C4
C5
C6
music scale
pitch detectedextracted melody
Figure 5.5: Plot of melody extracted from human whistling. Audio input tothe MES (red) is used to extract a melody (blue). Note that only the highesttwo octaves (5 and 6) are retained in the extracted melody. The audio input isrepresented by a frequency time trajectory comprising non-zero frequencies. Themelody comprises a series of pitch values in time marking the start or end of anote.
detected pitches, the note extraction takes place as expected.
5.3 Testing of Overall System
Finally, the complete system was tested. Part of the folk song Brahm’s Lullaby
was whistled to the HRI. This was processed by the MES (Fig. 5.7) and the
resulting melody was transmitted to the control system to play (Fig. 5.8).
In Fig. 5.7, the pitch detected (red) does not correspond to exact notes on
71
0 2 4 6 8 10
523.25
1046.40
2093.00
time (s)
freq
uenc
y (H
z)
0 2 4 6 8 10
C5
C6
C7
music scale
pitch detectedextracted melody
Figure 5.6: Plot of notes from a piano as recognized by the MES. Notes playedwere C4, D4, E4, F4, G4, A4, B4, C5, and then B4 through C4. Overtones inthe audio input result in erroneous recognition of pitch (red). This pitch infor-mation is then correctly used to extract a melody (blue). The highest frequencyis in octave 7. This is adjusted to octave 6. All the other frequencies are adjust-ed/discarded accordingly. The time scales have been aligned after the domainwas reduced to 2 octaves.
the 12-note musical scale. The extracted melody (blue) discretizes the pitch
detected to the nearest standard pitch on the 12-note musical scale. The MES
recognizes and stores only the start and end points of each note. This way, fewer
data points can be used to represent the melody as compared to the number of
data points used to represent the pitch detected.
The pitch trajectory for the melody is shown in Fig. 5.8a. For very short
duration notes, the the commanded pitch (red) is rarely achieved by the robot’s
72
pitch arm (blue). For longer duration notes, the commanded pitch is typically
achieved with no noticeable overshoot. This is more apparent in Fig. 5.8b which
shows the percentage error with time for the pitch trajectory. The overshoot for
the longer duration note typically remains under 1%. Fig. 5.8b also shows that
once the actual pitch settles down at the commanded pitch, the error remains
below 1%.
0 2 4 6 8 10 121046.40
1244.51
1479.98
1760.00
2093.00
time (s)
freq
uenc
y (H
z)
0 2 4 6 8 10 12C6
D#6
F#6
A6
C7
music scale
pitch detectedextracted melody
Figure 5.7: Results of full system test with a whistling input of Brahm’s Lullaby.Pitch detected by the MES (red) is used to extract a melody (blue). Note howthe extracted melody is a discretized representation of the input pitch detected.The rising note between 7 and 8 s is discretized to 4 different standard pitches.Pitch is discretized to the nearest note on the 12-note musical scale. The MESrecognizes and stores only the start and end of each note. Ideally, this, togetherwith discretization, saves memory. This saving on memory is best illustrated bythe two notes between 8 and 10 s.
73
0 2 4 6 8 10 121046.40
1244.51
1479.98
1760.00
2093.00
time (s)
freq
uenc
y (H
z)
0 2 4 6 8 10 12C6
D#6
F#6
A6
C7m
usic scalecommandactual
(a) Trajectory
0 2 4 6 8 10 12−10
−8
−6
−4
−2
0
2
4
6
8
10
time (s)
freq
uenc
y er
ror
(%)
(b) Error. Percentage difference between commanded and actual pitch.
Figure 5.8: Control system performance results for full system test with awhistling input of Brahm’s Lullaby.
74
Chapter 6
Conclusion
The goal of this work was to create a theremin playing robot that could be used
to test various control schemes, was easy for a user to program musically, and
that was inexpensive enough to be used for educational purposes. Section 1.3
provides the objectives in detail. Based on these objectives, we developed a pro-
grammable robot system comprising a pitch sensor, a set of robot manipulators
of manipulators to play a theremin, and a human robot interaction system to
command the robot (Chapters 3 and 4). A control scheme was programmed to
demonstrate that the robot is indeed programmable. Evaluation of some of the
subsystems was covered in Chapters 3 and 5. This chapter presents a discus-
sion drawing from the initial objectives and subsequent evaluations as to how
the objectives were achieved or not achieved. This is followed by a conclusion
summarizing the outcome of the discussion, and an outline of possible future
work for this project.
75
6.1 Discussion
The first primary objective in Section 1.3 was to develop a set of robotic ma-
nipulators to play the theremin. To that end, two sets of manipulators were
developed. The first set (Fig. 3.7) was quick to respond, but difficult to control
owing to backlash, hysteresis, and nonlinearity in kinematics. It also had the
tendency to excessively load the actuators. Therefore, a second set of manipula-
tors (Fig. 3.13) was developed with simpler and more linear kinematics. It did
not overload the actuators, and was much easier to control. However, it seemed
that this was not as fast as the earlier version of the manipulators. Nevertheless,
the robot was able to reach and control the antennae of the theremin, and play
commanded musical pieces. Therefore, this objective succesfully achieved.
The second objective in Section 1.3 was to develop a pitch detection system
for feedback control, i.e., have an accuracy of at least 0.5 % (Section 3.1.1). A
pitch sensor was thus designed. It was successfully able to process the audio
signal of the theremin. Testing results from Section 3.1.3 indicated a mean
absolute error of 0.1 %. This is very accurate pitch information, given that
pitch discrimination ability in 90 % of non musicians is 1%. Therefore, our
second objective was also achieved.
The third objective in Section 1.3 was to develop a robot system which can
be used to test different control schemes. The selection of a programmable robot
controller (Section 3.4) along with development of pitch sensor and theremin
playing manipulators was a step towards achieving this objective.
The fourth objective in Section 1.3 was to implement a control scheme to
demonstrate that the previous three objectives have been met: manipulators
to play the theremin, sensor to detect pitch, and a system on which a control
76
scheme can be implemented/programmed. To fulfill it, a PD control scheme
was implemented. It was proposed in Section 3.3.1 that it should be able to
perform with maximum settling time of 1.1 s, which was the time taken to move
from the lowest to highest pitch. Design requirements for the control system
included a maximum overshoot of 1 %, and a maximum steady-state error of 1
%.
A step response test in Section 5.1 indicated mean settling times of 0.19
s to 0.75 s for step sizes ranging from a single note to eight notes. Although
the controller was able to achieve the 1% or less overshoot and steady state
objective, no conclusion can drawn for the settling time. We observed a faster
response time for larger note steps, and faster response times when stepping
down from a higher frequency to a lower frequency. Also, the response time of
the controller was different at different frequencies. All these factors, coupled
with the fact that different musical pieces require different controller response
times for different note transitions, suggest that much more needs to be done to
analyze and benchmark the responsiveness of the controller. Nevertheless, we
were able to implement a control scheme. While trying to develop a suitable
controller, we also tried some other control schemes (see Section 3.3.2). This
fulfills the fourth objective of using a control scheme to verify that we have a
plafform on which we can test control schemes. In addition, we have developed
some simple tools to evaluate controller performance.
The fifth objective from Section 1.3 was to develop a user interface for com-
manding a music playing task to the robot system. The HRI and MES were
developed to achieve this objective. For the HRI, it was so that users would
find it easy to command the robot using whistling. We don’t know if it is an
easy interface or not, but we do have an interface for commanding the robot.
77
For the MES, it was also designed so that it would work very well with input
involving pure frequencies. This capability was demonstrated in the pure tone
test in Section 5.2. The whistling input tests in Section 5.2 and 5.3 showed
output melodies fairly close to the input. This was partly true for the piano
input in Section 5.2. The pitch trajectory recognized by the FFT stage resulted
in a very similar melody output. However, the FFT stage itself performed very
poorly due to the strong overtones present in the piano notes.
Finally, the domain reduction stage of the MES was designed to eliminate
the low amplitude noise during the note rest period. The pitch detection graphs
in Section 5.2 and 5.3 indicate low frequencies. From actual audio recordings,
we know that those are low amplitude background noise. These are clearly
eliminated in each of the MES output graphs in Section 5.2 and 5.3.
Therefore, the fifth objective of developing a user interface was also achieved.
It was not the best choice given a general reluctance of users to whistle, but
it worked as it was expected to. Overall, all 5 of the primary objectives from
Section 1.3 were achieved.
6.2 Conclusion
A robot system was developed to fulfill 5 specific objectives (see Section 1.3).
All of them were achieved. We now have a thereminist robot system which can
serve as a testbed for evaluating different motor control schemes. It can accept
human playing commands, it can record the theremin’s music output, and it
can give a musical performance based on any control scheme which we choose
to program it with.
We have implemented a control scheme for demonstration purposes and it
78
performs reasonably well. We tried looking at past work on human motor control
for inspiration. We learned that in general, human motor control involves both
feedback and feedforward, often for the same movement. Due to the inherent
difficulties in developing a feedforward model for the theremin, we adopted a
feedback control scheme only.
Figure 6.1: Thereminist robot system.
6.3 Future Work
At this stage, we have a device that can be used to explore control schemes for
robots in a musical environment. This can eventually be used to explore some
theories in human motor control for continuous pitch musical instruments18
such as theremins and trombones. A starting point is to make humans play the
theremin and observe their frequency-time trajectories. This leads to several
research questions. What can we learn about human motor control algorithms
from observing human musical performance on the theremin? Does being a
18Musical instruments capable of producing continuously varying pitch. Such instrumentsare not limited to playing notes at discrete levels.
79
musician play a role in this performance? Can performance on a theremin
be related to performance on a trombone? What does it take to develop a
control scheme whose response is similar to the response of human motor control
in a musical environment? Based on human perception, how does one define
similarity to human motor control? Can a thereminist robot be used to do pitch
training for non-musicians?
Another area to investigate is how professional thereminists coordinate mul-
tiple antenna control while playing the theremin. Do they move to a new note
while the volume is low, or do they do it as the volume is going up again? Is
that feedback, feedforward, or hybrid control? In the process of investigating
the above questions, it may also be useful to measure EMG and EEG19 activitiy.
These can provide some information on muscle and brain activity. By observing
which regions of the brain are active during different parts of the muscial task,
we might be able to learn more about the motor control scheme(s) used.
Other than that, the current system could use some improvements in the
software as well as hardware side. The FFT stage of the MES could be changed
to a smarter algorithm that can deal with overtones. More whistled notes can
be retained by the MES by shifting the highest whistled pitch to the highest
theremin pitch, rather than just using the highest octave. We could also add a
different form of user input for the HRI; something along the lines of a simulated
musical instrument20. At this point the robot has additional DOFs to point
the arms at the antenna, but currently this pointing action is not automated.
We could make this process as such by using a camera or proximity sensors.
19Electroencephalography (EEG) is the recording of electrical activity along the scalp tomonitor brain activity
20A graphical user interface or a physical replica that directly generates audio signals suit-able for the MES.
80
Moreover, a laptop is currently required for interaction with the robot. A shift
to the Link robot controller (which is a lot more stable now) can enable us to
do away with the laptop and make the robot system more portable. The Link
has a touch screen for interaction and we just have to enable it to perform audio
processing like the Minim library for Processing/Java.
On the hardware side of things, we noted that the pitch manipulator was
limited in its response time. An improved design with a faster response time
could be something to work on in the future. One direction for achieving this
is to use an arm with multiple degrees of freedom similar to those employed
by humans playing the theremin. Something like a wrist or even fingers could
be added, so that the control movement is divided across different degrees of
freedom. These additional degrees of freedom may or may not be translational
like the pitch arm. This could allow a faster response time using the same motor
as currently being used.
81
Bibliography
[1] Hobbico CS-60 Servo Specifications and Reviews. http://www.
servodatabase.com/servo/hobbico/cs-60.
[2] Nao Key Features - Audio. http://www.aldebaran-robotics.com/en/
Discover-NAO/Key-Features/audio.html.
[3] Processing. http://www.processing.org/.
[4] A. Alford, S. Northrup, K. Kawamura, K-W. Chan, and J. Barile. MusicPlaying Robot. In Proceedings of the International Conference on Fieldand Service Robotics (FSR ’99), pages 174–178, Pittsburgh, PA, August1999.
[5] Atmel. AVR205: Frequency Measurement Made Easy with Atmel tinyAVRand Atmel megaAVR. http://www.atmel.com/Images/doc8365.pdf,February 2011. Application Note, rev. A.
[6] Max Baars. The theremin - How it works. In Thereminvox, Milano, Italy,December 2004. Thereminvox.
[7] Alyssa M. Batula and Youngmoo E. Kim. Development of a Mini-Humanoid Pianist. In IEEE-RAS International Conference on HumanoidRobots, pages 192–197, Nashville, TN, USA, December 2010.
[8] Justin Cohen, Masataka Niwa, Robert W. Lindeman, Haruo Noma, Ya-suyuki Yanagida, and Kenichi Hosaka. A Closed-Loop Tactor FrequencyControl System for Vibrotactile Feedback. In (Interactive Poster) ExtendedAbstracts, ACM CHI 2005, pages 1296–1299, Portland, Oregon, USA, April2005.
[9] Kristina Dell. Vinyl Gets Its Groove Back. Time, 171(3):55, Jan 21 2008.
[10] Laurent Demany and Catherine Semal. The Role of Memory in Audi-tory Perception. In William A. Yost, Arthur N. Popper, and Richard R.Fay, editors, Auditory Perception of Sound Sources, volume 29 of SpringerHandbook of Auditory Research, pages 77–113. Springer US, 2007.
82
[11] Urko Esnaola and Tim Smithers. MiReLa: A Musical Robot. In Proceedingsof the IEEE International Symposium on Computational Intelligence inRobotics and Automation, pages 67–72, 2005.
[12] Damien Di Fede. Minim. http://code.compartmental.net/tools/
minim/.
[13] Damien Di Fede. Minim Manual: FFT. http://code.compartmental.
net/tools/minim/manual-fft/.
[14] David Gerhard. Pitch Extraction and Fundamental Frequency: Historyand Current Techniques. Technical Report TR-CS 2003-06, University ofRegina, Regina, Saskatchewan, Canada, 2003.
[15] M. Ghafouri and A. G. Feldman. The timing of control signals underlyingfast point-to-point arm movements. Experimental Brain Research, 137:411–423, 2001.
[16] Tony James and Abi Grogan. Back in the groove. Engineering Technology,6(11):50–53, december 2011.
[17] Angelica Lim, Takeshi Mizumoto, Louis-Kenzo Cahier, Takuma Otsuka,Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, and Hiroshi G.Okuno. Robot musical accompaniment: Integrating audio and visual cuesfor real-time synchronization with a human flutist. In Intelligent Robotsand Systems (IROS), 2010 IEEE/RSJ International Conference on, pages1964–1969, 2010.
[18] Christine L. MacKenzie. Motor skill in music performance: Commentson Sidnell. Psychomusicology: A Journal of Research in Music Cognition,6(1-2):25 – 28, 1986.
[19] MadLab. Schematics and Source Code for Digital Frequency Counter.http://www.apogeekits.com/counter_article.htm, 2012.
[20] F. Marmel, B. Tillmann, and W.J. Dowling. Tonal expectations influencepitch perception. Perception & Psychophysics, 70(5):841–852, 2008.
[21] Carmen Bachiller Martın, Jorge Sastre Martınez, Amelia Ricchiuti,Hector Esteban Gonzalez, and Carlos Hernandez Franco. Study of the In-terference Affecting the Performance of the Theremin. International Jour-nal of Antennas and Propagation, 2012, 2012.
[22] Takeshi Mizumoto. Re: Question on thereminist robot. Personal e-mail,March 2012.
83
[23] Takeshi Mizumoto, Toru Takahashi, Tetsuya Ogata, and HiroshiG. Okuno.Adaptive pitch control for robot thereminist using unscented kalman filter.In Wei Ding, He Jiang, Moonis Ali, and Mingchu Li, editors, Modern Ad-vances in Intelligent Systems and Tools, volume 431 of Studies in Compu-tational Intelligence, pages 19–24. Springer Berlin Heidelberg, 2012.
[24] Takeshi Mizumoto, Hiroshi Tsujino, Toru Takahashi, Tetsuya Ogata, andHiroshi G. Okuno. Thereminist Robot: Development of a Robot ThereminPlayer with Feedforward and Feedback Arm Control based on a Theremin’sPitch Model. In Proceedings of the IEEE/RSJ International Conference onIntelligent Robots and Systems, pages 2297–2302, October 2009.
[25] Mitsuhiro Nakamura and Hideyuki Sawada. Talking Robot and the Analy-sis of Autonomous Voice Acquisition. In Proceedings of the IEEE/RSJ In-ternational Conference on Intelligent Robots and Systems, Beijing, China,October 2006.
[26] C. Minos Niu, Daniel M. Corcos, and Mark B. Shapiro. Temporal ShiftFrom Velocity to Position Proprioceptive Feedback Control During Reach-ing Movements. Journal of Neurophysiology, 104(5):2512–2522, 2010.
[27] Hirohisa Ohta, Hiroshi Akita, and Motomu Ohtani. The Development ofan Automatic Bagpipe Playing Device. In Proceedings of the InternationalComputer Music Conference, pages 430–431, Tokyo, Japan, 1993.
[28] Brian Passey. Vinyl records spin back into vogue–2010 was top year forrecord sales since 1991. USA Today, 15, 2011.
[29] Klaus Petersen, Jorge Solis, and Atsuo Takanishi. Development of a Au-ral Real-Time Rhythmical and Harmonic Tracking to Enable the MusicalInteraction with the Waseda Flutist Robot. In Intelligent Robots and Sys-tems, 2009. IROS 2009. IEEE/RSJ International Conference on, pages2303–2308, 2009.
[30] Robert M. Poss. Distortion Is Truth. Leonardo Music Journal, 8:45–48,1998.
[31] Lawrence R. Rabiner, Micheal J. Cheng, Aaron E. Rosenberg, and Carol A.McGonegal. A Comparative Performance Study of Several Pitch DetectionAlgorithms. IEEE Transactions on Acoustics, Speech and Signal Process-ing, 24:399–418, October 1976.
[32] Kourosh Rahnamai, Brian Cox, and Kevin Gorman. Fuzzy AutomaticGuitar Tuner. In Annual Meeting of the North American Fuzzy InformationProcessing Society, pages 195–199, San Diego, CA, June 2007.
84
[33] Curtis Roads. The Tsukuba Musical Robot. Computer Music Journal,10(2):39–43, 1986.
[34] J.C. Rothwell, M.M. Traub, and C.D. Marsden. Automatic and VoluntaryResponses Compensating for Disturbances of Human Thumb Movements.Brain Research, 248(1):33 – 41, 1982.
[35] Masatsugu Sakajiri, Kenryu Nakamura, Satoshi Fukushima, ShigekiMiyoshi, and Tohru Ifukube. Voice Pitch Control Ability of Hearing Per-sons With or Without Tactile Feedback Using a Two-Dimensional TactileDisplay System. In 2011 IEEE International Conference on Systems, Man,and Cybernetics (SMC), pages 1069–1073, Anchorage, AK, October 2011.
[36] Koji Shibuya, Shoji Matsuda, and Akira Takahara. Toward Developinga Violin Playing Robot - Bowing by Anthropomorphic Robot Arm andSound Analysis. In Robot and Human interactive Communication, 2007.RO-MAN 2007. The 16th IEEE International Symposium on, pages 763–768, 2007.
[37] Kenneth D. Skeldon, Lindsay M. Reid, Viviene McInally, Brendan Dougan,and Craig Fulton. Physics of the Theremin. American Journal of Physics,66(11):945, November 1998.
[38] Thomas D. Snyder and Sally A. Dillow. Digest of Education Statistics2010, April 2011.
[39] Jorge Solis, Klaus Petersen, Tetsuro Yamamoto, Masaki Takeuchi, Shim-pei Ishikawa, Atsuo Takanishi, and Kunimatsu Hashimoto. Implementationof an Overblowing Correction Controller and the Proposal of a Quantita-tive Assessment of the Sounds Pitch for the Anthropomorphic Saxophon-ist Robot WAS-2. In IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), pages 1943–1948, Taipei, Taiwan, October2010.
[40] Jorge Solis, Koichi Taniguchi, Takeshi Ninomiya, Tetsuro Yamamoto, andAtsuo Takanishi. Development of Waseda flutist robot WF-4RIV: Imple-mentation of auditory feedback system. In IEEE International Conferenceon Robotics and Automation, pages 3654–3659, Pasadena, CA, USA, May2008.
[41] Kenji Suzuki, Takeshi Ohashi, and Shuji Hashimoto. Interactive Multi-modal Mobile Robot for Musical Performance. In Proceedings of the In-ternational Computer Music Conference, pages 407–410, Beijing, China,October 1999.
85
[42] Kenji Suzuki, Keishiro Tabe, and Shuji Hashimoto. A Mobile Robot Plat-form for Music and Dance Performance. In Proceedings of the InternationalComputer Music Conference, Berlin, Germany, August 2000.
[43] Kenji Suzuki, Yoichiro Taki, Hisaki Konagaya, Pitoyo Hartono, and ShujiHashimoto. Machine Listening for Autonomous Musical Performance Sys-tems. In Proceedings of the International Computer Music Conference,ICMA, pages 61–64, San Francisco, September 2002.
[44] Shogo Takahashi, Kenji Suzuki, Hideyuki Sawada, and Shuji Hashimoto.Music Creation from Moving Image and Environmental Sound. In Pro-ceedings of the International Computer Music Conference, Beijing, China,October 1999.
[45] Yoichiro Taki, Kenji Suzuki, and Shuji Hashimoto. Real-time InitiativeExchange Algorithm for Interactive Music System. In Proceedings of theInternational Computer Music Conference, Berlin, Germany, August 2000.
[46] Mari Tervaniemi. Musical Sound Processing in the Human Brain. Annalsof the New York Academy of Sciences, 930(1):259–272, 2001.
[47] Mari Tervaniemi, Viola Just, Stefan Koelsch, Andreas Widmann, and ErichSchrger. Pitch discrimination accuracy in musicians vs nonmusicians: anevent-related potential and behavioral study. Experimental Brain Research,161(1):1–10, 2005.
[48] WJ Wadman, JJ Denier Van der Gon, RH Geuze, and CR Mol. Control offast goal-directed arm movements. Journal of Human Movement Studies,5(1):3–17, 1979.
[49] Yan Wu, Polake Kuvinichkul, Peter Y.K. Cheung, and Yiannis Demiris.Towards Anthropomorphic Robot Thereminist. In Proceedings of the IEEEInternational Conference on Robotics and Biomimetics, pages 235–240,2010.
[50] Emily Chivers Yochim and Megan Biddinger. ‘It kind of gives you thatvintage feel’: vinyl records and the trope of death. Media, Culture &Society, 30(2):183–195, 2008.
[51] Kazuyoshi Yoshii, Kazuhiro Nakadai, Toyotaka Torii, Yuji Hasegawa, Hi-roshi Tsujino, Kazunori Komatani, Tetsuya Ogata, and Hiroshi G. Okuno.A Biped Robot that Keeps Steps in Time with Musical Beats while Listen-ing to Music with Its Own Ears. In Proceedings of the IEEE/RSJ Inter-national Conference on Intelligent Robots and Systems, pages 1743–1750,San Diego, CA, 2007.
86
[52] Robert J. Zatorre, Joyce L. Chen, and Virginia B. Penhune. When thebrain plays music: auditory-motor interactions in music perception andproduction. Nature Reviews Neuroscience, 8:547–558, July 2007.
87