UNIVERSITY OF OKLAHOMA POSITION CONTROL USING PITCH ...c3p0.ou.edu/IRL/Theses/Ghazi-MS.pdf · position control using pitch feedback a thesis approved for the school of aerospace and

UNIVERSITY OF OKLAHOMA

GRADUATE COLLEGE

POSITION CONTROL USING PITCH FEEDBACK

A THESIS

SUBMITTED TO THE GRADUATE FACULTY

in partial fulfillment of the requirements for the

Degree of

MASTER OF SCIENCE

By

MUSTAFA A. GHAZINorman, Oklahoma

2013

POSITION CONTROL USING PITCH FEEDBACK

A THESIS APPROVED FOR THESCHOOL OF AEROSPACE AND MECHANICAL ENGINEERING

BY

Dr. David Miller, Chair

Dr. Andrew Fagg

Dr. Peter Attar

c© Copyright by MUSTAFA A. GHAZI 2013All Rights Reserved.

DEDICATION

To my mother, Ammi, for motivating me and pushing me through school allthese years. Without you, I would not be where I am today.

Acknowledgements

First of all, I would like to thank Dr. David Miller, my advisor and mentor,

for having faith in me and encouraging me, despite my lack of experience with

robotics. I am particularly grateful for his patience with me. I would also like

to thank my committee members: Dr. Peter Attar, for teaching me not to

be afraid of math, and Dr. Andrew Fagg, for my formal training in practical

robotics.

I would like to thank Dr. John Fagan and Dr. Miller for teaching me so

much about robot construction, Dr. Dean Hougen for my formal introduction to

robotics, Amber Walker for her advice and guidance on everything from school

work to life in the US, Matthew Walker for introducing me to microcontrollers,

Michael Nash for his help with 8-bit microcontrollers and programming tips,

Andrew Kooiman for teaching me about robot testing and redesigning, Jacob

Henderson for his advice on electronics, and Clayton Stich for his help with

machining parts. I am also grateful for cooperation from Billy Mays and Greg

Williams at the AME machine shop.

Finally, I would like to thank Ammi and Baba, my parents, for their contin-

uing support, encouragement and love, keeping me motivated day after day.

ii

Contents

1 Introduction 11.1 Overview of the Theremin . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Related Work 92.1 Pitch Feedback in Robots and Humans . . . . . . . . . . . . . . 10

2.1.1 Automated systems controlling pitch . . . . . . . . . . . 102.1.2 Automated systems: summary . . . . . . . . . . . . . . . 142.1.3 Human motor control . . . . . . . . . . . . . . . . . . . . 142.1.4 Human motor control: summary & preliminary hypothesis 19

2.2 Pitch Detection Methods . . . . . . . . . . . . . . . . . . . . . . 202.2.1 Pitch detection methods: summary . . . . . . . . . . . . 22

2.3 Human Robot Interfaces in Musical Robots . . . . . . . . . . . 232.3.1 Human Robot Interfaces: summary . . . . . . . . . . . . 25

3 Development of a Thereminist Robot System 263.1 Pitch Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.1 Pitch sensor design requirements . . . . . . . . . . . . . 273.1.2 Pitch sensor design . . . . . . . . . . . . . . . . . . . . . 303.1.3 Pitch sensor implementation . . . . . . . . . . . . . . . . 35

3.2 Mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.1 Mechanical design requirements . . . . . . . . . . . . . . 373.2.2 Mechanical design & implementation . . . . . . . . . . . 38

3.3 Control Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3.1 Control scheme design requirements . . . . . . . . . . . . 473.3.2 Control scheme design & implementation . . . . . . . . . 48

3.4 Robot Controller . . . . . . . . . . . . . . . . . . . . . . . . . . 503.4.1 Robot controller performance requirements . . . . . . . . 503.4.2 Available robot controller options . . . . . . . . . . . . . 513.4.3 Selection of robot controller . . . . . . . . . . . . . . . . 51

iii

3.5 Human-Robot Interface . . . . . . . . . . . . . . . . . . . . . . . 533.5.1 HRI peformance requirements . . . . . . . . . . . . . . . 53

3.6 Music playing algorithm . . . . . . . . . . . . . . . . . . . . . . 54

4 Melody Extraction Scheme (MES) for Human-Robot Interface 564.1 MES Performance Requirements . . . . . . . . . . . . . . . . . . 574.2 MES Design & Implementation . . . . . . . . . . . . . . . . . . 57

4.2.1 Pitch detection . . . . . . . . . . . . . . . . . . . . . . . 584.2.2 Highest octave identification . . . . . . . . . . . . . . . . 604.2.3 Domain reduction and shifting . . . . . . . . . . . . . . . 614.2.4 Note extraction . . . . . . . . . . . . . . . . . . . . . . . 644.2.5 Melody extraction . . . . . . . . . . . . . . . . . . . . . 64

5 Testing and Performance 655.1 Testing of Feedback Control Scheme . . . . . . . . . . . . . . . . 65

5.1.1 Response to step commands . . . . . . . . . . . . . . . . 665.1.2 Effect of environmental disturbance . . . . . . . . . . . . 68

5.2 Testing of Melody Extraction Scheme . . . . . . . . . . . . . . . 695.2.1 Pure tone . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2.2 Whistling . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2.3 Piano recording . . . . . . . . . . . . . . . . . . . . . . . 70

5.3 Testing of Overall System . . . . . . . . . . . . . . . . . . . . . 71

6 Conclusion 756.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

iv

Abstract

An interactive robot has been developed to play a musical instrument called the

theremin. The robot has 4 degrees of freedom (DOF) and uses zero crossing

rate (ZCR) for pitch detection. A proportional-derivative (PD) feedback control

scheme is used to control pitch and play music. A user interface converts human

whistling sounds to a commanded melody for the robot. A melody extraction

scheme (MES) filters and shifts the input pitch to fit in the pitch generation do-

main of the theremin. In contrast to other thereminist robots, this robot is a low

cost platform using ‘hobby’ electronics. The interface enables non-musicians and

non-programmers to command the robot to play musical pieces of their choice.

Like some other thereminist robots, it can adapt to a changing environment,

and can be easily re-programmed to test different control schemes. Unlike other

thereminist robots which are set up to test predominantly feedforward control

schemes, our robot is set up to explore feedback as well as feedforward control

schemes. The development of this robot is a first step in creating tools for explo-

ration of aspects of human aural preferences, and human motor control during

musical performances. The benefits of understanding human motor control in

this context can extend beyond musical robots. This understanding can impact

human-robot interaction applications in general, leading to greater potential for

human acceptance of robotic systems.

v

Chapter 1

Introduction

A theremin is an electronic musical instrument that has a physical interface

amenable to manipulation by a simple robot. Unlike most other musical instru-

ments, the absence of physical contact in playing a theremin makes it difficult

to control using tactile feedback. The variability of the surrounding capacitive

field makes proprioceptive feedback inadequate. Together, these make aural

feedback a requirement. As such, it is a good vehicle for exploring issues having

to do with robots and music. This chapter presents the operating principles of

the theremin, and introduces the thesis project, providing some of the motiva-

tions and objectives. These are followed by an outline of the remainder of the

thesis.

1.1 Overview of the Theremin

A theremin (Fig. 1.1) is a musical instrument with a pitch control and a volume

control antenna. These antennae act as capacitors for two LC oscillator circuits

[6]. One antenna is part of the frequency oscillator circuit, and the other is

1

Figure 1.1: A theremin with the pitch and volume antennae labeled.

part of the volume oscillator circuit. Placing a foreign object within range of

an antenna, e.g. a hand, detunes its LC oscillator. This changes the frequency

of oscillation. Typically this change is very small; less than 1 %. For a pitch

circuit oscillating at 500 kHz, for example, this is a variation of about 5 kHz.

The 500-505 kHz signal is then converted to the audible 0-5 kHz range by

heterodyning1 with a 500 kHz signal. Typically, a theremin is designed to give

a pitch range of 3-5 octaves. A different LC oscillator circuit controls the volume

such that the more it is detuned, the weaker the signal gets, and the lower the

volume. Similarly, the more the pitch antenna is detuned, the higher the output

frequency (Fig. 1.2). Therefore, two hands are simultaneously used to play a

theremin, one for each antenna.

Interestingly, the hand movement at one antenna does not interfere with

the other antenna. This is because both antennas produce electric fields with

orthogonal polarizations [21]. A monopole antenna is used for the pitch and

a loop antenna is used for the volume. Furthermore, theremins are nonlinear.

The distance between two consecutive notes is different in each octave [21].

Lastly, the pitch antenna can easily be interfered with. Theremin players

1Taking the difference with another frequency.

2

−350 −300 −250 −200 −150 −100 −50 00

200

400

600

800

1000

1200

1400

1600

1800

2000

motor position (dimensionless)

freq

uenc

y (H

z)

without disturbing bodywith disturbing body

Figure 1.2: Variation of pitch (frequency) with distance from the pitch antennaof the theremin. Note the jump in frequencies further away from the antenna inthe presence of a disturbing body. The disturbing body here is a 150 lb personstanding about 30 cm (12 in) from the pitch antenna.

have reported interference effects in a radius of about 3 m (118 in) [21]. Any

moving object near the theremin will alter the pitch (Fig. 1.2). Since the

volume antenna is controlled by vertical movements, it is much more difficult to

interfere with. There is a large number of different materials, sizes, shapes, and

even environmental conditions that may modify the electrostatic fields around

the antennae, and hence the capacitances of the oscillators. Therefore, it is very

difficult to develop a generalized model of the theremin which accounts for all

the variables.

Our theremin is a Moog standard Etherwave model. It has 4 rotary control

knobs for controlling volume, pitch, wave form, and brightness. The volume

control knob controls the response of the volume antenna. Along with the out-

put volume, it also controls the location of the point at which the volume goes

3

down to zero. The location of this point can be set so that it is on the antenna

(requiring contact) or some distance away from the antenna (not requiring con-

tact) depending on how the user wants to set the volume to zero during play.

The pitch knob controls the response of the pitch antenna, i.e. the size of the

useful pitch control workspace. The wave form knob controls the shape of the

audio wave form. The brightness knobs controls the “character” of the sound.

For our purposes, we set the volume knob to get the lowest possible signal am-

plitude with nothing near the volume antenna. The pitch control knob was

set to get the largest possible pitch workspace. The wave form and brightness

knobs can be set to any pleasant sounding positions as determined by the user.

Testing indicated a reliable frequency variation from approximately 132 Hz to

2000 Hz, which corresponds to the range C]3 to B\

6 on the musical scale.

1.2 Motivation

Humans can have a tendency to prefer slightly distorted music over perfectly

produced music. In an article on music production equipment, Poss [30] points

out that distortion in music is like seasoning in food, and that it makes music

platable. He argues that most guitarists use equipment based on older technol-

ogy because such equipment has a certain distortion which is pleasing to the

human ear. Another article on the resurgence of vinyl records makes a similar

point. Vinyl records produce a warmer and more nuanced sound as compared

to CDs and digitally stored audio [9]. Quoting a music collector, “Most things

sound better on vinyl, even with the crackles and pops and hisses.” Similarly,

Yochim et al. [50] conclude that vinyl is viewed as human in its sonic imperfec-

tions. Vinyl enthusiasts seem to describe its sound as “somehow more alive.”

4

Record store owners have claimed that the scratches and pops often asso-

ciated with the vinyl sound are all part of the “warmth” that vinyl offers [28].

An article looking at the recent growth in vinyl industry in the UK mentions a

similar preference of the warmth of vinyl, and the “being there” quality. It is

said to provide a unique listening experience [16].

It seems that there is something about the distortion and imperfection that

is attractive. Perfection feels less natural and less attractive to humans. Perhaps

this has something to do with being human and being abe to relate better to

what we perceive as human-like. Unlike precision manufacturing robots, our

movements are not as accurate. Our speech does not sound exacly the same,

even if we repeatedly utter the same phrase. When we write, any particular

letter of the alphabet is slightly different every time it is written down. Referring

to repeated movements which are too precise, monotonic, or “unnatural” we

label them as “robotic.” We are frequently able distinguish between the output

of a “robotic” text-to-speech converting program and a “natural” recording

from a human. What is it that distinguishes “natural” from “robotic”? Can we

identify these distinguishing features and build them into robots to make them

more “natural” and acceptable to humans? Is it really a difference between

“natural” and “robotic” actions, or is our perception biased? If so, then is it

possible to manipulate this bias in perception? Creating a testbed to explore

these questions is one motivation behind this thesis.

A second motivation for this thesis is to explore human-robot interfaces

(HRI). As robots continue to assume an ever increasing number of roles in our

daily lives, there has been a surge in demand for more intuitive interfaces. It is

fascinating to see the variety of creative HRI solutions now available. The Nao

[2] humanoid robot, for example, has voice recognition and speech capabilities

5

to communicate in 8 languages. Another example is an interface that allows

a musical robot to understand non-verbal communication from a human flute

player [17]. The robot starts playing a theremin when it senses a visual cue from

the human. During play, the robot uses beat tracking and gesture recognition

to predict instantaneous tempo, and synchronizes its own play on the theremin.

HRI for musical robots can be interesting because of the variety of ways in

which music can be produced, perceived, and stored. We are interested to see

how music can be used as a mode of interaction with robots.

Another motivation is based on Science, Technology, and Math (STEM)

education. STEM education is a vital tool for developing and grooming the

innovators, inventors, and professionals of tomorrow. Meanwhile, as robotics-

related technologies continue to become more affordable and ubiquitious, many

STEM activities based on these technologies have come into being. Most of

these activities actively engage and challenge the imagination and resourceful-

ness of young minds. These minds get educated in multiple disciplines at once;

mechanical engineering, electrical engineering, computer science, mathematics,

management, and even aesthetics. Unfortunately, despite the widespread inter-

est, there are still not as many students attracted to STEM fields as would be

desirable. In 2008-09, no STEM discipline made it to the four most popular

majors in the US [38].

Music has long been a source of entertainment and attraction for adults and

youngsters alike. If music can be incorporated in STEM education activities,

then an additional number of otherwise disinterested youngsters can be moti-

vated to become involved. This is an additional motivation for this research.

That is to add the flavor of music to robotics-related STEM activities.

6

1.3 Objectives

The primary objective of this thesis is to develop a set of tools to facilitate

progress in exploring the areas of motivation outlined in the previous section.

This maps on to the following specific objectives:

1. Develop a set of robotic manipulators to play a theremin.

2. Develop a pitch detection system for feedback control.

3. Develop a robot system that can be used to test different control schemes.

4. Implement a control scheme to demonstrate the robot system’s ability to

be programmed to employ a given control scheme to play a theremin.

5. Develop a user interface for commanding a music playing task to the robot

system.

To make the robot practical for the STEM objective, making the robot

system as low cost as possible is a secondary objective. Keeping the system low

cost makes it more readily accessible for schools and STEM activity platforms.

1.4 Overview of Thesis

The layout of this thesis is as follows. Chapter 1 introduces the topic of the

thesis. In Chapter 2, some related work on control systems and pitch detection is

reviewed. Chapter 3 covers the design and development of the thereminist robot

system. This includes the control scheme, pitch detection sensor, mechanics, and

the human-robot interface. Chapter 4 outlines the melody extraction scheme

(MES) used to convert human audio input to a melody, i.e. a set of music playing

7

commands for the robot. Design requirements for the various subsystems are

presented in Chapters 3 and 4. Chapter 5 reports results from experiments

with the robot system. These experiments are intended to verify the objectives

identified in Chapter 1, and to validate that system performance meets the

requirements presented in Chapters 3 and 4. Chapter 6 discusses the significance

of the results and outlines possible future work.

8

Chapter 2

Related Work

One of the objectives of this thesis is to implement a control scheme for the

thermein playing robot. It can be helpful to review control schemes which

have been implemented in similar scenarios. Since we are dealing with a music

playing task, systems dealing with music and pitch will be considered. We will

seek inspiration from current theories on how humans perform similar tasks,

and perhaps in the process gather some more insight on the human processes

involved. Some terms used in the following sections are defined as follows.

Settling time is defined as the time taken to converge to a pre-defined pitch

error. Overshoot is defined as the maximum deviation (or pitch error) from the

intended pitch, measured from the instant when the error is first reduced to

zero. Error is defined by magnitude, and not by direction or sign. At a given

instant, it is typically defined relative to the target pitch at that instant, i.e. a

fraction or percentage.

9

2.1 Pitch Feedback in Robots and Humans

2.1.1 Automated systems controlling pitch

There have been many approaches in systems where music generation or fre-

quency control is involved: feedforward, feedback, and hybrid feedforward con-

trol. We will look at examples of each of these in order to select one suitable

approach to demonstrate on our thereminist robot system.

Alford et al. [4] used a hybrid feedforward-feedback control system with

their thereminist robot. The robot manipulators were controlled using artificial

pneumatic muscles. Their response is much slower and less accurate as com-

pared to electromechanical systems [4]. The hybrid control scheme involved

moving to a precalibrated position and then using a nonlinear pitch feedback

control loop to make corrections. Performance was compared to that of hu-

man subjects trying to match a note on the theremin with a note being played

to them from another source. Without any disturbance affecting the calibra-

tion, the robot demonstrated a settling time comparable to the best human

performances, and a much lower mean squared error. The ‘best’ performers in

this case were subjects with some musical experience, rather than professional

thereminists. Alford et al. [4] defined settling time as the time taken to reach a

given note, without specifying whether this had to be exact or within a certain

error margin. When the calibration was disturbed, the mean squared error ap-

proached that of human performance. Their results also indicated that with the

calibration disturbed, the overshoot became greater than that of humans. We

note that using a hybrid scheme did not prevent a degradation in performance

when a disturbance changed the feedforward model.

Using feedback control alone is another way of controlling pitch. Mizumoto

10

et al. [24] used a proportional-integral (PI) feedback controller for their therem-

inist robot. According to Mizumoto, feedback control easily makes unintended

vibrations [22]. So they switched to a feedforward scheme outlined below. How-

ever, the fact that they had partial success in using PI control to play a theremin

indicates that there is some potential in using classical feedback control schemes

for our purpose.

Another example of pitch feedback control is an automated guitar tuner.

Rahnamai et al. [32] used a fuzzy logic controller on their PC-based auto-

mated guitar tuner. Software detected pitch from the guitar and controlled

an actuator to physically tune the guitar. The controller design was based on

hardware characteristics and knowledge-base gathered from interviewing expert

musicians. The musicians’ knowledge was used to create fuzzy rule sets. The

pitch error of the tuner was about 5 cent2; small enough to be “acceptable by

most professional musicians”. In another application, Cohen et al. [8] developed

a frequency control system for vibration motors in vibrotactile feedback devices.

They used a proportional (P) feedback control scheme. A steady frequency was

reported to be reached in under 1 s. Both these works indicate that controlling

frequency using feedback control schemes is plausible.

Now we will look at some examples of feedforward control schemes. Mizu-

moto et al. [24] proposed two models for feedforward control for their therem-

inist robot: parametric and non-parametric. The parametric model used a

4-parameter nonlinear relation to fit a curve to measured pitch-position data

points. The non-parametric model described the pitch position relation using a

large number of linear interpolations. The parametric model was not affected

2A logarithmic measure for musical interval. On a 12-note musical scale, each note is 100cent (about 5.9 %) higher than the previous note. 5 cent is about 0.3%

11

by the changing environment, used only 12 data points, but was relatively less

accurate than the non-parametric model. Although the non-parametric model

was relatively more accurate, it was limited in that it required many more data

points (40-80) which had to be measured again every time the environment

changed. Mean absolute error3 was used as the performance metric to deter-

mine accuracy. This is not a useful quantitative metric since this error depends

on the total number of notes, the playing time duration, and the relative po-

sitions of the commanded notes. In later work, Mizumoto et al. [23] proposed

an improved feedforward model which used pitch detection for updating model

parameters while playing. The root mean square of the pitch error4 for this

model while playing a musical piece was 72.9 “cent”. A qualitative look at their

results indicates significant oscillations about the desired note. So although the

idea of a feedforward model being updated in real-time is attractive and possible

to implement, it may be very difficult to have it perform reasonably well.

In a different attempt at using a feedforward control scheme, Wu et al. [49]

used pitch detection to build a pitch map of the theremin. They were able to

develop a linear model based on the following observation: for a one degree of

freedom (rotational) robot arm, the arm rotation angle maps non-linearly to

distance from the theremin’s antenna, and maps linearly to the pitch produced.

They used this to develop a pitch-angle model which was approximately linear.

This model was limited in that it was valid for a specific arm length for the

robot. They also assumed that the environment would not change drastically

during the calibration and playing phases. However, for practical purposes,

3Mean of the error magnitude recorded at all the different instances during playing. Theinterval at which error was noted was not indicated by the authors.

41200*log(p/q) where p and q denote the observed and desired pitch in Hz. According totheir definition of cent, an error of 100 cent represents a half note error.

12

changes in the pitch model due to the environment are normal and are to be

expected.

Yet another example of feedforward control is a miniature pianist robot by

Batula et al. [7]. They used pitch detection to create and update a pitch-pose

map of the piano keys. The miniature robot had a tendency to drift slightly

during play, so an update to the model was required during play. Batula et

al. [7] reported a worst case of playing 1 wrong note after every 70 correct

notes. Such a scheme would perform worse with a thereminist robot because

the pitch domain of a theremin is translated as well as transformed non-linearly

(see Fig. 1.2) so the relative distances of the pitch-producing positions are also

changed. With a piano playing robot, the domain (keyboard) may move but

the keys are still the same size and at the same relative positions. Also, pianos

produce the wrong note only after the position error reaches a criticial point,

while theremins have a pitch error whenever there is any amount of position

error. Again, we see a music playing robot taking advantage of the fact that the

feedforward model is expected or assumed to change very little over time. In

a similar approach to building a feedforward model, Nakamura et al. [25] used

pitch detection to calibrate a motor-pitch map for humanoid robotic mouth.

In contrast to feedforward schemes reviewed thus far, some automated sys-

tems have used feedforward models without using any pitch detection. These

are music playing robots which include a bagpipe playing robot [27], a musi-

cian robot playing the piano [33] a saxophonist robot [39], and a violin5 playing

robot [36]. This indicates that for some instruments, the pitch-position relation

is so clearly defined that feedforward models can be developed without detecting

pitch.

5This study used a specialized electronic violin.

13

Many musical instruments seem to maintain their pitch generation proper-

ties. Once a feedforward model is developed, it can be assumed to remain the

same, and the instrument can be controlled using a feedforward scheme. In

general, a feedforward scheme can be more accurate than a feedback scheme,

and have the potential to respond faster. However, knowing that the theremin’s

acoustic properties change with the environment, and from looking at feedfor-

ward control attempts with other thereminist robots [4, 24, 22, 49] it is apparent

that a feedback control scheme would be most suitable for our purpose.

2.1.2 Automated systems: summary

To sum up the review of automated systems, we saw examples of different types

of control schemes: feedforward, feedback, and hybrid feedforward-feedback.

Given that most musical instruments more or less maintain their acoustic prop-

erties in a given playing session, feedforward control should be a logical choice.

In some cases, such a model can be developed without even detecting pitch.

However, if the acoustic properties change with time, then some element of

feedback control becomes a requirement. Since we are only looking to select a

control scheme for demonstration purposes, it is not important which scheme

we select. However, we do know that a theremin’s pitch properties can change

with time. Therefore, a feedback control scheme seems to be more suitable for

our thereminist robot, and this is the scheme we will use.

2.1.3 Human motor control

Thus far, we have reviewed control schemes primarily in music playing auto-

mated systems, i.e. “robotic” systems. Playing a theremin is a “natural” human

14

motor control task. We also know that humans prefer “natural” to “robotic”

actions. Therefore, we will now review human motor control strategies with the

intention of gaining at least an inspiration for our control system. If we can

learn enough to add “natural” characteristics to the control of our thereminist

robot, its performance might be more acceptable to humans. At the very least,

it will be interesting to compare actual human motor control schemes to robot

muscician control schemes.

MacKenzie [18] deduced that motor program (efferent) and feedback (affer-

ent) aspects play a mutually facilitative role in movement control. An efferent

pathway (or motor neuron) transmits an impulse out of the spinal chord. An

afferent pathway (sensory neuron) transmits an impulse to the spinal chord.

This sensorimotor interaction is representative of a feedback control loop.

Wadman et al. [48] conducted experiments with fast arm movements, mea-

suring electromyography (EMG) activity of the agonist6 and antagonist7 mus-

cles. There were three major outcomes of their study. The first was that the

agonist activity consisted of two bursts of activity separated by a period of

depressed activity. The antagonist muscles were active during this period of

depressed agonist activity. The second outcome was that the time duration of

the initial burst of agonist activity increased with the distance which had to

be covered. This increasing initial burst of agonist activity may well represent

feedback control. Alternatively, this response could also be a preset feedforward

action, with its time duration being a function of some perception information;

in this case, the distance to move. The third finding was that if the movement

was mechanically blocked without the subjects’ knowledge, then the pattern of

6Agonist muscles produce a movement.7Antagonists muscles are those which oppose a movement.

15

the EMG activity over at least the first 100 ms was the same as in the case

of the undisturbed movement. According to the Wadman et al. [48], this in-

dicated that the muscle activation patterns were preset over this period and

not immediately modifed by proprioceptive information. This seems to indicate

feedforward control. Even if 100 ms was the reaction time of subjects, the mus-

cles were still activated while the perception process was still going on. These

are only weak theories which may very well be incorrect, since data was not

collected on how the perception of the blocked movement made its way to the

subject’s brain. We can see here that monitoring muscle activity alone does not

provide much information as to what control scheme is being used.

Rothwell et al. [34] investigated the relationship between automatic and

voluntary phases of human motor control in a reflex action. They defined vol-

untary as the ability to abolish at will or greatly increase by extra effort. They

conducted experiments in which subjects were asked to maintain thumb position

in response to external disturbances. Thumb position and EMG were recorded.

They found that the magnitudes of the position reponses varied for the auto-

matic and voluntary phases, even for the same individual. In one set of trials, a

“saturation effect” was observed: as the size of the disturbance was increased,

the EMG magnitude of the automatic response remained constant. The varia-

tion in responses seems to indicate that multiple control strategies may be at

work. The saturation effect observed may be an indication of an underlying

feedback control mechanism, whereby there is an upper limit to the magnitude

of the control commands, either by design, or due to hardware limitations.

Some works have attempted to find evidence to support hypotheses on hu-

man motor control. One such hypothesis is the equilibrium-point (EP) hypoth-

esis cited by Ghafouri et al. [15]. The EP hypothesis suggests that changing

16

the static component of the torque-position relation forces the reflex action to

kick in, and to either settle into a new equilibrium pose or establish a new level

of static torques. This seems to be identical to a higher level controller chang-

ing the set point for a lower level controller (the lower level controller being

the reflex mechanism). In a thereminist robot system, this would be equivalent

to a higher level controller changing the commanded pitch to play (set point)

for a lower level pitch control system (reflexive in that it reactively tries to

minimize the pitch error at all times). Ghafouri et al. [15] concluded through

experiements that fast control movements may be completed without continu-

ous control guidance, since during the latter part of the movement, the control

signals responsible for the equilibrium shift remained constant. This agreed

with their hypothesis that in unobstructed movements, the equilibrium shifts

end at approximately the same time as peak velocity is reached. This seems

to be similar to saturation of higher level control commands, since the control

systems responsible for equilibrium shift remained constant. For a thereminist

robot, this is identical to the saturation of actuator velocity command once peak

velocity is achieved. Alternatively, this could just be representative of a control

hierarchy; the higher level controller successfully changing the set point of the

lower level controller, and then waiting for the lower level controller to complete

the motion. In a thereminist robot this would be analogous to a higher level

controller changing the commanded pitch to play (set point) and then a lower

level controller using that pitch command to actively control the actuator(s).

Reviewing work on music perception and production, Zatorre at al. [52]

cited work hypothesizing the involvement of two regions of the brain with mo-

tor control: the ventral premotor cortex (vPMC) and the dorsal premotor cor-

tex (dPMC). It has been hypothesized that the vPMC is involved in direct

17

visuomotor transformations and the dPMC is involved in indirect visuomotor

transformations.

Direct transformations represent a one-to-one matching of sensory features

with motor acts, e.g. matching properties of the visual object with an appropri-

ate motor gesture. For a thereminist robot, this could be identical to matching

a commanded pitch to a movement to a specific position in space which would

produce that pitch.

Indirect transformations involve motor information instructed by sensory

cues, e.g. the selection of a motor plan, and movement parameters such as

direction and amplitude. Sensory cues according to Zatorre et al. [52] represent

conditional rules as to which response to select among different alternatives.

In the context of a thereminist robot, this is identical to selecting from a set

of different arm trajectories, given the current and desired position in space to

move to.

If these hypotheses prove to be true, then both vPMC and dPMC indicate

different levels of feedforward control involved in musical performance. This

is identical to feedforward control models used by many musical robots and

automated systems as discussed in Section 2.1.1. However, even if true, we do

not know if these hypotheses also apply to musical instruments with changing

pitch models, such as a theremin. From the examples of theremin playing robots

we reviewed in Section 2.1.1, feedforward schemes seemed to be limited in their

success.

Niu et al. [26] concluded from experiments that the control of reaching

movements had three phases: suppression of proprioceptive feedback control,

followed by velocity feedback control, and then position feedback control late

in the movement. Drawing from these findings, one could hypothesize that

18

the first phase might very well be feedforward control. Alternatively, it could

also be saturated feedback control. The presence of velocity and position con-

trol in succession suggests that not attempting to control position earlier on

during the movement might be more efficient, or at least “natural.” These ex-

periements may not be entirely relevant for our purpose, since they were for

reaching movements rather than movements in a music playing task. But they

offer an interesting insight in that not only are both feedback and feedforward

control schemes used for the same task, different variables are controlled at dif-

ferent stages of the motion. As with other studies we have reviewed thus far,

we cannot deduce the details of the control schemes at work.

2.1.4 Human motor control: summary & preliminary

hypothesis

Based on the limited number of studies which we have reviewed, there is much

evidence indicating that human motor control involves varying degrees of feed-

back and feedforward control. There are some schools of thought and some facts

indicating how feedback control responds. But details, such as the actual feed-

back control algorithms employed by the human motor control system, seem to

be lacking. Furthermore, these studies in general seem to focus on motor con-

trol with either visual or proprioceptive feedback. In contrast, motor actions

with musical instruments also involve aural feedback. Only Wadman et al. [48]

seems to have looked at violin and trombone playing tasks. Other than that,

Alford et al. [4] (Section 2.1.1) recorded human performance on a theremin just

to prove that their thereminist robot performed better. Before attempting to

draw any conclusions about how human motor control functions when playing

19

musical instruments and using aural feedback, it is necessary to conduct a more

in-depth literature review. It may even be helpful to conduct our own studies

in this area.

We can present hypotheses, however. Based on work by Wadman et al [48],

the feedforward part of motor control is a preset action that is some function

of perception information. Similarly, from results by Niu et al. [26], not con-

trolling position earlier on during the feedback control phase may be one factor

that distinguishes “natural” from “robotic’ whereby a classical control system

may attempt to control position for the entire duration of the phase. These

hypotheses may mark the start of a scientific study. But for the purpose of this

thesis, we can opt to use them as inspiration for the development of the control

system in the next chapter.

2.2 Pitch Detection Methods

Another objective of this thesis is to develop a pitch detection system for feed-

back control. We know from the previous section that at least some human and

musical robot control schemes (human as well as robotic) make use of feedback.

In order to perform feedback control we must have some means of observing

the system output. Since this will be a music playing robot, the system output

will be the tones, i.e. pitch at different intensities. The development of a pitch

detection system is therefore critical to the task of feedback control. Therefore,

in this section, we will review pitch detection methods.

In general, pitch detection algorithms can be broadly divided into three

categories: time domain based, frequency domain based, and hybrid time and

frequency domain based. Rabiner et. al [31] compared representative algorithms

20

from all 3 categories. They concluded that in terms of computation time, the

time domain based methods performed the fastest. A faster performance means

lower computation time, or lower sampling time, or both.

The frequency domain based Fast Fourier Transform (FFT) seems to be

a popular method of choice for robot systems. Music playing robots such as

thereminists [49, 4], a miniature pianist [7], and a saxophonist [39] have used

FFT for pitch detection. Various other systems such as a dancing robot [51], an

automatic guitar tuner [32], a music interaction system [11], a music perception

system [43], and a humanoid robotic mouth [25] have also used FFT. Of these,

Wu at al. [49] and Alford et al.[4] reported a frequency resolution of 3.91 Hz at a

sampling frequency of 8kHz. Batula and Kim [7] reported a 99.6 % success rate

in pitch identification for their piano playing robot, correct to the nearest note.

Many systems have made use of real-time pitch feedback [49, 4, 7, 32, 43, 25].

Where identified, FFT was implemented using a computer [49, 4, 7, 32, 43].

From this it seems that FFT is widely used for pitch feedback and the popular

platform for implementation is a computer.

Another frequency domain-based method is the Auto-correlation (AC) pitch

detection method. Mizumoto et al. [24] used an AC-based method for a therem-

inist robot system.

Yet another popular frequency domain based method is the Cepstrum Anal-

ysis (CA) [14]. This has been used on an organ playing robot [33], a flutist

robot [40], a music interaction robot [41, 42], an interactive music generation

system [44], an interactive musical system [45] and a hearing aid [35]. Of these,

Roads [33], Suzuki et al. [41, 42]., Taki et al.[45] and Sakajiri et al. [35] could

detect pitch in human speech. Where mentioned, the implementation was on

computers [41, 42, 44, 45, 35] or a network of computers [33]. Therefore, CA

21

is another method which is widely used and traditionally implemented on a

computer.

Zero crossing (ZCR) is a time domain based pitch detection method. Some

examples of ZCR implementation include a vibro-tactile device control system

[8], a hobbyist frequency measurement instrument [19], and a manufacturer

application note [5]. More than 99 % accuracy was reported [8] or claimed

[5]. All three are implemented on microcontrollers. Being a time domain-based

method, it should perform the fastest (based on conclusions by Rabiner et al.

[31]). But the simplicity of ZCR limits it only to signals with pure frequencies

without any noise. Even if noise is taken care of, most musical instruments

generate multiple overtones, which ZCR is unable to deal with. Therefore, ZCR

is a desirable option for pure frequencies without noise or overtones.

2.2.1 Pitch detection methods: summary

To summarize, we saw examples of frequency domain based FFT, CA and

AC methods for pitch detection. We also saw some examples of the time do-

main based ZCR detection method. FFT and CA implementations have been

computer-based, while the ZCR implementation has been microcontroller-based.

If limiting computation power is not an issue, then FFT and CA seem to be the

best choice. If computation power is limited, then ZCR is the better choice. For

ZCR to work, the pitch signal should be free of noise and without overtones.

For the purpose of playing a theremin, we know that there are no overtones or

noise (more on absence of noise in Fig. 3.2, Section 3.1.1) in the pitch produced.

This makes it possible to use ZCR. Also, since we are trying to make this a low

cost pitch detection sensor, the least computationally demanding ZCR is the

22

preferable to implement on an inexpensive microcontroller which has limited

computation power.

2.3 Human Robot Interfaces in Musical Robots

The last objective in Section 1.3 is to develop an interface to command a music

playing task to the robot system. Such an interface can be as natural as talking

or singing to the robot, or non-intuitive like providing MIDI format files. It could

also require specific skills, such as the ability to play a musical instrument. In

this section, we will review previously implemented interfaces for musical robots.

Using aural cues is one way of commanding musical robots. Petersen et

al. [29] proposed an audio-based HRI for their flutist robot which enabled

it to detect the tempo and harmony from music being played by a human

musician. Similarly, Taki et al. [45] presented an audio-based HRI algorithm

which detected pitch, tempo, and volume. Tempo and volume were detected

to recognize instances of initiative exchange between a human musician and a

robot. Pitch was used to detect the current location on the musical score, which

was provided in advance. They defined the term “initiative” as the authority

to vary the performance tempo and to make another performer follow. Neither

of the two interfaces actually commanded a melody as a music playing task.

Some other systems have used aural cues in combination with additional cues

from humans. Suzuki et al. [41] developed an audio-based HRI system which

detected volume, pitch, and tempo from sounds produced by humans. This

included singing and clapping. It also detected forces and torques as induced

by the user, such as through pushing or shoving. In addition to that, it detected

color information through cameras. All this information was used to generate

23

music based on predefined schemes. This was an interactive music generation

system. So there was no way for the users to command it to play a specific

melody.

Similarly, Lim et al. [17] commanded their robot accompanist system through

visual and aural cues. By observing a human flutist, the robot could detect the

tempo using both visual and aural information. Using visual cues alone, it could

detect performance start and stop times. This system too was not commanded

by users to play a specific melody.

Another system which used multiple cues was a pianist robot which used

vision to read printed musical scores and then play them out [33]. It could

also understand verbal requests from the audience. To achieve this, speech

recognition was performed by using linear predictive and cepstrum analysis.

Another form of HRI which it demonstrated was that it could track a human

singing and adjust its own playing tempo to match it. It used 5 narrowly tuned

band pass filters to detect fundamental pitch from the singer’s voice.

Esnaola and Smithers [11] presented a music based HRI which did not require

speech recognition, or printed musical scores; playing a musical instrument was

optional. They developed a musical language in which the alphabet consisted of

10 musical notes which could either be whistled or played from an instrument.

It used FFT for pitch detection and kept track of the ambient noise levels. That,

coupled with a very limited set of expected musical notes, proved to be very

reliable. The system was tested successfully with classical music as ambient

noise, as well as in environments with large groups of people.

Using MIDI notes is another way of commanding robots to play musical

pieces. Alford et al. [4] interfaced with their thereminist robot by playing

on a keyboard, which transferred note information in the General MIDI (GM)

24

format. Another thereminist robot by Wu et al. [49] was commanded by simply

providing it with a MIDI file. Similarly Batula and Kim. [7] commanded their

miniature pianist robot using a MIDI format customized for the limitations of

the robot.

2.3.1 Human Robot Interfaces: summary

To summarize, there are many HRI implmentations which require musical in-

strument playing skills from a user. Some interfaces make use of simple visual

cues or physcial interactions. Speech recognition and using vision to read musi-

cal scores is another option; it does not require specific skills on part of a user.

Another interface offers both skilled an unskilled interfaces by accepting musical

instrument playing as well as whistling as commands. Yet another way is to

provide MIDI format files, either directly or through a musical instrument.

Vision and speech-based interfaces can require a lot of developmental time,

which is beyond the scope of this project. Physical interactions like touching,

bumping, tapping, etc. may not be the fastest methods of commanding a musi-

cal piece to a robot. Directly creating MIDI format files is not intuitive. Using

a musical instrument with MIDI or some other format is feasible, but would

add to hardware costs.

This leaves whistling as the least demanding interface. Pitch detection is

relatively easier to implement, and the speed of interaction can be as fast as

whistling the tune to the robot. Whistling is more intuitive as compared to

writing MIDI files, and does not require a musical instruments or MIDI inter-

facing hardware. Therefore, an interface which accepts whistling as input is

most suitable for our purpose.

25

Chapter 3

Development of a Thereminist

Robot System

This chapter covers the development of the individual susbsystems for the

thereminist robot system (Fig. 3.1). The ZCR scheme has been used to provide

pitch sensing for pitch control. Two arms, 2 degrees-of-freedom (DOF) each,

were developed to control the pitch and volume by interacting with the two

theremin antennae. A proportional-derivative (PD) control scheme is used to

control the pitch of the theremin. A human robot interface (HRI) accepts com-

mands from a user in the form of a whistled musical piece, and converts it to

a melody data structure recognized by the control system. The music playing

algorithm has been implemented as a higher level control system. The pitch

sensor has been implemented on an 8-bit microcontroller, the Atmel ATmega

32U4. The control system has been implemented on a 32-bit microcontroller,

the mbed LPC1768. The HRI has been implemented on a laptop computer

using Processing language.

26

Figure 3.1: Thereminist robot system with its various subsystems.

3.1 Pitch Sensor

One of the design objectives mentioned in Section 1.3 is to develop a pitch de-

tection system good enough for realtime feedback control. This section presents

the design of a pitch sensor for this purpose.

3.1.1 Pitch sensor design requirements

Factors that dictate the design of the pitch sensor are range, accuracy, response

time, and the physical characteristics of the signal. Range is important be-

cause we need to sense all the possible pitches for this application. Accuracy

is important because we will use this sensor to control pitch. Response time is

important because we need to control pitch in realtime. The faster the pitch can

be measured, the better the realtime performance. Lastly, physical characteris-

tics of the pitch signal dictate how the pitch signal is interpreted and whether

any signal conditioning is required.

27

Ideally, the sensor range should cover the entire audible frequency range,

which is about 20 - 20,000 Hz. But this is unnecessary if the theremin cannot

produce pitches in such a large range. Our theremin indicated a frequency range

of about 132 - 2000 Hz, or C]3 to B6 (Fig. 1.2). With a typical disturbance8, this

range decreased to about 330 - 2000 Hz, or E4 to B6 (Fig. 1.2). Given that there

is no defined set of possible musical pitches for the sensor to work with, we con-

sidered only pitches available on the theremin. Since the disturbances change

the pitch domain size, we only considered the pitches consistently available on

the theremin. Although we observed the range E4 - B6 (Fig. 1.2) to be consis-

tenly available, even with a disturbance, this was one very specific disturbance

in a particular environtmental condition. Under different conditions, the lowest

pitch on the available range can vary. It was undesirable to allow the robot

to attempt to produce pitches which were not available on the theremin; this

has potential to damage the robot. So we did not risk designing for the lowest

available note. To maintain a safety margin, we limited the lowest pitch for the

playing domain to the start of the lowest full octave available. This resulted in

a pitch domain spanning octaves 5 and 6: a pitch range of 523.25 Hz through

1975.53 Hz (C5 through B6). Physically, this pitch range typically spanned a

distance of about 4.5 inches (11.5 cm) for the robot’s pitch manipulator9.

The second requirement is the accuracy. For this, we look at humans for

inspiration. The pitch discrimination threshold in humans seems to be about

1 % 10. Specifically, when actively listening, a less than 1 % mistuned note

can be detected 10 % of the time by non-musicians, and 80 % of the time by

8Typical disturbance is defined as a scenario in which a person approaches the pitch an-tenna. This reduces the pitch domain (see Fig. 1.2).

9As determined experimentally with the manipulator developed in Section 3.2.210 % error in frequency in the context of this thesis is defined as

|frequencydifference|frequencyintended

× 100

28

musicians [46]. Pitch discrimination depends on several factors, and there are

many studies which have covered this topic. For example, Demany and Semal

[10] investigated the influence of memory on auditory perception, Marmel et al.

[20] investigated how tonal expectations affect pitch perception, and Tervaniemi

et al. [47] investigated how musicians and non-musicians differed in their pitch

perception. We are not developing this robot to entertain an audience with

what they perceive as a professional musical performance. Therefore, we select

1 % error as the design requirement for our robot system. To control pitch using

the system, the sensor must be more accurate than 1 %. We select 0.5 % sensor

accuracy to allow for some margin of error for other parts of the system.

Figure 3.2: Theremin audio signal as viewed on an oscilloscope. It is quitesimilar to a sine wave.

The third requirement is the response time. For our purpose, this does not

mean a specific time constraint, but that given the choice, the method with a

faster response time is preferable. Feedback control performance is sensitive to

the response time of the pitch sensor, i.e., the computation time of the pitch

detection algorithm. This translates to a preference for spending the minimum

feasible time on measuring and computing. This preference affects the selection

29

of the pitch detection algorithm.

The last requirement is the ability to deal with the physical characteristics

of the theremin’s audio signal. It is not symmetric about the horizontal axis and

closely resembles a sinusoidal wave form (see Fig. 3.2). At a typical low volume

knob setting11 without anything close to the volume antenna, the pitch signal

cycles between -320 mV and 640 mV. This signal is clearly audible. When

a physical object is close to the antenna, the volume is lowered. The lowest

amplitude signal that can be reliably obtained cycles between about -80 mV

and 100 mV. This lowest amplitude signal is barely audible. Since we are

playing music, only the clearly audible signal is relevant. Therefore, the sensor

must be able to detect at least the audible voltage signal (-320 mV to 640 mV).

To sumarize, design constraints are as follows: a frequency range of at least

523.25 - 1975.53 Hz (C5 - B6), at least 0.5 % accuracy, and ability to process

an audio sinusoidal-like signal (-320 to 640 mV) shown in Fig. 3.2. Given the

choice, a pitch detection algorithm with the shortest response time is preferred.

3.1.2 Pitch sensor design

An early attempt while exploring design ideas was to use a guitar tuner as a pitch

sensor (Fig. 3.3). This tuner featured two arrays of LEDs as indicators: the

pitch indicator, and the tune indicator. The pitch indicator showed the pitch,

correct to the nearest standard pitch on a 12-note musical scale. The pitch

information was incomplete in that only prime and accidental information was

provided. For example, if presented with C5 and C6 in succession, it would only

indicate that C was detected, without distinguishing between C5 and C6. The

11The lowest possible audible volume setting. See Section 1.1 for details on theremin controlknobs and settings used.

30

Figure 3.3: Guitar tuner which was hacked for use as a first pitch sensor

second array of LEDs (the tune indicator) indicated whether the pitch was in

tune with the nearest identified pitch. It had 3 indicator LEDs, which could

indicate whether a pitch was in tune, flat, or sharp. Additional information was

provided by the flat and sharp indicator LEDs. These LEDs cycled between on

and off states. The rate of cycling indicated the degree of flatness or sharpness.

We adapted the tuner as a pitch sensor by hacking into the control signal

lines driving the LEDs in the pitch and tune indicators. Each signal line drove

a single LED. When an LED lit up, indicating a signal, the potential difference

on the signal line with reference to the common line dropped to -4.3 V. For

each of the two indicators, only a single LED could light up at a given instant.

Adapting this system as a sensor was done using two stages. The first stage

converted the signals to positive voltages using inverting Op-Amps. The second

stage was a potential divider that was designed to produce voltage as a linear

function of the signal/LED number. Fig. 3.4 shows the measured voltage ouput

for the pitch indicator. A similar voltage divider was used to monitor the tune

indicater.

There were 3 issues that forced us to abandon this approach. First, this

31

1 2 3 4 5 6 7 8 9 10 11 120

0.5

1

1.5

2

2.5

3

3.5

4

signal number

Vol

tage

Figure 3.4: Output of first pitch sensor after the volatge divider stage indicatingthe pitch detection. Voltage is a function of the signal number activated, i.e.one of the 12 indicator LEDs which light up. Each LED corresponds to oneof 12 notes on a 12-note musical scale. A similar voltage divider was used tomonitor the tuning indicater LED array.

set up did not indicate any octave information; a time history of the pitch

detected and direction of pitch arm motion had to be recorded to keep track of

transitions between octaves. Secondly, the tune indicator signal cycled between

(on off) states when not in tune. This cycle rate was low enough to be noticeable

by the human eye, and could be as low as approximately 1 Hz. This means that

to get pitch information for non-standard pitches, there was a variable delay

in measuring the tuning information from the sensor. This was undesirable for

feedback control as it could hold up a potential control scheme waiting for sensor

data by several hunderd milliseconds. Thirdly, the tradeoff between resolution

and response time was poor in every aspect. Using only the tune indicator

provided limited resolution: only 2 measurement points between 2 standard

32

pitches. For example, between C5 and C]5, we could only detect C5-too-sharp,

and C]5-too-flat. To increase resolution, if we tried to measure the cycling rate of

the tune indicator signal, then that required multiple readings and increased the

measurement time delay even further. To decrease the response time, if we only

detected the in-tune signal from the tune indicator, then we would have been

left with only the ability to detect exact frequencies, essentially eliminating all

resolution information between two standard pitches. Therefore it was decided

that this idea was not feasible for feedback control. If we need to measure time

intervals for high-low transitions to increase accuracy, then we might as well

directly measure the time period of an oscillating audio signal. This led us to

explore alternative pitch detection methods and develop our own sensor.

With reference to implementing Zero Crossing Rate (ZCR) the audio sig-

nal from the theremin has a negative part to it, lacks a clean “edge” and has

a very small amplitude (see Fig. 3.2. In contrast, microprocessors and mi-

crocontrollers typically tolerate voltage inputs from 0 V though 5 V (or 3.3 V

which is becoming more common now). Microcontrollers and integrated circuits

(ICs) performing counting/timing operations typically require step changes in

the input signal in order to positively register quick voltage (logic) level changes.

Typically, these changes must be on the order of 1 V. Therefore, a signal condi-

tioning circuit was designed to convert the audio signal to a square wave varying

from 0 - 5 V and centered around 2.5 V. The circuit schematic is shown in Fig.

3.5. The first stage positively clamps the audio signal so that it oscillates round

2.5 V. The second stage is a voltage comparator with a reference set to 2.5 V, so

that it gives a high (5 V) when the signal is higher than 2.5 V, and a low (0V)

when the signal is lower. The last stage, for a failsafe protection for any micro-

controller or processor input port. In the event of overvoltage or overcurrent,

33

this stage would break down, protecting components after it.

Figure 3.5: Filter for pitch sensor. Converts low amplitude audio signal topositive square waves centered around 2.5 V

There is a choice between frequency domain and time domain-based pitch

detection methods. There is a requirement to perform realtime pitch feedback

control, i.e., the fastest practical pitch detection algorithm is desirable. As

discussed in Chapter 2, the time domain based methods are faster but limited

to working with pure frequencies without noise. The theremin audio signal does

not have any noticable noise (see Fig. 3.2). Therefore, a time domain-based

method is the best choice. The Zero Crossing Rate (ZCR) detection principle

was the time domain based method discussed in Section 2.2. It was also the

method which was found to be most suitable in that section. Therefore, this is

the method we selected to use for the sensor.

A simple method of error rejection was added to the pitch detection algo-

rithm: abnormally high readings (above 5000 Hz) were rejected and the reading

was repeated.

34

3.1.3 Pitch sensor implementation

There are two ways of implementing ZCR. One way is to measure the number

of signal oscillations for a fixed time interval. Another is to measure the period

of a fixed number of signal oscillations. We implemented ZCR using this second

way; measuring the period for a single oscillation is faster than waiting for a

fixed time duration for a number of oscillations.

Initially, ZCR was implemented using a 555 timer, binary coded decimal

counters, and shift registers. A microcontroller was used to read the clock from

this circuit and transmit the clock value over 5 V TTL serial. Although it

worked reliably, it took up a lot of space, was difficult to debug and modify, and

it seemed inefficient to use a microcontroller just for serial transmission.

Eventually, ZCR was implemented on an 8-bit microcontroller (ATmega

32U4) running at 16 MHz. Using a 2 MHz clock, the microcontroller counts the

number of clock pulses between two consecutive rising edges of the incoming

signal. The count is stored in a 16-bit integer and transmitted over TTL serial

(5 V) serial at 57600 bps. The transmission packet consists of 3 bytes: a header

byte and the two data bytes. A 1 ms delay follows each packet transmission

since it was deemed to be sufficient for the purpose of this project. With a 3

byte packet (27 bits at 8N1) per reading, it takes about 0.5 ms to transmit 1

packet. With a 1 ms delay, this can ideally provide pitch feedback at a rate of of

667 Hz. This was deemed more than sufficient for this project. The frequency

is computed at the receiving end using the count and the clock frequency.

The sensor system was tested with a range of sine wave frequencies using a

frequency generator. Since only octaves 5 and 6 were expected to be played,

only that frequency range was tested. The test signal minimum and maximum

35

400 600 800 1000 1200 1400 1600 1800 2000 2200 2400−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

actual frequency (Hz)

mea

n se

nsor

rea

ding

err

or (

%)

mean

2σ

Figure 3.6: Pitch sensor validation test results at various frequencies. Numberof samples is 5992, 10206, 10563, 10617, 10618, and 10618, for test frequenciesof 400 Hz through 2400 hz.

voltages were -300±20 mV and 640±20 mV, respectively. This corresponds to

the audible signal mentioned in Section 3.1.1. Results are illustrated in Fig.

3.6. Mean absolute error was always less than 0.2 %. For each frequency, about

6,000 to 10,000 samples were taken. Informal tests were carried out with the

low amplitude form of the audio signal 12 and the sensor was able to provide

reliable results.

3.2 Mechanics

One of the objectives mentioned in Section 1.3 is to develop a set of robotic

manipulators with adequate degrees of freedom to play a theremin. This section

12Cycling from -80 mV to 100 mV. See Section 3.1.1

36

covers the design of a robot platform with such a set of manipulators.

3.2.1 Mechanical design requirements

As mentioned in Section 3.1.1, the pitch range of interest can be controlled

over a distance of about about 11.5 cm (4.5 in). So some sort of manipulator

or arm is required for pitch control. It can be a linear or rotational degree of

freedom (DOF). Since the arm needs to be pointed towards the pitch antenna

despite the placement of the robot, a rotational degree of freedom is required

for this adjustment. Similarly, a volume control arm is required to control the

volume during play. Just like the pitch control arm, it is required to be pointed

correctly relative to the theremin, depending on the location of the antenna.

Instead of using these additional DOFs to point the antennae we could have

used a moving base with a differential or omni-wheel drive system. But in

such a system, positioning would be more difficult since the pose of both arms

would have to be considered simultaneously; this complicates the kinematics.

In using additional arm DOFs to control arm pose, either arm can be controlled

independently, which simplifies the kinematics involved.

With reference to pointing the arms correctly, it must be emphasized here

that it is not the intention for this design to be optimized to reach the antennae

from the largest possible number of poses. Rather, the intention is to handle

minor variations in pose.

Experiments with earlier versions of the robot involved using pull-pull cables

to control the rotational DOFs on the arms (Fig. 3.7) because such a system

offered a rapid response time. Those arms proved very difficult to control, and

their movement was very non-linear. There was very noticeable backlash and

37

Figure 3.7: An early version of the robot using pull-pull cables to control thearms.This system proved very difficult to control and was not pursued further.However, the idea of not loading any actuator with another actuator was re-tained.

hysteresis. In assuming some poses, the loads on the actuators exceeded their

capacities. These factors required us to design an additional control scheme

for the kinematics, demanding additional computation resources. Also, there

was no guarantee that the actuator loads would not be excessive. This discour-

aged us from pursuing an approach involving a system of cables, and biased

us towards pursuing systems with simpler kinematic models and no apparent

excessive loads on the actuators. However, we did retain the idea of not loading

any actuator with the weight of another actuator.

In terms of space requirements, the robot needed to house the controller

and sensor circuits. Finally, easy access was required for the controller/sensor

circuits for troubleshooting and upgrade purposes.

3.2.2 Mechanical design & implementation

First, we needed to decide whether the different DOFs should be linear or

rotational. For pointing the arms toward the antennae, either could have been

38

Figure 3.8: Illustration of the volume control arm kinematics.

used. A rotational DOF can rotate an arm to align it with the antenna. A

linear DOF can translate the arm to align it with the antenna. This latter

mechanism would have required more space for the arm to physically move

around. This will tend to increase the size of the robot. In contrast, for the

rotational DOF, only mounting space for an actuator is required. Therefore, we

selected rotational DOFs for pointing the arms towards the antennae.

Again, for the control (pitch and volume) DOFs for either arm, both rota-

tional and linear options were possible. The volume antenna of the theremin is

in approximately a horizontal plane (Fig. 1.1). As mentioned in Section 1.1,

it is controlled by relative motion orthogonal to this horzontal plane, i.e., by

vertical movements. If a linear DOF for volume control is used, then the vol-

ume arm would have to be directly above the antenna (vertically moving). A

rotational DOF would not have this restriction. A linear arm moving vertically

would make the robot bigger, while rotational DOF would only require mount-

ing space for the actuator. Therefore a rotational DOF was selected for volume

control. The volume arm kinematics are illustrated in Fig. 3.8.

39

Figure 3.9: Illustration of the pitch control arm kinematics.

For the pitch control DOF of the pitch arm, distance from the antenna

needs to be controlled. Initially, we tried using a rotational DOF, since it was

easier to implement. However, the pitch position-relation is already non-linear.

The angle position-relation is also non-linear and this introduced an additional

non-linear characteristic to the control trajectory of the arm. To simplify the

kinematics and focus just on the theremin control dynamics, a linear DOF for

the pitch control was selected. The pitch arm kinematics are illustrated in Fig.

3.9.

The next step after DOF selection is the selection of actuators. Originally,

we were using the KIPR CBC V2 robot controller (see Table 3.2). So the

selection was limited to actuators supported by the built-in motor controllers

and power supply of the CBC. This included the SG5010 and the CS-60. Both

are available in servo and motor configurations. Their specifications are listed

in Table 3.1.

Testing with earlier versions of the robot indicated that the motion of the

SG5010 actuators was not as smooth as compared to CS-60 actuators, and

40

Table 3.1: Performance specifications for actuator options. All specificationsare listed for 6 V operation.

Actuator CS-60 SG5010

maximum speed (deg/s) 375 429

maximum torque (oz-in) 49 152.8

option servo, DC motor servo, DC motor

position encoder no no

they occasionally jittered. The SG5010 actuators did offer higher torque and

speed. This was an important advantage since the weight of the volume arm

had to be supported, and its rotational inertia had to be overcome as fast as

possible. Therefore, for the volume DOFs we selected the higher torque SG5010

acuators. The occasional jitter is acceptable on the volume antenna, as long as

the approximate position is maintained. To maintain position, the easier and

cheaper solution was to use the servo option for the SG5010. For the pitch

arm, therefore, for predictable operation, the CS-60 option was selected. It was

also acceptable since the weight of the pitch arm did not have to be supported

by either actuator. For the rotational DOF for pointing the antenna correctly,

a CS-60 servo was used, as again, this was the easier and cheaper solution to

maintain position. For the linear DOF for pitch control, a servo would have

been easy to control, but there is a lag due to the internal position controller.

Also, in this servo, there is no way of knowing whether the commanded position

has been reached or not. Therefore, for direct control, (rather than merely

controlling the set point for an internal servo feedback controller) we selected

a CS-60 motor. Control with a motor is more accurate since we can use pitch

detection to determine whether the commanded position (for a commanded

pitch) has been reached or not. Also, with direct control, the lag in using a

41

servo actuator is eliminated. All these actuators are typically used at 6 V, but

they can be safely used at 7.2 V. This improves the performance slightly. We

opted to supply them with 7.2 V during use.

The next step is building the arms. For the pitch arm, we selected a 3.5

mm (0.14 in) carbon fiber tube, which was the lightest reasonably rigid option

available, i.e. it did not flex under its own weight. To support the tube, a

Delrin base with an identically sized groove was used (see Fig. 3.10a). Delrin

was used because it has a very low coefficient of friction. Using the lower torque

CS-60 motor, we needed to minimize friction. To drive the tube back and forth,

a lego wheel and tire system, approximately the same thickness, was used (see

Fig. 3.10a). KIPR motors are available with an attachment for a lego shaft

to be connected. The lego drive wheel was therefore attached to the motor

with a lego axle. The linear motion is illustrated in Fig. 3.10b and 3.10c. An

aluminum foil ‘hand’ was mounted at the end of the arm to act as a grounding

plate and hence control the pitch. The foil was connected to ground. With a

wheel diameter of 1.21 inch and a top motor speed of 375 deg/s[1], the pitch arm

can theoretically travel at 3.96 in/s (at 6 V). The actual speed was later found

out to be about 3.46 in/s (at 7.2 V). This was an average speed from a test

trajectory that included the velocity profiles to accelerate from rest and then

decelerate and come to a stop. The pitch arm travelled a total distance of 4.5

inches in 1.3 s. This actual speed of 3.46 in/s is less than the theoretical 3.96

in/s because friction reduces speed, and the acceleration-deceleration profiles

reduce the total distance travelled as compared to a trajectory where the motor

travels at maximum speed at all times.

Finally, the pitch arm was mounted on a CS-60 servo to allow for correct

orientation of with reference to the theremin (Fig. 3.10d). The pitch arm length

42

(a) Linear motion mechanism

(b) A forward position (c) A backward position

(d) Mounted with servo for correctly orienting thearm in the direction of the antenna.

Figure 3.10: Illustration of pitch arm mehcanics

43

was designed based on the required playing length of about 4 inches, plus 200 %

extra to easily allow for improper positioning13, plus 1 inch safe length to be left

behind the drive wheel axle. Some extra length was also required to allow for

the fact that the axle was mounted away from the front of the robot. With the

axle mounted about 5 inches from the front of the robot, the total arm length

came out to be 18 inches.

The volume arm design is more straightforward. A cardboard tube was used

as an arm. This was mounted onto the volume control servo actuator, which

was in turn mounted on the direction control servo actuator (see Fig. 3.11a).

A cardboard tube was easier to attach to the servo horn than a thin carbon

fiber tube. Also, unlike the carbon fiber, the cardboard tube did not flex when

rotated from one end in this configuration. The arm is rotated up in the ‘high’

position for a high, audible volume (Fig. 3.11a) and rotated down cloase to the

antenna in the ‘low’ position for an almost inaudible volume (Fig. 3.11b. The

length of the tube was selected as follows: when the right side of the robot was

aligned with the theremin, and the robot was 3 inches away from the theremin,

the downward projection of the volume arm, when rotated horizontally, passed

approximately through the center of the circular segment of the antenna. A

distance of 3 inches from the theremin is a typical placement which allows the

robot to be close enough to operate the theremin, but far enough so that the

robot does not collide with any of the plugs connected to the front panel of

the theremin. The design point selected is typically where thereminists place

their hand while controlling the volume. This design point is illustrated in Fig.

3.11c. This resulted in an arm length of 8.5 inches from pivot point to the end.

13So that the robot does not have to be placed at exactly the same relative pose each time.Allowance for improper positioning makes setting up the robot easier.

44

(a) A typical volume up position (b) A typical volume down position

(c) Design point for the volume arm

Figure 3.11: Design and implementation of volume arm. The workspace bound-ary is illustrated.

45

Figure 3.12: Front view of the robot with the theremin. The workspaces of thepitch control arm (left) and volume control arm (right) are indicated.

The volume arm was tested with the servo configuration, and it could easily lift

the cardboard arm; there was no noticeable difference in the servo performance

with and without the weight of the volume arm.

To summarize the design and implementation for the mechanics, there are 2

arms, one each for pitch and volume control. Each arm is 2DOF. The pitch arm

has 1 translational DOF for playing and 1 rotational DOF for pointing the arm

in the correct direction prior to playing. The volume arm has 1 rotational DOF

to move between two possible volume positions and another DOF to point the

arm correctly before initiating play. The translational DOF is provided by a

motor, while the other 3 DOF are provided by servo actuators. The workspace

for the pitch arm is limited to part of a horizontal plane (Fig. 3.12 and 3.13).

The workspace of the volume arm spans the surface of a hemisphere (Fig. 3.12

and 3.13).

To fulfill the space requirement, the platform base was designed somewhat

like a table top with arms attached on either side. The base is about as long as

the theremin and wide enough to provide adequate prototyping space.

46

Figure 3.13: Isometric view of the robot with the theremin. The workspaces ofthe pitch control arm and volume control arm are indicated.

3.3 Control Scheme

One of the objectives mentioned in Section 1.3 is to implement a control scheme

to demonstrate the robot system’s ability to be programmed to employ a given

control scheme to play a theremin. Therefore, this section presents the devel-

opment of such a control scheme.

3.3.1 Control scheme design requirements

Design requirements for a control system typically include settling time, max-

imum overshoot and a maximum steady state error. The primary concern for

our application is to ensure that humans do not perceive any overshoot, steady

state errors, or very slow transition between musical notes.

As mentioned in Section 3.2.2, it was experimentally determined that the

pitch arm moves at a speed of 3.46 in/s (at 7.2 V) moving a distance of 4.5

inches (the pitch domain) in 1.3 s. This is the limitation of the dynamics of the

system. We also noted in Section 3.1 that the desired pitch domain for playing

47

spans about 4.5 inches. Therefore the system can move across the entire pitch

domain in about 1.3 s (which includes acceleration and deceleration times, as

mentioned in Section 3.2.2). This is the best that can be done by the hardware.

Therefore, at this point, we can only comment that this is the best we can do

for moving across 24 notes. The requirement is to reach the desired pitch and

settle on it as quickly as possible.

As discussed in Section 3.1.1, based on human perception ability, a pitch

error of 1 % was selected for the robot system. Therefore, we select 1 % as the

design requirement for maximum steady-state error of the control system.

In playing music, ideally, there should be no perceptible overshoot. Drawing

from the steady-state requirement, we also select 1 % as our design constraint

for the maximum overshoot.

To summarize, the design constraints are as follows: a maximum overshoot

of 1 %, and a maximum steady-state error of 1 %. As for settling time, the

requirement is to reach the desired pitch and settle on it as quickly as possible.

3.3.2 Control scheme design & implementation

As discussed in Chapter 2, a controller can be feedforward, feedback, or a hy-

brid of both. Knowing that the acoustic properties of a theremin do not remain

constant with time, feedforward control will not be useful unless it adapts to the

changing environment, as was eventually done by Mizumoto et al. [23] for their

theremin playing robot. For completeness, we tried using a feedforward con-

trol model using a pitch-position lookup table. But for adequate performance,

measurements for the lookup table had to be taken again every time anything

in the environment changed. This is a cumbersome process. Faced with this

48

scenario we could have proceeded to develop a feedforward model that adapted

to the changing environment. This was the strategy adopted by Mizumoto et

al. [23]. However, with the potential levels of computation that would have

been involved, and the fact that pitch detection would still have to be used, we

decided not to pursue this approach. Instead, we opted to use pitch detection

for feedback control. Without an elaborate feedforward model to modify at

run-time, we feel that a feedback control approach would be computationally

less demanding.

We propose a classical proportional-derivative (PD) controller, with percent-

age frequency error as feedback. Initially, we tried using linear PD control, but

that only worked for a limited range of pitches. This limitation was expected,

given the non-linear variation of pitch as a function of distance from the an-

tenna, as illustrated in Fig. 1.2. To deal with non-linearity, we tried using a

non-linear PD control where the controller gains were functions of frequency.

These functions were derived from curve fitting the pitch position mapping of

the theremin, as illustrated in Fig. 1.2. This scheme didn’t perform very well.

One reason was that the pitch domain is slightly different under different en-

vironmental conditions. Another reason was that our feedback error was still

frequency, which is non-linear. To mitigate the dependance on an absolute

term such as frequency error, we tried using a relative error term: percentage

frequency error (relative to target frequency). We started off by testing it with a

simple PD control scheme. Preliminary testing showed promising results. So we

opted to use this strategy without any modifications. Hence the proposed PD

controller using percentage frequency as feedback error. Below is the algorithm

for the control loop (algorithm 1).

The above controller generates a power command in terms of percentage

49

Algorithm 1 PD feedback control

while timeElapsed < timeout doerror = (fCmd − fCurr)/fCmd ∗ 100powerCmd = pGain∗error+dGain∗(error−errorPrev)/(timeCurr−timePrev)

end while

maximum voltage supplied to the motor: -100 represents full voltage in reverse,

and 100 represents full voltage forward. The power command generated by the

controller is offset by ±24 % (depending on the direction) to compensate for

the motor deadband. For example, if the power command is -30 %, it is offset

to -54 %. Any power command beyond ±76 % is saturated to ±100 %.

3.4 Robot Controller

One of the objectives mentioned in Section 1.3 is to have a robot system on

which different control schemes can be tested. This can be met by using a

programmable controller on which different control schemes can be programmed.

This section outlines the selection of a suitable robot controller.

3.4.1 Robot controller performance requirements

The primary requirements for the controller relate to input, output, memory

and programmability. In terms of input, a 5 V TTL serial port is required to

receive data from the pitch sensor. This is based on the output of the pitch

sensor (see Section 3.1.3). In terms of output, control of 3 servos and one

motor is required. This is based on the selection of actuators for the robot

manipulators in Section 3.2.2. In terms data storage, some way of logging

performance data for later retrieval and analysis is required. This is crucial for a

platform that is to implement different control systems so that their performance

50

can be compared. In terms of programmability, the input and output data

rates should be customizable. This comes from preliminary testing with the

CBC robot controller in which the input and output data rates are, by design,

fixed at 50 Hz. Even though the pitch sensor could provide data at higher

rates, we were limited by this data rate, and could neither sense nor control

at higher rates. Another programmability requirement is that the controller be

relatively easier to develop on, i.e. the supporting libraries should be available

and relatively reliable/stable.

3.4.2 Available robot controller options

Three options were readily available. The CBC robot controller by the KISS

Institute for Practical Robotics (KIPR) is a previous generation educational

robotics controller. It is a complete solution with built-in motor controllers,

servo control, and servo power supply. It has a 3.3 V serial port (5 V tolerant).

The Link is the latest controller by KIPR. It was released at the time of this

selection (January 2013) and has similar features. The mbed LPC1768 is a

prototyping platform based on a 32-bit microcontroller. It lacks the built-in

motor controllers and servo power supplies. It does have PWM output which

can be used to send control signals to external motor controllers and externally-

powered servos. The relevant features of all three systems are summarized in

Table 3.2.

3.4.3 Selection of robot controller

The CBC V2 is limited to accessing the serial port and controlling a motor at

upto 50 Hz, so it is not suitable for this project. The Link and mbed do not have

51

Table 3.2: Summary of robot controller performance parameters.

Controllers CBC V2 Link mbed

clock speed (MHz) 350 800 96

motor control built-in, 50 Hz built-in, 300 Hz none

servo control PWM outputs PWM outputs PWM outputs

servo power built-in built-in none

TTL serial 1, @ 50 Hz 1 3

voltage level 3.3 V 3.3 V 3.3 V

5V tolerant yes yes yes

data-logging USB stick USB stick 2 MB onboard

firmware/IDE stable under development stable

these limitations. Motor control in the Link is limited to 300 Hz, but that is

such a high limit that it is acceptable for the purposes of this project. Although

the Link is superior in terms of features, the firmware and native libraries for the

Link were still under development at the time of this evaluation. So there was no

guarantee that all of the proposed features would be available and be reasonably

bug-free. Therefore the somewhat limited but developmentally stable mbed

microcontroller was the choice for this project.

To complete the controller requirement, the VEX Robotics Motor Controller

29 was selected. This is a low cost motor controller which can supply a maximum

4 A of current and can be controlled by a single PWM pin. Later a TTL serial

controlled servo driver module was added for controlling the servos when it was

discovered that the mbed did not support multiple PWM frequency generation.

Servos are controlled at 50-60 Hz which is lower than the PWM frequency used

for the pitch motor control. In the current implementation (see Section 3.6)

control of one volume servo is required for adjusting volume during play.

52

3.5 Human-Robot Interface

The last specific objective in Section 1.3 is to develop a user interface for enabling

a human to command a music playing task to the robot system. This section

covers the design of such a Human-Robot Interface (HRI).

3.5.1 HRI peformance requirements

The only performance requirement for the HRI is that the interface be intuitive

for people outside the realm of computer science and music. Keeping this in

perspective, the easiest way for an individual to command the robot system to

play a musical piece or melody would be to simply play it to the system. This

can be done in the form of a simulated musical instrument, or it can simply be

hummed or whistled to the robot. The latter is a typical and quick method of

exchange between humans when they do not have the name or recording of a

melody. Therefore it was decided that the HRI would accept a musical piece in

the form of whistling.

The HRI is designed as follows. As a human whistles to the system, a

recording is made. This recording is used to generate a musical piece by the

Melody Extraction Scheme (MES) which is outlined in the next chapter. The

generated musical piece is then written to the local file system of the mbed

microcontroller. After the file write operation is complete, the robot receives a

command signal to begin playing.

Partially processed MES results are played back to the user to indicate the

frequencies detected. The HRI is implemented on a laptop running Windows 7

operating system. Built-in speakers and mic are used for audio input/output.

53

Figure 3.14: Human Robot Interface

3.6 Music playing algorithm

A simple music playing scheme was used to play melodies. For each musical

note, the volume arm servo is commanded to raise the volume, and immediately

after that, the pitch feedback control loop is activated for the desired pitch, for

a time duration equal to the note duration. To protect14 the hardware, at the

end of the note duration time interval, a stop command is issued to the pitch

motor in case it hasn’t reached its target pitch. Following this, the volume arm

is moved to a low volume position and the pitch control loop engaged for the

same note for a time duration equal to the note rest duration. Again, a stop

command is issued at the end of the pitch control loop.

To avoid overstressing the volume control servo with very small rest dura-

tions, the algorithm is designed to skip the volume down command if the note

rest duration is less than 150 ms. This time duration was somewhat arbitrarily

14This is to protect from short circuiting the pitch arm with the pitch antenna, whichcauses the microcontroller to reset. The highest pitch location is about 0.25 inch away fromthe antenna. A limit switch is not suitable because the “limit” changes depending on theplacement of the theremin.

54

selected after observing the behavior of the servo with several musical pieces.

55

Chapter 4

Melody Extraction Scheme

(MES) for Human-Robot

Interface

As mentioned in Section 3.5, the human-robot interface (HRI) for the robot

takes as input a human sound, extracts the melody, and transmits it to the

robot controller to play. At the heart of the HRI is the MES which performs

this melody extraction. It starts by using FFT to extract a time-varying pitch

trajectory from the human audio input. This pitch trajectory is then shifted to

the pitch domain of the theremin. Finally, a melody comprising musical notes

is assembled for transmission to the robot controller. These stages are covered

in detail in the following sections.

In the traditional sense, a melody is defined as a data set of musical notes

along with their rest time durations. A musical note is defined as a specific

pitch on the music scale accompanied by its playing duration. This is different

from the note data object used in the MES (see Table 4.1). The note object also

56

includes rest information. Therefore a melody in this context is a set of note

objects. So although the data structure is different, the information contained

is the same for either definition of term ‘melody’.

Table 4.1: The note data structure used by the MES

field data type definition unit

index float position on standard musical scale -frequency float pitch Hzprime char prime -accidental char accidental -octave int octave -duration int duration msrest int rest msposition int for manipulator -time float timestamp s

4.1 MES Performance Requirements

The main requirement is to extract a melody from an audio recording. This is

based on the design of the HRI in Section 3.5.1 where the HRI records audio

input, and calls upon the MES to process it (see Fig. 3.14). Another require-

ment is to ensure that the extracted melody is compatible with the capabilities

of the theremin. From Section 3.1.1 on pitch sensor requirements, we know

that the theremin only consistently covers two full octaves, i.e. octaves 5 and 6.

Therefore, we need to ensure that all extracted melodies fit within this range.

4.2 MES Design & Implementation

The melody extraction scheme (MES) was implemented using the Processing[3]

language. Input and processing of audio was done using the Minim[12] library

57

which has a number of useful audio related functions. The MES extracts a

melody in 5 stages listed below:

• Detect pitch

• Identify the highest octave

• Delete lower frequencies, and shift the remaining frequencies to the theremin’s

playing domain by octave

• Extract set of notes

• Assemble a melody

These stages are outlined in the following subsections.

4.2.1 Pitch detection

As concluded in Section 2.2, to detect pitch, one can use the frequency domain

based FFT, CA, or AC methods, or the time domain based ZCR method. We

have already used ZCR for the pitch detection sensor (Section 3.1.3) because of

its low computational demand (faster response time) and because it is suitable

for use with the theremin’s audio output. We discussed in Section 3.1.1 that

a low computation time was desired because feedback control is sensitive to

the computation time of the pitch detection algorithm. The theremin’s audio

output is relatively noise-free and does not have any overtones. It is possible

that we use the same scheme or even the pitch sensor itself for pitch detection in

the MES. However, with a live recording, background noise will always be there,

and there is no guarantee that overtones will not be present. ZCR cannot be

reliably used to detect pitch in such a situation. Furthermore, melody extraction

58

is need not be performed in realtime and is hence not sensitive to computation

time of the pitch detection algorithm.

This leads to consideration of the three frequency domain based methods:

FFT, CA, and AC. While it is quite possible to implement these from scratch us-

ing basic principles, it is a time-consuming process. Fortunately, the Minim[12]

audio library for Processing and Java has a built-in FFT implementation[13].

This is a quick solution to implement a frequency domain based pitch detection

method. Therefore, FFT is the method of choice for pitch detection for the

MES.

FFT is performed on the audio recording made by the HRI. The audio

sampling rate is 44.1 kHz. For the transform, the buffer size is 2048 i.e. 2048

equally sized pitch intervals (or bands) are defined. When a FFT is performed,

exact pitch values cannot be identified. Only the intervals can be identified.

However, the pitch value in the center of each interval is assumed to be the

pitch detected for that interval. In this implementation, for a sampling rate of

44.1 kHz the maximum pitch considered is the Nyquist frequency, i.e. 22.1 kHz.

This gives a pitch resolution of about 21.6 Hz15 for the entire range considered,

i.e., 0 - 22.1 kHz. This means that pitch can be detected to the nearest 21.6 Hz.

For every instantaneous transform, a number of pitches with varying am-

plitudes (or loudness) are detected. It is assumed that the pitch produced by

the user, i.e. from whistling, humming, etc., is much louder than the pitches in

the background noise. Based on this assumption, for every time window, the

loudest pitch (one with the greatest amplitude) is taken to be the one produced

by the user.

152/2048*22100 Hz = 21.6 Hz; based on implementation notes from the Minim Manual [13]

59

4.2.2 Highest octave identification

The highest pitch is identified and its octave is computed for use in the next

stage. While computing the octave, pitch data are rounded off to the near-

est standard pitch on the musical scale. The rounding off procedure used is

explained as follows.

A table of standard musical note pitch values is stored in a 1 dimensional

lookup table. This table is based on the reference A4 (440 Hz). Instead of

searching through the table for the closest reference pitch, one can also use a

mathematical relation. For a standard musical note scale know that:

fn = fo × 2n12 where,

• fn is the frequency of the note n half steps away

• fo is the frequency of a fixed reference note. Here, it is the pitch of A4,

i.e. 440 Hz

• n is the number of half steps above or below. This is positive for half steps

above the reference note, and negative for half steps below. A half step of

+1 is the next highest note, +2 is two higher, and so on.

Rearranging the above and accounting for the fact that A4 is at index 57

(counting from C0) of our reference array, we get this relation to determine the

index from the pitch detected:

index = 12 × log(fn/fo)log(2)

+ 57.0

For a non-standard pitch value, the above relation gives a fractional index.

The index is rounded off and the corresponding standard pitch is obtained from

a lookup table. To identify the octave, the rounded index is used in a similar

60

lookup table which has the ocatves listed. This pitch rounding process is also

employed in the next stage.

4.2.3 Domain reduction and shifting

Ideally, to be playable on the theremin, frequencies in a musical piece must fall

within octaves 5 and 6. As concluded from Section 3.1.1 these are the only

two full octaves consistently available. However, human whistling may have

frequencies above octave 6. Or they could also all be below octave 6, or even

below octave 5. They might also not be limited to two octaves. This calls for a

domain reduction and frequency shift so that they may be played on octaves 5

and 6.

The first step is the domain reduction step. We can select two consecutive

octaves of frequencies. If we assume that the background noise is at a lower

frequency than the frequency of the user’s whistling, then we can select the

highest two octaves. This assumption should hold true, as long as the user

is not in a high frequency noise environment like an orchestra or a jet engine

testbed. As for the frequencies in the lower octaves, we could either ignore

them, or shift them to these highest two octaves. Shifting the lower octave

frequencies to higher octaves can create interesting and often confusing acoustic

effects. Also, all the noise would make its way into the musical piece, unless

some method of distinguishing between noise and music is defined. If we assume

that the typical whistled musical piece does not span more than two octaves,

then we can just ignore the lower octaves. Given that whistling is only a single

‘musical instrument’ and not all people are capable of whistling complex musical

pieces spanning 3 or more octaves, this assumption should hold true in most

61

Figure 4.1: Domain reduction by retaining the highest 2 octaves. The highestpitch is in octave 7. The domain is reduced to pitches from octaves 7 and 6(highlighted in blue).

cases. Ignoring the lower octaves also takes care of the low frequency noise.

Therefore, we choose to retain the top two octaves and ignore the rest of the

frequencies. This is illustrated in Fig. 4.1.

The next step is the shifting step which is quite straightforward. We know

the highest octave recorded and we know that frequencies must be adjusted so

that the highest octave is 6. Instead of multiplying or dividing by a factor of 2,

we simply increment/decrement the frequency index (defined in Section 4.2.2)

by multiples of 12 16. There are 12 notes in each octave. This implies that the

indices of corresponding notes in consecutive octaves differ by 12. For example,

16The formula used is: index = index + 12*(6 - highest note octave)

62

Figure 4.2: MES pitch shifting. Pitches from octaves 7 and 6 (Fig. 4.1) areshifted to octaves 6 and 5 respectively, i.e., theremin’s pitch domain.

for the pitches in Fig. 4.1, the shifted pitches would be as shown in Fig. 4.2.

As a result of the shifting step, frequencies for all data points below the

second highest octave are set to zero (see Fig. 4.1). An instant with a frequency

of zero represents a rest instant in the musical piece. This domain reduction

ensures that input data span two octaves to match the two octaves available

on the theremin. It also eliminates the low amplitude background noise which

makes its way into the recording during rest periods. The frequencies are then

shifted so that the highest frequency is in octave 6, which is the highest complete

octave available on the theremin (Fig. 4.2). During this step, frequency data are

rounded off to the nearest standard frequency according to the special rounding

63

off procedure defined in Section 4.2.2.

4.2.4 Note extraction

Note extraction is based on the idea that for any given note, pitch detection will

detect the same frequency for as long as the note is being played. Therefore,

data points which are consecutive in time belong to the same note if their

frequencies are the same.

Accordingly, discrete data points are grouped into notes. This is achieved by

assigning consecutive data points with the same frequency to the same group.

Therefore, there is one note for one group of consecutive frequency-time data

points. Each note consists of a frequency field and a duration field (see Table

4.1). This greatly reduces the number of data points representing the frequency-

time trajectory. The result is a number of notes with zero and non-zero fre-

quencies.

4.2.5 Melody extraction

The last stage is to recognize the note rest durations. Every zero frequency note

is in fact the rest period after the previous note. Each note also has a rest field.

Therefore, the duration field value of each zero-frequency note is assigned to

the rest field of the note immediately before it. The zero-frequency notes are

then discarded.

64

Chapter 5

Testing and Performance

The control system and MES were put through a series of tests to evaluate

their design effectiveness. First, the control system’s response to various input

commands was checked. This was followed by a repeat of a subset of these tests

but with a disturbing body in the presence of the theremin. Subsequently, the

MES was tested with input from pure tones, human whistling, and piano keys.

Finally, the overall system was tested with a whistled musical piece as input.

This chapter presents all these tests and their results.

5.1 Testing of Feedback Control Scheme

First, the PD feedback control system was tested. Frequency trajectories were

provided in the melody input format defined in chapter 4. Two tests were

performed.

65

0 2 4 6 8 10 12400

600

800

1000

1200

1400

1600

1800

time (s)

freq

uenc

y (H

z)

commandactual

Figure 5.1: A typical step command test. Step commands were issued acrossthe entire playing domain, starting from the lowest frequency, up to the highestfrequency, and then back down to the lowest frequency. This figure shows thetrajectory for 4-note steps, i.e. 4 notes skipped per step command.

5.1.1 Response to step commands

The commanded frequency was changed in steps starting from the lowest playing

frequency of 523.25 Hz (C\5) up to the highest playing frequency of 1975.53 Hz

(B\6) and then back down again (5.1). For each step, the controller was given 1 s

to respond. For reference, the design requirements decided in Section 3.3.1 called

for a maximum steady-state error of 1 %. This was the error used to determine

when the controller had settled. This test was performed for commanded steps

corresponding to 1, 2, 4, 8 and 16 note transitions. A note transition of 1

corresponds to all notes being commanded, a transition of 2 corresponds to

every second note commanded, and so on. The mean response times are shown

in Fig. 5.2. As expected, the settling time increases for larger steps. The

response time of the 16-note step could not be recorded because the controller

66

0 1 2 3 4 5 6 7 8 90.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

number of note transitions per step

aver

age

settl

ing

time

(s)

Figure 5.2: Average settling times for commanded frequency steps with 1, 2,4, and 8 note transitions per step. As expected, the settling time increases forlarger steps. The controller was unable to complete its response for the 16-stepcommands within the given time of 1 s.

could not reach the commanded frequency within the allotted 1 second (per

step) time limit for this test. It was also noted that the maximum overshoot

remained below the 1% threshold decided in Section 3.3.1.

The controller seems to speed up for bigger note transitions. Fig. 5.3 shows

the settling time per note, which indicates that the average time spent per note

transition actually decreases for larger step sizes. This is expected, given the

control law. Also, a qualitative look at the data seemed to indicate that the

controller responded faster when stepping from a higher frequency down to a

lower frequency. However, the response time of the controller was different at

different frequencies, so not much could be deduced without further analysis.

67

0 1 2 3 4 5 6 7 8 90.08

0.1

0.12

0.14

0.16

0.18

0.2

number of note transitions per step

aver

age

settl

ing

time

per

note

tran

sitio

n (s

/not

e)

Figure 5.3: Settling time per note transition. This indicates that the averagetime spent per note transition decreases for larger step sizes, i.e., the controllerresponds faster for larger step commands.

5.1.2 Effect of environmental disturbance

This test was used to observe how the control system was affected by distur-

bances. A 1-note step test similar to the one in Fig. 5.1 was used with a fixed

disturbance17 within range of the antenna.

A disturbance affects the pitch produced as follows. Section 1.1 described

the use of heterodyning frequencies from two LC oscillators to produce pitch.

A theremin’s pitch antenna acts as a capacitor for one of the two oscillators.

A disturbance in the vicinity of the antenna increases the capacitance of the

antenna. This increase in capacitance changes the oscillator frequency such

that the difference in frequencies of the two oscillators is increased. This in

turn increases the pitch produced by the theremin. Additional disturbances,

17This disturbance was in the form of a circular Aluminum plate (13 inch diameter, 1/16inch thick) placed approximately 14 inches away from the antenna, facing the antenna.

68

e.g. a pitch controlling arm, tend to add to the capacitance and hence increase

the pitch even more. Skeldon et al. [37] provides a mathematical treatment of

the physics of the disturbance and its relation to pitch produced.

As shown previously in Fig. 1.2 a disturbance “stretches” out the pitch

domain in physical space. The physical distance between pitch positions in

space increases. Consquently, the time taken for the robot arm to move between

pitch positions should also increase. As expected, the 1-note step test showed

that the mean settling time increased from 0.1920 s to 0.2414 s.

5.2 Testing of Melody Extraction Scheme

This section covers tests conducted for the MES. Live audio was provided

through the HRI. These tests were intended to demonstrate that the MES was

functional and to investigate possible limitations.

5.2.1 Pure tone

Audio with a set of pure frequencies (no overtones) was provided using a fre-

quency generator application from a cell phone. Although this was an ideal

input without noise, it tested the basic functionality of converting pitch input

to a melody. The input frequencies and the output are shown in Fig. 5.4.

5.2.2 Whistling

A section of Pachelbel’s Canon was input in the form of whistling. This is a

typical input that the MES should expect and this test can demonstrate if the

MES can process such an input. Results are shown in Fig. 5.5.

69

0 5 10 15

523.25

1046.40

2093.00

time (s)

freq

uenc

y (H

z)

0 5 10 15

C5

C6

C7

music scale

pitch detectedextracted melody

Figure 5.4: Plot of pure tones as processed by the MES. Audio input to theMES (red) is used to extract a melody (blue). The audio input is representedby a frequency time trajectory comprising non-zero frequencies. The melodycomprises a series of pitch values in time marking the start or end of a note.This melody can then be transmitted to the robot to be played on the theremin.The jump in pitch at 3.8 s is due to a pitch detected at that point.

5.2.3 Piano recording

A simulated piano application was used to play the notes C4, D4, E4, F4, G4,

A4, B4, C5, and then B4 through C4. This was intended to test the MES beyond

its design domain for another potentially useful form of input. Notes from a

piano comprise multiple overtones. In contrast, the MES was not designed for

pitch input with overtones. Results are shown in Fig. 5.6. Overtones in the

audio input result in erroneous recognition of pitch. This is a limitation of the

pitch detection algorithm (Section 4.2.1). However, based on the erroneously

70

0 1 2 3 4 5 6 7 8 9

130.81

361.63

523.25

1046.40

time (s)

freq

uenc

y (H

z)

0 1 2 3 4 5 6 7 8 9

C3

C4

C5

C6

music scale


Figure 5.5: Plot of melody extracted from human whistling. Audio input tothe MES (red) is used to extract a melody (blue). Note that only the highesttwo octaves (5 and 6) are retained in the extracted melody. The audio input isrepresented by a frequency time trajectory comprising non-zero frequencies. Themelody comprises a series of pitch values in time marking the start or end of anote.

detected pitches, the note extraction takes place as expected.

5.3 Testing of Overall System

Finally, the complete system was tested. Part of the folk song Brahm’s Lullaby

was whistled to the HRI. This was processed by the MES (Fig. 5.7) and the

resulting melody was transmitted to the control system to play (Fig. 5.8).

In Fig. 5.7, the pitch detected (red) does not correspond to exact notes on

71

0 2 4 6 8 10

523.25

1046.40

2093.00

time (s)

freq

uenc

y (H

z)

0 2 4 6 8 10

C5

C6

C7

music scale


Figure 5.6: Plot of notes from a piano as recognized by the MES. Notes playedwere C4, D4, E4, F4, G4, A4, B4, C5, and then B4 through C4. Overtones inthe audio input result in erroneous recognition of pitch (red). This pitch infor-mation is then correctly used to extract a melody (blue). The highest frequencyis in octave 7. This is adjusted to octave 6. All the other frequencies are adjust-ed/discarded accordingly. The time scales have been aligned after the domainwas reduced to 2 octaves.

the 12-note musical scale. The extracted melody (blue) discretizes the pitch

detected to the nearest standard pitch on the 12-note musical scale. The MES

recognizes and stores only the start and end points of each note. This way, fewer

data points can be used to represent the melody as compared to the number of

data points used to represent the pitch detected.

The pitch trajectory for the melody is shown in Fig. 5.8a. For very short

duration notes, the the commanded pitch (red) is rarely achieved by the robot’s

72

pitch arm (blue). For longer duration notes, the commanded pitch is typically

achieved with no noticeable overshoot. This is more apparent in Fig. 5.8b which

shows the percentage error with time for the pitch trajectory. The overshoot for

the longer duration note typically remains under 1%. Fig. 5.8b also shows that

once the actual pitch settles down at the commanded pitch, the error remains

below 1%.

0 2 4 6 8 10 121046.40

1244.51

1479.98

1760.00

2093.00

time (s)

freq

uenc

y (H

z)

0 2 4 6 8 10 12C6

D#6

F#6

A6

C7

music scale


Figure 5.7: Results of full system test with a whistling input of Brahm’s Lullaby.Pitch detected by the MES (red) is used to extract a melody (blue). Note howthe extracted melody is a discretized representation of the input pitch detected.The rising note between 7 and 8 s is discretized to 4 different standard pitches.Pitch is discretized to the nearest note on the 12-note musical scale. The MESrecognizes and stores only the start and end of each note. Ideally, this, togetherwith discretization, saves memory. This saving on memory is best illustrated bythe two notes between 8 and 10 s.

73

0 2 4 6 8 10 121046.40

1244.51

1479.98

1760.00

2093.00

time (s)

freq

uenc

y (H

z)

0 2 4 6 8 10 12C6

D#6

F#6

A6

C7m

usic scalecommandactual

(a) Trajectory

0 2 4 6 8 10 12−10

−8

−6

−4

−2

0

2

4

6

8

10

time (s)

freq

uenc

y er

ror

(%)

(b) Error. Percentage difference between commanded and actual pitch.

Figure 5.8: Control system performance results for full system test with awhistling input of Brahm’s Lullaby.

74

Chapter 6

Conclusion

The goal of this work was to create a theremin playing robot that could be used

to test various control schemes, was easy for a user to program musically, and

that was inexpensive enough to be used for educational purposes. Section 1.3

provides the objectives in detail. Based on these objectives, we developed a pro-

grammable robot system comprising a pitch sensor, a set of robot manipulators

of manipulators to play a theremin, and a human robot interaction system to

command the robot (Chapters 3 and 4). A control scheme was programmed to

demonstrate that the robot is indeed programmable. Evaluation of some of the

subsystems was covered in Chapters 3 and 5. This chapter presents a discus-

sion drawing from the initial objectives and subsequent evaluations as to how

the objectives were achieved or not achieved. This is followed by a conclusion

summarizing the outcome of the discussion, and an outline of possible future

work for this project.

75

6.1 Discussion

The first primary objective in Section 1.3 was to develop a set of robotic ma-

nipulators to play the theremin. To that end, two sets of manipulators were

developed. The first set (Fig. 3.7) was quick to respond, but difficult to control

owing to backlash, hysteresis, and nonlinearity in kinematics. It also had the

tendency to excessively load the actuators. Therefore, a second set of manipula-

tors (Fig. 3.13) was developed with simpler and more linear kinematics. It did

not overload the actuators, and was much easier to control. However, it seemed

that this was not as fast as the earlier version of the manipulators. Nevertheless,

the robot was able to reach and control the antennae of the theremin, and play

commanded musical pieces. Therefore, this objective succesfully achieved.

The second objective in Section 1.3 was to develop a pitch detection system

for feedback control, i.e., have an accuracy of at least 0.5 % (Section 3.1.1). A

pitch sensor was thus designed. It was successfully able to process the audio

signal of the theremin. Testing results from Section 3.1.3 indicated a mean

absolute error of 0.1 %. This is very accurate pitch information, given that

pitch discrimination ability in 90 % of non musicians is 1%. Therefore, our

second objective was also achieved.

The third objective in Section 1.3 was to develop a robot system which can

be used to test different control schemes. The selection of a programmable robot

controller (Section 3.4) along with development of pitch sensor and theremin

playing manipulators was a step towards achieving this objective.

The fourth objective in Section 1.3 was to implement a control scheme to

demonstrate that the previous three objectives have been met: manipulators

to play the theremin, sensor to detect pitch, and a system on which a control

76

scheme can be implemented/programmed. To fulfill it, a PD control scheme

was implemented. It was proposed in Section 3.3.1 that it should be able to

perform with maximum settling time of 1.1 s, which was the time taken to move

from the lowest to highest pitch. Design requirements for the control system

included a maximum overshoot of 1 %, and a maximum steady-state error of 1

%.

A step response test in Section 5.1 indicated mean settling times of 0.19

s to 0.75 s for step sizes ranging from a single note to eight notes. Although

the controller was able to achieve the 1% or less overshoot and steady state

objective, no conclusion can drawn for the settling time. We observed a faster

response time for larger note steps, and faster response times when stepping

down from a higher frequency to a lower frequency. Also, the response time of

the controller was different at different frequencies. All these factors, coupled

with the fact that different musical pieces require different controller response

times for different note transitions, suggest that much more needs to be done to

analyze and benchmark the responsiveness of the controller. Nevertheless, we

were able to implement a control scheme. While trying to develop a suitable

controller, we also tried some other control schemes (see Section 3.3.2). This

fulfills the fourth objective of using a control scheme to verify that we have a

plafform on which we can test control schemes. In addition, we have developed

some simple tools to evaluate controller performance.

The fifth objective from Section 1.3 was to develop a user interface for com-

manding a music playing task to the robot system. The HRI and MES were

developed to achieve this objective. For the HRI, it was so that users would

find it easy to command the robot using whistling. We don’t know if it is an

easy interface or not, but we do have an interface for commanding the robot.

77

For the MES, it was also designed so that it would work very well with input

involving pure frequencies. This capability was demonstrated in the pure tone

test in Section 5.2. The whistling input tests in Section 5.2 and 5.3 showed

output melodies fairly close to the input. This was partly true for the piano

input in Section 5.2. The pitch trajectory recognized by the FFT stage resulted

in a very similar melody output. However, the FFT stage itself performed very

poorly due to the strong overtones present in the piano notes.

Finally, the domain reduction stage of the MES was designed to eliminate

the low amplitude noise during the note rest period. The pitch detection graphs

in Section 5.2 and 5.3 indicate low frequencies. From actual audio recordings,

we know that those are low amplitude background noise. These are clearly

eliminated in each of the MES output graphs in Section 5.2 and 5.3.

Therefore, the fifth objective of developing a user interface was also achieved.

It was not the best choice given a general reluctance of users to whistle, but

it worked as it was expected to. Overall, all 5 of the primary objectives from

Section 1.3 were achieved.

6.2 Conclusion

A robot system was developed to fulfill 5 specific objectives (see Section 1.3).

All of them were achieved. We now have a thereminist robot system which can

serve as a testbed for evaluating different motor control schemes. It can accept

human playing commands, it can record the theremin’s music output, and it

can give a musical performance based on any control scheme which we choose

to program it with.

We have implemented a control scheme for demonstration purposes and it

78

performs reasonably well. We tried looking at past work on human motor control

for inspiration. We learned that in general, human motor control involves both

feedback and feedforward, often for the same movement. Due to the inherent

difficulties in developing a feedforward model for the theremin, we adopted a

feedback control scheme only.

Figure 6.1: Thereminist robot system.

6.3 Future Work

At this stage, we have a device that can be used to explore control schemes for

robots in a musical environment. This can eventually be used to explore some

theories in human motor control for continuous pitch musical instruments18

such as theremins and trombones. A starting point is to make humans play the

theremin and observe their frequency-time trajectories. This leads to several

research questions. What can we learn about human motor control algorithms

from observing human musical performance on the theremin? Does being a

18Musical instruments capable of producing continuously varying pitch. Such instrumentsare not limited to playing notes at discrete levels.

79

musician play a role in this performance? Can performance on a theremin

be related to performance on a trombone? What does it take to develop a

control scheme whose response is similar to the response of human motor control

in a musical environment? Based on human perception, how does one define

similarity to human motor control? Can a thereminist robot be used to do pitch

training for non-musicians?

Another area to investigate is how professional thereminists coordinate mul-

tiple antenna control while playing the theremin. Do they move to a new note

while the volume is low, or do they do it as the volume is going up again? Is

that feedback, feedforward, or hybrid control? In the process of investigating

the above questions, it may also be useful to measure EMG and EEG19 activitiy.

These can provide some information on muscle and brain activity. By observing

which regions of the brain are active during different parts of the muscial task,

we might be able to learn more about the motor control scheme(s) used.

Other than that, the current system could use some improvements in the

software as well as hardware side. The FFT stage of the MES could be changed

to a smarter algorithm that can deal with overtones. More whistled notes can

be retained by the MES by shifting the highest whistled pitch to the highest

theremin pitch, rather than just using the highest octave. We could also add a

different form of user input for the HRI; something along the lines of a simulated

musical instrument20. At this point the robot has additional DOFs to point

the arms at the antenna, but currently this pointing action is not automated.

We could make this process as such by using a camera or proximity sensors.

19Electroencephalography (EEG) is the recording of electrical activity along the scalp tomonitor brain activity

20A graphical user interface or a physical replica that directly generates audio signals suit-able for the MES.

80

Moreover, a laptop is currently required for interaction with the robot. A shift

to the Link robot controller (which is a lot more stable now) can enable us to

do away with the laptop and make the robot system more portable. The Link

has a touch screen for interaction and we just have to enable it to perform audio

processing like the Minim library for Processing/Java.

On the hardware side of things, we noted that the pitch manipulator was

limited in its response time. An improved design with a faster response time

could be something to work on in the future. One direction for achieving this

is to use an arm with multiple degrees of freedom similar to those employed

by humans playing the theremin. Something like a wrist or even fingers could

be added, so that the control movement is divided across different degrees of

freedom. These additional degrees of freedom may or may not be translational

like the pitch arm. This could allow a faster response time using the same motor

as currently being used.

81

Bibliography

[1] Hobbico CS-60 Servo Specifications and Reviews. http://www.

servodatabase.com/servo/hobbico/cs-60.

[2] Nao Key Features - Audio. http://www.aldebaran-robotics.com/en/

Discover-NAO/Key-Features/audio.html.

[3] Processing. http://www.processing.org/.

[4] A. Alford, S. Northrup, K. Kawamura, K-W. Chan, and J. Barile. MusicPlaying Robot. In Proceedings of the International Conference on Fieldand Service Robotics (FSR ’99), pages 174–178, Pittsburgh, PA, August1999.

[5] Atmel. AVR205: Frequency Measurement Made Easy with Atmel tinyAVRand Atmel megaAVR. http://www.atmel.com/Images/doc8365.pdf,February 2011. Application Note, rev. A.

[6] Max Baars. The theremin - How it works. In Thereminvox, Milano, Italy,December 2004. Thereminvox.

[7] Alyssa M. Batula and Youngmoo E. Kim. Development of a Mini-Humanoid Pianist. In IEEE-RAS International Conference on HumanoidRobots, pages 192–197, Nashville, TN, USA, December 2010.

[8] Justin Cohen, Masataka Niwa, Robert W. Lindeman, Haruo Noma, Ya-suyuki Yanagida, and Kenichi Hosaka. A Closed-Loop Tactor FrequencyControl System for Vibrotactile Feedback. In (Interactive Poster) ExtendedAbstracts, ACM CHI 2005, pages 1296–1299, Portland, Oregon, USA, April2005.

[9] Kristina Dell. Vinyl Gets Its Groove Back. Time, 171(3):55, Jan 21 2008.

[10] Laurent Demany and Catherine Semal. The Role of Memory in Audi-tory Perception. In William A. Yost, Arthur N. Popper, and Richard R.Fay, editors, Auditory Perception of Sound Sources, volume 29 of SpringerHandbook of Auditory Research, pages 77–113. Springer US, 2007.

82

http://www.servodatabase.com/servo/hobbico/cs-60

http://www.servodatabase.com/servo/hobbico/cs-60

http://www.aldebaran-robotics.com/en/Discover-NAO/Key-Features/audio.html

http://www.aldebaran-robotics.com/en/Discover-NAO/Key-Features/audio.html

http://www.processing.org/

http://www.atmel.com/Images/doc8365.pdf

[11] Urko Esnaola and Tim Smithers. MiReLa: A Musical Robot. In Proceedingsof the IEEE International Symposium on Computational Intelligence inRobotics and Automation, pages 67–72, 2005.

[12] Damien Di Fede. Minim. http://code.compartmental.net/tools/

minim/.

[13] Damien Di Fede. Minim Manual: FFT. http://code.compartmental.

net/tools/minim/manual-fft/.

[14] David Gerhard. Pitch Extraction and Fundamental Frequency: Historyand Current Techniques. Technical Report TR-CS 2003-06, University ofRegina, Regina, Saskatchewan, Canada, 2003.

[15] M. Ghafouri and A. G. Feldman. The timing of control signals underlyingfast point-to-point arm movements. Experimental Brain Research, 137:411–423, 2001.

[16] Tony James and Abi Grogan. Back in the groove. Engineering Technology,6(11):50–53, december 2011.

[17] Angelica Lim, Takeshi Mizumoto, Louis-Kenzo Cahier, Takuma Otsuka,Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, and Hiroshi G.Okuno. Robot musical accompaniment: Integrating audio and visual cuesfor real-time synchronization with a human flutist. In Intelligent Robotsand Systems (IROS), 2010 IEEE/RSJ International Conference on, pages1964–1969, 2010.

[18] Christine L. MacKenzie. Motor skill in music performance: Commentson Sidnell. Psychomusicology: A Journal of Research in Music Cognition,6(1-2):25 – 28, 1986.

[19] MadLab. Schematics and Source Code for Digital Frequency Counter.http://www.apogeekits.com/counter_article.htm, 2012.

[20] F. Marmel, B. Tillmann, and W.J. Dowling. Tonal expectations influencepitch perception. Perception & Psychophysics, 70(5):841–852, 2008.

[21] Carmen Bachiller Martın, Jorge Sastre Martınez, Amelia Ricchiuti,Hector Esteban Gonzalez, and Carlos Hernandez Franco. Study of the In-terference Affecting the Performance of the Theremin. International Jour-nal of Antennas and Propagation, 2012, 2012.

[22] Takeshi Mizumoto. Re: Question on thereminist robot. Personal e-mail,March 2012.

83

http://code.compartmental.net/tools/minim/

http://code.compartmental.net/tools/minim/

http://code.compartmental.net/tools/minim/manual-fft/

http://code.compartmental.net/tools/minim/manual-fft/

http://www.apogeekits.com/counter_article.htm

[23] Takeshi Mizumoto, Toru Takahashi, Tetsuya Ogata, and HiroshiG. Okuno.Adaptive pitch control for robot thereminist using unscented kalman filter.In Wei Ding, He Jiang, Moonis Ali, and Mingchu Li, editors, Modern Ad-vances in Intelligent Systems and Tools, volume 431 of Studies in Compu-tational Intelligence, pages 19–24. Springer Berlin Heidelberg, 2012.

[24] Takeshi Mizumoto, Hiroshi Tsujino, Toru Takahashi, Tetsuya Ogata, andHiroshi G. Okuno. Thereminist Robot: Development of a Robot ThereminPlayer with Feedforward and Feedback Arm Control based on a Theremin’sPitch Model. In Proceedings of the IEEE/RSJ International Conference onIntelligent Robots and Systems, pages 2297–2302, October 2009.

[25] Mitsuhiro Nakamura and Hideyuki Sawada. Talking Robot and the Analy-sis of Autonomous Voice Acquisition. In Proceedings of the IEEE/RSJ In-ternational Conference on Intelligent Robots and Systems, Beijing, China,October 2006.

[26] C. Minos Niu, Daniel M. Corcos, and Mark B. Shapiro. Temporal ShiftFrom Velocity to Position Proprioceptive Feedback Control During Reach-ing Movements. Journal of Neurophysiology, 104(5):2512–2522, 2010.

[27] Hirohisa Ohta, Hiroshi Akita, and Motomu Ohtani. The Development ofan Automatic Bagpipe Playing Device. In Proceedings of the InternationalComputer Music Conference, pages 430–431, Tokyo, Japan, 1993.

[28] Brian Passey. Vinyl records spin back into vogue–2010 was top year forrecord sales since 1991. USA Today, 15, 2011.

[29] Klaus Petersen, Jorge Solis, and Atsuo Takanishi. Development of a Au-ral Real-Time Rhythmical and Harmonic Tracking to Enable the MusicalInteraction with the Waseda Flutist Robot. In Intelligent Robots and Sys-tems, 2009. IROS 2009. IEEE/RSJ International Conference on, pages2303–2308, 2009.

[30] Robert M. Poss. Distortion Is Truth. Leonardo Music Journal, 8:45–48,1998.

[31] Lawrence R. Rabiner, Micheal J. Cheng, Aaron E. Rosenberg, and Carol A.McGonegal. A Comparative Performance Study of Several Pitch DetectionAlgorithms. IEEE Transactions on Acoustics, Speech and Signal Process-ing, 24:399–418, October 1976.

[32] Kourosh Rahnamai, Brian Cox, and Kevin Gorman. Fuzzy AutomaticGuitar Tuner. In Annual Meeting of the North American Fuzzy InformationProcessing Society, pages 195–199, San Diego, CA, June 2007.

84

[33] Curtis Roads. The Tsukuba Musical Robot. Computer Music Journal,10(2):39–43, 1986.

[34] J.C. Rothwell, M.M. Traub, and C.D. Marsden. Automatic and VoluntaryResponses Compensating for Disturbances of Human Thumb Movements.Brain Research, 248(1):33 – 41, 1982.

[35] Masatsugu Sakajiri, Kenryu Nakamura, Satoshi Fukushima, ShigekiMiyoshi, and Tohru Ifukube. Voice Pitch Control Ability of Hearing Per-sons With or Without Tactile Feedback Using a Two-Dimensional TactileDisplay System. In 2011 IEEE International Conference on Systems, Man,and Cybernetics (SMC), pages 1069–1073, Anchorage, AK, October 2011.

[36] Koji Shibuya, Shoji Matsuda, and Akira Takahara. Toward Developinga Violin Playing Robot - Bowing by Anthropomorphic Robot Arm andSound Analysis. In Robot and Human interactive Communication, 2007.RO-MAN 2007. The 16th IEEE International Symposium on, pages 763–768, 2007.

[37] Kenneth D. Skeldon, Lindsay M. Reid, Viviene McInally, Brendan Dougan,and Craig Fulton. Physics of the Theremin. American Journal of Physics,66(11):945, November 1998.

[38] Thomas D. Snyder and Sally A. Dillow. Digest of Education Statistics2010, April 2011.

[39] Jorge Solis, Klaus Petersen, Tetsuro Yamamoto, Masaki Takeuchi, Shim-pei Ishikawa, Atsuo Takanishi, and Kunimatsu Hashimoto. Implementationof an Overblowing Correction Controller and the Proposal of a Quantita-tive Assessment of the Sounds Pitch for the Anthropomorphic Saxophon-ist Robot WAS-2. In IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), pages 1943–1948, Taipei, Taiwan, October2010.

[40] Jorge Solis, Koichi Taniguchi, Takeshi Ninomiya, Tetsuro Yamamoto, andAtsuo Takanishi. Development of Waseda flutist robot WF-4RIV: Imple-mentation of auditory feedback system. In IEEE International Conferenceon Robotics and Automation, pages 3654–3659, Pasadena, CA, USA, May2008.

[41] Kenji Suzuki, Takeshi Ohashi, and Shuji Hashimoto. Interactive Multi-modal Mobile Robot for Musical Performance. In Proceedings of the In-ternational Computer Music Conference, pages 407–410, Beijing, China,October 1999.

85

[42] Kenji Suzuki, Keishiro Tabe, and Shuji Hashimoto. A Mobile Robot Plat-form for Music and Dance Performance. In Proceedings of the InternationalComputer Music Conference, Berlin, Germany, August 2000.

[43] Kenji Suzuki, Yoichiro Taki, Hisaki Konagaya, Pitoyo Hartono, and ShujiHashimoto. Machine Listening for Autonomous Musical Performance Sys-tems. In Proceedings of the International Computer Music Conference,ICMA, pages 61–64, San Francisco, September 2002.

[44] Shogo Takahashi, Kenji Suzuki, Hideyuki Sawada, and Shuji Hashimoto.Music Creation from Moving Image and Environmental Sound. In Pro-ceedings of the International Computer Music Conference, Beijing, China,October 1999.

[45] Yoichiro Taki, Kenji Suzuki, and Shuji Hashimoto. Real-time InitiativeExchange Algorithm for Interactive Music System. In Proceedings of theInternational Computer Music Conference, Berlin, Germany, August 2000.

[46] Mari Tervaniemi. Musical Sound Processing in the Human Brain. Annalsof the New York Academy of Sciences, 930(1):259–272, 2001.

[47] Mari Tervaniemi, Viola Just, Stefan Koelsch, Andreas Widmann, and ErichSchrger. Pitch discrimination accuracy in musicians vs nonmusicians: anevent-related potential and behavioral study. Experimental Brain Research,161(1):1–10, 2005.

[48] WJ Wadman, JJ Denier Van der Gon, RH Geuze, and CR Mol. Control offast goal-directed arm movements. Journal of Human Movement Studies,5(1):3–17, 1979.

[49] Yan Wu, Polake Kuvinichkul, Peter Y.K. Cheung, and Yiannis Demiris.Towards Anthropomorphic Robot Thereminist. In Proceedings of the IEEEInternational Conference on Robotics and Biomimetics, pages 235–240,2010.

[50] Emily Chivers Yochim and Megan Biddinger. ‘It kind of gives you thatvintage feel’: vinyl records and the trope of death. Media, Culture &Society, 30(2):183–195, 2008.

[51] Kazuyoshi Yoshii, Kazuhiro Nakadai, Toyotaka Torii, Yuji Hasegawa, Hi-roshi Tsujino, Kazunori Komatani, Tetsuya Ogata, and Hiroshi G. Okuno.A Biped Robot that Keeps Steps in Time with Musical Beats while Listen-ing to Music with Its Own Ears. In Proceedings of the IEEE/RSJ Inter-national Conference on Intelligent Robots and Systems, pages 1743–1750,San Diego, CA, 2007.

86

[52] Robert J. Zatorre, Joyce L. Chen, and Virginia B. Penhune. When thebrain plays music: auditory-motor interactions in music perception andproduction. Nature Reviews Neuroscience, 8:547–558, July 2007.

87

Documents

UNIVERSITY OF OKLAHOMA POSITION CONTROL USING PITCH ...c3p0.ou.edu/IRL/Theses/Ghazi-MS.pdf · position control using pitch feedback a thesis approved for the school of aerospace and