Expert Systems with Applications · Automatic detection of musicians’ ancillary gestures based on video analysis Rodrigo A. Segera, Marcelo M. Wanderleyb, Alessandro L. Koericha,c,⇑

Expert Systems with Applications 41 (2014) 2098–2106

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Automatic detection of musicians’ ancillary gestures based on videoanalysis

0957-4174/$ - see front matter � 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.eswa.2013.09.009

⇑ Corresponding author at: Pontifical Catholic University of Paraná, PostgraduateProgram in Computer Science, R. Imaculada Conceição, 1155, Curitiba, PR 80215-901, Brazil. Tel.: +55 41 3271 1669; fax: +55 41 3271 2121.

E-mail addresses: [email protected] (R.A. Seger), [email protected] (M.M. Wanderley), [email protected], [email protected](A.L. Koerich).

Rodrigo A. Seger a, Marcelo M. Wanderley b, Alessandro L. Koerich a,c,⇑a Federal University of Paraná, Department of Electrical Engineering, Centro Politécnico, CP19011, Curitiba, PR 81531-970, Brazilb McGill University, Centre for Interdisciplinary Research in Music Media and Technology, Schulich School of Music, 555 Sherbrooke West H3A 1E3, Montreal, QC, Canadac Pontifical Catholic University of Paraná, Postgraduate Program in Computer Science, R. Imaculada Conceição, 1155, Curitiba, PR 80215-901, Brazil

a r t i c l e i n f o

Keywords:GesturesMusic expressionVideo analysisImage processing

a b s t r a c t

A novel approach for the detection of ancillary gestures produced by clarinetists during musical perfor-mances is presented in this paper. Ancillary gestures, also known as non-obvious or accompanist gesturesare produced spontaneously by musicians during their performances and do not have meaning in sound,but they help in the creation of music. The proposed approach consists in detecting, segmenting andtracking points of interest and parts of the musician body in video scenes to further analyze if the move-ment associated to these points of interest or body parts could be related to ancillary gestures. In partic-ular, we tackle the problem of detecting the three most commonly seen ancillary gestures of this class ofmusicians: clarinet bell moving up and down, bending of the knees and shoulder curvature. In this paperwe show that the optical flux algorithm for tracking a point of interest at the bottom of the clarinet belland the projection profile algorithm for analyzing the knees and the shoulder regions are effective indetecting ancillary movements related to the clarinet, knee movement and body curvature respectively.These techniques were evaluated with respect to the precision and recall in detecting ancillary gestureson 12,423 video frames of nine clarinetists’ presentations recorded in a studio. The experimental resultshave shown that the precision in detecting ancillary gestures varies between 78.4% and 92.8%, while therecall varies between 85.3% and 95.5%. These results also imply that any further analysis of the videos byspecialists could focus on less than 500 frames which represents a reduction of more than 99% in theworkload.

� 2013 Elsevier Ltd. All rights reserved.

1. Introduction

Music is a performing art and part of the quality of the musicexperience comes from the interaction between the player andthe instrument and the relationship with the sound that is pro-duced (Goldstein, 1998; Godoy & Jensenius, 2009). According toMitra and Acharya (2007), gestures are expressive, meaningfulbody motions involving physical movements of the fingers, hands,arms, head, face, or body with the intent of conveying meaningfulinformation or interacting with the environment. However,gestures are examined through various scientific fields, such ashuman–human communication, human–computer interactionand music interaction, each field adds a different meaning to the

word gesture and uses its own terminology to determine the word(Sotirios & Georgios, 2009).

The study of gesture in music is a vast research field. Delalande(1988) proposed a three-tiered classification of gesture: the effec-tive gestures, accompanying gestures and figurative gestures.Effective gestures are those that play a direct role in the productionof sound; accompanist gestures or ancillary gestures refer to bodymovements that are not involved in production of sound, such asshoulder or head movements; figurative gestures are sonic ges-tures that affect the produced sound but that have no directcorrespondence to physical movement. If we consider a specificmusical instrument such as a clarinet, previous work hasshown that ancillary gestures occur frequently in performances(Wanderley, 1999). For a clarinetist, effective gestures are move-ments that actually produce sound such as the fingers pressingkeys on the clarinet. The ancillary, spontaneous or non-obviousgestures are the movements that accompany the process of musicproduction, like movements of the shoulders and head. Finally, thefigurative gestures are those that are perceived by the sound itself,and not by movements - usually generated by changes in note

http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2013.09.009&domain=pdf

http://dx.doi.org/10.1016/j.eswa.2013.09.009

mailto:[email protected]





http://dx.doi.org/10.1016/j.eswa.2013.09.009

http://www.sciencedirect.com/science/journal/09574174

http://www.elsevier.com/locate/eswa

R.A. Seger et al. / Expert Systems with Applications 41 (2014) 2098–2106 2099

articulation or melodic variation (Wanderley, 1999; Wanderley,2002).

Gesture recognition is an example of multidisciplinary researchwhere different approaches such as computer vision, pattern rec-ognition, and machine learning have been used (Aggarwal & Cai,1999; Gavrila, 1999; Mitra & Acharya, 2007; Dahl & Friberg,2004; Davidson, 1993). Techniques such as feature extraction,object detection, and classification have also been found to beeffective. Regardless the approach, any practical implementationof gesture recognition typically requires the use of different imag-ing or tracking devices (Mitra & Acharya, 2007).

The development of methods based on video analysis have beengrowing more and more due to the broad applicability in real-life(Aggarwal & Cai, 1999; Davidson, 1993; Gavrila, 1999; Dahl &Friberg, 2004; Fernandez-Caballero, Castillo, & Rodriguez-Sanchez,2012; Lamberti & Camastra, 2012; Ruiz-Lozano, Medina, Delgado,& Castro, 2012). The detection of ancillary gestures of music per-formers based on video analysis is a multidisciplinary researchsubject related to music, computer science and psychology. Theterms accompanist gestures or ancillary gestures have been usedinterchangeably to designate those gestures that are part of aperformance, but that are not produced for the purpose of soundgeneration (Wanderley, Vines, Middleton, McKay, & Hatch, 2005).Although there is no consensus as to why accompanist gesturesare performed, it is clear that they are present in highly skilled per-formers technique (Wanderley, 1999). In the universe of clarinet-ists, these gestures are basically the complete up/down clarinetbell movement, the complete circle clarinet bell movement, mov-ing up/ down of head, moving up/ down of shoulders, back curlingmovement, flapping arms movement, bending waist movement,bending knees movement, stepping feet movement, weight shift-ing to the left or right (Wanderley et al., 2005). Among thesegestures, some of them appear more frequently. This is the caseof the complete up/down clarinet bell, back curling, bending kneesand weight shifting.

Currently, these gestures are evaluated through visual observa-tions on videos recorded in controlled environments (Wanderleyet al., 2005; Verfaille, Quek, & Wanderley, 2006). A person needsto watch the video and label the frame intervals where the move-ments are present. Such videos are also synchronized with the datagathered by motion capture sensors, which are attached to themusician body. Since the motion capture sensors usually producea large amount of data, the analysis of such a data is constrainedto the intervals where the movements were observed on thevideos. However, such a process is time-consuming because itrequires the manual labeling of the video frames. Besides, theprocess of labeling videos is very subjective and depends on thepersonal judgment of the person watching the video. Furthermore,some gestures of the musician can be very subtle and very briefwhich can impose some difficult to be detected by humans watch-ing the videos. Another drawback of such an approach is that it isnecessary to attach from eight to ten optical markers to themusician body. If the musician is a clarinetist, typically, markersare connected to the head, shoulder, elbow, hip and knee of theperformer as well as to the mouthpiece and bell of the clarinet.These markers and their cables usually interfere with performers’freedom of movement.

The detection of gestures of musicians using computer vision isa somewhat overlooked research subject regardless the type ofmusical instrument and very few works can be found in the liter-ature. Burns and Wanderley (2006) introduce a method to visuallydetect and recognize fingering gestures of the left hand of a guitar-ist. The method was developed following preliminary manual andautomated analysis of video recording of a guitarist. Tsagaris,Manitsaris, Dimitropoulos, and Manitsaris (2011) present a com-puter vision methodology for the recognition of finger gestures

performed on a virtual piano which is able to recognize the ges-tures of all the five fingers simultaneously. Scale and rotationinvariance techniques are integrated into the system to increasethe recognition quality and reduce the processing time. Shuai,Maeda, and Takahashi (2012) propose an interactive music gener-ation system based on the music conductor’s hand motions de-tected by a 3D motion capture device. The computer generatesthe music automatically, and then the music is arranged underthe human music conductor’s gestures. However, all these worksonly present a proof of concept or prototype without any objectiveevaluation of performance in detecting gestures.

In this paper we present a novel approach based on computervision techniques for detecting, segmenting and tracking pointsof interest and parts of the musician body in video scenes. Thepoints of interest and parts of the musician body are directly re-lated to the three most commonly seen ancillary gestures of clari-netists: clarinet bell moving up and down, bending of the kneesand shoulder curvature. The optical flux algorithm is employedto track a point of interest placed at the bottom of the clarinet belland the knees and the shoulder regions of the musician body areanalyzed using a projection profile algorithm. The experimental re-sults on a set of nine videos with clarinetists’ presentations haveshown that precision up to 92.8% and recall up to 92.8% can beachieved in detecting ancillary gestures. These values of precisionand recall suggest a significant speed up in any further analysisby a specialist, since it could be carried out on less than 500 outof 12,423 video frames of the current dataset.

The research discussed here is part of a continuing effort toquantify and analyze the gestures of musicians in order to under-stand their origins and communicative utility (Wanderley et al.,2005). However, the scope of this paper is constrained to the auto-matic detection of ancillary gestures on video scenes of several ad-vanced clarinetists playing standard concert solo repertoire whichis a crucial step to allow further and refined analysis only on thevideo segments where such gestures are present.

This paper is organized as follows. Section 2 presents the pro-posed approach for automatic detection of specific ancillary ges-tures, which is made up of pre-processing, detection of clarinetbell up/down movements, detection of bending knees, and detec-tion of back curling. Section 3 presents the experimental resultson the detection of the ancillary gestures on nine video sequencesof four clarinetists. Finally, the conclusions and the perspectives forfuture work are stated in the last section.

2. Proposed approach

In this section we outline our approach to detect three types ofancillary gestures that are usually performed by clarinetists duringtheir performances. Fig. 1 presents an overview of the proposed ap-proach. The first step consists in detecting and segmenting the ob-jects of interest, said the musician and the musical instrumentfrom other elements of the scene. We have as input, the videoframes as well as an image, which contains only the backgroundof the scene. This pre-processing step is made up of backgroundsubtraction, definition of a region of interest (ROI) in the videoframes, thresholding and filtering. The output of this step is a bin-ary video frame, which contains only the musician and the musicalinstrument (black pixels) and a blank background (white pixels).Furthermore, the centroid of the musician is also computed at thisstep since it is required by some of the further processing algo-rithms. Next, we have three independent steps which aim is to de-tect the three ancillary gestures of interest: clarinet bell up/downmovement, bending knees, and back curling. The output of all thesethree detection steps is a label (yes or no) assigned to each videoframe to indicate if the respective ancillary gesture is present ornot on such a frame. The details of all these steps are presentedas follows.

Fig. 1. An overview of the proposed approach for detecting ancillary gestures of clarinetists including the following steps: pre-processing, detection of clarinet bell up/downmovement, bending knees and back curling.

2100 R.A. Seger et al. / Expert Systems with Applications 41 (2014) 2098–2106

2.1. Pre-processing

The main aim of the pre-processing step is to segment the musi-cian and the musical instrument from the rest of the elementspresent in the scene as well as to enhance the quality of the seg-mented image. Separating objects of interest and background is acommon procedure in image processing. In the last years severalmethods have been proposed in order to perform backgroundelimination (Stauffer & Grimson, 1999; Oliver, Rosario, & Pentland,2000; Lo & Velastin, 2001; Elgammal, Duraiswami, Harwood, &Davis, 2002; Kumar, Sengupta, & Lee, 2002; Butler, Sridharan, &Bove, 2003; Cucchiara, Grana, Piccardi, & Prati, 2003; Piccardi &Jan, 2004; Seki, Wada, Fujiwara, & Sumi, 2003).

Stauffer and Grimson (1999) modeled the values of all pixels ofa scene as Gaussian distributions. Gaussian models of all back-ground pixels substitute the background. However, this methodneeds assigning new observed pixel value to the best matching dis-tribution and estimating of the updated model parameters. Oliveret al. (2000) presented an eigen-background approach based oneigenvalues decomposition of the whole image. Lo and Velastin(2001) proposed a temporal filter which computes the median va-lue of the last n frames and use it as background model. Cucchiaraet al. (2003) also employed a temporal median filter where sub-sampled frames and the time of the last computed median valueare used to calculate the median values. Elgammal et al. (2002)presented a non-parametric model based on kernel density estima-tion which employs the last n background values to model thebackground distribution. Kumar et al. (2002) proposed modelingthe pixel intensity with a normal distribution. If a pixel intensityof the current frame is in a region defined by the mean and vari-ance of its background model, the pixel is then declared to be abackground pixel. Butler et al. (2003) modeled each pixel by agroup of clusters where each cluster consists of a weight and anaverage pixel value and represents whether the foreground orthe background. Seki et al. (2003) introduced a method based onco-occurrence of image variations where the image is partitionedinto a set of blocks to treat a component vector instead of usingpixel values. Piccardi and Jan (2004) employed a sequential kerneldensity approximation with some computational optimizations tomitigate the computational drawback of mean-shift vector tech-niques. All these methods that average models or sampled images

over time are somewhat robust to variations in the illuminationand changes in the background, though they are time-consuming.

However, if we assume that there is no variation in the back-ground, a simple background subtraction technique can be em-ployed. A background subtraction technique is the differencebetween the current video frame and a reference image. Thisassumption holds in our case as the background is static and themusician and the musical instrument present slow and light move-ments without a motion within the scene. The technique chosen tosegment the musical and the musical instrument is a weightedbackground subtraction (Lee, Lizarraga, Gomes, & Koerich, 1997;Karasulu & Korukoglu, 2012). This procedure was chosen basedon an analysis of the video dataset that we are dealing with. Thevideos were recorded in a studio where the illumination is con-trolled. Furthermore, the scenario is simple and static, only themusician moves while the camera and the objects present in thescene remain static. The background subtraction requires a refer-ence image to be subtracted from the video frames. Since every-thing but the musician and the musical instrument must beeliminated, a video frame containing only the background, withoutthe musician and the musical instrument, is the ideal reference im-age. In practice, at the beginning of each video there are someframes with an empty scenario, therefore the background imageis chosen among these video frames.

Given the background image denoted as b(i,j) and a current vi-deo frame, denoted as ft(i,j), the resulting image, denoted as f 0t ði; jÞis obtained by Eq. (1).

f 0t ði; jÞ ¼ ftði; jÞ � bði; jÞ ð1Þ

where 1 6 i 6M and 1 6 j 6 N are the pixel coordinates, t is theframe index, and M and N are the horizontal (number of rows)and vertical (number of columns) image dimensions respectively.

A usual method to speed up and improve the analysis capabilityis focusing on local information by zoning. Zones can be defined inportions of equal size, or non-proportionally, where zones may notcover the entire space and may as well overlap. Human expertsusually define a zoning strategy based on domain knowledge.The musicians spend almost the entire length of the recording ina central region, because their movements are constrained by thesheet music tripod stand. This implies that the musician shouldkeep neither so close to avoid that the clarinet touches the tripod

Table I3 � 3 Structural element used in opening and closing morpholog-ical operations.

1 1 11 1 11 1 1


stand nor so far to impair the sheet music reading. Therefore, asimple heuristic rule which splits the horizontal dimension of thevideo frames into four slices and eliminates the first and the lastquarters is enough to delimit the ROI, as shown in Fig. 2. In prac-tice, all further operations will be carried out only within theROI, which is delimited by changing the lower and upper boundsof the vertical image index j to N/4 6 j 6 3N/4 in Eq. (1).

Even if both the background image and the video frames arefrom the same video, there may be small changes in the pixel val-ues of the background throughout the time. Then, a thresholdingoperation is applied to the image resulting from the subtractionto regularize pixels with intermediate values. Such an operation re-duces the intermediary gray levels resulting from the backgroundsubtraction process to either black or white. Therefore, the blackpixels represent the musician and his clarinet while the back-ground is entirely white. The thresholding operation assigns eitherthe value 0 or 1 to all pixels, depending on whether its current va-lue is lower or greater than a threshold value Th, as denoted inEq. (2).

f 00t ði; jÞ ¼0 if f 0t ði; jÞ 6 Th

1 otherwise

�ð2Þ

The opening and closing morphological operators are used toeliminate any remaining noise and to smooth the contours of thebinary image. The opening operation is the erosion of f 00t ði; jÞ by astructuring element w(k,l) followed by dilation by the same struc-turing element as denoted in Eq. (3). The 3 � 3 structuring elementused in the morphological operations is shown in Table I. Thestructuring element has its origin at the center pixel and it isshifted over the image and at each pixel of the image its elementsare compared with the set of the underlying pixels. If the two setsof elements match the condition defined by the set operator, thepixel underneath the origin of the structuring element is set to apre-defined value. The opening operation smoothes the contourof the image, breaks narrow passages, and eliminates narrowprotrusions.

f 00t ði; jÞ �wðk; lÞ ¼ f 00t ði; jÞ �wðk; lÞ� �

�wðk; lÞ ð3Þ

where � and � denote the dilation and erosion operation respec-tively, and � is the opening operator.

Let g(i,j) be the image resulting from the opening operation, theclosing operation is the dilation of g(i,j) by structuring elementw(k,l) followed by erosion by the same structuring element, as de-noted in Eq. (4). The closing operation smoothes the contour of theimage, fuses narrow breaks and long thin gulfs, eliminates smallholes, and fills gaps in the contour.

gtði; jÞ �wðk; lÞ ¼ ½gtði; jÞ �wðk; lÞ� �wðk; lÞ ð4Þ

Fig. 2. Image splitting into four zones and the delimitation of the ROI within thetwo central zones.

where � is the closing operator.Fig. 3 illustrates the steps of the pre-processing. First, the video

frames (Fig. 3(a)) and the background image (Fig. 3(b)) are con-verted to grayscale with eight-bit depth, since color informationdoes not provide any discriminant information. Then, the back-ground reference image (Fig. 3(c)) is subtracted from all otherframes of the video, resulting in an image only with the musicianand the clarinet (Fig. 3(d)). The next step is to delimit the ROI,which encompasses the musician and the clarinet (Fig. 3(e)). Final-ly, after the binarization and the morphological operations, themusician and the clarinet are segmented (Fig. 3(f)).

The further processing steps require relating the movement ofthe clarinet or the movement of some parts of the musician bodyto a reference point attached to the own musician body. For suchan aim, the centroid of the musician is chosen as reference pointand it is computed by Eqs. (5) and (6), for each video frame.Fig. 4 illustrates the centroid of the musician.

cx ¼PM

i¼1

PNj¼1i

npð5Þ

cy ¼PM

i¼1

PNj¼1j

npð6Þ

where cx and cy are the centroid coordinates, and np is the totalnumber of black pixels. It is worth to notice that only the i and jcoordinates of the black pixels are used to compute the centroidof the musician by Eqs. (5) and (6), while the white pixels arediscarded.

2.2. Detection of clarinet bell up/down

The ancillary gesture of lifting up/putting down the clarinet isone of the most common and evident gestures of a clarinet player.The detection of the clarinet movement is interactive and startswith an intervention of the user at the very first frames of the videoto place a virtual marker at the bottom edge of the clarinet bell, asshown in Fig. 4. Once the virtual marker is placed, its horizontaland vertical coordinates, denoted as vx and vy respectively, are re-tained and tracked through the subsequent frames of the videousing the classical Lucas-Kanade differential method for opticalflow estimation (Lucas & Kanade, 1981). The Lucas-Kanade methodassumes that the displacement of the virtual marker between twoconsecutive frames is small and approximately constant within aneighborhood of the pixel related to the virtual marker. Therefore,the algorithm provides an estimation of the (vx,vy) coordinates forall subsequent frames of the video.

However, only watching over if the coordinate of the virtualmarker has changed between consecutive frames is not enoughto indicate if there is a movement of the clarinet or not, becausethe displacement of the virtual marker can also be produced bythe own motion of the musician in the scene. Even if we have as-sumed that the musician motion is small and constrained by thepresence of the sheet music tripod stand, such a motion can beconfounded with the clarinet movement. Therefore, to minimizethe impact of the musician motion in the clarinet movement, thedifference between the coordinates of the virtual marker and thecoordinates of the musician centroid, denoted as (dx, dy) is com-puted by Eq. (7) for each video frame.

Fig. 3. Pre-processing steps: (a) current video frame, (b) background image, (c) grayscale video frame, (d) image resulting from the subtraction between the current videoframe and the background image, (e) segmentation of the region of interest, (f) final frame after binarization and morphological operations.

Fig. 4. The virtual marker placed at the clarinet bell and the centroid of themusician.


ðdx; dyÞ ¼ ðvx;vyÞ � ðcx; cyÞ ð7Þ

Fig. 5 shows the trajectory of the virtual marker, denoted as (vx,vy),and the trajectory of the musician centroid, denoted as (cx,cy). Thehorizontal dimension of the graph refers to the video frame indexwhile the vertical dimension refers to the variation in number ofpixels. Fig. 5 also shows the curve related to the difference between

Fig. 5. Clarinet bell up/down ancillary g

the coordinates of the virtual marker and the musician centroid, de-noted as (dx,dy), which is computed by Eq. (7), as well as the groundtruth for the video which is deliberately overlapped to the (dx,dy)curve. The ground truth curve has only two levels where the highlevel indicates the frame intervals where the ancillary gesture ofthe clarinet bell is present, while the low level indicates the absenceof such an ancillary gesture. The comparison between the (dx,dy)curve and the ground truth suggests that the highest differences be-tween the positions of the virtual marker and the centroid, whichare given by the positive peaks at the (dx,dy) curve, are related tothe ancillary gesture of the clarinet bell up/down.

We have adopted a simple peak detector based on a singlethreshold value to catch the peaks of (dx,dy). Such a threshold va-lue, denoted as Thcis computed by taking the mean value and thestandard deviation of the difference between the coordinates ofthe marker and the coordinates of the centroid, that is,Thc ¼ ðdx þ rdx; dy þ rdyÞ, where dx; dy and rdx, rdy denote themean and standard deviation values respectively. Therefore, thedecision whether the ancillary gesture is present or not is givenby Eq. (8). The curve resulting from the thresholding, denoted aspeaks of (dx,dy) is also presented in Fig. 5.

clarinet movement ¼Yes if ðdx;dyÞ > Thc

No otherwise

�ð8Þ

esture analysis for a sample video.

Fig. 6. ROI for bending knees detection.


2.3. Detection of bending knees

The detection of the bending knees ancillary gesture requires adifference strategy, since it is not possible to assign and trackvirtual markers in this case. Therefore, we chose to compute thehorizontal projection profile (Koerich, Sabourin, & Suen, 2005; Koe-rich & Kalva, 2005). Let g0(i,j) be a binary image with M rows and Ncolumns, the horizontal projection profile is defined as the sum ofwhite pixels parallel to the x axis between the left border of the ROIand the edge of the musician. The horizontal projection profile isrepresented by a vector PPh as defined by Eq. (9). However, thehorizontal projection profile is computed only for the pixelswithin the ROI, so the lower and upper limits are N/4 and 3N/4respectively.

Furthermore, it is possible to limit the area where the projectionprofile is computed to the half bottom of the ROI. This is possibleusing domain knowledge that the musician is approximately atthe middle of the frame and his knee is guaranteed to appear al-ways at the half bottom of the video frame (Fig. 6). Therefore, PPh[j]is computed for M/2 6 j 6M. Fig. 7(a) shows a sample of the hori-zontal projection profile taken at the half bottom of the ROI.

PPh½j�t ¼X3N=4

i¼N=4

g0tði; jÞ ð9Þ

The horizontal projection profile is computed at each frame andthe differences between values of consecutive frames are evidencethat a movement has happen. A reduction in the amount of pixelcounts indicates the intervals of bending knees (Fig. 7(b)). How-ever, it is not enough looking only at the differences of the projec-tion profile between frames to characterize a bending knee, since

Fig. 7. Vertical projection profile for bending knees ancillary

such differences may be due to the motion of the musician in thescene. Again, it is necessary to relate such a measure to the cen-troid of the musician. Therefore, for each video frame we computethe mean value of the horizontal projection profile, denoted as PPh.PPh decreases when the knees are bended or when the musicianmoves towards the sheet music tripod stand. The decision is thentaken looking also to the vertical position of the musician centroidas shown in Eq. (10).

bending knees ¼ Yes if PPh < Thk and cy < Thy

No otherwise

(ð10Þ

where Thk and Thy are thresholds below which we consider a bend-ing knee movement.

2.4. Detection of back curling

The last ancillary gesture that we are interested in detecting isthe back curling movement. For such an aim we have also em-ployed projection profile (Koerich et al., 2005; Koerich & Kalva,2005). First, the bounding box, which is the smallest bounding orenclosing box within which all the points of the musician andinstrument lay, is computed. In the case of the clarinetists, thearms or the clarinet bell, the head and the feet of the musician de-limit such a box. Since the region of interest is the back of themusician, which is always above the musician centroid, we canuse the centroid to constrain the vertical dimension of such abounding box, as shown in Fig. 8. Next, the top vertical projectionprofile is computed within this constrained bounding box, asshown in Fig. 9(a). Then, the back curling movement can be sum-marized by the variation of the area of the vertical projection pro-file delimited by the head of the musician and his shoulder (theborder of the bounding box), where transitions from small to largeareas suggest movement.

Two waypoints in the top vertical projection profile are used tocomputed variation of such an area: one at the head, denoted as pa

and one at the last pixel of the rightest bar of the vertical projectionprofile, denoted as pb. These two waypoints are shown in Fig. 9(b).We can infer the signature area of the movement by computing thearea of the triangle (Eq. (11)) formed by these two points and athird point collinear to them, denoted as pc, as shown in Fig. 9(c).

At ¼ðpcx � paxÞ � ðpby � pcyÞ

2ð11Þ

where pcx and pax are the x coordinate of the points pc and pa respec-tively, and pby and pcy are the y coordinate of the points pb and pc

respectively.

gesture detection: (a) straight knees; (b) bending knees.

Fig. 8. Bounding box for back curling detection.


Analogously to the movement of the clarinet bell, we haveadopted a simple peak detector to decide whether or not the back-ing curling gesture occurs. Such a threshold value is computed tak-ing the mean area value plus the standard deviation. Fig. 10 showsthe threshold value and the variation of the area of the triangle re-lated to the back curling gesture. Fig. 10 also shows the groundtruth curve for such a video.

Fig. 9. Projection profile for back cur

Fig. 10. Back curling ancil

3. Experimental results

The proposed approach was evaluated into nine video se-quences of four clarinetists playing the second movement of Stra-vinsky’s Three Pieces for Solo Clarinet recorded at the InputDevices and Music Interaction Laboratory in the Music TechnologyArea of McGill University, Canada. The videos were recorded by asingle fixed camera in a studio with non-varying illumination. Thisparticular piece was chosen for its non-metrical structure; the ab-sence of a consistent meter was used to filter out some of therhythmic time-keeping movements that are pervasive in metricalmusic (Wanderley et al., 2005). This piece is also ideal because itis unaccompanied; a musician’s mental representation of the workdoes not include sonic events other than those present in his or herown sound. Finally, this piece is part of the standard repertoire foradvanced clarinetists, which allows for replications and extensionsof previous research with musicians of varying schools of training.

Four performers from different countries (European and NorthAmerican) were asked to perform each piece several times in threedifferent styles: standard, expressive and immobilized. Standardperformances were defined as the types of performance one wouldmake at live concerts; expressive performances were defined asperformances where the musician tried to exaggerate the expres-sive content of his or her performance, though not necessarily

ling ancillary gesture detection.

lary gesture analysis.

Table IIIPrecision and recall for the ancillary gestures detection.

Ancillary gesture Precision (%) Recall (%)

Clarinet Bell Up/Down 78.4 95.5Bending Knees 77.3 85.3Back Curling 92.8 88.5Overall 83.1 91.2


the movements; and immobilized performances were defined asperformances where the clarinetist was to move as little aspossible.

These videos were captured into a studio through a digital videocamera installed in parallel with the musician. The illuminationwas the standard of the ambient, as well as the background. Theresolution of the videos varies between 720 � 480 and 400 � 300pixels, and the frame rate varies between 15 and 25 frames per sec-ond. Each video sequence has between 1,300 and 1,700 frames.Furthermore, the video sequences were watched and each videoframe was labeled according to the presence or absence of thethree ancillary gestures of interest. This ground-truth is necessaryto further evaluate the accuracy of the proposed approach.

There are several thresholding values that need to be adjustedto decide whether an ancillary gesture is present or not. Further-more, it is very difficult to find global values, which could besuitable for all musicians’ performances. Therefore, we haveadopted local thresholds in the sense that they are specific for eachvideo. For such an aim, we use the first frames of the videos forsetting up the thresholds and the remaining frames to evaluatethe performance of the algorithms in detecting the ancillarygestures. Even if this procedure is not very elegant it is plausiblesince the threshold values are computed in an automatic fashion.Furthermore, the size of the dataset is small to allow us to worktowards a generic approach.

The main goal of the experiments is to evaluate the accuracy indetecting the three ancillary gestures. The nine videos of the data-set sum up 443 ancillary gestures where 201 are movements of theclarinet bell up/down, 68 are movements of bending knees and 174are movements of back curling. In fact, the number of movementsrefers to the number of frames where the ancillary gestures arepresented. The proposed approach reached 486 gesture detections.However, just looking the overall number is not a clear indicationof its accuracy. Table II presents the breakdown of the number ofthe detections in terms of the number of true positives which cor-responds to the number of gestures detected in the frames wherethe gestures are present according to the ground-truth, the numberof false positives which corresponds to the number gestures de-tected in the frames where the gestures are not present accordingto the ground-truth and false negatives which corresponds to thenumber of gestures not detected in the frames where the gesturesare present according to the ground-truth.

In Table II, the total number of detections corresponds to thesummation of the number of true positives and the number of falsepositives. Among the 486 detected gestures, 404 are true positivesand the remaining are false positives. Furthermore, 39 othermovements occurred (false negative), but they were not detected.Therefore, we can compute the precision and the recall usingEqs. (12) and (13) which are shown in Table III.

precision ¼ TPTP þ FP

ð12Þ

recall ¼ TPTP þ FN

ð13Þ

where TP, FP and FN denote the number of true positives, falsepositives and false negatives respectively.

Table IIBreakdown of the ancillary gestures detection.

Ancillarygesture

Total number ofdetections

Truepositive

Falsepositive

Falsenegative

Clarinet Bell Up/Down

245 192 53 9

Bending Knees 75 58 17 10Back Curling 166 154 12 20Total 486 404 82 39

4. Conclusion

In this paper, we have presented a novel approach for the detec-tion of ancillary gestures produced by clarinetists during musicalperformances which combines several algorithms to detect threeancillary gestures: clarinet bell up/down, back curling and bendingknees. The detection rates resulting from the combination of thealgorithms proposed in this paper are very promising. It has a smallbias to produce false positives since the majority of detection er-rors (about 68%) were due to false positive cases. However, thisbias is intentional, since the main aim is to highlight the videoframes where the ancillary gestures may have occurred for subse-quent analysis. Such a subsequent analysis, which should be car-ried out by human specialists, focuses only on the sequence offrames where the gestures were found. Focusing only on short se-quences of frames speeds up the analysis. For instance, if we con-sider the nine videos of the dataset that we have used, they sum up12,635 frames. According to the results presented in Table II, it isnecessary to analyze only 486 video frames out of 12,635 frames.This represents a reduction of more than 99% in the workload forthe specialists. On the other hand a higher number of false nega-tives are rather problematic, since they represent loss of data.

The proposed approach for detecting ancillary gestures of clar-inetists is original, therefore it is difficult to make a direct compar-ison with other works. Compared with other results available inthe literature that attempts to analyze similar ancillary gestures,but using body sensors, markers and special clothes, our approachrepresents an interesting tradeoff between precision and the con-straints to the musician performance, since there is no specialrequirement regarding the clothes or movements that the musi-cian can execute (Wanderley et al., 2005; Verfaille et al., 2006).The musician has the freedom to play the instrument as she/hewishes.

In spite of the good results achieved, there are some shortcom-ings related to the proposed approach. The main shortcoming isthe need of adjusting several parameters of the algorithms,depending on the scene. Changes in the background, illuminationor position of the musician relative to the camera position couldinterfere in the performance of the proposed approach.

In order to improve these results, it is necessary to employ moresophisticated techniques for preprocessing as well as to find outalternative forms of decision-making, replacing the proposedthresholding values by other mathematical arrangements less sus-ceptible to errors. This is one of the subjects of our future work.

References

Aggarwal, J., & Cai, Q. (1999). Human motion analysis: A review. Computer Visionand Image Understanding, 73, 428–440.

Burns, A. -M., & Wanderley, M. (2006). Visual methods for the retrieval of guitaristfingering. In International conference on new interfaces for musical expression (pp.196–199). Paris, France.

Butler, D., Sridharan, S., & Bove, V. (2003). Real-time adaptive backgroundsegmentation. In IEEE international conference on acoustics, speech, and signal(pp. 349–352). Hong Kong.

Cucchiara, R., Grana, C., Piccardi, M., & Prati, A. (2003). Detecting moving objects,ghosts, and shadows in video streams. IEEE Transactions on Pattern Analysis andMachine Intelligence, 25(10), 1337–1442.

http://refhub.elsevier.com/S0957-4174(13)00740-9/h0005






Dahl, S., & Friberg, A. (2004). Expressiveness of musician’s body movements inperformances on marimba. Gesture-Based Communication in Human–ComputerInteraction, 1, 479–486.

Davidson, J. (1993). Visual perception of performance manner in the movements ofsolo musicians. Psychology of Music, 21, 103–113.

Delalande, F. (1988). La gestique de Glenn Gould. Louise Courteau Editrice, 84–111.Elgammal, A., Duraiswami, R., Harwood, D., & Davis, L. (2002). Background and

foreground modeling using nonparametric kernel density estimation for visualsurveillance. Proceedings of the IEEE, 90(7), 1151–1163.

Fernandez-Caballero, A., Castillo, J., & Rodriguez-Sanchez, J. (2012). Human activitymonitoring by local and global finite state machines. Expert Systems withApplications, 39, 6982–6993.

Gavrila, D. (1999). The visual analysis of human movement: A survey. ComputerVision and Image Understanding, 73, 82–98.

Godoy, R., & Jensenius, A. (2009). Body movement in music information retrieval. In10th international society for music information retrieval conference (pp. 45–50).Kobe, Japan.

Goldstein, M. (1998). Gestural coherence and musical interaction design. In IEEEinternational conference on systems, man, and cybernetics (Vol. 2, pp. 1076–1079).San Diego, USA.

Karasulu, B., & Korukoglu, S. (2012). Moving object detection and tracking by usingannealed background subtraction method in videos: Performance optimization.Expert Systems with Applications, 39, 33–43.

Koerich, A., & Kalva, P. (2005). Unconstrained handwritten character recognitionusing metaclasses of characters. In IEEE international conference on imageprocessing (pp. 542–545). Genoa, Italy.

Koerich, A., Sabourin, R., & Suen, C. (2005). Recognition and verification ofunconstrained handwritten words. IEEE Transactions on Pattern Analysis andMachine Intelligence, 27(10), 1509–1522.

Kumar, P., Sengupta, K., & Lee, A. (2002). A comparative study of different colorspaces for foregound and shadow detection for traffic monitoring system. InIEEE fifth international conference on intelligent transportation systems (pp. 100–105). Singapore

Lamberti, L., & Camastra, F. (2012). Handy: A real-time three color glove-basedgesture recognizer with learning vector quantization. Expert Systems withApplications, 39, 10489–10494.

Lee, L., Lizarraga, M., Gomes, N., & Koerich, A. (1997). A prototype for brazilianbankcheck recognition. International Journal of Pattern Recognition and ArtificialIntelligence, 11(4), 549–569.

LoB., & Velastin, S. (2001). Automatic congestion detection system for undergroundplatforms. In International symposium on intelligent multimedia, video and speechprocessing (pp. 158–161). Hong Kong.

Lucas, B., & Kanade, T. (1981). An iterative image registration technique with anapplication to stereo vision. In Proceedings of imaging understanding workshop(pp. 121–130).

Mitra, S., & Acharya, T. (2007). Gesture recognition: A survey. IEEE Transactions onSystems, Man and Cybernetics, 37(3), 311–324.

Oliver, N., Rosario, B., & Pentland, A. (2000). A bayesian computer vision system formodeling human interactions. IEEE Transactions on Pattern Analysis and MachineIntelligent, 22(8), 831–843.

Piccardi, M. & Jan, T. (2004). Efficient mean-shift background subtraction. In IEEEinternational conference on systems, man and cybernetics (pp. 3099–3104).Singapore.

Ruiz-Lozano, M., Medina, J., Delgado, M., & Castro, J. (2012). An expert fuzzy systemto detect dangerous circumstances due to children in the traffic areas from thevideo content analysis. Expert Systems with Applications, 39, 9108–9117.

Seki, M., Wada, T., Fujiwara, H., & Sumi, K. (2003). Background subtraction based oncooccurrence of image variations. In Computer vision and pattern recognitionconference (pp. 65–72). Madison, USA.

Shuai, C., Maeda, Y., & Takahashi, Y. (2012). Music conductor gesture recognizedinteractive music generation system. In Joint 6th international conference on softcomputing and intelligent systems and 13th international symposium on advancedintelligent systems (pp. 840–845). Kobe, Japan.

Sotirios, M., & Georgios, P. (2009). Computer vision method in music interaction. InFirst international conference on advances in multimedia (pp. 146–151). Colmar,France.

Stauffer, C., & Grimson, W. (1999). Adaptative background mixture models for real-time tracking. In IEEE conference on computer vision and pattern recognition (pp.246–252). Ft. Collins, USA.

Tsagaris, A., Manitsaris, S., Dimitropoulos, K., & Manitsaris, A. (2011). Intelligentinvariance techniques for music gesture recognition based on skin modelling. In12th International symposium on computational intelligence and informatics (pp.219–223). Budapest, Hungary.

Verfaille, V., Quek, O., & Wanderley, M. (2006). Sonification of musicians’ ancillarygestures. In Proceedings of the 12th international conference on auditory display(Vol. 1, pp. 20–23).

Wanderley, M. (1999). Non-obvious performer gestures in instrumental music.Gesture-based Communication in Human–Computer Interaction, 1, 37–48.

Wanderley, M. (2002). Quantitative analysis of non-obvious performer gestures.Gesture and Sign Language in Human–Computer Interaction, 1, 241–253.

Wanderley, M., Vines, B., Middleton, N., McKay, C., & Hatch, W. (2005). The musicalsignificance of clarinetists’ ancillary gestures: An exploration of the field. Journalof New Music Research, 34(1), 97–113.










































Documents

Expert Systems with Applications · Automatic detection of musicians’ ancillary gestures based on video analysis Rodrigo A. Segera, Marcelo M. Wanderleyb, Alessandro L. Koericha,c,⇑