RECURRENT NEURAL NETWORKS FOR DRUM TRANSCRIPTION · PDF fileRECURRENT NEURAL NETWORKS FOR DRUM TRANSCRIPTION ... representing bass drum, snare drum, ... we focus on the transcription

Embed Size (px)

Citation preview

  • RECURRENT NEURAL NETWORKS FOR DRUM TRANSCRIPTION

    Richard Vogl, Matthias Dorfer, Peter KneesDepartment of Computational PerceptionJohannes Kepler University Linz, Austria

    [email protected]

    ABSTRACT

    Music transcription is a core task in the field of musicinformation retrieval. Transcribing the drum tracks of mu-sic pieces is a well-defined sub-task. The symbolic repre-sentation of a drum track contains much useful informationabout the piece, like meter, tempo, as well as various styleand genre cues. This work introduces a novel approach fordrum transcription using recurrent neural networks. Weclaim that recurrent neural networks can be trained to iden-tify the onsets of percussive instruments based on generalproperties of their sound. Different architectures of recur-rent neural networks are compared and evaluated using awell-known dataset. The outcomes are compared to resultsof a state-of-the-art approach on the same dataset. Further-more, the ability of the networks to generalize is demon-strated using a second, independent dataset. The exper-iments yield promising results: while F-measures higherthan state-of-the-art results are achieved, the networks arecapable of generalizing reasonably well.

    1. INTRODUCTION AND RELATED WORK

    Automatic music transcription (AMT) methods aim at ex-tracting a symbolic, note-like representation from the au-dio signal of music tracks. It comprises important tasks inthe field of music information retrieval (MIR), as withthe knowledge of a symbolic representation many MIRtasks can be address more efficiently. Additionally, a vari-ety for direct applications of AMT systems exists, for ex-ample: sheet music extraction for music students, MIDIgeneration/re-synthesis, score following for performances,as well as visualizations of different forms.

    Drum transcription is a sub-task of AMT which ad-dresses creating a symbolic representation of all notesplayed by percussive instruments (drums, cymbals, bells,etc.). The source material is usually, as in AMT, a monau-ral audio sourceeither from polyphonic audio containingmultiple instruments, or a solo drum track. The symbolicrepresentation of notes played by the percussive instru-ments can be used to derive rhythmical meta-informationlike tempo, meter, and downbeat. The repetitive rhythmi-

    c Richard Vogl, Matthias Dorfer, Peter Knees. Licensedunder a Creative Commons Attribution 4.0 International License (CC BY4.0). Attribution: Richard Vogl, Matthias Dorfer, Peter Knees. Recur-rent Neural Networks for Drum Transcription, 17th International Societyfor Music Information Retrieval Conference, 2016.

    cal structure of the drum track, as well as changes therein,can be used as features for high-level MIR tasks. Theyprovide information about the overall structure of the songwhich can be utilized for song segmentation [18]. Thedrum rhythm patterns can also be utilized for genre clas-sification [7]. Other applications for rhythmic patterns in-clude query-by-tapping and query-by-beat-boxing [11,19].

    A common approach to the task of drum transcriptionis to apply methods used for source separation like non-negative matrix factorization (NMF), independent compo-nent analysis (ICA), or sparse coding. In recent work,Dittmar and Gartner [5] use an NMF approach to tran-scribe solo drum tracks into three drum sound classesrepresenting bass drum, snare drum, and hi-hat. Theyachieve F-measure values of up to 95%. Their approachfocuses on real-time transcription of solo drum tracks forwhich training instances of each individual instrument arepresent. This is a very specific use case and in manycases separate training instances for each instrument arenot available. A more general and robust approach whichis able to transcribe different sounding instruments is de-sirable. Smaragdis [28] introduces a convolutional NMFmethod. It uses two-dimensional matrices (instead of one-dimensional vectors used in NMF) as temporal-spectralbases which allow to consider temporal structures of thecomponents. Smaragdis shows that this method can be ap-plied to transcribe solo drum tracks. Lindsay-Smith andMcDonald [21] extend this method and use convolutiveNMF to build a system for solo drum track transcription.They report good results on a non-public, synthetic dataset.

    Fitzgerald et al. [9] introduce prior subspace analysis,an ICA method using knowledge of the signals to be sepa-rated, and demonstrate the application for drum transcrip-tion. Spich et al. [29] extend this approach by incorporat-ing a statistical music language model. These works focuson transcription of three and two instruments, respectively.

    Scholler and Purwins [26] use a sparse coding approachto calculate a similarity measure for drum sound classifica-tion. They use eight basis vectors to represent the soundsfor bass drum, snare drum, and hi-hat in the time domain.Yoshii et al. [33] present an automatic drum transcriptionsystem based on template matching and adaptation, similarto sparse coding approaches. They focus on transcriptionof snare and bass drum only, from polyphonic audio sig-nals.

    Algorithms based on source separation usually use theinput signal to produce prototypes (or components) rep-

    730

  • Figure 1. Overview of the proposed method. The ex-tracted spectrogram is fed into the trained RNN which out-puts activation functions for each instrument. A peak pick-ing algorithms selects appropriate peaks as instrument on-set candidates.

    resenting individual instruments and so called activationcurves which indicate the activity of them. A peak pick-ing algorithm is needed to identify the instrument onsets inthe activation curves. Additionally, the identified compo-nent prototypes have to be assigned to instruments. This isusually done using machine learning algorithms in combi-nation with standard audio features [6, 16].

    Another approach found in the literature is to first seg-ment the audio stream using onset detection and classifythe resulting fragments. Gillet and Richard [13] use acombination of a source separation technique and a sup-port vector machine (SVM) classifier to transcribe drumsounds from polyphonic music. Miron et al. use a combi-nation of frequency filters, onset detection and feature ex-traction in combination with a k-nearest-neighbor [23] anda k-means [22] classifier to detect drum sounds in a solodrum audio signal in real-time. Hidden Markov Models(HMMs) can be used to perform segmentation and classi-fication in one step. Paulus and Klapuri [24] use HMMsto model the development of MFCCs over time. Decodingthe most likely sequence yields activation curves for bassdrum, snare drum, and hi-hat and can be applied for bothsolo drum tracks as well as polyphonic music.

    Artificial neural networks consist of nodes (neurons)forming a directed graph, in which every connection hasa certain weight. Since the discovery of gradient descenttraining methods which make training of complex archi-tectures computationally feasible [17], artificial neural net-works regained popularity in the machine learning commu-nity. They are being successfully applied in many differ-ent fields. Recurrent neural networks (RNNs) feature ad-ditional connections (recurrent connections) in each layer,providing the outputs of the same layer from the last timestep as additional inputs. These connections can serve asmemory for neural networks which is beneficial for taskswith sequential input data. RNNs have been shown to per-form well, e.g., for speech recognition [25] and handwrit-ing recognition [15]. Bock and Schedl use RNNs to im-prove beat tracking results [3] as well as for polyphonic

    0 200 400 600 800 10000

    1020304050607080

    freq. bin

    spectrogram

    0 200 400 600 800 1000

    bass

    snare

    hi-hat

    targets

    0 200 400 600 800 1000frame

    bass

    snare

    hi-hat

    network output

    Figure 2. Spectrogram of a drum track and target functionsfor bass drum, snare drum, and hi-hat. The target functionhas a value of 1.0 at the frames at which annotations for theinstruments exist and 0.0 otherwise. The frame rate of thetarget function is 100Hz, the same as for the spectrogram.The third graph shows the output of the trained RNN forthe spectrogram in the first image.

    piano transcription [4]. Sigtia et al. [27] use RNNs in thecontext of automatic music transcription as music languagemodels to improve the results of a frame-level acousticclassifier. Although RNNs have been used in the past fortranscription systems [4], we are not aware of any workusing RNNs for transcription of drum tracks.

    2. TASK AND MOTIVATION

    In this work, we introduce a new method for automatictranscription of solo drum tracks using RNNs. While it is afirst step towards drum transcription from polyphonic mu-sic, there also exist multiple applications for the transcrip-tion of solo drum tracks. In electronic music production, itcan be used to transcribe drum loops if a re-synthesis usingdifferent sounds is desired. The transcription of recordedsolo drum tracks can be used in the context of recordingand production of rock songs. Nowadays it is not unusualto use sampled drums, e.g., in low-budget productions orin modern heavy metal genres. One the one hand, this isdue to the complexity and costs of recording drums. On theother hand, with sampled drums it is easier to achieve thehigh precision and even robotic sounding style desired insome genres. Instead of manually programming the drumtrack, automatic transcription of a simple low-quality drumrecording can be used as basis for the production of a song.As in other works, we focus on the transcription of bassdrum, snare drum, and hi-hat. These instruments usuallydefine the main rhythmic patterns [24], depending on genreand play style. They also cover most (>80% in the case ofthe ENST-Drums dataset, see Sect