Composition of Multi-instrument Music with Deep Learning€¦ · Composition of Multi-instrument Music with Deep Academic year 2018-2019 Master of Science in Computer Science Engineering

LearningComposition of Multi-instrument Music with Deep

Academic year 2018-2019

Master of Science in Computer Science Engineering

Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Dr. ir. Tim Verbelen, Dr. ir. Bert VankeirsbilckSupervisor: Prof. dr. ir. Bart Dhoedt

Student number: 01300711Yarne Hermann

Permission of Use

“The author gives permission to make this Master’s dissertation available forconsultation and to copy parts of this Master’s dissertation for personal use.In the case of any other use, the copyright terms have to be respected, inparticular with regard to the obligation to state explicitly the source whenquoting results from this Master’s dissertation.”

Yarne Hermann, May 2019

i

Acknowledgements

As someone who has been writing his own music for eight years by now, andwho has a deep interest in deep learning, this thesis subject was perfect forme. Over the course of the past year, I have been able to combine my twobiggest interests into one result.

I would like to start by thanking my counsellors, Dr. Ir. Tim Verbelen andDr. Ir. Bert Vankeirsbilck, who have guided me throughout this project. Iwant to thank Dr. Ir. Cedric De Boom for providing the initial thesis subjectand for helping me through the final moths of this project. I would also like tothank my supervisor, Prof. Dr. Ir Bart Dhoedt, for his support on this projectas well as on my other endeavors.

I would like to thank my uncle for his interest in this project and for all theways in which he has helped me with it. Furthermore, I would like to thankmy friends, Kilian and Jurgen, for their interest, their help with the survey andthe mental support they provided throughout this year. I would also like tothank everyone who participated in the survey and have shown great interestin my results.

Finally, I would like to thank my parents, who have always provided me withthe support I needed, especially during the last few days of this thesis. Theyhave gone out of their own way to allow me to spend an exchange at the EPFLin Switzerland and to continue my studies in the United States. I would notwish for any other parents than them.

Yarne Hermann, May 2019

ii

Composition of Multi-instrument Musicwith Deep Learning

by

Yarne HermannMaster’s dissertation submitted in order to obtain the academic degree ofMaster of Science in Computer Science Engineering

Supervisor: Prof. Dr. Ir. Bart DhoedtCounsellors: Dr. Ir. Tim Verbelen, Dr. Ir. Bert Vankeirsbilck

Faculty of Engineering and Architecture, Ghent UniversityDepartment of Information TechnologyChair: Prof. Dr. Ir. Bart DhoedtAcademic year 2018-2019

Abstract

Artificially generating interesting music has been a challenge that many re-searchers have attempted to tackle. With the advent of deep learning models,and particularly LSTM networks, impressive results have been achieved thatgenerate single instrument tracks, as well as multi-instrument songs. Multi-instrument music generation models usually start by generating the rhythmsection and uses this to generate a melody. This approach makes sense formany genres, but what about the genres where it does not? Metal music, forexample, exhibits very different characteristics from pop music and requires awriting approach that starts with the melodic instruments and uses these togenerate the rhythm section. This thesis researches metal generation follow-ing this writing process, by first generating lead guitar tracks out of thin air,and sequentially adding rhythm guitar, bass and drum tracks. Besides usingthis new generation order, some simplifications made for pop music generationcannot be translated to metal music. By not making those simplifications, thegeneration approach should be more flexible than some existing approaches.Different Long Short-Term Memory (LSTM) networks are used for each in-strument. Lead guitar generation uses a more basic LSTM model, while theother instruments use Bi-LSTM models. Resulting generations were evaluatedthrough an online survey with 109 participants.

Keywords: music generation, deep learning, long short-term memory net-works

iii

1

Composition of Multi-instrument Musicwith Deep Learning

Yarne Hermann

Abstract—Artificially generating interesting music has beena challenge that many researchers have attempted to tackle.With the advent of deep learning models, and particularlyLSTM networks, impressive results have been achieved thatgenerate single instrument tracks, as well as multi-instrumentsongs. Multi-instrument music generation models usually start bygenerating the rhythm section and uses this to generate a melody.This approach makes sense for many genres, but what about thegenres where it does not? Metal music, for example, exhibitsvery different characteristics from pop music and requiresa writing approach that starts with the melodic instrumentsand uses these to generate the rhythm section. This thesisresearches metal generation following this writing process, byfirst generating lead guitar tracks out of thin air, and sequentiallyadding rhythm guitar, bass and drum tracks. Besides using thisnew generation order, some simplifications made for pop musicgeneration cannot be translated to metal music. By not makingthose simplifications, the generation approach should be moreflexible than some existing approaches. Different Long Short-Term Memory (LSTM) networks are used for each instrument.Lead guitar generation uses a more basic LSTM model, while theother instruments use Bi-LSTM models. Resulting generationswere evaluated through an online survey with 109 participants.

I. INTRODUCTION

The capability to create art has long been a differencebetween human minds and the processors of computers. Ascomputers evolve and as research on artificial intelligenceincreases, computers are becoming more and more capableof creating art. Xie et al. have created a model to generatepoetry [1].

Also the generation of music is being much researched.Models have been created than can generate single instrumenttracks [2], as well as music that consists of multiple instru-ments [3]. Most of this research has, however, focused on thegeneration of classical and pop music, bot not on metal music.

The writing process of multi-instrument metal music ex-hibits major differences compared to that of pop music. Oneparticular difference is the order in which instrument tracks arewritten. A sequential multi-instrument generation approach formetal music is proposed that first generates lead guitar, andthen adds rhythm guitar, bass and drum tracks one by one.

LSTM models will be created for each of these instruments,where the lead guitar model generates music from nothing,while the other models generate music conditional on alreadywritten tracks. The created models will finally be testedthrough a survey that asks participants to indicate the qualitythey perceive for generated fragments.

II. STATE OF THE ART

Music generation can lead to two different output represen-tations: an audio or signal representation that consists of audio

waveforms, or a symbolic representation that corresponds tosheet music. The latter approach has seen more success andallows to clearly see and transform the notes that are beingplayed. Therefore, this thesis will focus on this format as well.

Different kinds of models have been used for music gener-ation, with the most popular ones being based on Long Short-Term Memory (LSTM) networks [4]. These networks replacethe regular hidden units in neural networks, by more complexLSTM units that can store information in memory cells.This allows them to learn long-term dependencies betweensequence elements, which has led to their increased use forsequence generation.

Monophonic melodies, where each track can only playone note at a time, were created using LSTM networks bySturm et al. [5] and by Colombo et al. [2]. Generating multi-instrument music usually relies on conditional generation ofadded instruments. A first example of conditional generation ofmonophonic melodies on top of blues chord progressions, wascreated by Eck et al. [6]. Later models, such as BachBot [7],DeepBach [8] and Bachprop [3], improved on this, howeverthey have all made simplifications that can not be extended tometal music.

Zukowski and Carr have used deep LSTM networks tocreate audio generation models that generate black metal anddeath metal [9]. By using SampleRNNs [10], they are ableto train a model on a set of actual song recordings and thengenerate audio signals in the style of the training data. Theirmodels are able to generate some form of structure, howeverthe resulting generations often seem to copy fragments ofthe training data and put these fragments in new orders tocreate new songs. While Zukowski and Carr focused on audiogeneration, models that generate symbolic multi-instrumentmetal music have not yet been explored.

III. DATA SET CREATION

Data sets containing metal music in the desired format ofa lead guitar, rhythm guitar, bass and drum track, are notavailable. This meant that a data set had to be created. This wasdone by collecting sheet music from the Ultimate Guitar [11]website, where users submit sheet music for existing songs indifferent formats. The most popular format, is the Guitar Pro[12] format. Guitar Pro is a program for writing sheet musicin both standard and tablature notation.

300 Guitar Pro files of metal songs were collected fromUltimate Guitar. These were then manually updated so thatthey would consist of a single lead guitar, rhythm guitar, bassand drum track. Some files were also manually updated toremove outlying note pitches and durations. The guitar and

2

bass tracks were also transposed, so that they would use thesame root in all songs. This was done to make an abstraction ofactual pitches and instead consider each pitch as an elementswithin a musical scale.

The resulting Guitar Pro files were then transformed to theMIDI format, which can be read in Python via the python-midilibrary [13]. The MIDI data set can be found at https://tinyurl.com/y29gwb5u. These MIDI files were in turn transformedto a more useful representation that represented the track asa sequence of notums. Each notum has a pitch attribute, thatis a set of all pitches that are being played together (with anempty set corresponding to a rest), and a duration attribute.An extended version of the notum representation, the enotumrepresentation, splits the pitch sets into a single pitch valueand a modifier value, that is the set of intervals to the otherpitches that are being played together. This set of intervalscorresponds to a specific version of a certain chord type (suchas major, minor or dominant 7th). The enotum representationis more useful for lead guitar generation, while the notumrepresentation is best-suited for the generation of the otherinstruments. To remove long, outlying, (e)notum durations, theduration values were limited to a maximum of one whole noteduration or 1920 MIDI ticks. To represent longer (e)notums,a specific hold (e)notum was introduced that indicated theprevious (e)notum’s duration should be extended with the holdnotum’s duration.

Finally, different preprocessing steps were defined. A firststage converted the drum set used in each song to a limitedset that only consisted of a kick drum, snare drum, hi-hat,tom and a crash cymbal. This was done in order to deal withimbalance between the occurrences of different drums. Basstracks were also turned monophonic, since they rarely playedmultiple pitches simultaneously. A second stage converted thetracks into time steps of 20 MIDI ticks. Each time step wouldstill have an associated notum that now had a duration of 20ticks. This allowed to synchronize multiple instrument tracksand was required for the conditional generation of rhythmguitar, bass and drum tracks. This stage is not used for leadguitar generation. A final stage then converted the (e)notumto a one-hot or multi-hot representation that can be used asthe input and output of neural networks.

IV. AN APPROACH FOR MULTI-INSTRUMENT MUSICGENERATION

The approach taken for generating multi-instrument musicis a sequential one, that starts by generating a lead guitar trackfrom nothing. Lead guitar information is then used to generatea rhythm guitar track. Both guitars are in turn used to generatethe bass track and finally, the guitar tracks and bass track areused for generating drums. An advantage of this approach isthat it can achieve multi-instrument music, but each instrumentcan also be focused on separately. Besides this, the approachallows to generate accompanying instruments for already writ-ten tracks.

It should be noted that in this particular thesis, the fullgeneration of a song will not be done. This is simply toalways use optimal (human-made) input tracks to generate an

accompaniment for. Generating accompaniments to existingtracks also allows better verification of the individual instru-ment track generators.

A. Lead Guitar

For lead guitar generation, enotums will be used insteadof notums. A sequence of enotums in one-hot encoding willbe taken as input to the network, which will output the nextenotum attributes in one-hot notation. This means that networkin fact has three outputs: one for the enotum pitch, one for itsduration and one for its modifier. Monophonic generation caneasily be achieved by omitting the modifier inputs and outputs.

The network takes an input sequence of 200 enotums anduses this to predict the next enotum. The input sequence isalso called the history. Using a history length of 200 notumson average corresponds to 14.6% of the notums in a songfrom the data set. Such a quite long history helps to capturelong-term dependencies between the notums.

Since an input sequence of 200 is required for predicting anotum, the model would not be able to predict the first 200notums of a song. This is remedied by placing 200 designatedstart (e)notums before the first enotum of each song. Thesestart enotums are only used at the input of the model and notgenerated at its output. The concept of start (e)notums is alsoused for the generation of the other instruments.

1) Model Architecture: The proposed model architectureis shown in Figure 1. For each element in the sequenceof enotums, its three attributes are first concatenated. Thesequence is then put through an LSTM layer that has 128hidden nodes and outputs a final state after processing thewhole sequence. The LSTM layer is followed by a dropoutlayer, which turns some of its inputs to zero during trainingin order to avoid overfitting.

After the dropout layer, probabilities for the values of eachenotum attribute are generated in a conditional way. Firstthe dropout layer’s outputs are passed to a fully-connected(or dense) layer to output probabilities for the generatedenotum’s pitch. These probabilities are then concatenated withthe dropout layer’s output and put through another dense layerto output probabilities for the generated enotum’s duration.Finally, the pitch and duration probabilities are concatenatedwith the dropout layer’s output and put through a final denselayer to output probabilities for the generated enotum’s mod-ifier. These dense layers use a softmax activation function,which is an activation function that is used to output proba-bilities for multi-class problems.

A simplified monophonic version of this model was alsotested. This version omitted the modifier attribute and didnot use conditional prediction of the attributes. Instead, thedropout layer’s output was put through a dense layer forpitch prediction and through another dense layer for durationprediction. The simplified model also allowed to be trainedto output sequences. In this sequence-outputting approach,a sequence of 200 enotums is still supplied, but the modelgenerates an output after each enotum. This can lead to bettertraining results [14]. Outputting sequences is, however, harderwhen using dependent outputs.

3

Fig. 1: Model architecture for lead guitar generation.

2) Generation: Generation of a sequence of enotums isdone in a sequential manner. With each step, the model takesa sequence of 200 enotums as its input and generates outputprobabilities for the attributes of the next enotum. Theseprobabilities are then sampled to achieve actual values for thegenerated enotum. This generated enotum is then added atthe end of the original input sequence, while the first elementof this sequence is removed. The resulting new sequence isthen used as an input for the model to generate the followingenotum, and so on.

Sampling is done using a temperature τ . This means thatthe probabilities that were output by the model are convertedaccording to:

p̃i = ft(pi) =p1/τi∑i p

1/τi

(1)

and can be used to tune the randomness of the generation.If τ equals 1.0, all probabilities remain the same. As thetemperature goes to 0, the high probabilities will becomeeven higher and the low probabilities will become lower,leading to a higher sampling preference for the value thatoriginally had the highest probability. If a temperature higherthan 1.0 is chosen, the probabilities will become more uniform,becoming perfectly uniform as the temperature nears infinity.Temperatures higher than 1.0 will therefore introduce morerandomized sampling.

Different temperatures can be used per output and thetemperatures can also be adapted during generation, by e.g.increasing them if the same value keeps being predicted.Temperature sampling is also used for the other instruments.

Apart from temperature sampling, the generation can alsomake use of acceptance sets. The concept behind acceptancesets is that only a limited set of pitches, durations andmodifiers can be used (accepted) in the generated enotumsfor a track. Moreover, each pitch also has a limited set ofmodifiers that can be generated together with it. This ideawas inspired by analyzing the data set. On average, a song inthe data set only uses 28 different pitch values (ignoring restand hold values), 8 different duration values and 7 differentmodifier values. Before generation, an acceptance set can becreated by sampling the probability of occurrence in a songfor each possible value. For example, if a certain duration isonly used in 30% of the songs in the data set, it has a 30%chance to be included in the acceptance set. The modifier of agenerated enotum is sampled after the pitch and this samplingis further restricted. For each pitch, only the modifiers thathave occurred together with it at least once in the data set areaccepted. This is because certain scale elements (which thepitches correspond to after transposition) will not work willwith certain modifiers.

V. OTHER INSTRUMENTS

The generation of additional tracks has some properties thatare shared between the different instruments.

First off, notums are used instead of enotums. This isbecause when generating additional instruments it is clearer tosee the pitches that are being played by the other instruments,instead of only a single pitch and a modifier. Especially forbass generation, this has the benefit of allowing the bass tofollow one of the pitches in a chord if it would want todo so. Furthermore, not using modifiers reduced the inputdimensions.

Secondly, when dealing with multiple instruments, a conver-sion to time steps is required for synchronization between thedifferent tracks. A history with a duration of eight whole notesis used for these models, which equals 768 time steps of 20ticks. On average, a history of this length corresponds to 4.87%of the time steps in a song. This history length already requiresmuch more processing than the history length of 200 notumsthat was used in Section IV-A. A length of eight whole noteswas selected, however, in order to still obtain a decent history.Using time steps with a set duration also means that notumdurations are no longer explicitly generated (only implicitlyby generating hold symbols). Therefore, the duration attributedoes not need to be included at the model input, allowing toreduce the input dimensions.

For each of the models, the generation of the notum attime step t for track n depends on the following surroundingnotums:

• notums t− 768...t− 1 of track n• notums t− 768...t− 1 of tracks 0...n− 1• notum t of tracks 0...n− 1• notums t + 1...t + 768 of tracks 0...n − 1, in reversed

order (i.e. moving towards the current time step).

4

Fig. 2: Model architecture for generating monophonic rhythmguitar tracks.

By also using future notums of the other tracks, a model cananticipate what is about to be played and generate notums thattake this into account. 768 extra end symbols or rests need tobe added at the ends of the other tracks to allow the modelto receive future notums when predicting the last 768 timesteps (similar to how a start notum was required). Rests weresimply added instead of designated end notums for this.

A. Basic Model Architecture

The model architecture for the generation of monophonicrhythm guitar tracks is shown in Figure 2. It puts each of thethree possible sequences types (previous notums for instrumentunder prediction, previous notums for other instruments andfuture notums for other instruments) through an LSTM layerthat outputs a state after the full sequence is processed1. Thecurrent notums of the other instruments are concatenated andput through a dense layer. This is then followed by two denselayers, with a dropout layer in between, that result in thepredicted probabilities for the generated notum’s pitch.

Both the bass model and the rhythm guitar model followthis structure (rhythm guitar thus being monophonic). Theirgeneration also happens in a similar way to the lead guitargeneration, but without acceptance sets since the other trackscan guide their pitches. The drum model required slightadaptations and is shown in Figure 3.

First off, the drum track is polyphonic since it should bepossible to hit multiple drums simultaneously. To deal withthis polyphonic nature, the model has five different outputs,one for each drum. Each of these outputs is used to predictwhether the particular drum should be hit or not. If a particular

1when there are multiple other instruments, their values are concatenatedbefore going to their respective LSTM layer

drum is not hit, this can be regarded as a rest for that drum. Ifall five drum outputs give a rest, this leads to an actual rest forthe entire drum set. Each drum also has its own characteristicsand is played differently, which is another reason for usingdifferent outputs.

Secondly, extra inputs are added to include an indicationof the current time signature, as well as the index of thecurrent time step within the bar. Drums heavily rely onthis information to generate structured patterns. These inputsare concatenated together with the other values of the bigconcatenation in the middle of the model.

Finally, when converting to time steps, drum hits will alwayscorrespond to a single time step and all other time steps arerests (hold symbols are not present, since a drum hit cannot beheld). This leads to a big imbalance of rest notums (3,610,590)compared to non-rest notums (374,500) over all drum tracksin the drum data set. The key to dealing with this imbalance,is to differentiate rest outputs. Normally, each different drumoutput would only be able to represent two values: a hit or arest. Differentiating rest outputs leads to the use of 1 hit valueand 24 different rest values, which could be named differently,e.g. rest1, rest2, ..., rest24.

These values are only used for the target outputs duringtraining. Original rests are converted to follow the rule that thefirst rest after a hit always is rest1, and subsequent rests followthe ordering of the rest names in a circular way. Using thesedifferentiated rests reduces imbalance and allows the model totrain better on the hit probability for each drum.

VI. IMPLEMENTATION

Each of the models was implemented in Python 3.6 usingthe Keras 2.2.4 library [15] with a Tensorflow 1.13.1 [16]back-end. All preprocessing was done in Python as well,making use of the python-MIDI library [13] for MIDI relatedoperations. All preprocessing steps were also accompanied bya reversed version, so that model outputs could be transformedto MIDI files.

Each model made use of an Adam optimizer with a learningrate of 0.001 and a decay of 10−6. For the outputs, a softmaxactivation function was used, shown in Equation (2). Thisactivation function turns multi-class outputs to probabilities.The number of classes is denoted by C in the equation,whereas s relates to the output of the model before applyingthe softmax activation function.

pi = f(s)i =esi

∑Cj e

sj(2)

For the loss function, categorical cross-entropy was used,shown in Equation (3). This takes the softmax of the targetoutput and then takes the negative logarithm of it. The modelis optimized by reducing this loss function.

CE = −log(

est∑Cj e

sj

)(3)

Models that had multiple output layers also had a total lossvalue, which the average of the losses of their different outputs,

5

Fig. 3: Model architecture for generating drum tracks.

and is the actual loss that is minimized (as opposed to eachindividual loss being minimized). The individual losses couldbe given weights in order to make the total loss more focusedon a particular output’s loss.

The initial data set was first split into a training and a testset, with the test set including 30 songs, i.e. 10% of the initialdata set. During training, the training set was then further splitinto the actual training (80%) and a validation (20%) set.The actual training set thus included 216 songs, while thevalidation set included 54 songs. The splits were made in asemi-balanced way, balanced meaning that the distribution ofall values is similar among the different set. The splits weresemi-balanced, since when dealing with songs, it is not alwayspossible to achieve perfect balance for the more infrequentvalues. However, all possible values were at least contained inthe training set.

Generation was very quick for lead guitar, with a speed ofaround 10 notums per second. A lead guitar track with theaverage number of notums of 1372 would thus only take littlemore than 2 minutes. The other models took much longerto generate, due to their more complex networks and thegeneration of short time steps. Rhythm guitar, bass and drumsrespectively took around 1.0, 1.5 and 2.0 seconds per timestep. This leads to respective generation times for the averagelength of 15676 time steps of 4.4, 6.6 and 8.8 hours.

VII. EVALUATION

The evaluation consisted of three parts: an objective evalua-tion using a similarity measure, a subjective evaluation of thegenerated results and an online survey that asked participantsto rate generated fragments.

A. Objective Evaluation

When generating accompanying instruments for existingtracks, the generated instrument track can be compared tothe original track by calculating the similarity. This measuresimply indicates for how many percent of the time steps,both the generated and original track are playing the same.A low similarity is not necessarily bad, since different notescan possibly sound good. A high similarity, on the other hand,indicates that the generated music is very much like human-written music, which indicates a level of quality.

This measure was only meaningful for bass generation,since rhythm generation was subjectively not as good anddrum generation achieved high similarity due to the highnumber of rests. Bass tracks generated for songs in the testset were able to reach similarities of 60%-85%, indicating thatoverall bass generation is quite good.

6

B. Subjective Evaluation

1) Lead Guitar: Using sampling temperatures around 0.8for pitch generation, 0.5-0.9 for duration generation and 1.0for modifier generation led to decent results.

The simple baseline model is able to generate alternatingtypes of song sequences within a single generation, goingfrom melodies to verse riffs to more rhythmic riffs in a quitenatural manner. The more advanced model with dependentoutputs had more trouble generating sequences with differentsong sections, but did manage to create more interesting riffs.

Sampling temperatures around 0.5 for the duration gen-eration lead to mostly quarter, eighth and sixteenth notesbeing generated, while temperatures around 0.9 led to morevariation, without overdoing it. The model did not oftengenerate chords and had some trouble placing them at theright spots. Using acceptance sets did help with this placing,but still lead to not many chords.

2) Rhythm Guitar: The rhythm guitar model received lessattention than the other models, that were considered moreimportant. It was able to generate some decent accompanyingrhythms, but also often played the same as the lead guitarin spots were it did not have to (in metal music, there aremany parts were both guitars are playing the same, but notall parts). It also sometimes got stuck in very long rests orholds. Overall, the creativity of the rhythm guitar model andits understanding of how to accompany the lead guitar track,should be improved.

3) Bass: Bass guitar generation is considered to be quitegood. An important achievement of the model is that it isable to provide proper rhythmic bass lines when both guitarsare playing melodies. Earlier iterations either stopped playingduring these parts or would try to follow the melody, whichthey were not supposed to.

4) Drums: The drum generation model is generally ableto create believable drums. It plays regular drum patterns anduses its bass drum quite well (which in metal often followscertain guitar notes). It also detects parts where it should stopplaying while the lead guitar continues by itself.

Its main shortcomings are that it rarely uses the hi-hat andtom and that its crash cymbal and snare hits are too oftenat quarter note intervals. This works for many sections bythemselves, but for a full drum track there should be morevariation. Metal music often has parts where these drumsshould be played at eighth note intervals, but this is notcaptured very well.

C. Online Survey

The quality of the models was further tested through anadapted Turing test. Instead of asking to identify which exam-ples are generated and which are human-written, participantswere asked to score the quality of all examples. This decisionwas made, since quality of the music was considered to bemore important than whether or not people would think thatthe music was artificially generated. At the same time, if

the models are rated as well as the original tracks, it canbe assumed that they are in fact able to emulate humanintelligence.

The survey presented 9 lead guitar fragments of which3 were human-written and asked participant to rate thesefragments on a scale from 1 to 5. For rhythm guitar, bass anddrums, participants were presented with fragments of existingsongs in two versions. One version had the generated versionof the instrument track and the other the original version. Theorder in which both versions were presented was randomized.This was, however, only done for fragments were the gener-ation differed from the original. 5 fragments were presentedfor bass and drums, and 3 for rhythm2. All fragments canbe found at https://www.youtube.com/watch?v=qtd0yIBsx5I&list=PLViwM4NqatHunqsl79u-SPthm0DDO1kC3.

109 participants took part in the survey. They generallyliked the lead guitar generations, with an example generatedby the simple model being the best-rated one. Of the threerhythm guitar fragments, two were considered decent bythe participants, not showing an overall preference for thegenerated, nor the original version. Original bass fragmentswere generally preferred over the generated ones. Finally,the drum fragments were considered decent as well, withparticipants showing a general preference for one generatedfragment, a general preference for one original fragment andno overall preference for the remaining three.

VIII. CONCLUSION & FUTURE WORK

This thesis focused on the sequential generation of multi-instrument metal music.

Two lead guitar generation models were introduced first.These models generated notes one by one, by generating theirpitch, duration and possibly, a set of intervals to form a chord.They both used an LSTM layer followed by one or more fully-connected layers. After this, a general Bi-LSTM model forgenerating the other instruments was illustrated. This modeltook already written instrument tracks as its input, in orderto write suitable accompaniments. The model itself had to beadapted to deal with certain difficulties when generating drumtracks.

An online survey was created to receive opinions on thequality of the generation for the different instrument. Thissurvey only tested short fragments per instrument, and notfully generated songs. For the rhythm guitar, bass and drums,only fragments where these instruments where playing differ-ent notes than the original tracks were tested.

The survey was answered by 109 participants. Its resultsshowed a positive appreciation towards the generated leadguitar and drum fragments. The initial rhythm guitar fragmentswere also considered decent. Original bass fragments weregenerally preferred over generated ones, however. Completelygenerated bass and drum tracks were not surveyed, but person-ally, such bass tracks were considered to be quite good, whilefull drum tracks were considered decent, but lacking enoughvariation.

2This instrument received less research but a test of the current results couldstill useful

7

Future work should be focused on improving various aspectsof the different instrument models. Lead guitar generationshould be improved to have better structure in the generatedsongs and make use of repetitions. The generation of chordsfor both lead and rhythm guitar also needs to be improved.The rhythm guitar model should further be improved to not getstuck at long holds or rests and to become more creative. Drumgeneration, should finally be improved to generate faster-pacedpatterns, make more use of toms and hi-hat. Eventually thedrum set should also be extended to include more drums thanthe small five-piece set.

General directions for future work include the use ofGenerative Adversarial Networks [17], which have shownimproved performance for sequence generation compared toLSTM networks [18] [19]. Generative Adversarial Networkshave been avoided during this thesis, however, because of theirlong training time.

The online survey also sparked interest among musicianswanting to use the models. Currently the rhythm guitar, bassand drum generation models have very long running times,though. Before making the models available to the public,these generation times need to be reduced.

ACKNOWLEDGMENT

I would like to thank Tim Verbelen, Bert Vankeirsbilck,Cedric De Boom and Bart Dhoedt for their continued supportduring this project.

REFERENCES

[1] S. Xie, R. Rastogi, and M. Chang, “Deep poetry: Word-level andcharacter-level language models for shakespearean sonnet generation.”[Online]. Available: web.stanford.edu/class/archive/cs/cs224n/cs224n.1174/reports/2762063.pdf

[2] F. Colombo, A. Seeholzer, and W. Gerstner, “Deep artificial composer:A creative neural network model for automated melody generation,”in International Conference on Evolutionary and Biologically InspiredMusic and Art. Springer, 2017, pp. 81–96.

[3] F. Colombo and W. Gerstner, “Bachprop: Learning to compose musicin multiple styles,” 2018.

[4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” pp. 1735–1780, 1997.

[5] B. Sturm et al., “Music transcription modelling and composition usingdeep learning,” in 1st Conference on Computer Simulation of MusicalCreativity, 2016.

[6] D. Eck and J. Schmidhuber, “Finding temporal structure in music: Bluesimprovisation with lstm recurrent networks,” in Proceedings of the 12thIEEE Workshop on Neural Networks for Signal Processing. IEEE,2002, pp. 747–756.

[7] L. Feynman, “Bachbot: Automatic composition in the styleof bach chorales,” 2016. [Online]. Available: www.mlmi.eng.cam.ac.uk/foswiki/pub/Main/ClassOf2016/Feynman Liang 8224771assignsubmission file LiangFeynmanThesis.pdf

[8] G. Hadjeres, F. Pachet, and F. Nielsen, “Deepbach: a steerable modelfor bach chorales generation,” 2017.

[9] Z. Zukowski and C. Carr, “Generating black metal and math rock:Beyond bach, beethoven, and beatles,” 2018.

[10] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. C.Courville, and Y. Bengio, “Samplernn: An unconditional end-to-endneural audio generation model,” 2017.

[11] “Ultimate guitar,” https://www.ultimate-guitar.com/, accessed: 2019-05-30.

[12] “Guitar pro,” https://www.guitar-pro.com/en/index.php, accessed: 2019-05-30.

[13] “vishnubob/python-midi,” https: / /github.com/vishnubob/python-midi,accessed: 2019-05-30.

[14] C. De Boom, T. Demeester, and B. Dhoedt, “Character-level recurrentneural networks in practice : comparing training and sampling schemes,”2018.

[15] “Keras,” https://keras.io/, accessed: 2019-05-30.[16] “Tensorflow,” https://www.tensorflow.org/, accessed: 2019-05-30.[17] G. I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley et al.,

“Generative adversarial nets,” 2014.[18] H. Dong, W. Hsiao, L. Yang, and Y. Yang, “Musegan: Multi-track se-

quential generative adversarial networks for symbolic music generationand accompaniment,” 2017.

[19] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “Midinet: A convolutionalgenerative adversarial network for symbolic-domain music generation,”in Proceedings of the 18th International Society for Music InformationRetrieval Conference, 2017.

Table of Contents

Permission of Use i

Acknowledgements ii

Abstract iii

Extended Abstract iv

Table of Contents xi

List of Figures xv

List of Tables xvii

List of Abbreviations & Symbols xix

1 Introduction 1

2 Music Theory 32.1 A Music Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Notes, Rests & Chords . . . . . . . . . . . . . . . . . . . 42.1.2 Tempo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 Dots & Triplets . . . . . . . . . . . . . . . . . . . . . . . 52.1.4 Time Signature . . . . . . . . . . . . . . . . . . . . . . . 62.1.5 Tie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.6 Clef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.7 Accidentals & Key Signature . . . . . . . . . . . . . . . . 82.1.8 Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Tablatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Key Signatures and Tablatures . . . . . . . . . . . . . . 11

2.3 Guitar Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.1 Bends & Whammy Bar . . . . . . . . . . . . . . . . . . . 11

xi

2.3.2 Palm Mutes . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Drum Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.6 Metal Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6.1 Song Structure . . . . . . . . . . . . . . . . . . . . . . . 152.6.2 Guitar Roles . . . . . . . . . . . . . . . . . . . . . . . . . 162.6.3 Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 State of the Art 203.1 Output Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.1 Audio Generation . . . . . . . . . . . . . . . . . . . . . . 203.1.2 Symbolic Generation . . . . . . . . . . . . . . . . . . . . 22

3.2 Electronic Symbolic Formats . . . . . . . . . . . . . . . . . . . . 233.2.1 Music Instrument Digital Interface (MIDI) . . . . . . . . 233.2.2 Piano Roll . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Further Scope Options . . . . . . . . . . . . . . . . . . . . . . . 253.3.1 Monophonic or Polyphonic . . . . . . . . . . . . . . . . . 263.3.2 Single-Instrument or Multi-Instrument . . . . . . . . . . 263.3.3 Temporal Scopes . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4.2 Transposition . . . . . . . . . . . . . . . . . . . . . . . . 283.4.3 Preprocessing: Hold Symbols . . . . . . . . . . . . . . . 28

3.5 Machine Learning Models for Music Generation . . . . . . . . . 283.5.1 Feed-Forward Neural Networks . . . . . . . . . . . . . . 293.5.2 Stacked Autoencoders . . . . . . . . . . . . . . . . . . . 293.5.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . 303.5.4 Generative Adversarial Networks . . . . . . . . . . . . . 32

3.6 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . 323.7 Metal Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Data Set Creation 344.1 A New Concept: The Notum . . . . . . . . . . . . . . . . . . . . 34

4.1.1 Notum Limitations . . . . . . . . . . . . . . . . . . . . . 364.1.2 Alternative Rest Symbol . . . . . . . . . . . . . . . . . . 374.1.3 Hold Notum . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 The Extended Notum: The Enotum . . . . . . . . . . . . . . . . 384.3 Data Gathering . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.4 MIDI-to-Notum Conversion . . . . . . . . . . . . . . . . . . . . 41

xii

4.5 Manual Improvements . . . . . . . . . . . . . . . . . . . . . . . 424.5.1 Outlying Notum Durations . . . . . . . . . . . . . . . . . 424.5.2 Outlying Pitches . . . . . . . . . . . . . . . . . . . . . . 45

4.6 Preprocessing Pipelines . . . . . . . . . . . . . . . . . . . . . . . 494.6.1 Cleaning Stage . . . . . . . . . . . . . . . . . . . . . . . 494.6.2 Conversion to Enotums . . . . . . . . . . . . . . . . . . . 504.6.3 Conversion to Time Steps . . . . . . . . . . . . . . . . . 524.6.4 Conversion to Integer & One-Hot . . . . . . . . . . . . . 52

5 An Approach For Multi-Instrument Music Generation 545.1 Lead Guitar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.1.1 Start (E)Notums . . . . . . . . . . . . . . . . . . . . . . 565.1.2 Kickstarting . . . . . . . . . . . . . . . . . . . . . . . . . 565.1.3 Model Architecture . . . . . . . . . . . . . . . . . . . . . 575.1.4 Generation . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2 Bass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.2.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . 645.2.2 Generation . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3 Rhythm Guitar . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.4 Drums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6 Implementation 726.1 Shared Details of the Models . . . . . . . . . . . . . . . . . . . . 726.2 Lead Guitar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.2.1 Simple Baseline Model . . . . . . . . . . . . . . . . . . . 746.2.2 Advanced Model . . . . . . . . . . . . . . . . . . . . . . 80

6.3 Bass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.3.1 Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 826.3.2 Proposed Bass Network . . . . . . . . . . . . . . . . . . 83

6.4 Rhythm Guitar . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.5 Drums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7 Evaluation 907.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 917.2 Lead Guitar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 937.3 Other Instruments . . . . . . . . . . . . . . . . . . . . . . . . . 947.4 Bass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.5 Rhythm Guitar . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

xiii

7.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977.6 Drums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

8 Conclusion 1008.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Bibliography 102

xiv

List of Figures

2.1 Music score and tablature for the first seven bars of ‘Master ofPuppets’ by Metallica. The instrument being represented is thelead guitar. Highlighted elements are explained in Section 2.1. . 3

2.2 Figures taken from [8] . . . . . . . . . . . . . . . . . . . . . . . 52.3 A dotted eight note and a triplet of eighth notes. . . . . . . . . 62.4 An overview of different dynamic markings and their relation-

ships. Taken from [11]. . . . . . . . . . . . . . . . . . . . . . . . 92.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.6 A whammy or tremolo bar on a guitar. Taken from [12] . . . . . 122.7 An example drum score. . . . . . . . . . . . . . . . . . . . . . . 142.8 A key to link drum score elements to drums. Taken from Guitar

Pro 7.5 software [13]. . . . . . . . . . . . . . . . . . . . . . . . . 142.9 A different way of writing the drum score in Figure 2.7. Both

drum scores sound the same when played. . . . . . . . . . . . . 152.10 Example of the breakdown in ‘One’ by Metallica . . . . . . . . . 18

3.1 Stereo waveform of the first seven bars of Master of Puppets byMetallica. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 An automated piano and a piano roll. Picture taken from [16]. . 253.3 A piano roll representation for the first seven bars of Master of

Puppets by Matallica. . . . . . . . . . . . . . . . . . . . . . . . 263.4 Bi-LSTM architecture used by DeepBach [4]. . . . . . . . . . . . 31

4.1 Music score and tablature for the first seven bars of ‘Master ofPuppets’ by Metallica. The instrument being represented is thelead guitar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Occurrence numbers for the different notum durations in the

original data set. Notum durations longer than a whole notehave been removed here. For clarity, occurrence numbers havebeen capped at 10,000. . . . . . . . . . . . . . . . . . . . . . . . 43

xv

4.4 Example of a grace note in bar 2. . . . . . . . . . . . . . . . . . 444.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.6 Drum pitch occurrences after maximally reducing the drum set. 494.7 Guitar pitch occurrences after removing rare pitches. . . . . . . 504.8 Bass pitch occurrences after removing rare pitches. . . . . . . . 50

5.1 Model architecture for lead guitar generation. . . . . . . . . . . 595.2 Simplified baseline model architecture for lead guitar generation. 605.3 The bass generation model architecture. . . . . . . . . . . . . . 655.4 The rhythm guitar generation model architecture. . . . . . . . . 665.5 The drum generation model architecture. . . . . . . . . . . . . . 675.6 Occurrence numbers for the differentiated rest values in the

drum data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.1 Total training and validation loss curves for the simple model. . 746.2 Pitch and duration training and validation loss curves for the

simple model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.3 Pitch and duration training and validation accuracy curves for

the simple model. . . . . . . . . . . . . . . . . . . . . . . . . . . 756.4 Verse riff generated by original simple model with pitch temper-

ature 1.0 and duration temperature 0.5. . . . . . . . . . . . . . . 786.5 Melody going over in a small breakdown generated by original

simple model with pitch temperature 1.0 and duration temper-ature 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.6 Total training and validation loss curves for the advanced model. 806.7 Pitch, duration and modifier training and validation loss curves

for the advanced model. . . . . . . . . . . . . . . . . . . . . . . 816.8 Pitch, duration and modifier training and validation accuracy

curves for the advanced model. . . . . . . . . . . . . . . . . . . 816.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.10 Training and validation loss curves for the proposed drum model. 876.11 Individual training and validation loss curves for the proposed

drum model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.12 Individual training and validation accuracy curves for the pro-

posed drum model. . . . . . . . . . . . . . . . . . . . . . . . . . 886.13 Generated drum tab for bars 22-33 (minutes 0:32-0:50) for the

song ‘94 Hours’ by As I Lay Dying. . . . . . . . . . . . . . . . . 89

xvi

List of Tables

3.1 Common data sets and the number of songs they contain. . . . . 27

4.1 Mean, standard deviation, minimum and maximum number ofnotums per instrument for the songs in the data set . . . . . . . 42

4.2 Significant notum durations that are no multiple of 20 ticks andtheir corresponding fractional durations. . . . . . . . . . . . . . 45

4.3 The remaining notum durations in MIDI ticks and in their cor-responding fractional notations. Dotted note durations are in-dicated with a dot next to them and triplet note durations areindicated with a t next to them. Some note durations are theresult of ties, which are written using a plus sign. . . . . . . . . 46

4.4 An overview of the drum set that was kept after removing outliers. 474.5 Overview of the modifier reduction operations and their respec-

tive numbers of different modifiers per guitar. . . . . . . . . . . 514.6 Mean, standard deviation, minimum and maximum number of

time steps for the songs in the data set . . . . . . . . . . . . . . 52

5.1 Time signatures that are kept in the drum data set and thenumber of songs they occur in. . . . . . . . . . . . . . . . . . . . 69

6.1 Summary of total losses for different configurations of the simplebaseline model. The values that are different from the originalmodel are put in bold. . . . . . . . . . . . . . . . . . . . . . . . 76

6.2 Summary of total losses for different configurations of the simplebaseline model with modifiers. The values that are differentfrom the original model are put in bold. . . . . . . . . . . . . . 77

6.3 Similarities between generated and original bass tracks for fivedifferent songs . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

xvii

7.1 Number and percentage of participants that were non-musicians,musicians that not write music and musicians that also writemusic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.2 Generated instruments that were played and number of partic-ipants that played them. . . . . . . . . . . . . . . . . . . . . . . 92

7.3 Presented guitar examples. . . . . . . . . . . . . . . . . . . . . 937.4 Lead guitar ratings for each example. Existing fragments have

their number in italics. The highest percentage is indicated inbold for each fragment. . . . . . . . . . . . . . . . . . . . . . . 93

7.5 Presented bass examples and which example is the generated one. 967.6 Bass ratings for each example. 1 indicates preference for the

original and 5 indicates preference for the generated version.The highest percentage is indicated in bold for each fragment.Also indicated is how many percent of people recognized thesong. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.7 Bass rating percentages for example 3, split into whether peopledid or did not recognize the song. . . . . . . . . . . . . . . . . . 96

7.8 Presented rhythm guitar examples and which example is thegenerated one. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.9 Rhythm guitar ratings for each example. 1 indicates preferencefor the original and 5 indicates preference for the generated ver-sion. The highest percentage is indicated in bold for each frag-ment. Also indicated is how many percent of people recognizedthe song. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.10 Presented drum examples and which example is the generatedone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7.11 Drum ratings for each example. 1 indicates preference for theoriginal and 5 indicates preference for the generated version.The highest percentage is indicated in bold for each fragment.Also indicated is how many percent of people recognized thesong. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

xviii

List of Abbreviations & Symbols

Bi-LSTM Bidirectional LSTM

GAN Generative Adversarial Network

LSTM Long Short-Term Memory

MIDI Musical Instrument Digital InterfaceZ Flat^ Natural\ Sharp

xix

Chapter 1

Introduction

The capability to create art has long been a difference between human mindsand the processors of computers. As computers evolve and as research onartificial intelligence increases, computers are becoming more and more capableof creating art. Xie et al. have created a model to generate poetry [1], whileGatys et al. were able to convert camera pictures to different painters’ styles[2].

Also the generation of music is being much researched, both in the form ofdirect audio as done by WaveNet [3], as well as in the form of sheet music,for example by DeepBach [4]. The latter approach, called symbolic generation,has seen the most success. Models have been created that can generate singleinstrument tracks [5] as well as music that consists of multiple instruments [6].Most of this research has, however, focused on the generation of classical andpop music, bot not on metal music.

Recent works such as [7], have created models that generate metal in directaudio form. These models, however, do not generate sheet music, thus notallowing to directly know which notes are being played. Moreover, they havefocused on the rather chaotic death metal subgenre. In this thesis, the focuswill be put towards the trash metal and metalcore subgenres that exhibit moremelodic and structured parts.

Metal music exhibits some major differences compared to pop and classicalmusic, and therefore does not allow the use of many existing models. Oneparticular difference is the order in which instrument tracks are written. Whilethe writing process for pop music can start with the rhythm section, the writingprocess for metal starts with the generation of riffs and melodies. This leads

1

to a different order of generating instruments, starting with the lead guitartrack and sequentially adding the rhythm guitar, bass and drum tracks.

This sequential multi-instrument generation approach is researched in this the-sis. Furthermore, various simplifications that have been made by existing mod-els are dropped to deal with the specificities of metal music. Dropping thesesimplifications should lead to models that are more flexible in the range ofgenres they can be applied to.

Each different instrument will need its own model. First a model based onLSTM networks will be created to generate lead guitar tracks from nothing.After this, a general model with a Bi-LSTM structure will be created to gen-erate supplemental instruments. This model will then be adapted to achieveuseful models for each of the remaining instruments.

The goal of this thesis is to explore and deal with the major difficulties whengenerating instruments in this order, while omitting some commonly-madesimplifications. A preference will be put towards achieving decent modelsper instrument, instead of optimizing a model for a single instrument. Theresulting models will be evaluated via objective measures such as validationloss and similarity, personal subjective analysis and through a survey that teststhe perceived quality of generated fragments.

Before continuing, a basic introduction to music theory and the properties ofmetal music is given in Chapter 2. This is followed by an overview of thestate of the art of music prediction in Chapter 3. Chapter 4 then explainsthe various steps that are taken to obtain, improve and preprocess a data setof multi-instrument metal songs. After explaining the creation of the dataset, Chapter 5 will elaborate on the approach that is taken and discuss themodels that will be used, while Chapter 6 will discuss the implementationand objective and subjective analysis of these models. The survey and itsresults are then presented in Chapter 7, before finishing with a conclusion inChapter 8.

2

Chapter 2

Music Theory

In order to have a meaningful discussion about music generation, an intro-duction to some basic music theory is provided in this chapter. Readers whoalready have some musical knowledge can skip most of this introduction, al-though they are advised to read Section 2.4. Here, a representation of scales ispresented that differs from standard music theory. New information can alsobe encountered in Section 2.5, which is dedicated to drum scores.

2.1 A Music Score

Figure 2.1: Music score and tablature for the first seven bars of ‘Master ofPuppets’ by Metallica. The instrument being represented is the lead guitar.Highlighted elements are explained in Section 2.1.

3

Figure 2.1 shows a music score and tablature for the song ‘Master of Puppets’by Metallica. Each of the highlighted elements in the music score will now beexplained.

2.1.1 Notes, Rests & Chords

The most important elements of a music score are notes, rests and chords.They are placed on a staff, which consists of five parallel lines. In Figure 2.1an example of each of them is highlighted in red.

A single note is shown on the second staff. It holds two kinds of information.

First, its height on the staff indicates the pitch of the note. The note can beplaced on a line of the staff or in between two lines. If a note lies higher on thestaff, its pitch will be higher as well. If the note becomes too high or too lowto fit on the five-line staff, extra lines are drawn. The term pitch is often usedinterchangeably with the term note when not talking about a specific note ona score.

The second piece of information represented by a note on a score, is its du-ration. The note’s duration is expressed through the note’s symbol. Mostdurations are expressed as binary fractions. An overview of duration namesand corresponding symbols is given in Figure 2.2a. For note symbols with aflag (such as the eighth, sixteenth and thirty-second note symbols), the flagscan be connected between consecutive notes, as can be seen in the second staffof Figure 2.1.

A rest indicates that no note is played and thus there is a pause betweennotes. Rests obviously do not have a pitch, but they do have durations andcorresponding symbols. Figure 2.2b gives an overview of the symbols that areused to indicate different rest durations.

Chords are combinations of notes being played together. They are representedas multiple notes that are at the same horizontal position on the staff. Thescore of Master of Puppets begins with a chord.

The following four subsections will be further related to note durations. Af-ter these, two subsections will discuss elements related to pitches and finally,Section 2.1.8 will discuss dynamics.

4

(a) Different note durations and theircorresponding symbols.

(b) Different rest durations and theircorresponding symbols.

Figure 2.2: Figures taken from [8]

2.1.2 Tempo

Highlighted in purple in Figure 2.1 is the tempo of the song. This value isused to relate the fractional durations to durations in seconds or milliseconds.The tempo is given is a number of beats per minute (bpm), where one beatcorresponds to a quarter note.

In Figure 2.1, the tempo is 212 bpm. This means that a quarter note has aduration of:

60 seconds212 bpm = 283.0 milliseconds

The tempo can change throughout a song.

2.1.3 Dots & Triplets

Dots and triplets offer more duration options besides the binary fractions thathave already been presented. Figure 2.3 first shows a dotted eighth note andthen a triplet of eighth notes.

A dot is used to indicate that a note’s duration is multiplied by 1.5.

5

Triplets offer a variation to powers of 2 in the denominators of note durations.They instead have products of 3 and a power of 2 in their denominator. Thiswould thus lead to third notes, sixth notes, twelfth notes, ... . However, thesedurations are never called like this. Instead, sixth notes, for example, are calledtriplet quarter notes.

This is because the idea of triplets is to evenly divide the total duration oftwo notes with equal durations, over three notes. A triplet of quarter notesis therefore called this way, since it divides the duration of two quarter notesover three notes.

The availability of triplet notes offers many more possible note durations.

Figure 2.3: A dotted eight note and a triplet of eighth notes.

2.1.4 Time Signature

The time signature is highlighted in yellow in Figure 2.1 and indicates thestructure of bars or measures. In a music score, bars are separated by verticallines on the staff.

Bars make a score more readable, and also simplify the writing process. Sec-tions in a song are always a number of bars long, with a new section startingon a new bar. The first note of a bar is also regularly accentuated.

Bars with the same time signature have the same duration. A time signatureis written as a fraction. The denominator of the time signature refers to a noteduration with the same denominator and 1 as the numerator. The numeratorof the time signature, then indicates the number of times this note durationfits in one bar.

As an example, in Figure 2.1 the time signature signature is 4/4. This meansthat each bar has the length of 4 quarter notes. This is the most common timesignature. Another common time signature is 3/4.

An alternative name for time signature is meter. Actually, the meter is thestructure of the bars, whereas a ‘time signature’ is the specific indication of

6

the meter. Both terms can be used quite interchangeably however, since theirdifference is only subtle.

Bars, and therefore time signatures as well, are in essence only a structuralelement. Thus removing bars from a score would still yield the same notesbeing played with the same durations. However, the use of bars and differenttime signatures leads to different playing styles, due to different accentuationwhile playing. Especially drum beats are usually heavily based on the timesignature. This even leads to differences between equivalent fractions such as3/4 and 6/8. However, in the remainder of this thesis, such equivalent fractionswill be considered equal.

2.1.5 Tie

A tie is used to create even more note duration options. It is indicated byusing curved lines between consecutive notes that have the same pitch. Thetie indicates that the notes should be played only once, with a duration thatis the sum of the durations of the tied notes.

An example is highlighted in cyan in Figure 2.1. (The tablature will be dis-cussed in Section 2.2, but a tie is represented there by putting tablature ele-ments between brackets, as can also be seen in Figure 2.1)

Ties are also used to keep a note ringing across bar lines. This also happensin Figure 2.1. Rests do not tie, since multiple rests after each other are alwaysthe same as a combined rest.

In the remainder of this thesis, ties will often be referred to as holds.

2.1.6 Clef

The clef, highlighted in blue in Figure 2.1 relates to the pitch of the notes. Inthis example a G-clef is shown, which indicates that the second lowest line ofthe staff corresponds to the note G. From this reference point, the other pitcheson the staff can be inferred. Seven letters are used to denote pitches: A, B,C, D, E, F and G. Moving one space higher on the staff, then corresponds togoing to the next letter in a circular way (so after G comes A again).

Starting from a letter, e.g. A, and moving up the staff until reaching an Aagain is called moving up an octave. The second A is said to be one octave

7

higher than the first A. In physics, each note corresponds to a frequency, andby moving up an octave, the frequency of the first note is simply doubled. Thisis why pitches with the same letter still sound very similar.

The actual pitch can therefore be written down as a combination of a letterand a number, e.g. A4. The number then increases each octave. The G-clefmore specifically indicates that the second line of the staff corresponds to G4.

2.1.7 Accidentals & Key Signature

Section 2.1.6 mentioned seven letters that are used to denote pitches. However,there are actually twelve different pitches (not considering octaves). These are:A, A\ or BZ, B, C, C\ or DZ, D, D\ or EZ, E, F, F\ or GZ, G, G\ or AZ. Here,the \ symbol is called sharp and the Z symbol is called flat. The notes withthe sharp or flat symbol have pitches that fall in between the other notes. Inthis order, each note is said to be one semitone higher than the previous one.A difference of two semitones is called a whole tone. Semitones and wholetones are examples of intervals, with an interval being a difference betweentwo notes. A semitone can be called an interval of 1 and a whole tone can becalled an interval of 2.

The lines of a staff only correspond to pitches without sharps or flats. Sharpor flat notes are then indicated by putting the \ or Z symbol next to them inthe score. This means that the corresponding note should be considered as thesharp or flat version for the remainder of the current bar. A natural symbol ^is used to revert this later in the same bar. The \, Z and ^ symbols are calledaccidentals.

To avoid the use of many accidentals in the score, a key signature is writtennext to the clef. For each pitch letter, the key signature indicates whetherit should be considered unchanged, sharp or flat until a new key signature isgiven. Accidentals then only need to be written when a deviation from thekey signature occurs. The key signature itself is written as a combination ofaccidentals \ or Z symbols, but it is placed next to the clef instead of next tonotes. Each accidental in the key signature affects all octaves of that pitchletter. This is in contrast to accidentals written next to notes, which onlyaffect the pitch letter of that note and not its octaves. The key signature thatleads to the lowest number of accidentals next to notes in the score shouldusually be selected.

8

The default key signature is called C major or A minor and includes no sharpsor flats. In Figure 2.1 both the key signature and some accidentals are high-lighted in green. The key signature is G major or E minor and indicates thateach F should be considered as an F\. The \ accidentals that are highlightednext to a chord indicate that for the remainder bar 2, the notes C4, G4 andC5 should be considered as C\4, G\4 and C\5. The ^ accidentals next to thefollowing chord revert this, thus indicating that C\4, G\4 and C\5 should beplayed as C4, G4 and C5 again.

2.1.8 Dynamics

To conclude this section, dynamics are discussed. Dynamic markings indicatevariations in velocity of the notes that are being played. These variations arerelative and therefore do not correspond to actual decibel or velocity levels [9].

Figure 2.4 presents an overview of the different dynamic markings and howthey relate to each other. The numbers in the Velocity column are values usedby recording software like Logic Pro [10]. These values are not used in anabsolute manner, but their differences define the relative differences betweenthe dynamic markings.

In Figure 2.1 a dynamic marking is highlighted in orange.

Figure 2.4: An overview of different dynamic markings and their relationships.Taken from [11].

2.2 Tablatures

A tablature is a different representation of a music score aimed at guitar andbass players. It usually consists of six lines for guitar and four lines for bass,

9

with each line corresponding to a string1. The bottom line corresponds to thelowest-sounding string and the upper line to the highest-sounding one. Stringscan also be numbered, with the upper line always corresponding to the firststring and the bottom line corresponding to the sixth string in this example.

Notes and chords to be played are listed sequentially, just like on a musicscore. On a tablature, the notes to be played are indicated by numbers thatcorrespond to the frets of a guitar. When a number is placed on a line, itmeans that the guitarist has to strum the corresponding string, while holdinga finger at the fret that corresponds to the number. When a 0 is written,this means that the corresponding string should be strummed without holdingdown the string at a fret. This is also called “strumming an open string”. Ifno number is present, this means that the string should not be strummed.

The first chord in the tablature of Figure 2.1 indicates that the guitar playershould strum the third, fourth and fifth string, while holding fingers at the 9thfret of the third and fourth string and at the 7th fret of the fifth string.

The tablature does not indicate the timing of the notes. This informationcan, however, easily be read from the music score. In fact, one of the mainpurposes of tablatures is so that guitar players are not required to be able toread music scores. The biggest difficulty with reading music scores is readingthe pitches to be played. Reading the rhythm is much easier, thus makingthe combined use of a tablature for pitch information and a music score forrhythm information very useful.

There is not a one-to-one mapping between tablatures and music scores. Gui-tars are usually tuned so that the note at the fifth fret of one string is equalto the note of the next higher open string. This means that many notes canbe played at multiple positions on the guitar, each on a different string and ata different fret. Tablatures that are written differently, but corresponding tothe same notes, will thus always result in the same corresponding music score.There is thus a direct mapping of tablatures to music scores, but not the otherway around.2

1There also exist guitars with more than six strings and bass guitars with more than fourstrings. Tablatures can be adapted to this, by adding more lines.

2Actually twelve different music scores are possible, each using a different key signature.The statement made should thus be more nuanced. There is a direct mapping of tablaturesto a set of twelve music scores that each use a different key signature, but not the other wayaround. A single music score or a set of twelve music scores that each use a different keysignature, can lead to an exponential number of corresponding tablatures

10

Tablatures are written to indicate the positions at which the sequence of notesis easiest to play (or at least the positions for which the writer thinks theyare easiest to play). This is an added benefit to tablatures compared to musicscores, which do not hold any positional information. Since tablatures repre-sent the strings of a guitar, they also depend on the tuning of the guitar. Ifthe strings are tuned differently, the tablature needs to be updated in order tokeep the same corresponding notes and music score. The tablature can alsobe kept as is, which then changes the played notes and music score.

Since this paper focuses on guitar music, music scores that are presented willalways be accompanied by a tablature.

2.2.1 Key Signatures and Tablatures

Section 2.1.7 explained that a key signature is indicated on a staff to reduce thenumber of accidentals that need to be written in the music score. Tablaturesdo not have a key signature indication, since they only include finger positionsand not pitches. This has an important consequence for the correspondingmusic score.

Since guitarists usually write tablatures, the default key signature (C major orA minor) is often used for the corresponding music score. This can potentiallylead to many accidentals in the score. Thus, for those scores the key signatureinformation is often useless.

2.3 Guitar Techniques

There are many guitar techniques which cannot all be listed here. The follow-ing two subsections will briefly describe some important techniques that willbe mentioned further in this thesis.

2.3.1 Bends & Whammy Bar

On guitars and bass guitars (and in fact most stringed instruments), the stringscan be bent by pushing them up or pulling them down. This results in agradual change in pitch. Actually, the underlying frequency of the note grad-ually changes, which results in pitches between the standard twelve also being

11

reached. An example of a whole-tone bend is shown in Figure 2.5a. ’Whole-tone’ means that the resulting pitch at the end of the bend is a whole tonehigher than the original pitch.

Some guitars have a whammy bar (also called tremolo bar), shown in Figure 2.6.By pulling or pushing on this bar, the pitch can be gradually increased ordecreased respectively. This is also called an upwards or downwards dive. Anexample of a whole-tone whammy dive downwards is shown in Figure 2.6.

(a) A whole-tone bend. (b) A whole-tone whammy dive down-wards.

Figure 2.5

Figure 2.6: A whammy or tremolo bar on a guitar. Taken from [12]

2.3.2 Palm Mutes

On guitars, notes can be palm muted by resting the palm of the strumminghand slightly on the end of the strings. This alters the sound characteristicsof the notes that are being played. The alternation between palm-muted and

12

non-palm-muted notes is often used when writing guitar music (especially inmetal music).

Palm mutes are indicated by putting ‘P.M.’ between the music score and tab-lature for the notes that are palm-muted. When consecutive notes are palm-muted, the P.M. is followed by a dashed line for these notes. An example canbe seen in Figure 2.1.

2.4 Scales

Music is usually written using notes from a scale. A scale is an ordering ofintervals that sum to 12. An example scale is the minor scale, which has thefollowing list of intervals: 2 - 1 - 2 - 2 - 1 - 2 - 2. A scale can be linked to aroot note, which is the note to start the scale from. For example, the A minorscale is constructed by starting from an A and including the notes that oneencounters by adding the intervals one by one. This leads to the A minor scaleconsisting of the following pitches: A, B, C, D, E, F, G and then returning toA again.

A useful way of representing scales is by listing the cumulative intervals, i.e.the intervals from a root note to each other note in the scale. The root noteitself is also included, and has intervals 0 and 12 (for the octave) For the minorscale, this leads to: 0 - 2 - 3 - 5 - 7 - 8 - 10 - 12.

The power of scales is that their notes work well together. Therefore, muchsongs will only use notes that fall within a specific scale. Of course, it is stillallowed to use notes that do not fall within the scale. These notes are calledout-of-scale.

A song is mostly determined by its sequence of intervals. It is only when aroot note is selected that these intervals lead to specific pitches. While someroot notes may lead to better sounding music, changing the root note does notinfluence the quality as much as changing intervals. Changing the root noteof a song is called transposing.

2.5 Drum Scores

Drums are very different from guitars. They lead to percussive sounds thatdo not have a pitch. This means that a regular music score cannot be used

13

to describe a drum track. Instead, a drum score is used, of which an exampleis offered in Figure 2.7. A drum score uses a different clef and uses multiplesymbols to indicate the drums to be hit. These symbols can also differ toindicate the way to hit the drum, e.g. a regular snare hit uses a differentsymbol than a rim shot. A key to the most common drum symbols is given inFigure 2.8.

Figure 2.7: An example drum score.

Figure 2.8: A key to link drum score elements to drums. Taken from GuitarPro 7.5 software [13].

Durations in drum scores are ambiguous. Cymbals ring out, while other per-cussive elements only sound for a short period. It is not possible, for example,to vary the duration of a snare hit. Each drum just sounds and rings out atthe moment when it is hit.

Therefore durations are actually not indicating how long a drum sounds, butinstead they indicate the time until the next hit. A quarter note thus means:hit the indicated drums and wait for a quarter note before hitting the nextdrums. The word ‘wait’ is closely linked to the concept of rests and in fact,Figure 2.7 can be rewritten using only sixteenth note durations, resulting inFigure 2.9. Both these drum scores sound the same when played.3

3One might think that Figure 2.9 indicates that the cymbal hits should only ring out fora sixteenth note and then be choked, i.e. muted by putting fingers on the cymbal. This is

14

Figure 2.9: A different way of writing the drum score in Figure 2.7. Both drumscores sound the same when played.

2.6 Metal Music

Metal music exhibits some major differences to pop music. These differenceslead to challenges that make some pop music generation models unusable formetal music.

2.6.1 Song Structure

Most pop music songs can be divided into different sections. The most commonones are intro, verse, chorus, bridge, outro and sometimes a solo. These sectionsare usually structured in straight-forward manners, with the verse and thechorus sections being repeated multiple times. An example structure is: intro,verse, chorus, verse, chorus, bridge, chorus, outro.

In metal music, these structures can appear as well, but oftentimes, songsexhibit less structure. There can be many sections that are not repeated laterin the song, but that are also not classified as intro, bridge, outro or solo.Usually, these sections are then considered as lettered verse sections. Verse Ais different from verse B, which is also different from verse C. These letteredverse sections can still be repeated, but they do not have to.

Songs can also include short ”filler” parts. These parts are no sections, butthey form the link between two sections. They are often only between 1 and4 bars in length.

Additionally, metal music introduces a new section: the breakdown. This sec-tion is typified by guitars playing the same note in a rhythmic pattern. A fewadditional notes can be included in between. An example of one of the firstbreakdowns is in the song ‘One’ by Metallica at 4:36. Breakdowns are heavilyused in the metalcore subgenre, which will be part of the focus in this thesis.

not true, however. Chokes are indicated using a staccato symbol, which is a dot above orbelow the score. The use of staccato symbols will be ignored in this thesis however.

15

An example song structure can then be: intro, verse A, filler part, verse B,chorus, breakdown, verse A, verse C, chorus, bridge, solo, verse D, outro.

2.6.2 Guitar Roles

Guitar Roles in Regular Genres

In jazz, blues, pop and much rock music, guitars are typically separated intoa rhythm and a lead guitar. In these genres, the rhythm guitar mainly playschords, often following a standard chord scheme that is being repeated through-out the song. A famous example is the blues scheme [14].

While the rhythm guitar is playing chords, the lead guitar plays a melody ontop of them, which can also be called a lead. A direct consequence of this, isthat writing and generating music starts with the rhythm guitar and is followedby the lead guitar.

Drums and bass can be added after both guitars are written, but can also beadded when only the rhythm guitar has been defined.

Guitar Roles in Metal Genres

In metal music, chord schemes are generally not used. Instead of a rhythmguitar that only plays chords and a lead guitar that only plays leads, both gui-tars are mostly playing riffs. Riffs can be considered as something in betweenchord schemes and leads. They include more melodic elements than standard-ized chord schemes, however, they draw less “forefront attention” than leads.

Generally, the rhythm guitar will mainly play riffs, while the lead guitar playsriffs and melodies. It also happens that both guitars play in harmony. Thismeans that both guitars play essentially the same riff or melody, but one ofthe guitars replaces some or all notes by notes that lie a bit higher in the scale.A very common way of harmonizing is to replace notes by the ones that lietwo steps higher in the scale.

In a harmony, not necessarily all notes in the harmony need to be the samenumber of steps higher in the scale. In metal music, harmonies are oftenadded during a repetition of a riff. In this thesis it will be assumed that theseharmonies are always played by the rhythm guitar.

16

During solos or melodies, the rhythm guitar often continues playing the pre-vious riff or already starts playing the riff that the lead guitar is going to playafter the solo or melody. It can also happen that the rhythm guitar plays riffsthat are never played by the lead guitar.

The following list summarizes some of the most common relations betweenboth guitars.

• Both guitars play the same riff

• The lead guitar plays a riff and the rhythm guitar plays chords over thisriff

• Both guitars play a riff or a melody, where the rhythm guitar is playinga harmony of what the lead guitar is playing.

• The rhythm guitar plays a riff or chords, while the lead guitar plays amelody or solo

The riff-driven basis of metal songs and lack of chord schemes have seriousconsequences on the order in which instruments are written. The writingprocess generally starts with writing guitar riffs or melodies. Sometimes chordsor riffs are added to melodies in a second guitar, other times, melodies are beingadded to chords or riffs. Of course, in order to add a harmony, the underlyingriff or melody first has to be written.

When the rhythm guitar plays chords, they are usually added on top of a riffor melody played by the lead guitar. These chords are then based on the riffor melody that is being played by the lead guitar, not the other way around.(This is not to say that it never happens that chords are written before themelody or riff that goes over them)

Drums and bass are generally only added when the guitars have already beenwritten. The bass guitar is usually following the rhythm guitar and closelymatching it. Of course, there are also parts where the bass plays somethingdifferent. Also, when both guitars play a melody, the bass needs to still providean accompanying rhythm section.

Drums have their basic beats, but can also closely follow the guitars. Es-pecially the bass drum hits are often highly correlated to the guitar notes.During breakdowns, for example, the bass drum is usually hit with every gui-tar note. An example of the guitar and drums during the breakdown in ‘One’by Metallica, is shown in Figure 2.10.

17

(a) Guitar

(b) Drums

Figure 2.10: Example of the breakdown in ‘One’ by Metallica

The aforementioned differences lead to a writing approach that starts withwriting a lead guitar, which is then followed by writing the rhythm guitar.This is assuming the rhythm guitar plays chords and harmonies. Actually,a better naming for the guitars would be first and second guitar. In theremainder of this thesis, the terms ‘lead guitar’ and ‘first guitar’ will be usedinterchangeably, as well as the terms ‘rhythm guitar’ and ‘second guitar’. Onlyafter the guitars have been written, the bass guitar and drums can be added4.

2.6.3 Scales

Metal music is usually written in some form of the minor scale. This in itself, isnot a very challenging difference compared to the regular genres. A challengein writing or generating metal music, however, is that it also often uses notesthat are not part of the used scale. In other words, these notes are ‘out-of-scale’.

4It should be noted that in this discussion, and the remainder of this thesis, a cleardistinction is made between the rhythm and lead guitar. However, in many metal bandseach guitar player sometimes take the role of rhythm guitarist and other times of lead guitar.The exchange of roles between both guitar players can happen in the same song. In genreslike pop music, usually the roles are not exchanged within the same song.

18

Different metal subgenres make less or more use of such out-of-scale notes. Themetalcore subgenre, for example, uses out-of-scale notes only occasionally.

In some sense, the use of out-of-scale notes offers an extra degree of freedom.However, this freedom should be used correctly. When too many out-of-scalenotes are being used or when they are used at the wrong time, the result willsound bad. Using out-of-scale notes therefore also offers an added difficulty.

19

Chapter 3

State of the Art

This chapter will present an overview of the state of the art in music generation.Section 3.1 will compare the two different input and output formats that areused in music generation. This is followed by a brief discussion of some popularsymbolic formats in Section 3.2. After this some further decisions regardingthe scope of generation are presented in Section 3.3, before moving on to adiscussion of data sets in Section 3.4. Finally, an overview of some interestingmodels is presented in Section 3.5, which is followed by a brief discussion ofan existing model for metal music in Section 3.7.

3.1 Output Formats

When classifying the different music generation approaches, a first division canbe made based on the output format, which is also the format of the data usedfor training. This leads to two types: audio generation and symbolic generation[16].

3.1.1 Audio Generation

Audio generation outputs waveforms of audio signals. An example of a wave-form is shown in Figure 3.1, which shows the stereo waveform of the first sevenbars of Master of Puppets by Metallica.

Audio signal generation has its advantages and disadvantages. Advantages are:

20

Figure 3.1: Stereo waveform of the first seven bars of Master of Puppets byMetallica.

• Each recording of a song is in essence a waveform, usually stored in the.wav or .mp3 format. This means that any song could in principle beused for training data.

• The output is also in waveform format and can therefore be immediatelyplayed back.

Its disadvantages are:

• The resulting waveform does not show what notes are being played.These notes still need to found out by musicians, which is not an easytask. There exists software to do this, such as AnthemScore [17], howeverhighly distorted sounds create more difficulties for software like this.

• Transposing the resulting music to a different pitch is possible throughtools (recording software like Logic Pro [10] can do this), however, bigtransposition intervals lead to big losses in quality when using such tools.

• Audio recordings have no inherent representation of structure, such asbars in sheet music. These structural elements therefore cannot be usedto aid in the generation process.

• Each music recording has different sound characteristics, while the soundcharacteristics throughout a song usually remain similar. This leads toan added difficulty during generation: the model is trained using inputsongs that each have different sound characteristics, but an output songshould ideally have uniform sound characteristics.

One example of an audio generation model is WaveNet by Google Deepmind[3]. While it has shown potential, its results are far from usable yet, soundingmuch like random sound clips that follow each other. Audio signal generation

21

has its advantages, but it is generally much harder than symbolic genera-tion. Some researchers also propose that the essence of music generation is incomposing music, which is done through symbolic representations [16]. Au-dio generation is considered more reminiscent of generating sounds instead ofcompositions.

This belief is followed in this thesis, which therefore focuses on the second typeof generations: symbolic generation.

3.1.2 Symbolic Generation

Symbolic generation outputs sheet music. Sheet music can come in manydifferent formats, such as music scores and tablatures. Electronic formats forsheet music are discussed in Section 3.2.

Advantages of symbolic generation are:

• The resulting sheet music shows which notes should be played in a waythat is clear to any musician that understands the sheet music represen-tation. It is also easy to alter certain notes of the result.

• Since the result is sheet music, it can easily be transposed.

• Sheet music can include various informative elements such as time sig-natures, bars, dynamic markings, ...

• Sheet music does not have associated sound characteristics. A specificnote can be considered to sound exactly the same in each and every musicscore.

• Since the output is in the form of sheet music, this output can be analyzedfrom a music theory perspective. This not only helps to think aboutimprovements to the model, but also allows to identify musical conceptsthat a model has learned.

Disadvantages of symbolic generation include:

• For each song that is used for training data, some form of sheet musicneeds to be available. For many songs, sheet music is not available.Sheet music that is available can be incomplete, incorrect or includerepresentation errors. A representation error could be the use of half theintended bpm, which also halves all note durations in order to achievemusic that sounds the same.

22

• Since the output is in the form of sheet music, this music still needs tobe played in order to be heard. Software exists that can play back sheetmusic in certain formats, however the resulting sound quality is mostlymediocre.

3.2 Electronic Symbolic Formats

Common electronic formats are MIDI [18] and piano roll. These will both beexplained now. Other formats include: ABC notation [?] and MusicXML [20].

3.2.1 Music Instrument Digital Interface (MIDI)

The MIDI protocol was designed to offer an interface between electronic mu-sical instruments, software and devices. It is event-based to allow the real-time communication between these different components. Many keyboardsare equipped with a MIDI output, thus allowing them to send MIDI events toa receiving component. MIDI events can also be written a file. The events inthe file can then be read and processed sequentially.

MIDI events can have different types, each accompanied by type-related datafields. The most important events are the Note on and Note off events:

• Note on: Events of this type have a single pitch field, indicating thepitch that should start sounding, and a velocity field, indicating how loudthis pitch should be sounding.

• Note off : These events have a single pitch field, indicating the pitchthat should stop sounding, and a velocity field which always has 0 as itsvalue.

In MIDI, pitches are expressed as integers in the range [0, 127]. MIDI pitch 60corresponds to C4, also called ‘middle C’. Increasing/decreasing a MIDI pitchvalue by one corresponds to increasing/decreasing the corresponding pitch byone semitone. For percussive instruments like drums, the MIDI pitch numberscorrespond to different drums. Some numbers correspond to the same drum,hit differently.

The velocity is also expressed as an integer in the range [0, 127], where 0corresponds to silence and 127 to ‘very loud’. Velocity values relate to dynamic

23

markings. Figure 2.4 shows MIDI velocity values for the different dynamicmarkings.

MIDI events further have a tick field, which is used to indicate the time untilthe next event. This time is not expressed in seconds or milliseconds, butinstead in MIDI ticks. 1 MIDI tick corresponds to 1/480 of a beat. The actualtime to wait in seconds therefore depends on the current tempo (which canalso be updated using MIDI events).

A quarter note could thus be expressed by using a ’Note on’ event with tickvalue 480, followed by a ’Note off’ event, with both events having the samepitch value. It is also possible to have other events in between. The require-ment for the note to have a quarter note duration is simply that the sum of thetick values of the ’Note on’ event and all subsequent events before the ’Noteoff’ event, equals 480.

Most events also include a channel attribute, indicating the instrument trackthey correspond to. Some events are global and therefore do not have a channelattribute. An example of a global event is a ’Time signature’ event.

MIDI allows to capture most aspects of a music score. In fact, all conceptsthat were described in Chapter 2 can be expressed through MIDI events, withthe exception of palm mutes. The MIDI format is also very popular, with alot of support for playing back MIDI tracks with different instrument sounds.

The drawback to the MIDI format is that it is not very intuitive. Each ’Noteon’ and ’Note off’ event only corresponds to a single pitch, meaning that allnotes of a chord are expressed via different events. Rests are not even explicitlyindicated, but have to be derived from the ticks between ’Note off’ and ’Noteon’ events. When using MIDI in deep learning context, it will thus be requiredto convert MIDI files to a more intuitive representation.

3.2.2 Piano Roll

The piano roll representation is quite old, as it was designed for automatedpianos. An example of a piano roll representation for the first seven bars ofMaster of Puppets by Metallica is shown in Figure 3.3.

Piano rolls are more easily readable than MIDI files. Each horizontal bar hasa vertical position that corresponds to the pitch that should be played. InFigure 3.3, the pitch can be determined from the piano keys on the left. Thehorizontal length of the bars then indicate how long a pitch should be held for.

24

Figure 3.2: An automated piano and a piano roll. Picture taken from [16].

One major drawback of piano rolls, however, is that if a pitch is held andthen this pitch is hit again, this is not indicated. To make this more concrete,at the start of bar 4 of the music score and tablature of Master of Puppets(Figure 2.1), the same note (E2 in the score or 0 in the tablature) is hit twoconsecutive times. In the piano roll in Figure 3.3 however, this is replaced by asingle bar, that actually has the length of both notes together. In piano rolls,consecutive notes with the same pitch are held instead of restrummed. Thisleads to music that can sound very differently and therefore songs in the MIDIformat will be preferred.

3.3 Further Scope Options

The following subsections each discuss choices to be made for the scope of amusic generation model. These choices lead to different levels of complexity.

25

Figure 3.3: A piano roll representation for the first seven bars of Master ofPuppets by Matallica.

3.3.1 Monophonic or Polyphonic

In a monophonic track, only one pitch is being played at a time. This effectivelymeans chords do not exist in a monophonic track. The opposite of monophonicis polyphonic, where a track can hove more than one pitch being played at thesame time.

Generating monophonic tracks is easier than generating polyphonic tracks.

3.3.2 Single-Instrument or Multi-Instrument

These terms are quite obvious. Single-instrument songs only consist of onetrack, while multi-instrument consist of multiple tracks.

Multi-instrument generation is obviously harder and can be done using twoapproaches. Either all instruments are generated simultaneously or differentinstruments are generated one by one.

3.3.3 Temporal Scopes

The temporal scope refers to how much time of the song is generated in onestep. There are three different temporal scopes: global, time step and note step[16].

When using a global scope, the neural network generates an entire song in onestep. This scope is used by MiniBach [4] and DeepHear [21].

26

When a time step scope is used, each step the network generates a small pieceof music with a set duration. This duration, which is the length of the timestep, usually corresponds to the greatest common divisor of all possible notedurations in the training data. A time step can however also be larger, forexample a whole measure [22].

With the final scope, the note step scope, a network will generate a note witheach step. For this note, both the pitch and the duration are generated. Thisstrategy can result in much fewer processing steps than the time step strategy.That is because a note’s duration will typically be larger than a single timestep. This approach has also been used by Walder in [23].

3.4 Data

An important element for any machine learning model is its training data.This section will discuss available data sets and the use of transposition.

3.4.1 Data Sets

Different data sets are available and common ones are listed in Table 3.1. Eachof these data sets can be found in the MIDI format, but many of them are alsoavailable in the ABC format.

Data set name # songs in data setLakh MIDI Dataset v0.1 [24] 176,581MuseData [25] 783JSB Chorales [26] 382Nottingham database [27] 1,200

Table 3.1: Common data sets and the number of songs they contain.

While these data sets are popular, most of them do not contain metal music,making them useless for this thesis. The Lakh data set included some Metallicasongs, but not much more than that.

27

3.4.2 Transposition

Transposition has been mentioned in Section 2.4 and results in increasing ordecreasing all pitches in a song by the same amount. In music generation, itcan be used in two ways to improve the data set.

A first option is to transpose all songs to the same key, i.e. so their notesbelong to the same scale with the same root note [28]. Out-of-scale notes arestill possible, but should be reduced to the minimum amount by transposingcorrectly. The idea behind this approach is to make an abstraction of theactual pitches, and instead have each note correspond to a specific element ofthe scale. This effectively allows treating each song as a set of intervals again,instead of a set of pitches.

A second option is to augment the data set by transposing each song to alldifferent root notes. The idea behind this approach is to provide key invarianceof all songs and make them more generic. It also multiplies the data set sizeby 12, leading to more data. This technique is used in [29].

3.4.3 Preprocessing: Hold Symbols

When generating music for multiple instruments, it is important to be ableto synchronize the different instrument tracks. A very simple, yet effectivepreprocessing step for this has been suggested in [4]. The idea is to divide allinstrument tracks into equal time steps. Using time steps raises the questionof how to represent notes that are being held. For this, the authors introducea new pitch that corresponds to a hold symbol. This hold symbol does notrepresent which pitches are being held, but just says that the pitches of theprevious time step have to be held during the next time step.

3.5 Machine Learning Models for Music Gen-eration

Different kinds of machine learning models have been used for music genera-tion. Their uses will now be discussed.

28

3.5.1 Feed-Forward Neural Networks

Feed-forward neural networks are the most basic networks. They simply takean input that is propagated through multiple layers of hidden units and thenresults in an output. The feed-forward architecture is not used very oftenfor music generation. One example is the MiniBach Chorale CounterpointSymbolic Music Generation System [?]. This network takes as input a sopranotrack of four 4/4 bars and then outputs additional alto, tenor and bass tracksfor those four bars (soprano, alto, tenor and bass are vocal ranges). Generationhappens in a single step and thus a global temporal scope is used.

While this architecture produces decent results, it has some major shortcom-ings. The output will always have a fixed size, in this case four bars, andtherefore does not allow longer or shorter pieces of music. Feed-forward net-works are also inherently deterministic in nature, meaning that the same inputwill always produce the same output. Lastly, since the generation happens ina single step, it does not allow human intervention during generation. Thismeans that a human cannot choose to specify the first two bars of the altovoice and have the model take this into account.

To deal with these challenges, more advanced models are required.

3.5.2 Stacked Autoencoders

An autoencoder is a network that is trained to output its input. It does soby sending the input first through multiple layers with decreasing numbersof hidden units. These layers are then followed by corresponding layers withincreasing numbers of hidden units that lead to the output [32]. The first setof layers is referred to as the encoder and ends in a hidden layer with only asmall amount of hidden units (the encoder output). The second set of layersis referred to as the decoder. It takes the encoder output as its input andtransforms this to the output of the autoencoder. The autoencoder describedabove is actually a stacked autoencoder, which indicates the use of multiplelayers for the encoder and decoder. For a regular autoencoder, the encoderand decoder only consist of a single layer.

Trained autoencoders can be used for generation by only keeping the decoderpart. A random input vector can be used as an input to the decoder, whichhas learned how to transform small vectors to output data. By using randominputs for the decoder, new music is created that should have characteristics

29

of the training data.

An example of music generation with a stacked autoencoder is the DeepHearsystem by Sun [21]. This system again outputs four 4/4 bars of music. Com-pared to the feed-forward strategy, the DeepHear system allows to generatemusic ex nihilo, i.e. without having to provide an already written music trackas input. It does, however, have some shortcomings, most notably its highdegree of plagiarism. ’Randomly generated’ pieces of music were often almostcopies of music from the training data.

3.5.3 Recurrent Neural Networks

Recurrent neural networks are the best-known models for sequence generation.They have not only been used for music generation, but also for e.g. generatingtexts in the style of Shakespeare [31]. They are trained to take a sequence ofinputs, e.g. note steps or time steps, and predict the next element in thesequence. The predicted element can then be added to the sequence of inputs,after which a next element can be generated again.

Standard recurrent neural networks, however, face the problem of vanishinggradients during backpropagation. This restricts their capabilities of learninglong-term dependencies between sequence elements [32]. This has lead to theinvention of Long Short-Term Memory (LSTM) networks [33].

Long Short-Term Memory Networks

LSTM networks make use of more complex hidden units that are able to storeinformation in memory cells. These memory cells can be controlled by gatesthat determine whether the content of the memory cell should be updated ornot. The LSTM network not only learns connections between cells, but alsolearns how to control these gates. Deep LSTM networks are networks that usemultiple LSTM layers.

The structure of LSTM cells allows them to learn long-term dependencies be-tween sequence elements, which has led to their increased use for sequence gen-eration. Monophonic melodies were created using LSTM networks by Sturmet al. [34] and by Colombo et al. [5].

A first example of conditional generation of monophonic melodies on top ofblues chord progressions, was created by Eck et al. [30], but relied on the sim-

30

plicity of the chord progressions. BachBot [35] and DeepBach [4] are two mod-els that relied on LSTM networks for adding harmonies in the style of Bach.Their focus on Bach harmonization does, however, lead to some simplificationsbeing made, and each of their generated instrument tracks is monophonic.

The DeepBach architecture, however, makes an interesting use of LSTM net-works. The architecture, shown in Figure 3.4, uses two LSTM networks topredict a note. One network takes previous notes as its input, while the othernetwork takes future notes as its input, in a reversed order. Both networks arethen used together with the current notes of the other tracks, to predict thecurrent note of the track that is being predicted. This architecture is called abidirectional LSTM, or Bi-LSTM, architecture.

Figure 3.4: Bi-LSTM architecture used by DeepBach [4].

Another way of conditional generation was presented by Makris et al. [36].Their model is not focused on generating a main melody or a harmony, butinstead generates a drumbeat conditional to a provided bass line. Previousdrum notes are put through a deep LSTM layer, while the bass line is putthrough a feed-forward layer. Both layers’ outputs are then combined in orderto predict the next drums that are hit. Their work also proposed the use ofa drum set with only five elements, representing the kick drum, snare drum,

31

toms, hi-hat and cymbals. For each time step, the drums that are hit canthen be represented as a binary array, e.g. 01010 indicating that the snare andhi-hat are hit.

BachProp [6], finally, is a deep LSTM network that generates music by takingthe MIDI format more explicitly into account. It can generate good resultsfor different data sets, however, these data sets do not include drums tracks,which have different structural properties than more melodic instruments. Itis also not able to sequentially add more instruments.

3.5.4 Generative Adversarial Networks

Generative Adversarial Networks (GANs) have been proposed by Goodfellowet al. in [37]. A GAN consists of two models that are trained simultaneously:a generator that learns the generation function, and a discriminator that istrained to detect whether a sample was generated or ‘real’, i.e. came from thetraining data. The goal of this structure is to train the generator so that thediscriminator can no longer detect whether a sample was generated or real.

GANs have been show to lead to generations that outperform LSTM networks,and have been used in MuseGAN [38] and MidiNet [39]. Their main drawbackis that they are very hard to train, requiring long training times and properselection of hyperparameters [40]. For these reasons, GANs will not be usedin this thesis.

3.6 Evaluation Methods

Evaluation of generated music through objective measures is hard. For anyform of art, objective measures can only be used as a non-strict indicationof quality. This means that these measures can be used as a first indication,however, subjective analysis is still required.

Through subjective analysis, various qualities of a generated piece can be de-tected, such as the presence of different types of song sections in lead guitargeneration,.

A popular way of testing whether artificial intelligence gets close to humanintelligence is through a Turing test [41]. A Turing test consists of creating asurvey where people are presented human-generated data and data generated

32

by an artificial intelligence. These people are then asked to interact with thisdata and then guess which data has been generated by the artificial intelligenceand which data was human-generated. The idea is that when people cannotdiscern the difference between the two, that the artificial intelligence is in factintelligent and can emulate human characteristics.

3.7 Metal Music

To finish this state of the art summary, a recent example of metal musicgeneration is presented. Zukowski and Carr have used deep LSTM networksto create audio signal generation models that generate black metal and deathmetal music [7]. These subgenres are more chaotic and contain less melodicparts than the trash metal and metalcore subgenres that will be focused inin this thesis. By using SampleRNNs [42], they are able to train a model ona set of actual song recordings and then generate audio signals in the styleof the training data. In March 2019, they released a continuous YouTubelive stream that keeps generating death metal music in the style of the bandArchspire. This live stream can be found at: https://www.youtube.com/watch?v=CNNmBtNcccE.

As with all audio signal generation approaches, the sound quality is generallyfar below the quality of actually recorded music. Their models are able togenerate some form of structure, however the resulting generations often seemto copy fragments of the training data and put these fragments in new ordersto create new songs.

Models that generated symbolic metal music were not found.

33

https://www.youtube.com/watch?v=CNNmBtNcccE

https://www.youtube.com/watch?v=CNNmBtNcccE

Chapter 4

Data Set Creation

This chapter will start with the introduction of two new concepts: the notumand the enotum. Following this, an overview is given of the different stepsthat were taken to collect a data set and to manually improve some of itsshortcomings. Hereafter, a discussion is presented of the various preprocessingsteps that were used to transform the data set to a format that can be usedas an input to a neural network.

4.1 A New Concept: The Notum

A notum is a new concept that is introduced to simplify the further discussionin this thesis. It is essentially a hypernym of the terms note, rest and chord.This simplifies further descriptions by not having to write ‘note, rest and/orchord’ every time.

A notum consists of a notum duration and a notum pitch:

• Notum pitch - This is the set of pitches to be played when the notum ishit. Using a set allows to cover rests, single notes and chords. If the setis empty, the notum corresponds to a rest. A single note is representedusing a set that includes only a single pitch, while chords use a set withmultiple pitches. For drums, the notum pitch is a set of all drums thatare being hit together.

• Notum duration - how long the notum will sound for, which is thetime until the next notum.

34

Using this definition, a music score or MIDI track can easily be converted toa sequence of notums, which is also called the notum representation of themusic score. The music score for Master of Puppets by Metallica is repeatedin Figure 4.1. The corresponding sequence of notums for the first four bars isas follows:

[{E3, B3, E4}, 1/4] [{}, 3/4] [{D3, A3, D4}, 3/16] [{}, 1/16] [{C\3,G\3, C\4}, 3/16] [{}, 1/16] [{C3, G3, C4}, 3/2] [{E2}, 1/8] [{E2}, 1/8][{E3}, 1/8] [{E2}, 1/8] [{E2}, 1/8] [{D\3}, 1/8] [{E2}, 1/8] [{E2}, 1/8]

Each notum is written inside square brackets and the notum pitch inside curlybrackets. Durations are written as fractions, where 1/4 obviously refers to aquarter note. 3/4 refers to a dotted half note. Notice that the tied chord inbars 2 and 3 results in only one notum.

Figure 4.1: Music score and tablature for the first seven bars of ‘Master ofPuppets’ by Metallica. The instrument being represented is the lead guitar.

Instead of using pitches in letter-number notation and durations in fractionalnotation, it is also possible to use MIDI notation for notum pitches and notumdurations, as discussed in Section 3.2.1. Pitches then correspond to a numberin the range [0, 127], whereas durations are represented using MIDI ticks.The MIDI way of pitch and duration notation will mostly be chosen over theletter-number and fractional notation in the remainder of this thesis. Thefirst four bars of Master of Puppets in notums using MIDI pitch and durationrepresentation is as follows:

[{52, 59, 64}, 480] [{}, 1440] [{50, 55, 62}, 360] [{}, 120] [{49, 54, 61},360] [{}, 120] [{48, 53, 60}, 2880] [{40}, 240] [{40}, 240] [{52}, 240][{40}, 240] [{40}, 240] [{51}, 240] [{40}, 240] [{40}, 240]

35

4.1.1 Notum Limitations

Clearly the notum representation is less readable to musicians, however, it caneasily be processed by a computer. The notum representation does, however,have a few limitations that slightly limit its possibilities. These limitations arein a sense also simplifications.

Loss of Overlapping Ties

First of all, in a regular music score it is possible to have a note still ringingwhile another note is being strummed, as depicted in Figure 4.2a. A held notecan also be released while other notes are still being held.

This is not possible with notums, as the definition of the notum durationexplicitly states that it is the time the note will sound for, which is definedas the time until the next notum. In cases such as the one in Figure 4.2a,each notum will only sound until the next notum is strummed, resulting inFigure 4.2b. This simplification thus slightly changes the way a song willsound. However, all original notes of the song are still being strummed.

(a) Tied notes before converting to no-tum representation.

(b) Result after converting back from no-tum representation.

Figure 4.2

Loss of Bends and Whammy Dives

The notum representation also removes bends and whammy dives (describedin Section 2.3.1). In MIDI, these concepts are included, but the gradual in-creases or decreases in pitch require the use of extra pitch elements, as well asuncommon durations. This added complexity has been ignored in the notumconcept. Ignoring bends and whammy dives again leads to a simplificationthat leads to slight changes to the song, but all original notes are still beingstrummed.

36

Further losses of information

The notum representation is very simplistic, capturing the most essential as-pects of the notes to be played. It therefore also loses the following pieces ofinformation:

• Tempo: By losing tempo information, the song will sound differentlyand be using the standard tempo of 120 bpm.

• Time signatures & bar lines: This results in a loss of structure, butdoes not alter how the represented music will sound

• Clef, key signatures & accidentals: This information is not required,since the explicit pitches are contained inside notum pitches.

• Dynamic markings: Dynamics are lost since they are not included inthe notums. As a result each notum will sound at the same velocity.

• Palm mutes: These are specifically guitar oriented and changes soundcharacteristics. They are omitted for simplicity.

While all this information is lost, the most essential part of music is still kept:the pitches to be played and their durations.

4.1.2 Alternative Rest Symbol

The notum pitch for a rest can be written in an alternative way. Instead ofusing an empty set of pitches, a designated rest symbol can be used. Whenusing letter-number notation, the rest pitch would be written as an ‘R’. UsingMIDI pitch notation, a pitch value of 128 is selected, since normal MIDI pitchesare in the range [0, 127].

This means that the following notum pitches are all equal to a rest: {}, {R},{128}. Notice that a notum pitch set that includes an ‘R’ (or 128 in withMIDI pitch notation) cannot include other pitches.

Using a designated rest symbol is useful in some situations, whereas using anempty set is useful in others.

37

4.1.3 Hold Notum

The Master of Puppets example shows that the use of ties or holds is notnecessary with notums. However, using a hold notum can have its uses.

A hold notum has a notum pitch that is represented by a designated holdsymbol ‘H’, inspired by the hold symbol from Section 3.4.3. Using MIDI pitchnotation, a pitch value of 129 is selected to indicate a hold.

A hold notum can follow any other notum and simply indicates that the previ-ous notum should continue to be held for the duration of the hold notum.1 Thehold notum itself does not indicate the notes that are being held and thereforedepends on the previous notum. Multiple hold notums can also follow eachother.

Hold notums have two main uses. First of all, they can be used to limit themaximum notum duration, for example to maximum 1 whole note or 1920MIDI ticks. This can be useful to avoid long durations that rarely occur. Forexample, a notum duration of 7 whole notes is rare, but can be transformedto a notum with a duration of 1 whole note, followed by six hold notums witha duration of 1 whole note. Following this duration limit of 1 whole note,the notum [{48, 53, 60}, 2880] in the Master of Puppets example would beconverted to the notums [{48, 53, 60}, 1920], [{129}, 960].

A second use for hold notums is when using set durations. This means thatall notums need to have the same (short) duration and is used when using atime step temporal scope. In that case, hold notums can be used to increasethe actual durations of the notes that are being played.

4.2 The Extended Notum: The Enotum

The extended notum, or enotum, offers a variation to the representation ofnotum pitches by introducing a third attribute, the enotum modifier. Enotumpitches are no longer represented as a set, but as a single symbol when usingletter-number pitch notation, or a single integer when using MIDI pitch nota-tion. This means that rests need to be represented with the alternative waydescribed in Section 4.1.2.

1A rest notum can be followed by a hold notum, however, this yields the same effect asfollowing a rest notum by another rest notum.

38

For chords, for which the notum pitch set includes more than 1 pitch value,the enotum pitch will be the lowest value of the notum pitch set. The enotummodifier will in turn be a set that holds the differences between the enotumpitch and the other pitches that were part of the notum pitch set. For rests,holds and single notes, the enotum modifier will obviously be an empty set.

As an example, the enotum representation of the first four bars of Master ofPuppets is as follows, with the enotum attributes in the order pitch, duration,modifier:

[52, 480, {7, 12}] [128, 1440, {}] [50, 360, {7, 12}] [128, 120, {}] [49,360, {7, 12}] [128, 120, {}] [48, 2880, {7, 12}] [40, 240, {}] [40, 240, {}][52, 240, {}] [40, 240, {}] [40, 240, {}] [51, 240, {}] [40, 240, {}] [40,240, {}]

The enotum notation offers two big advantages. First off, the enotum modifiersactually indicate the types of chords that are being played, since a chord typeis defined by its intervals. A power chord, for example, only has intervals of[0]12 and [7]12, where [ ]12 means modulo 12.

Each chord type can be defined by a number of congruence classes modulo 12,and each version of a specific chord type includes at least one interval fromeach of the congruence classes of the chord type. The [0]12 congruence class isalways included, since an interval of 0 is always present. A major chord, forexample, can be defined by the congruence classes [0]12, [4]12 and [7]12. Oneversion of a major chord is then given by the intervals {4, 7} (the 0 intervaldoes not need to be written). Another version is given by the intervals {7, 12,16, 19}. The enotum modifier, thus specifies the specific version of a chordtype that is being played.

A second advantage of the enotum notation is that it allows to directly gofrom a polyphonic track to a monophonic track, by simply ignoring the enotummodifiers.

The enotum representation is not useful for drums. There are no intervalsbetween different drums, so a modifier attribute would be meaningless.

4.3 Data Gathering

Section 3.4.1 listed different data sets that are available for music generation.Most of these data sets, however, do not contain any multi-instrument metal

39

songs. The one data set that does, the Lakh data set, only has a few metalsongs, but not enough for training. The songs that are present in this dataset, are also not divided into only two guitar tracks, with one taking the roleof lead guitar and the other the role of rhythm guitar (following the definitionsgiven in Section 2.6.2). This means that a data set had to be created.

The first step in the data gathering process consisted of collect sheet music fromthe Ultimate Guitar [43] website. This website allows users to submit sheetmusic of existing songs, so that other users can learn how to play these songs.The usual process of creating this sheet music consists of a user listening to thesong, trying to figure out which notes are being played and writing this down.Most of the submissions on Ultimate Guitar are therefore not official versions,which means that mistakes can still be present in them. This is, however, nota big problem for this thesis, since even with these small mistakes, the musicstill sounds valid within the genre. The big advantage of Ultimate Guitar isthat users also submit a lot of sheet music of songs from lesser-known artists.Whereas the available MIDI data sets consist mainly of classical and/or popmusic, Ultimate Guitar contains a lot of songs from metal bands.

The sheet music on Ultimate Guitar can be submitted in various formats, withthe arguably best format being the Guitar Pro [13] format. Guitar Pro is aprogram for writing sheet music in both standard and tablature notation. Itallows playback of the sheet music in a more advanced way than MIDI. Itcan, for example, incorporate the sounds of palm mutes on guitars and evendifferentiates the sound of the same note played on different strings. GuitarPro also offers the option to extract a PDF, MP3 and, most importantly, MIDIversion of a Guitar Pro file2.

Combining Ultimate Guitar with Guitar Pro allowed to obtain many MIDIversions of existing metal songs. It should maybe be noted that contrary towhat their names may indicate, both Ultimate Guitar and Guitar Pro areoriented towards more instruments than just guitars. The multi-instrumentapproach that is taken in this thesis considers songs containing two guitartracks, one bass track and one drum track. This set of instruments is presentin virtually any metal song. Many songs, however, feature more tracks.

In creating the data set, occasional extra instruments like pianos were sim-ply discarded from the collected songs, since these would not be trained on.Various songs also featured more than two guitar tracks. For these songs, the

2Guitar Pro 6 was initially used, but its extracted MIDI had many errors, particularlywith triplets. Upgrading to Guitar Pro 7.5 resulted in practically flawless MIDI extraction.

40

different tracks were carefully combined to end up with two guitar tracks thathold the most important notes.

Even when there were only two guitar tracks in the Guitar Pro file, theseguitars would still not always follow the division into lead and rhythm guitarthat was presented in Section 2.6.2. Therefore it was also necessary for someparts to switch the original guitars, in order to achieve the intended divisionof the guitar tracks.

The collected data set ended up containing 300 songs of various artists, somepresent many times, others only with a few songs. In total, 37 artists arepresent. The bands with most songs in the data set are: August Burns Red(50 songs), Metallica (46 songs) and As I Lay Dying (37 songs). The entiredata set can be found at https://tinyurl.com/y29gwb5u.

4.4 MIDI-to-Notum Conversion

As mentioned in Section 3.2.1, the MIDI format lacks intuitivity and needsto be converted to a format that is easier to further process. The python-midi library [48] was used to open MIDI files in Python. Each track was thenconverted to the notum representation from Section 4.1.

Different statistics regarding the number of notums in the data set per instru-ment are presented in Table 4.1. As can be seen there is a large variety inthe number of notums per song. By not using a global temporal scale, thisvariety can be handled by models. The number of notums does not indicatehow long each song is in terms of time. This is because each notum can havea different duration and because songs can be written in a different tempo. Asan example, the songs ‘No Lungs To Breathe’ by As I Lay Dying and ‘LeperMessiah’ by Metallica each have 1545 notums in their lead guitar track. Thefirst song, however, lasts only 4:04 minutes, while the second one lasts 5:40minutes. This is when both songs are played at their original tempos. Whenthey are both played at 120 bpm, the former lasts 6:44 minutes, while thelatter lasts 7:09 minutes

After converting all songs to their notum representation, the guitars and basswere also transposed to the E minor scale. This was done to make an abstrac-tion of the actual pitches, and instead have the MIDI pitch numbers correspondto elements in the minor scale.

41

https://tinyurl.com/y29gwb5u

Instrument Mean Standard deviation Min. Max.Lead guitar 1372 408 385 3022Rhythm guitar 1245 426 342 3159Bass 1189 396 339 3172Drums 1621 507 469 5541

Table 4.1: Mean, standard deviation, minimum and maximum number of no-tums per instrument for the songs in the data set

The minor scale is frequently used in metal music and on a guitar in standardtuning, the lowest note is an E. This is why the E minor scale was chosen totranspose to, although a different root could have also been used.

Most songs in the data set were already using the minor scale or a slightvariation of it, and thus only the root note of the original song had to bedetermined in order to know how many steps the song had to be transposed.Some songs were written in a major scale, however, the E minor scale consists ofthe exact same notes as the G major scale. Thus, these songs were transposedto the G major scale.

4.5 Manual Improvements

While converting from MIDI to notums already introduces some simplifica-tions, further manual simplifications had to be made as well. These mostlydeal with the occurrence of outlying notum durations and pitches.

4.5.1 Outlying Notum Durations

The number of different notum durations that occur in the original data setis 133, however, 66 of these are longer than a whole note or 1920 ticks. Asa first simplification, it is chosen to never consider notums with these longerdurations, but instead introduce hold notums. When a notum has a durationlonger than 1920 ticks, it is converted to a notum with the same pitch and aduration of 1920 ticks. This is then followed required number of hold notumswith durations of 1920 ticks and a final hold notum with the remaining durationthat is less than 1920 ticks.

Rest notums that have a duration longer than a whole note are split the same

42

way into multiple rests, instead of using the hold notum. This choice is madebecause a hold notum depends on the previous notum to determine the notumpitch to be held. Using multiple rest notums removes this dependency.

After this conversion (which can happen programmatically) there remain 67notum durations, which is still too many. Figure 4.3 presents an overview ofthe variation in occurrence numbers for these different durations. For clarity,the occurrence numbers have been capped at 10,000 and the durations arepresented in MIDI ticks.

Figure 4.3: Occurrence numbers for the different notum durations in the orig-inal data set. Notum durations longer than a whole note have been removedhere. For clarity, occurrence numbers have been capped at 10,000.

As can seen in the figure, there is a big imbalance in notum durations anddurations that only rarely occur should be removed since the generation modelswill not be able to train on them properly.

Grace Notes

Many of these rare durations are the result of grace notes, very short pre-notesthat lead in the actual intended note. When writing sheet music, these gracenotes are not counted towards the duration of the bar. Therefore, one canhave two 4/4 bars, each consisting of eight 1/8 notes, with the second bar alsostarting with a grace note. This is shown in Figure 4.4.

43

Figure 4.4: Example of a grace note in bar 2.

When Guitar Pro converts this to MIDI, it also converts the grace note, whichneeds to get a duration. The grace note gets a very short duration, usually20 or 30 ticks, and is included before its following note. Therefore, in thisexample it will be included at the end of the first bar. This will lead to theduration of the note before the grace note (here the final note of the first bar)being reduced by these 20 or 30 ticks. The final note of the first bar originallyhad a duration of 240 ticks (corresponding to an eighth note), but would nowreceive a duration of 220 or 210 ticks, which is very rare.

This duration reduction as a result of grace notes leads to many of the outlyingnotum durations. To fix this, the Guitar Pro files were manually checked forgrace notes which were then removed. This slightly changes the way the songssound, however, grace notes are quite rare and can be considered as more ofan expressive effect, rather than a principal part of the score.

Other Rare Notum Durations

There were also notum durations that only rarely (i.e. fewer than 100 times)occurred and were not the result of grace notes. These notum durations wereremoved by increasing or decreasing them, and doing the opposite to the dura-tion of a neighboring notum as compensation (so that the sum of all durationsremains equal). An exception was made for a duration of 1800 ticks (15 tiedsixteenth notes), which occurred 85 times. This duration was kept because itoften corresponds to a rest after a single 1/16 note at the beginning of a 4/4bar, which sounded weird when adapted.

44

Odd Notum Durations

Finally, there were also a few durations that were not multiples of 20 ticks,which could be considered odd durations. In a later processing stage (Sec-tion 4.6.3), it is required that each notum duration is a multiple of 20. Thedurations that were no multiple of 20 and that were not the result of gracenotes, are listed in Table 4.2. The 48 and 96 tick durations correspond to quin-tuplet notes, which is where the total duration of four notes is divided overfive notes (similar to how triplet notes divide the duration of two notes intothree notes). A duration of 96 ticks thus corresponds to quintuplet sixteenthor 1/20 notes. In the data set, such quintuplet notes are only used in solos.

Removing these four durations was mostly done by actually removing a notefrom the song and increasing the duration of surrounding notes. Especially forquintuplet notes, this approach had to be taken, reducing five notes to four.In some cases, the previous technique of increasing or decreasing a durationand compensating in a neighboring note was also possible. This process hadto be done manually in order to select the best notes to remove.

Duration in MIDI ticks Corresponding fractional duration30 1/6448 Quintuplet 1/3290 Dotted 1/3296 Quintuplet 1/16

Table 4.2: Significant notum durations that are no multiple of 20 ticks andtheir corresponding fractional durations.

Resulting Notum Durations

After removing the outlying and odd notum durations, 25 durations remain.These are listed in Table 4.3. 25 different durations is more than what manymodels consider. Especially durations shorter than a sixteenth note are notfrequently dealt with, but also durations like 1/4 + 1/8 + 1/16 are oftenremoved from data sets.

4.5.2 Outlying Pitches

For each instrument there were some outlying notum pitches.

45

MIDI ticks Fractional MIDI ticks Fractional40 1/32t 640 1/2t60 1/32 720 1/4 .80 1/16t 800 1/4 + 1/4t120 1/16 840 1/4 + 1/8 + 1/16160 1/8t 960 1/2180 1/16 . 1080 1/2 + 1/16240 1/8 1200 1/2 + 1/8300 1/8 + 1/32 1320 1/2 + 1/8 + 1/16320 1/4t 1440 1/2 .360 1/8 . 1680 1/2 + 1/4 + 1/8420 1/8 + 1/16 + 1/32 1800 1/2 + 1/4 + 1/8 + 1/16480 1/4 1920 1600 1/4 + 1/16

Table 4.3: The remaining notum durations in MIDI ticks and in their corre-sponding fractional notations. Dotted note durations are indicated with a dotnext to them and triplet note durations are indicated with a t next to them.Some note durations are the result of ties, which are written using a plus sign.

Drums

For drums, outlying pitches that occurred fewer than 400 times could easilybe handled programmatically. Originally there were 33 different drum pitchesin the whole data set. This number was reduced to 18 by basically reducingthe drum set. Exotic instruments like bongos and congas were simply removedfrom the drum set by deleting their hits from the data set. Other special drumhits were turned to more regular drum hits. For example, side stick snare hitswere turned into regular snare hits.

The resulting drum set is illustrated in Table 4.4. The resulting occurrencenumbers are illustrated in Figure 4.5a and Figure 4.5b.

An even further reduction is possible by grouping the drums into five mainclasses: kick drum, snare drum, hi-hat, tom and cymbal3. Grouping the drumsaccording to these classes leads higher occurrence numbers per class and lessimbalance. This grouping operation relates to turning the drum set into themost basic set with only a kick drum, snare drum, hi-hat, one tom and a crash.This is the same drum set that was used by Makris et al. [36]. The resulting

3I.e. crash, ride, china and splash.

46

Drum type Hit types & MIDI numbersKick drum - (36)Snare drum - (38)Tom Low floor (41), very low (43), low (45), low-mid (47), hi-mid (48), high (50)Hi-hat Closed (42), pedal (44), open (46)Crash High (49), medium (57)Ride Normal (51), bell (53), edge (59)China - (52)Splash - (55)

Table 4.4: An overview of the drum set that was kept after removing outliers.

occurrence numbers of hits this maximally reduced drum set, are shown inFigure 4.6.

Guitars & Bass

After transposition, guitar and bass tracks still had some outlying pitches. Re-moving outlying pitches for these instruments is much harder than for drums.That is because simply removing notums with outlying pitches would lead tothe introduction of odd-placed rests. Instead, these pitches had to be manuallyshifted up or down, which meant that the songs were changed. This manualshift had to be done carefully so that the songs would still sound good. Somepitches could therefore not easily be removed. After the manual shifts, eachpitch that was kept occurred at least 50 times, except for MIDI pitch value 93for the lead guitar (occurring only 18 times) and MIDI pitch value 62 for bass(occurring only 45 times)4 5.

After removing outlying pitches, there were 62 different pitches for the leadguitar, 52 for the rhythm guitar and 40 for the bass guitar. The resulting pitchoccurrences for both guitars are shown in Figure 4.7 and for bass in Figure 4.8.

Obviously, there is still a big imbalance in pitches, however, this is unavoid-able when dealing with music generation. Some pitches, or more specificallyelements in a scale, are more prevalent than others in music, and thereforethe data set will reflect this. Midi pitch values 40 on guitar and 28 on bass

4Manually shifting pitches is a time-consuming and hard task. After having removedsome outlying pitches, it was therefore chosen to stop the removal process.

5Many music generation models do not handle outlying pitches and instead allow the fullrange of MIDI pitch values from 0 to 127.

47

occur most often and correspond to the root note of the scale (after makingabstraction of pitches by transposing).

(a) Drum pitch occurrences after reducing the drum set.

(b) Drum pitch occurrences after reducing the drum set, capped at 30,000.

Figure 4.5

48

4.6 Preprocessing Pipelines

After transposition and the manual improvements, the data set can be putthrough further preprocessing stages. This section will list the preprocessingstages that were created. Some of them only apply to the prediction of certaininstruments.

4.6.1 Cleaning Stage

The first cleaning stage consists of two different steps. First off, the drumsare being reduced as described in Section 4.5.2. This section mentioned thatreducing drum pitches could be done programmatically, which is what thisstep does.

Secondly, bass tracks are turned monophonic, i.e. chords are turned into singlenotes. This is done by only keeping the lowest pitch of chord notums. Thereason for turning bass monophonic is because out of all 356,847 notums inthe data set’s bass tracks, only 668 (or 0.187%) are chords. Dealing withpolyphonic tracks is more difficult than dealing with monophonic tracks. Sincechords are only rarely used for bass tracks, the decision to turn them to singlenotes can be justified6.

6As mentioned in Section 4.2, the enotum representation allows to directly make a track

Figure 4.6: Drum pitch occurrences after maximally reducing the drum set.

49

Figure 4.7: Guitar pitch occurrences after removing rare pitches.

Figure 4.8: Bass pitch occurrences after removing rare pitches.

4.6.2 Conversion to Enotums

This subsection’s title says it all: this stage converts notums to enotums.Remember that enotums are not used for drums and have no real advantagefor a bass track that already is monophonic. Therefore, actually only the

monophonic. The enotum representation was, however, not used in this step because forsome instruments, models will be trained using notums and not enotums.

50

guitars are really converted from notums to enotums.

This is immediately followed by a second cleaning stage that removes outlyingenotum modifiers. Originally the lead guitar has 288 different modifiers overthe entire data set, while the rhythm guitar even has 336 different modifiers.Remember that each modifier corresponds to a different version of a chordtype. The large numbers of modifiers indicate that many different chord typesand versions are used in metal music.

Outliers are removed in two rounds. In the first round, each modifier thatoccurs 10 or fewer times for a specific guitar is changed to a similar modifierthat occurs more than 10 times7. This leads to a reduction of 116 modifiers forthe lead guitar and 127 for the rhythm guitar. Since this first round changesoutlying modifiers to other existing modifiers, these existing modifiers will nowhave higher occurrence numbers.

The second round will then do the same thing for modifiers that occur fewerthan 20 times after the first round has been executed. This leads to a furtherreduction of 34 modifiers for the lead guitar and 30 for the rhythm guitar.

After the two reduction rounds, there are still many modifiers. An extremereduction operation can then still be performed by reducing modifiers to chordtypes, i.e. to sets that consist of congruence classes [1]12 to [11]12. Such areduction leads to 56 chord types for the lead guitar and 71 for the rhythmguitar.

Table 4.5 presents a summary of the numbers of modifiers after each reductionoperation.

Reduction steps Lead guitar Rhythm guitarNo reduction 228 336After first reduction 172 209After both reductions 138 179Only keeping chord types 56 71

Table 4.5: Overview of the modifier reduction operations and their respectivenumbers of different modifiers per guitar.

7These similar modifiers where manually chosen

51

4.6.3 Conversion to Time Steps

To synchronize the different tracks of a song, a conversion to time steps can bemade. All time steps need to have a set duration, which should be the greatestcommon divisor of all possible (e)notum durations. From Table 4.3 it can beseen that the duration of a single time step should thus be 20 MIDI ticks.

The conversion to time steps is then done by turning the duration of each(e)notum to 20 ticks and inserting hold (e)notums to compensate. For rest(e)notums, extra rest (e)notums can be inserted instead of hold (e)notums, asexplained in the beginning of Section 4.5.1. After this conversion, all (e)notumshave the same duration and therefore their duration attribute can be removed.

Notice that this conversion leads to a big increase in the numbers of hold andrest (e)notums, while the numbers of single note and chord notums remain thesame.

When time steps are used, all instrument tracks of a song have the samenumber of time steps (otherwise synchronization would not be possible). Thisis different from the case when note steps were used. Table 4.6 shows differentstatistics regarding the number of time steps for the songs in the data set. Thenumber of time steps is a different indication regarding the length of a songthan the number of notums. One notum with a long duration can lead to manymore time steps than ten notums with very short durations. The actual timethat a song takes in seconds, however, still depends on the tempo of the song.Comparing this table with Table 4.1 should mainly indicate the difference inrequired processing steps between using time steps and using note steps.

Mean Standard deviation Min. Max.15767 5338 5652 37116

Table 4.6: Mean, standard deviation, minimum and maximum number of timesteps for the songs in the data set

4.6.4 Conversion to Integer & One-Hot

The final preprocessing stage is used to convert (e)notums to suitable inputsand outputs for neural networks. Since (e)notum attributes are discrete valuesthey have to be converted to one-hot notation. One-hot notation representseach possible value as a binary array that only has a single value equal to 1

52

and all others equal to 0. The index of the value that equals 1 corresponds tothe original discrete value. The length of a one-hot array is thus equal to thenumber of possible discrete values.

Before converting to one-hot, a first conversion should be made to the integernotation. Values in this notation simply are the indexes that equal 1 in theone-hot notation.

For enotums, the conversion to integer or one-hot notation is straightforward.Each enotum duration, pitch and modifier is converted to an integer or one-hotarray that corresponds to its value. Notice that each modifier is a set, but eachdifferent set is mapped to a different value.

For notums, there is a difference. The conversion of notum durations is thesame as for enotum durations. Notum pitch sets are not converted to a singlevalue like enotum modifiers however. Instead, a pitch set is converted to amulti-hot notation. This is like a one-hot array, but it can have 1s at multipleindices. When converting notum pitch sets, each index in the multi-hot arraythat has a value of 1 corresponds to a pitch in the pitch set. Converting anotum pitch set to integer notation thus also results in a set of indices, insteadof a single index.

53

Chapter 5

An Approach ForMulti-Instrument MusicGeneration

This thesis focuses on multi-instrument music generation, more specifically formetal music. As mentioned in Section 2.6.2, the generation of metal musicshould start with the lead guitar. The approach taken for generating multi-instrument music will therefore be a sequential one. First a lead guitar will begenerated. Lead guitar information is then used to generate a rhythm guitartrack. Both guitars are in turn used to generate the bass track and finally, theguitar tracks and bass track are used for generating drums.

An advantage of this approach is that it can achieve multi-instrument mu-sic, but each instrument can also be focused on separately. Besides this, theapproach allows various manners of generating accompaniments to alreadywritten tracks. A musician could have written only a single lead guitar trackand simply generate all other instruments. He/she could also write all guitarand bass tracks him/herself and then use the drum generation to only adddrums. It is even possible to generate only a bass guitar and than still putdrums written by a human behind this.

It should be noted that in this particular thesis, the full generation of allmulti-track song will not be done. Instead, instruments like rhythm guitar,bass and drums are generated for existing songs. This is simply to alwaysuse optimal (human-made) input tracks to generate an accompaniment for.Generating accompaniments to existing tracks also allows better verificationof the individual instrument track generators.

54

The following sections will describe the approaches that are taken per instru-ment. Each approach will be based on LSTM networks, taking a sequence of(e)notums as input and generating a single (e)notum.

5.1 Lead Guitar

The generation process starts with the lead guitar. Since the lead guitar trackis generated by itself, it does not need to be synchronized with other instrumenttracks. This allows the use of a note step temporal scale instead of a time stepone (meaning that the time step conversion preprocessing stage is bypassed),and thus requires fewer processing steps.

For lead guitar generation, enotums will be used instead of notums. A sequenceof enotums in one-hot encoding will be taken as input to the network, whichwill output the next enotum attributes in one-hot notation. This means thatnetwork in fact has three outputs: one for the enotum pitch, one for its durationand one for its modifier. Monophonic generation can easily be achieved byomitting the modifier inputs and outputs.

Besides requiring fewer processing steps, the note step temporal scale has anadditional advantage. The sequences that are used as input always need tohave a set length, i.e. they need to consists of a set number of enotums. Anexample sequence length could be 200, meaning that 200 consecutive enotumsare used to predict the next enotum. 200 enotums can actually capture a longhistory. Table 4.1 showed that the original average number of enotums in alead guitar track from the data set equals 1372. This means that using historyof 200 enotums, on average, captures 14.6% of the enotums in a song. If a timestep temporal scope were used, a history of 200 time steps would, on average,only correspond to 1.27% of the time steps in a song (since Table 4.6 says thata song on average consists of 15767 time steps).

Being able to use a long history for the input helps with capturing long-termdependencies. Using note steps does, however, mean that the total duration ofthe history is adaptive. If the input sequence consists of 200 sixteenth notes,its total duration will be half that of when the input sequence consists of 200eighth notes. This is not necessarily a bad thing. It simply means that thehistory will be more localized when it comes to duration if shorter durationsare present in the input sequence.

A final benefit of using note steps over time steps, is that the conversion of

55

time steps leads to a large increase in the number of rest and hold notums.This would lead to much more imbalance, which is very difficult to deal withwhen generating out of thin air.

5.1.1 Start (E)Notums

The network takes an input sequence of enotums in order to predict the nextone. This is the case during both training and generation. By requiring aninput sequence of length 200, the network cannot be trained to generate thefirst 200 enotums, however, since it does not have a suitable input sequencefor these notes. The same holds for generation. The network requires 200enotums that have already been written before it can start generating.

This can be solved by using start enotums. These are enotums that have aseparate pitch symbol or value, similar to the rest and hold enotums. Thechosen symbol is an ‘S’, and its corresponding pitch value in MIDI notationis 130. Each start enotum also has the same duration (e.g. 240 ticks) and amodifier that is equal to an empty set of intervals.

Each song in the data set will be prepended with a number of start symbolsequal to the input sequence length. This allows the network to also generateand be trained on the first enotums of a song.

Start enotums can only occur at the input of a network and not at its output,since the generation of a start symbol is meaningless.

An alternative to prepending start enotums would be prepending rest enotums.Start enotums, however, have the advantage that they can explicitly indicatethe beginning of a song, since they only occur before the beginning of a song.

Start enotums or start notums are also used by the other instruments.

5.1.2 Kickstarting

During generation, start symbols could also be intentionally avoided in orderto kickstart the generation process. Kickstarting refers to providing an alreadywritten sequence of enotums as input to the network. The network will thencontinue writing on this sequence as a way of completing it. This works becausethe network simply takes a sequence of enotums with a set length as its input.This sequence can be from any part in the song though.

56

5.1.3 Model Architecture

The model architecture starts quite simple. It has three one-hot inputs: a pitchinput that can take 65 different values (62 pitches and 3 different symbols), aduration input that can take 25 different values and a modifier input that cantake 138 different inputs (modifiers have gone through the first two reductionstages). The extreme reduction of modifiers to chord types was not used, asto keep the possibility of more interesting modifiers.

Each input takes sequences of length 200. The three different inputs are thenconcatenated to form sequences of multi-hot arrays with length 65 + 25 + 138= 228. Each of these multi-hot arrays has exactly three indices with a valueof 1, each corresponding to a different original input.

The result of the concatenation is then put through an LSTM layer of 128hidden units. This means the hidden layer consists of 128 LSTM cells that eachoutput a value. The LSTM layer process the full sequence of 200 enotums andthen outputs a single array of 128 cell outputs. It does thus no longer output asequence (LSTM layers do have the ability to output sequences though). Theoutputs of the LSTM cells correspond to their states after processing the fullinput sequence.

The LSTM layer is followed by a dropout layer [32]. A dropout layer puts afraction of the outputs of the previous layer to 0 during training. During eachtraining step a different subset of the previous layer’s outputs is put to 0. Thisis used as a regularization technique. After the training phase, the dropoutlayer does nothing. The dropout fraction was initially set at 0.2.

After the dropout layer comes the first output layer, which generates 64 prob-abilities for the pitch of the output enotum. These probabilities sum to 1 andeach of them is for a pitch value that corresponds to the same index in one-hotnotation. Notice that the number of output probabilities for pitches is 1 fewerthan the possible input values since the start symbol cannot be output.

The resulting output probabilities for the enotum pitch are then concatenatedwith the output of the dropout layer. The concatenated result is used for thesecond output layer, that outputs probabilities for the duration of the outputenotum. Finally, both the pitch and duration probabilities are concatenatedagain to the output of the dropout layer for the final output layer. This finallayer outputs probabilities for the modifier of the generated enotum.

The structure of the output layer follows the principle of dependent outputs.

57

The ordering of the dependencies was chosen according to the following trainof thought:

• A sequence of previous enotums can lead to some pitches that would fitwell for the next note, while other pitches would be less suited. Thisfollows from harmonic concepts where phrases of pitches already make alistener expect certain following pitches. For this reason, pitch probabil-ities where selected for the first output layer.

• Depending on the pitch that was chosen, several options for the durationmake more sense than others. In metal music it often happens, for ex-ample, that the same pitch is hit multiple times with the same durationand when they are followed by another pitch, this new pitch has a longerduration. This is just one example to illustrate how the selection of thenext pitch can affect the duration of the next enotum.

• Depending on the pitch and duration a modifier can finally be chosen.Each modifier works better with some pitches in the scale than withothers, which is why this depends on the pitch probabilities. The depen-dence on duration is made because shorter durations are generally morelikely to lead to single notes, whereas longer durations are more likely tolead to chords.

The values for the sequence length, dropout fraction, number of hidden layersand number of hidden layer units can all be changed. Specific values were usedhere to offer a more tangible explanation. The model’s architecture is shownin Figure 5.1

Simplified Baseline Model Architecture

A much simpler architecture was used early on as a baseline. This used mono-phonic enotums, so the modifier attribute was omitted. It also did not makeuse of the start symbols and it allowed enotum durations longer than 1920ticks. This resulted in 56 possible durations.

The architecture is represented in Figure 5.2. Its start is virtually built thesame way as for the architecture in Figure 5.1. Its output layers, however, arenot dependent anymore. Another difference is that the LSTM layer outputsa sequence here, which propagates to the model outputs, that also deliversequences.

In this sequence-outputting approach, a sequence of 200 enotums is still sup-

58

Figure 5.1: Model architecture for lead guitar generation.

plied, but the model generates an output after each enotum. This can leadto better training results [44]. Outputting sequences is, however, harder whenusing dependent outputs.

The same architecture with single outputs instead of sequence outputs, hasalso been used as a second baseline.

5.1.4 Generation

Generation of a sequence of enotums is done in a sequential manner. With eachstep, the model takes a sequence of 200 enotums as its input and generatesoutput probabilities for the attributes of the next enotum. These probabilitiesare then sampled to achieve actual values for the generated enotum. Thisgenerated enotum is then added at the end of the original input sequence, while

59

Figure 5.2: Simplified baseline model architecture for lead guitar generation.

the first element of this sequence is removed. The resulting new sequence isthen used as an input for the model, in order to generate the following enotum,and so on.

When using models that output sequences, like the simplified baseline model,generation can happen very similarly. Probabilities are generated for 200 eno-tums, but the first 199 of these correspond to the last 199 enotums of the inputsequence. Therefore only the last element of the output sequence should beconsidered. Generation than happens the exact same way as in the case wereonly a single output is given by the model.

Different generations can be achieved by using different randomizer seeds thatinfluence the sampling.

Temperature Sampling

Temperature sampling is a special way of sampling that adapts the probabilitiesbefore actually sampling [45]. The idea is to put each probability pi throughthe following function:

60

p̃i = ft(pi) = p1/τi∑i p

1/τi

(5.1)

τ is the temperature. If the temperature equals 1.0, all probabilities remain thesame. As the temperature goes to 0, the high probabilities will become evenhigher and the low probabilities will become lower. Using a temperature closeto 0 will result in almost always sampling the most likely value. If a temper-ature higher than 1.0 is chosen, the probabilities will become more uniform,becoming perfectly uniform as the temperature nears infinity. Temperatureshigher than 1.0 will therefore introduce more randomized sampling.

Different temperatures can be chosen for the different attributes. For example,a high temperature for the pitch attribute can lead to much randomization inpitches, and this can be combined with a low temperature for the durations,that will then mostly follow the highest model output probability.

Each temperature can also be made adaptive, by e.g. increasing it when thesame value is generated consecutively. This could be done to force variation ifthe same value keeps being generated for too many consecutive enotums.

Temperature sampling is also used for all other instruments.

Acceptance Sets

Generation can also make use of acceptance sets. The concept behind accep-tance sets is that only a limited set of pitches, durations and modifiers can beused (accepted) in the generated enotums for a track. Moreover, each pitchalso has a limited set of modifiers that can be generated together with it.

Acceptance sets were inspired by analyzing the data set. On average, a song inthe data set only uses 28 different pitch values (ignoring rest and hold values),8 different duration values and 7 different modifier values. Acceptance setsforce the generated track to exhibit similar characteristics and do not allowthem to e.g. use all 62 different pitches in one track.

Constructing acceptance sets is very simple. For each pitch, duration or mod-ifier value v, its acceptance probability is determined as:

paccept,v = #Songs that use the value# Songs in data set

(5.2)

61

The probability of not accepting the value, or rejecting the value, preject,v =1 − paccept,v. For each value v, paccept,v and preject,v are sampled to determinewhether the value will be allowed in the generated track.

During generation, the model still outputs probabilities for all possible values.Probabilities for the values that are not accepted will simply be brought to 0,the remaining probabilities will be normalized and then temperature samplingis performed.

Modifiers are accepted conditionally on the generated pitch. This is doneby looking at the entire data set, and only allowing modifiers for which anenotum in the data set exists that has both the modifier and the generatedpitch. Modifiers can thus only be used with pitches that they have alreadybeen used with in the data set. This can also be formulated by saying thatmodifiers are accepted conditionally to the generated pitch.

This puts strong restrictions on modifiers:

• First off, they have to be in the set of sampled accepted modifiers.

• Secondly, they are only accepted conditionally to the generated pitch.

5.2 Bass

Although the next instrument to be generated is the rhythm guitar, bass guitargeneration will first be discussed. There are two reasons for this. First off,the rhythm generation approach will be largely based on the bass generationapproach. Secondly, bass generation is easier, since bass is a monophonicinstrument and because it can be generated conditional on two guitar tracks,instead of only one.

The bass generation approach has several differences to the lead guitar gener-ation approach.

To start with, bass generation will make use of notums instead of enotums.One reason for this, is because bass tracks cannot have modifiers (since thebass is assumed to be monophonic). Secondly, the bass generation also usesinputs from already generated or written guitar tracks. Having these guitartracks in notum notation is clearer for the bass model. The inputs to the modelwill namely directly reflect all pitches that are being played and often a basstrack will play one of these pitches, possibly a few octaves higher or lower.

62

Using notums also allows to reduce the input size by not requiring inputs formodifiers. This reduction in input size is very welcome, since bass generationhas to use time steps instead of note steps.

As mentioned earlier, time steps allow to synchronize the different instrumenttracks, but they also require more generation steps. Besides that, when usingtime steps, the short duration of the notums requires long input sequences tocapture a somewhat decent history. A history of eight whole notes is chosen,which corresponds to 768 time steps of 20 ticks. On average, a history of thislength corresponds to 4.87% of the time steps in a song.

Thanks to the availability of guitar tracks, a model for bass generation can usea much more extensive input to aid in generating the next bass notum. Thisresults in four different kinds of input that can all be used together:

• the previous 768 notums of the bass track

• the previous 768 notums of the lead and rhythm guitar tracks

• the current notum of the lead and rhythm guitar tracks

• the following 768 notums of the lead and rhythm guitar tracks, in reversedorder (i.e. moving towards the current time step)

By using all this information, the bass model knows what notums have justbeen played. It also knows the notums that the guitars are going to play atthe time step for which a bass notum has to be generated. Bass is closelyrelated to the guitars, so this information is very useful. Finally, the modelknows what the guitars are going to play after the bass notum to predict. Thisinformation can be used to generate a bass notum that anticipates the futureguitar notums.

The resulting sum of notums that are used as input to the model is 5∗768+2 =3, 842 notums, which is about 19 times the amount of notums used for leadguitar generation. At least, the notums hold no modifier information and alsono duration information, since time steps are used. The notums thus only holdpitch values.

End Rests

Section 5.1.1 introduced start notums to allow generation of the first notumsin a song. Since the bass model also makes use of future guitar notums, it can

63

only generate the final notes of a song if the guitar tracks are extended with768 notums.

A designated end notum could have been used for this, but instead the simpleroption of rests was used here. In fact, designated start notums are mainlyuseful for lead guitar generation, since it has to start from nothing. Bassgeneration can rely on guitars, and when the future guitar notums are rests, abass model is expected to anticipate that it should also play rests over thosenotes. This is not a general truth, since there are parts where the bass playssolo. These parts are few though.

5.2.1 Model Architecture

The model architecture for the bass generation model is based on the Bi-LSTM architecture used by DeepBach [4]. There are seven different inputs tothe model: two for the sequences of previous lead and rhythm guitar notums,two for the sequences of following lead and rhythm guitar notums, one for thesequence of previous bass notums and two for the current lead and rhythmguitar notums.

The lead guitar notums can have 65 different pitch values, the rhythm guitarnotums 55 and the bass notums 43. Each of these numbers also includes rest,hold and start notums. Lead and rhythm guitar inputs that correspond to thesame timeframes (previous, current or future notums) are first concatenated.

The concatenated sequential guitar inputs and the sequential bass inputs arethen each put through their own LSTM layer with 128 hidden units. Thisreduces the to single arrays of LSTM outputs. The concatenation of the currentguitar notums is put through a fully-connected (dense) layer with 128 hiddenunits.

Each of these four layers is followed by a dropout layer, and their outputs areconcatenated again.

The concatenated outputs are then put through two fully-connected layers(with a dropout layer in between) to finally output the bass pitch probabilities.

The model architecture is shown in Figure 5.3.

64

Figure 5.3: The bass generation model architecture.

5.2.2 Generation

The generation process is similar to the generation process for lead guitar andalso makes use of temperature sampling. Acceptance sets are not used, sincethe bass pitches can be steered well enough by the guitar pitches. Kickstartedgeneration is still possible, by providing a set of initial bass notums. Finally,the presented modal always works in a single-output manner, due to the de-pendence on future guitar notums.

5.3 Rhythm Guitar

The model for rhythm guitar generation is only created for monophonic gen-eration and is heavily based on the bass generation model. It is actually anadaptation, where the role of the lead and rhythm guitars in the bass model

65

is taken by just the lead guitar, and the role of the bass guitar is taken by therhythm guitar.

This leads to the model architecture in Figure 5.4.

Figure 5.4: The rhythm guitar generation model architecture.

5.4 Drums

Finally, the drum generation model is again highly related to the bass modeland initially built as an adaptation to handle one more instrument. It takesas its input:

• The previous 768 drum notums.

• The previous 768 lead guitar, rhythm guitar and bass notums.

• The current lead guitar, rhythm guitar and bass notum.

66

• The following 768 notums of the lead guitar, rhythm guitar and basstracks.

It makes use of the maximally reduced drum set from Section 4.5.2 and itsarchitecture is shown in Figure 5.5. For the drum history no start symbol isused, since this input is multi-hot coded. For drums, the added start notumswill just be rest notums (multi-hot pitch array with only zeros).

Figure 5.5: The drum generation model architecture.

It exhibits some important differences to the bass model. First off, the drumtrack is polyphonic since it should be possible to hit multiple drums simul-taneously. To deal with this polyphonic nature, the model has five differentoutputs, one for each drum. Each of these outputs is used to predict whetherthe particular drum should be hit or not. If a particular drum is not hit, thiscan be regarded as a rest for that drum. If all five drum outputs give a rest,this leads to an actual rest for the entire drum set.

The idea of using separate outputs for each drum also comes from the differentcharacteristics in the use of each drum. Generally, cymbals and hi-hat are usedin the most regular pattern, e.g. hitting sequences of quarter or eighth notes.Snare drums are also hit in regular patterns, but also deviate from this to makea drum part more interesting. Bass drums, on the other hand, often followthe rhythm of the notes that the guitars play. During breakdowns, they are

67

mostly exactly following the guitar rhythm, while the other drums are playingregular patterns. Finally, toms are mostly used for fills.

When drums are provided at the input, they are not split over five differentmodel inputs, but simply presented as a multi-hot array of length five.

The inputs that were mentioned earlier go through the same initial stages asfor the bass model, i.e. inputs that correspond to the same time frame areconcatenated and then go through either a fully-connected or an LSTM layer,followed by a dropout layer.

The outputs of the dropout layers are then concatenated, but that is not all.The drum model actually defines two extra inputs that will also be includedin this concatenation.

Time Signature & Time Step Position

These extra inputs correspond to the time signature that is being used and tothe position of the current time step within the current bar. These inputs areincluded because drums are much more concerned with the structure of barsand time signatures. For example, a very typical drum pattern for a 4/4 bar,puts cymbals are hi-hats on each quarter note or eighth note, and a snare hit inthe middle of the bar (i.e. the third quarter note or the fifth eighth note whendividing the bar in durations of even length). The time signature informationand the position within the bar cannot easily be extracted from the guitar orbass tracks and is therefore explicitly included at the input.

Over the entire data set, 25 different time signatures were used (this is whenconsidering equivalent fraction such as 3/4 and 6/8 equal). Many of these,however, were only used a small number of times. For the drum approach,songs that included time signatures that occurred fewer than 10 times overthe entire data set were ignored1. This resulted in 255 songs and 8 differenttime signatures being kept. These time signature and the number of songsthey occur in are shown in Table 5.1.

The time step position corresponds to the index of the current time step withinthe bar. A regular 4/4 bar has a length of 1920 ticks. Since each time step is20 ticks, this leads to 1920 / 20 = 96 time step positions in a 4/4 bar. Thelongest bar length is 7/4, which corresponds to 3360 ticks. Thus, the maximumamount of time step positions is 3360 / 20 = 168.

1Remember that time signatures changes can also happen within a song.

68

Time signature Number of songs1/4 132/4 833/4 657/8 194/4 2535/4 356/4 317/4 15

Table 5.1: Time signatures that are kept in the drum data set and the numberof songs they occur in.

The time signature information is extracted from the MIDI files and added tothe data set as separate track. This track represents for each time step the timesignature that is currently being used. Using the time signature information,the time step position can also be determined for each time step, and this isadded to the data set as another separate track. When converting to one-hot,time signatures are represented as one-hot arrays of length 8, while time steppositions are represented as one-hot arrays of length 168.

(It should be noted that the lead guitar generation model does not gener-ate time signatures and these should thus currently be manually added if allgeneration models are used together.)

After concatenating the time signature and time step position inputs to theoutputs of the dropout layers, the result goes through one fully-connected layerfollowed by a dropout layer. The outputs of the dropout layer is then fed tofive different fully connected output layers that generate the probabilities of ahit for each particular drum.

Dealing with Rest Imbalance

Compared to the guitar and bass tracks, the drum tracks have a huge amountof rest notums. As mentioned earlier, converting to time steps leads to a largeincrease in rest and hold notums. However, hold notums are meaningless fordrums (since drums are simply hit), and are therefore replaced by rest notums.This results in an enormous amount of rest notums (3,610,590) compared tonon-rest notums (374,500) over all drum tracks in the drum data set.

69

One of the ways to deal with this, was to use the maximally reduced drum set,since this makes each individual drum hit more often. This does not changethe ratio of non-rest to rest notums however.

The key to dealing with this imbalance, is to differentiate rest outputs. Nor-mally, each different drum output would only be able to represent two values:a hit or a rest. Differentiating rest outputs leads to the use of 1 hit value and24 different rest values (this number could also be adapted). Each of theserest values could be named differently, e.g. rest1, rest2, ..., rest24.

In training outputs, the original rest values are replaced by the differentiatedrest values in the following way:

A rest immediately after a hit is turned into a rest1. A rest after a rest1 istuned into a rest2, and so on. A rest after a rest24 is turned into a rest1 again.Figure 5.6 presents an overview of the resulting differentiated rest occurrences.

Figure 5.6: Occurrence numbers for the differentiated rest values in the drumdata set.

The highest occurrence number for a single differentiated rest value is 388,670,which is similar to the number of non-rest notums. The different rest notumsare placed in the structured ordering instead of a randomized one so that themodel can also learn which differentiated rest notums it should output. Thisis important for training, since it better allows the model to minimize its loss.If the differentiated rests were placed in a randomized order, it would be muchharder for the model to correctly predict these differentiated rests. This would

70

lead to higher losses and in trying to decrease these losses, the model couldstart training in a way that leads to worse prediction of the drum hits.

The differentiated rests are only relevant during training, to let the model trainbetter for the hit outputs. During generation, for each output layer the 24 restprobabilities and the single hit probability will be sampled. If the result ofthis sampling is hit, a simple hit is generated. If the result is a differentiatedrest, a regular rest will be generated for that drum. Combining the resultsof all outputs then leads to a multi-hot array of length 5 that represents thedrums that are hit at the current time step, or a rest in case the multi-hotarray only contains zeros. This multi-hot array can then be used as an inputfor the prediction of the next drum notum.

71

Chapter 6

Implementation

Each of the models was implemented in Python 3.6 using the Keras 2.2.4 li-brary [46] with a Tensorflow 1.13.1 [47] back-end. All preprocessing was donein Python as well, making use of the python-MIDI library [48] for MIDI re-lated operations. Each preprocessing step was also accompanied by a reversedversion, so that model outputs could be transformed to MIDI files. The out-put MIDI files could be opened and played back with Guitar Pro 7.5 [13] orLogic Pro X [10]. Both these programs were used for this purpose. Guitar Pro7.5 was particularly useful for visualizing the MIDI files as music scores andtablatures. It also offered decent guitar sounds and drum sounds. Logic ProX was able to provide better bass sounds.

Since this thesis focuses on the generation of multiple instruments, their specificmodels have not all been optimized to the point were they result in very goodgenerations. Instead, the major challenges that came with each instrumentwere dealt with in order to achieve decent generations. The ways to deal withthese challenges have already been explained in Chapter 5.

To increase processing speed, two Nvidia GeForce GTX 980 [?] GPUs wereused.

6.1 Shared Details of the Models

Each model made use of an Adam optimizer with a learning rate of 0.001 anda decay of 10−6. For the outputs, a softmax activation function was used,shown in Equation (6.1). This activation function turns multi-class outputs toprobabilities. The number of classes is denoted by C in the equation, whereas

72

s relates to the output of the model before applying the softmax activationfunction.

pi = f(s)i = esi∑Cj e

sj(6.1)

For the loss function, categorical cross-entropy was used, shown in Equa-tion (6.2). This takes the softmax of the target output and then takes thenegative logarithm of it. The model is optimized by reducing this loss func-tion.

CE = −log(

est∑Cj e

sj

)(6.2)

Models that had multiple output layers also had a total loss value, whichis the average of the losses of their different outputs, and is the actual lossthat is minimized (as opposed to each individual loss being minimized). Theindividual losses could be given weights in order to make the total loss morefocused on a particular output’s loss.

The initial data set was first split into a training and a test set, with thetest set including 30 songs, i.e. 10% of the initial data set. During training,the training set was then further split into the actual training (80%) and avalidation (20%) set. The actual training set thus included 216 songs, whilethe validation set included 54 songs. The splits were made in a semi-balancedway, balanced meaning that the distribution of all values is similar among thedifferent set. The splits were semi-balanced, since when dealing with songs, itis not always possible to achieve perfect balance for the more infrequent values.However, all possible values were at least contained in the training set.

For drum generation, a reduced data set was used with only 255 songs. Theinitial test set was reused, but ignored the 4 songs that were not part of thedrum data set.

Training and validation was done in batches of 200 samples. A sample refersto a randomly selected input sequence and its corresponding output. Eachtraining epoch consisted of 100 batches, meaning that 20,000 samples were usedper training epoch. The validation phase after each epoch used 25 batches,or 5000 samples, to calculate a validation loss which should be minimized toavoid overfitting.

73

Models were trained using early stopping, i.e. after each epoch the model issaved and the final version that is used of the model, is the one that has thelowest validation loss. A maximum number of epochs of 100 was chosen, but ifvalidation loss did not improve for 20 consecutive epochs, training was stopped.

6.2 Lead Guitar

6.2.1 Simple Baseline Model

The first lead guitar model that was trained, was the simple baseline model.

For the initial model, input sequences of length 200 were used. A single LSTMlayer was used with 128 hidden units, and the model’s output consisted ofsequences, instead of only a single output. This resulted in a total validationloss of 1.20 after 36 epochs, whereas the total training loss was 1.01. Thevalidation loss for pitch prediction was 1.84 and its training loss was 1.53. Forthe duration prediction, the validation loss was 0.56 and the training loss 0.49.The validation accuracy of pitch prediction was 46.6% and the training accu-racy was 54.0%. For duration prediction, the validation accuracy was 82.9%and the training accuracy 84.4%.

The evolutions of these values over the different epochs are shown in Figure 6.1,Figure 6.2 and Figure 6.3. As can be seen, the model quickly reaches a decentlevel for these values and then slightly improves over the following epochs.

Figure 6.1: Total training and validation loss curves for the simple model.

74

Figure 6.2: Pitch and duration training and validation loss curves for thesimple model.

Figure 6.3: Pitch and duration training and validation accuracy curves for thesimple model.

All these values are hard to interpret, however. Of course, reaching a highaccuracy or a low loss is good, but in music generation there are differentpitches and durations possible when generating the next notum. What can beconcluded, however, is that the model is better trained on generating the sameduration that an original song used, than it is for pitch generation. This canbe expected, since rhythm has more predictable structure in it than melody.

Table 6.1 presents an overview of variations that were made, their resulting

75

minimal total losses and the epoch at which this loss was achieved. The ‘Sin-gle Output’ column refers to whether a single output was used instead ofsequences.

Layers Hidden Sample Batch Single Total EpochUnits Length Size Output Loss

1 128 200 200 no 1.20 362 128 200 200 no 1.20 221 256 200 200 no 1.22 131 64 200 200 no 1.18 451 128 200 400 no 1.21 251 128 100 400 no 1.20 351 128 200 200 yes 1.16 80

Table 6.1: Summary of total losses for different configurations of the simplebaseline model. The values that are different from the original model are putin bold.

This table shows that varying the different parameters does really influencethe total loss. Also the ratio of the individual losses and accuracies remainedsimilar. Using fewer hidden units led to a slightly lower total validation loss.The use of a single output also led to a slight decrease in the loss. This canbe explained because the model can focus more on this single output that canuse the full history. When outputting sequences, the loss is calculated over all200 elements of the output, of which some can only use a short history.

It is also important to see how the number of epochs changes. Using morecomplex models, such as models with 2 layers or 256 hidden units, leads tofewer epochs until the minimal validation loss is achieved. This could, however,indicate that a more suboptimal solution is found and that the model getsstuck and cannot further improve. The single output model needs much moreepochs, indicating a more gradual improvement. This ties in with the modelbeing able to be more focused on its single output and thus less easily gettingstuck.

The simple baseline architecture was also expanded to include modifiers. Theresulting losses and epochs are shown in . This model exhibits similar behaviorwhen parameters are changed. The total loss for these models seems smaller,however, this is because it takes the average over three losses: pitch, durationand modifier loss. A relatively high pitch therefore only counts for 33% insteadof 50% in the total loss.

76

Layers Hidden Sample Batch Single Total EpochUnits Length Size Output Loss

1 128 200 200 no 1.00 432 128 200 200 no 1.02 351 256 200 200 no 1.00 181 128 200 400 no 1.00 341 128 100 400 no 1.00 381 128 200 200 yes 0.97 42

Table 6.2: Summary of total losses for different configurations of the simplebaseline model with modifiers. The values that are different from the originalmodel are put in bold.

Generation

While losses give an objective indication of the quality of the models, musicquality is subjective and therefore a subjective analysis is required to rate thequality of each model.

Sampling was done using the following different temperatures (Section 5.1.4)for the different model outputs: 0.1, 0.5, 1.0 and 1.5.

The chosen temperatures had specific effects on the different outputs, thatwere common for all different models:

• Sampling temperatures of 0.1 lead to very boring generations that mainlyconsisted of the same value being repeated over and over again.

• Sampling temperatures of 1.5 are too extreme and lead too too muchrandomness.

• Sampling temperatures of 0.5 can be useful for durations. Resulting gen-erations often mainly use eighth notes, however relatively long sequencesof eighth notes do often occur in metal music. Using this temperature,the duration value will also occasionally be different making the genera-tion more interesting. For pitches and modifiers, this temperature is stillquite too low. Long sequences of eighth note durations can be accept-able, but long sequences of pitches are very boring. Long sequences ofthe same modifier (usually the single note modifier) are acceptable, butthe use of different modifiers (i.e. chords) is too infrequent.

• Sampling temperatures of 1.0 offer quite a lot of variation without going

77

too extreme. sequences of different pitches are present and there is morevariation in duration. Modifiers are also more present.

The conclusion about these sampling temperatures is that temperature forduration should be chosen between 0.5 and 1.0, for pitch at or slightly below1.0 and for modifier around 1.0.

The original simple baseline model actually led to surprisingly decent results,without using acceptance sets. Variations to the model offered decent resultsas well, some similar in quality (such as the 64 hidden units model), othersworse. Since the simple baseline model die not use start symbols, it alwaystakes some notums before a decent generation starts.

Some subjective conclusions on the best variations will now be given.

The original model is able to generate music that varies between differentsections of a metal song. Figure 6.4 shows an example of a typical metalcoreverse riff that is generated by the model when using a pitch temperature of 1.0and a duration temperature of 0.5 (the original model does not use modifiers).What typifies these verse riffs is the alternation between open notes on thelowest string (i.e. the root note in this case) and other notes that or not veryhigh or low pitched, with all these notes having the duration of an eighthnote. Figure 6.5 shows an example of later in the same generation, where theguitar is first playing a melody consisting of higher-pitched notes. This is thenfollowed by a small part leading in a (here rather simple) breakdown.

Figure 6.4: Verse riff generated by original simple model with pitch tempera-ture 1.0 and duration temperature 0.5.

The full generation, consisting of 500 notums can be found at https://tinyurl.com/yxjckw9d. The tempo was set at 180 bpm and the root note atC2.

78

https://tinyurl.com/yxjckw9d

https://tinyurl.com/yxjckw9d

Figure 6.5: Melody going over in a small breakdown generated by originalsimple model with pitch temperature 1.0 and duration temperature 0.5.

While the generated music does not really repeat specific sections, the factthat it recognizes to change between verse-like riffs, melodies and breakdowns isalready quite good. This indicates that sequence of 200 input notums capturesa good history.

Using only 64 hidden units yielded quite decent generations, with melodiesbeing more melodically consistent, i.e. using pitches that were closer to eachother. However, this model was not as good for generating different songsections, and instead mainly generated melody and occasionally an almostverse-like riff.

The simple model with a single output, while having a better validation loss,yielded results that were a bit less good than the original results. This modelwould more frequently generate long patterns of the same note and generatedfewer clearly song sections. It was, however, able to generate some decent riffsthat were more like Metallica music.

Variations of the simple model that used modifiers did not yield decent re-sults. The modifiers were always placed rather randomly and did not makesense. Of course, the modifier probabilities are not generated conditionally inthe simple model.

The original baseline model was also tested in combination with a kickstart.

79

For this, the 200 first notums of an existing song were used, and the modelgenerated further notes for this. The provided input from the original songhelped the model to generate decent music. However this generated music didnot really form a logical continuation to the supplied sequence. It used similarpitches, but it did not create riffs like the ones that were provided.

Links to further samples from these models will be provided in Chapter 7.

6.2.2 Advanced Model

The advanced model that was explained in Section 5.1.3 was able to reacha validation loss of 0.95 after training for 78 epochs. Its total loss curve isshown in Figure 6.6 while the individual loss and accuracy curves are shownin Figure 6.7 and Figure 6.8 .

Figure 6.6: Total training and validation loss curves for the advanced model.

The parameter values that were presented in Section 5.1.3 were also used hereand generation made use of an acceptance set. The model created the bestgenerations (subjectively) when using sample temperatures of 0.8 for the pitchvalues, 0.9 for duration and 1.0 for modifiers.

Initial generations with this model suffered from long repetitions of the samenote. This was remedied by using adaptive sample temperatures. This wasdone by multiplying a temperature by 1.001 if its corresponding output re-peated a value. Once a different value was generated, the temperature wasreset to its original value.

80

Figure 6.7: Pitch, duration and modifier training and validation loss curvesfor the advanced model.

Figure 6.8: Pitch, duration and modifier training and validation accuracycurves for the advanced model.

This model was able to create more interesting melodies and riffs than thesimple baseline model. ’More interesting’ relates to better capability to cre-ate well-sounding variation in pitches, durations and modifiers. The baselinemodel was also able to create variations in pitches and durations, but thesewere usually not as good, especially if the duration also changed. The condi-tional dependence of duration on pitch helps with creating meaningful varia-tions.

81

The model was also better at generating modifiers that made sense. This isboth thanks to the dependency in the model’s architecture, which helped forpredicting well-placed modifiers, as well as thanks to the use of acceptancesets, which avoided using bad modifiers (bad meaning that they would notsound good with the current pitch). However, modifiers were still not usedvery much. This can be explained by the number of possible modifiers beingtoo big.

The model was not as good as the baseline model in generating different songsections. The generations mostly consisted of long melody parts and someriffs. Breakdowns were not really generated and good riffs were occasionallyfollowed by parts that were not very good.

For both the simple model and the advanced model, generation can be donevery fast, with a speed of around 10 notums per second. A lead guitar trackwith the average number of notums of 1372 would thus only take little morethan 2 minutes.

6.3 Bass

Before discussing bass generation, an evaluation measure called the similaritywill be introduced.

6.3.1 Similarity

The similarity is an objective measure that can be used for bass generation.The model will be tested by generating for a song from the test set. Thesimilarity measure is then used to see how similar the generated bass track isto the original track.

It is calculated as the percentage of generated notums that are equal to theircorresponding notum in the original track (where the notums are consideredafter converting to time steps). Special care is taken of hold notums. Holdnotums do not indicate their associated pitch. Thus if the generated andoriginal track both have a hold notum at a certain time step, it should also beverified that these hold notums will effectively correspond to the same pitchbeing held. This can easily be verified by adding information about the heldpitches to the hold notums when the similarity is being calculated.

82

While similarity is an objective measure, it is not a fully objective qualitymeasure. A lot of variation in music is possible, and therefore the generatedbass tracks could even sound very good when they are playing entirely differentnotums than the original track. A low similarity thus holds no qualitativeinformation. A high similarity, however, does indicate a good quality. If thesimilarity equals 100%, the model would have generated exactly the samenotums that a human has written. Therefore, a high similarity indicates thatthe model achieves human quality. (This of course assumes that the human-written tracks were qualitative.)

Even with a high similarity, though, the parts that are not similar should beexamined.

6.3.2 Proposed Bass Network

The proposed bass model had only a single loss value, the pitch loss. After 90epochs, the model was able to reach a validation loss of 0.07. Figure 6.9a showsthe evolution of the training and validation loss over the different epochs. Itcan be seen that there is a stead decrease in loss, meaning that the model keepsimproving. Training and validation accuracies are also shown in Figure 6.9b.This graph indicates that the model is able to already reach a good predictionaccuracy after the first few epochs. Over the course of the epochs, both thetraining and validation accuracies stay around 98%. It should, however, benoted that since the model makes use of time steps, many rests and holds arepresent in the data. If the model is able to correctly generate these, it canalready achieve a high accuracy, without generating other pitches correctly.

Generating bass tracks has shown to yield the best results when using low sam-pling temperatures (around 0.05). This will make the model almost exclusivelygenerate the pitches that have the highest probabilities. These temperaturesnot only subjectively lead to the best results, but they also yield the highestsimilarity measures.

Some similarities that were achieved with songs from the test set are listed inTable 6.3.

As can be seen, some song achieve high similarities above 80%, while othershave lower similarities.

Analyzing these lower similarities resulted in the following observations:

83

(a) Training and validation loss curves for the proposed bass model.

(b) Training and validation accuracy curves for the proposed bass model.

Figure 6.9

• The song ‘Echoes’ by August burns Red, has a part where the originalbass closely follows the melody of the lead guitar, while the generatedbass track follows the rhythm guitar during this part. Subjectively, bothoptions are plausible though.

• During the solo of ‘Creeping Death’ by Metallica, the original bass playsa more rhythmic sequence, whereas the generated bass plays long notes.While both are plausible, the rhythmic sequence from the original basstrack does subjectively sound better.

84

Artist - Song Similarity (%)As I Lay Dying - 94 Hours 82.4August Burns Red - Echoes 61.5ERRA - White Noise 86.3Metallica - Creeping Death 75.9Parkway Drive - Deliver Me 85.0

Table 6.3: Similarities between generated and original bass tracks for fivedifferent songs

An important achievement of this model is that it is able to provide properrhythmic bass lines when both guitars are playing melodies. Earlier iterationsof the model would either stop playing during such parts, or they would tryto follow the melody, while this did not make sense for the bass track. Theseearlier iterations were also sometimes following the rhythm guitar and then thelead guitar, even following the lead guitar during solos, which is not supposedto happen. The presented bass model is however able to understand a role aspart of the rhythm section, often following the rhythm guitar (as is often donein metal), creating slight but well-suited variations on the rhythm guitar, andfinally, offering a rhythm section when both guitars are occupied with playingmelodies.

The model is not without its flaws, though. When the lead guitar and rhythmguitar are track switch their parts, the bass model will follow the rhythm track,which might now be playing a melody that should not be played by the bass.

Training the model takes a long time, around 8 minutes per epoch, resulting ina training time of 13.3 hours for the model when 100 epochs are performed. Aneven bigger drawback is the generation time, which takes around 1.5 secondsper time step. With the average song length being 15767 time steps, this leadsto an average generation time of 394 minutes, or around 6.5 hours. Because ofthis long generation time and the already quite good results of the bass model,it was currently not further optimized. This was also because models for theother instruments had to be trained as well.

6.4 Rhythm Guitar

Rhythm guitar generation was explored last and has therefore received less at-tention. Using the model that was proposed in Section 5.3, decent to mediocre

85

monophonic rhythm guitar parts could be generated.

For some parts, the generated rhythm guitar copied the lead guitar, which isoften allowed in metal. For other parts it was able to provide some rhythmover a melody played by the lead guitar.

It did experience various difficulties though:

• In many songs, there are parts where first both guitars are playing thesame riff and then the lead guitar starts playing a melody while therhythm guitar continues playing the previous riff. The generated rhythmguitars did not grasp this concept and instead offered started playingquite basic rhythm parts when the lead guitar began playing the melody.

• An important part in metal is the use of harmonies between both guitars,however the generated rhythm guitar only rarely harmonized the leadguitar, and never for more than a couple notes.

• The rhythm guitar generation often get stuck in a very long hold or avery long rest. This usually started during lead guitar melodies, butwould even continue over following regular riffs, that would usually beplayed by both guitars.

Rhythm guitar training took around 5 minutes per epoch and already stoppedafter 22 epochs, achieving a loss of 0.14 and an accuracy of 96%. The results ofthe subjective analysis, however, clearly show that this accuracy mostly comesfrom rest and hold notums. Generation takes around 1 second per time step,resulting in a generation time of around 263 minutes or 4.4 hours.

6.5 Drums

Training the drum model from Section 5.4 resulted in a total loss of 0.73 after66 epochs. The evolution of this total loss over the different epochs is shownin Figure 6.10. Individual losses and accuracies for each drum are shown inFigure 6.11 and Figure 6.12.

Training takes about 18 minutes per epoch, leading to a training time of 1188minutes or almost 20 hours.

As can be seen from the individual curves, the model has some problems withcorrectly generating hi-hats, and really struggles with toms.

86

Figure 6.10: Training and validation loss curves for the proposed drum model.

Figure 6.11: Individual training and validation loss curves for the proposeddrum model.

A version with 2 hidden layers and 100 hidden units per layer1 was also trained.This resulted in a total loss of 0.77 and did not offer better subjective results.

Another version that increased the doubled the weights of the hi-hat and tomlosses was also trained. This led to a total loss of 0.67 after 17 epochs already.The individual accuracies, however, did not really improve much, while theaccuracies of the other drums went down.

1Using 128 hidden units resulted in a model size that led to difficulties with the GPUs.

87

Figure 6.12: Individual training and validation accuracy curves for the pro-posed drum model.

The use of differentiated rests is very helpful. Earlier approaches resulted ingenerations that only generated bass drum hits, and put these on every singleguitar note. Occasionally these generations would throw in a crash or snarehit, but this was rare.

With the differentiated rests, however, the drums are able to generate regulardrum patterns, mostly consisting of snare and cymbal hits. This is also helpedby the use of the time signature and time step position inputs. The modelrealizes when it should pause, e.g. at an interlude in the middle of a song,where only the lead guitar should be playing. During these parts, the modelwill occasionally generate some extra drum hits (that sometimes make sense),but not play a standard drum beat, which is good. The bass drum is also nothit on every guitar note, but more on the guitar notes where it makes sense.

An example of some generated drums for the song ‘94 Hours’ by As I LayDying, is shown in Figure 6.13 and can be heard together with the other in-struments at https://tinyurl.com/yxlggypq. This part clearly shows themodel’s capability of generating drums for a verse riff, going into a short in-terlude and then going into a breakdown.

While the model can generally create believable drums, there are two bigaspects which it struggles with. First off, as was expected from looking at theindividual losses and accuracies, hi-hats are not often generated and toms evenless. This is in part due toms and hi-hats being less represented in the dataset. Toms are also particularly difficult drums, since they are mainly used in

88

https://tinyurl.com/yxlggypq

Figure 6.13: Generated drum tab for bars 22-33 (minutes 0:32-0:50) for thesong ‘94 Hours’ by As I Lay Dying.

fills, meaning that the model cannot rely on the guitars for them (like for thekick drum), nor learn a regular pattern linked to the time step positions (likefor the other drums).

Secondly, the model’s generation of cymbals and snares mostly follows a 4/4pattern, meaning that these drums are only hit at quarter note intervals. Whilethis does usually work for isolated parts of a song, a whole song becomes boringif this same pattern is mostly played. In metal there are many parts where thedrums are hit in intervals of eighth notes, but this is not often generated.

A final shortcoming of the current drum generation model is its long generationtime. Each time step takes around 2 seconds. This leads to a generation timeof 525 minutes, or 8.75 hours, for a song with the average length of 15,767time steps.

89

Chapter 7

Evaluation

The quality of the models is tested through an adapted Turing test. Instead ofasking to identify which examples are generated and which are human-written,participants were asked to score the quality of all examples. There are severalreasons for this.

Firstly, the main goal of this thesis is to generate good music. The philoso-phy is that people want to hear music that they like in the first place, andonly in the second place, they might be concerned with whether that music ishuman-written or artificially generated. Besides that, the models should offerthe functional use of allowing musicians to generate extra tracks or notes toaccompany what they have already written. In this context, quality is themost important factor.

Secondly, by asking which music is generated, people are likely to either assumethat the most predictable or the most random example is the generated one.At the same time, predictable examples are not necessarily bad in quality, norare they necessarily the generated examples.

Finally, it would be possible to ask both a score and ask which example wasgenerated. However, this would lead to correlated results. If people identifyan example as generated, they might be biased to think it has a lower quality.

Testing subjects were thus only asked to rate the examples, but they were stillinformed that there were artificially generated examples. While this might stillintroduce some bias, not explicitly asking which example is generated makespeople think less about identifying this. Furthermore, spreading the surveyas one about artificial intelligence raised interest towards it. The survey itselfpresented short fragments of the different instruments to the subjects. This

90

of course only validates the quality of those fragments and not the quality offully generated songs. Surveying opinions on fully generated songs is difficulthowever, since it asks for much more time from the people participating. Frag-ments that were presented were cherry-picked, i.e. fragments were chosen thatwere subjectively considered to be better than others.

The following sections will discuss the participants of the survey (Section 7.1),which is then followed by sections on the examples and results per instrument.A playlist with all examples that were used can be found at https://www.youtube.com/watch?v=qtd0yIBsx5I&list=PLViwM4NqatHunqsl79u-SPthm0DDO1kC3.

7.1 Participants

Metal music has been generated in this thesis, but not everyone listens to thisgenre or appreciates it. In order to get meaningful opinions regarding theresults, participants had to be gathered that are familiar with the genre.

This was done through three Facebook groups: ‘Djent’, ‘Prog Snob’ and ‘Ex-tended Range Guitar Nerds’. The first two groups are focused on metal sub-genres, while the last one is focused on people that want to talk about guitarswith more than six strings. Such guitars are often used in metal, and thereforethis group also had suitable members. Each of these Facebook groups has over13,000 members.

In the end, 109 people took part in the survey. These people were first askedto tell whether they were a musician or not, and if so, which instruments theyplay and whether they also write music. Table 7.1 displays the different kindsof participants, while Table 7.2 displays how many people played each of thegenerated instruments.

Non-musicians Musicians Musicians(non-writers) (writers)

Number of 14 19 76Percentage (%) 12.8 17.4 69.8

Table 7.1: Number and percentage of participants that were non-musicians,musicians that not write music and musicians that also write music.

These tables show that most participants are musicians that write music andthat most musicians play the guitar. Having a big group of musician subjects

91

https://www.youtube.com/watch?v=qtd0yIBsx5I&list=PLViwM4NqatHunqsl79u-SPthm0DDO1kC3

https://www.youtube.com/watch?v=qtd0yIBsx5I&list=PLViwM4NqatHunqsl79u-SPthm0DDO1kC3

Instrument Number of playersGuitar 72Bass 48Drums 35

Table 7.2: Generated instruments that were played and number of participantsthat played them.

can lead to biased results. The presence of 48 bass players can be helpful, sincethey can be better-suited to rate the bass parts (bass is usually more in thebackground, so it can be harder for non-bass players to formulate an opinionon the quality of the bass).

7.2 Lead Guitar

To test the lead guitar results, 9 fragments between 10 and 20 seconds in lengthwere selected. Three of these were from existing songs, five were from nothingand one was generated using a kickstart from an existing song (its fragmentdid not include part of the kickstart sequence).

The existing examples were selected to partially fall in line with the generatedsequences. Generated sequences were mostly monophonic and did not includemuch repetition. The existing sequences therefore included a solo, a melodyand a verse-like riff.

The lead guitar generation model does not generate tempos and outputs itsgenerations with the root note at E2. In MIDI, when there is no tempo defined,the default tempo of 120bpm is selected. Metal music is regularly faster than120bpm and often uses a root note that has a lower pitch than E2. For eachgenerated sample, a suitable tempo was selected and the root note was adapted.

Test subjects where presented all 9 examples and asked to rate each of themon a scale from 1 to 5 (1 meaning ‘bad’ and 5 meaning ‘good’).

Table 7.3 presents an overview of the examples and the order they were pre-sented in. While the simple model is present more often than the advancedmodel, it is present in multiple variations, each of which should be consideredas a different approach.

92

Example Model used/Song name1 Advanced model2 Advanced model3 Simple model (single output)4 ‘The Fractal Effect’ by After The Burial (verse riff)5 Simple model6 Simple model7 Simple model (kickstarted with ‘Doomsday’ by Architects)8 ‘Vacancy’ by As I Lay Dying (solo)9 ‘Balance’ by August Burns Red (melody)

Table 7.3: Presented guitar examples.

7.2.1 Results

Table 7.4 presents a general overview of how people rated each different exam-ple.

Example 1 2 3 4 51 19.3% 30.3% 33.9% 13.8% 2.8%2 2.8% 17.4% 26.6% 37.6% 15.6%3 4.6% 23.9% 30.3% 34.9% 6.4%4 2.8% 14.7% 23.9% 37.6% 21.1%5 2.8% 7.3% 22.0% 41.3% 26.6%6 5.5% 18.3% 28.4% 30.3% 17.4%7 6.4% 22.9% 31.2% 26.6% 12.8%8 1.8% 15.6% 26.6% 31.2% 24.8%9 9.2% 22.9% 36.7% 21.1% 10.1%

Table 7.4: Lead guitar ratings for each example. Existing fragments havetheir number in italics. The highest percentage is indicated in bold for eachfragment.

These results show that the test subjects considered most generations to beof similar quality to the human-written examples. Particularly example 5was highly appreciated, even more than any of the existing songs that werepresented. This was generated by the simple model and shows that it cangenerate pieces of music that are considered good. The second most popularexample was number 4, which comes from an existing song. After that, both

93

example 8 (from an existing song) and example 2 (generated) were preferredmost. Example 2 was generated by the advanced model.

The rest of the results show that example 1 and human-written example 9were liked the least, followed by the kickstarted example 7.

The conclusion that is made from these results is that both the simple modeland the advanced model are able to yield fragments of lead guitar music thatcan be considered as good as some human-written music. The music generatedby the original simple model is preferred more than that generated by theadvanced model, however each model only presented some examples. At thesame time, listeners were only presented with some pieces of existing music.The low performance of the kickstarted generation is not necessarily the faultof kickstarting, but can also depend on the particular example. These nuancesshould be taken into account, but it is good to see that generated examplesare not overall disliked compared to existing ones.

7.3 Other Instruments

The models for the other instruments generate tracks on top of already writtentracks. In order to test these generations, subjects were presented two versionsof the same part of a song: one with the original instrument track and onewith the generated track.

For each test, a full band was presented. This means that e.g. for generatedbass, the existing drums were also included in the example. The instrumentthat was tested was also put at a slightly higher volume than the other instru-ments, in order to make it easier to hear the difference.

An important aspect when picking fragments to be used, is that only frag-ments where the generated track was different from the existing track couldbe used. Otherwise, there would be no difference between the presented ex-amples. Fragments where the generated track is equal to the original track areautomatically considered to be good. (Notice that generations only happenfor the test set, which was never used during training. Therefore these goodgenerations are not a result of simply overfitting to the data set.)

Furthermore, no tracks were regenerated using a different seed to potentiallybecome a better result. This was not done because generating tracks for theseinstruments takes long. It does, however, mean that initial generations are

94

tested for their quality. This is in contrast to testing how many regenerationsare required to achieve a good quality.

Fragments for each instrument were selected to be rather different from eachother. This difference was achieved by selecting fragments that correspond todifferent types of song sections or to artists with big differences between theirstyles. Sometimes the difference was also just based on the original instrumentplaying rather different sequences.

People were then asked to indicate which version they preferred through arating of 1 to 5. These ratings mean the following:

• 1: The first version is the best and the second version is bad

• 2: The first version is better, but the second version is good

• 3: Both versions are equally good

• 4: The second version is better, but the first version is good

• 5: The second version is the best and the first version is bad

For each song, the order in which the fragments were presented was random.For bass fragments, the test subjects were also able to hear the bass by itself,since some bass notes can be harder to hear when other instruments are play-ing as well. For drum fragments, the generated drums used the maximallyreduced drum set and the existing drum tracks were also converted to this.Finally, for rhythm fragments, the rhythm guitar of the existing tracks wasmade monophonic, since the generated rhythm guitars are also monophonic.

People were also asked to tell whether they recognized the song.

7.4 Bass

Table 7.5 presents an overview of the examples and the order they were pre-sented in. For each example the name of the song is given and an indicationof whether the first or second version provided is the generated one.

7.4.1 Results

Table 7.6 presents a general overview of how people rated each different exam-ple. It also indicates haw many percent of people recognized the song.

95

Example Song name Generated example1 ‘Echoes’ by August Burns Red 12 ‘White Noise’ by ERRA 13 ‘Creeping Death’ by Metallica 14 ‘Mark The Lines’ by Veil of Maya 25 ‘Principles’ by Elitist 2

Table 7.5: Presented bass examples and which example is the generated one.

Example 1 2 3 4 5 Recognized1 44.0% 34.9% 10.1% 6.4% 4.6% 7.3%2 9.2% 28.4% 33.9% 20.2% 8.3% 6.4%3 9.2% 22.0% 24.8% 24.8% 19.3% 46.8%4 27.5% 28.4% 22.0% 15.6% 6.4% 17.4%5 21.1% 28.4% 19.3% 22.0% 9.2% 7.3%

Table 7.6: Bass ratings for each example. 1 indicates preference for the originaland 5 indicates preference for the generated version. The highest percentageis indicated in bold for each fragment. Also indicated is how many percent ofpeople recognized the song.

These results show that, generally, the original bass track is preferred. Espe-cially for example 1, there was a strong preference to the original bass track.For example 2 both the original and the generated bass tracks were consideredto have a similar quality.

Example 3 was the only one with a slight preference towards the generatedbass track. This was also the track that was recognized by most people, sothere might be a slight bias here. Table 7.7 shows the rating percentages forthis track for people who recognized the song and for people who did not.It shows that the preference mostly came from people who already knew thesong, while people who did not recognized it showed a smaller preference.

1 2 3 4 5Recognized 11.8% 23.5% 13.7% 29.4% 21.6%Not recognized 7.0% 20.7% 34.5% 20.7% 17.2%

Table 7.7: Bass rating percentages for example 3, split into whether peopledid or did not recognize the song.

The low recognition percentages for the other songs indicate that their ratings

96

are unbiased.

The conclusion that can be taken is that for parts where the generated bassdiffers from the original bass, it is usually considered to generated lower qualitysequences.

7.5 Rhythm Guitar

Table 7.8 presents an overview of the examples and the order they were pre-sented in. For each example the name of the song is given and an indication ofwhether the first or second version provided is the generated one. Only threeexamples were used for rhythm guitar since this model received less focus andit was harder to cherry-pick decent parts were the generated rhythm guitarwas different.

Example Song name Generated example1 ‘Doomsday’ by Architects 12 ‘Specter’ by Elitist 23 ‘To Carry You Away’ by After The Burial 2

Table 7.8: Presented rhythm guitar examples and which example is the gen-erated one.

7.5.1 Results

Table 7.9 presents a general overview of how people rated each different exam-ple. It also indicates haw many percent of people recognized the song.

Example 1 2 3 4 5 Recognized1 29.4% 34.9% 25.7% 5.5% 4.6% 33.0%2 12.8% 14.7% 46.8% 19.3% 6.4% 18.3%3 14.7% 21.1% 29.4% 25.7% 9.2% 11.0%

Table 7.9: Rhythm guitar ratings for each example. 1 indicates preference forthe original and 5 indicates preference for the generated version. The highestpercentage is indicated in bold for each fragment. Also indicated is how manypercent of people recognized the song.

97

These results show that for the first example, there is a strong preference forthe original track. This track was recognized by 33.0% of the participants.Since the track has low percentages for ratings 4 and 5, this result can beconsidered unbiased. In the original version of this fragment, both guitarsplay the same notes, which subjectively also makes the most sense here. Forthe other two fragments, the generated and original versions were consideredto be equally good.

A conclusion to be made is that the rhythm guitar model is able to generatesome decent fragments of music, but it does not always offer decent accom-paniment to the lead guitar. Only three samples were tested however, sincerhythm guitar generation received less focus. It is nice, though, to see that itscurrent results can be appreciated.

7.6 Drums

Table 7.10 presents an overview of the examples and the order they werepresented in. For each example the name of the song is given and an indicationof whether the first or second version provided is the generated one.

Example Song name Generatedexample

1 ‘Black Sheep’ by August Burns Red 12 ‘Sad But True’ by Metallica 23 ‘Even If You Win You’re Still a Rat’ by Architects 14 ‘94 Hours’ by As I Lay Dying 25 ‘Divisions’ by August Burns Red 2

Table 7.10: Presented drum examples and which example is the generated one.

7.6.1 Results

Table 7.11 presents a general overview of how people rated each different ex-ample. It also indicates haw many percent of people recognized the song.

These results are indicating that the generated and original drum tracks havecomparable quality. While the most-given rating for example 1 equals 4, thepercentages of all ratings for this example indicate that both versions are pre-ferred by different people. For examples 2 and 4, both versions are considered

98

Example 1 2 3 4 5 Recognized1 11.9% 30.3% 11.9% 32.1% 13.8% 9.2%2 20.2% 18.3% 22.0% 19.3% 20.2% 58.7%3 26.6% 33.0% 18.3% 11.9% 10.1% 12.8%4 14.7% 24.8% 25.7% 24.8% 10.1% 15.6%5 11.0% 16.5% 24.8% 25.7% 22.0% 7.3%

Table 7.11: Drum ratings for each example. 1 indicates preference for theoriginal and 5 indicates preference for the generated version. The highestpercentage is indicated in bold for each fragment. Also indicated is how manypercent of people recognized the song.

to have equal quality. Example 3 lead to a bigger preference for the originaldrums. For this example, the original drums were playing a more regular beat,with occasional short fills. The generated version started with more of a longfill and offered less of a regular beat near the end. For example 5 on the otherhand, generated drums were preferred. This came as a bit of a surprise. Theoriginal version included many fills, and it was expected that this would beseen as higher quality. The generated drums for this example offer a moresteady beat, which apparently was preferred.

The conclusion from these results is that generated drum fragments can offersimilar quality to human-written drum fragments. It is important to takeinto account the use of the word ’fragments’, since no full drum tracks werepresented. Drums are also very contextual. While some fragments may be lessappreciated by themselves, they might be appreciated more when the drumsof the surrounding fragments are heard. Nonetheless, it is good that the drummodel is able to generate decent fragments of drums.

99

Chapter 8

Conclusion

The focus of this thesis was on the generation of multi-instrument music, bygenerating instruments in a specific order. The instruments that were consid-ered were lead guitar, rhythm guitar, bass and drums, instruments that arepart of almost every metal and rock song. They were to be generated in thesame order that they were mentioned in. Furthermore, the generation of metalmusic was considered, since this genre does not allow some simplifications thatare made in state-of-the-art music generation models.

Two lead guitar generation models were introduced first. These models gener-ated notes one by one, by generating their pitch, duration and possibly, a setof intervals to form a chord. They both used an LSTM layer followed by oneor more fully-connected layers. The second model introduced improvements,such as generating note durations conditionally on the generated pitches andonly allowing a limited set of notes, durations and chords to be used in eachgenerated song. These models were able to generate lead guitar tracks thatexhibited characteristics of different sections of a metal song, such as verseriffs, melodies and breakdowns. While no generated riffs are repeated like inreal songs, different types of song sections do alternate in generated songs.

Secondly, a bass generation model was presented. This model followed a Bi-LSTM approach and used previous bass notes, and surrounding rhythm guitarand lead guitar notes, to predict the next bass note. To synchronize withthe other instrument tracks, each track had to be divided in very short timesteps. This led to very long generation times. The model was able to generatebass tracks for existing songs that played 60-85% of the original bass notes,indicating decent generation. Parts were the generated bass varied from theoriginal had varying quality.

100

The bass generation model was then adapted to work for rhythm guitars andfor drums. Rhythm guitars were briefly explored and a model was createdthat was able to generate some decent, monophonic rhythm guitar parts. Themodel did, however, struggle with some parts of a song, where it would stopgenerating. The adapted model for drums generated drum tracks that onlyused a kick drum, snare drum, hi-hat, crash cymbal and a single tom. Usingthis limited drum set, the adapted model was able to generate decent drumtracks that mainly used the kick drum, snare drum and crash cymbal.

An online survey was created to receive opinions on the quality of the genera-tion for the different instrument. This survey only tested short fragments perinstrument, and not fully generated songs. For the rhythm guitar, bass anddrums, only fragments where these instruments where playing different notesthan the original tracks were tested.

The survey’s results showed a positive appreciation towards the generatedlead guitar and drum fragments. The initial rhythm guitar fragments werealso considered decent. Original bass fragments were generally preferred overgenerated ones, however. Completely generated bass and drum tracks were notsurveyed, but personally, such bass tracks were considered to be quite good,while full drum tracks were considered decent, but lacking enough variation.

8.1 Future Work

The models that were created during this thesis, dealt with certain difficultiesfor the generation of each instrument. Other difficulties remain and should betackled in future research. A first difficulty to be handled is the generation ofchords. This forms a difficulty for both the lead and rhythm guitar generationmodels.

Secondly, lead guitar generation should result in better structure in the songsand make use of repetitions.

A third task, is improving the rhythm generation model. It should become ableto generate accompanying notes for the lead guitar during the entire song. Fur-thermore, it should become more creative to generate better accompaniments.

Drum generation should be improved to better include hi-hats and toms, aswell as generate more different types of drum beats. After these improvements,the model could be extended to generate more drum set elements.

101

General directions for future work include the use of Generative AdversarialNetworks, which have shown improved performance for sequence generationcompared to LSTM networks. Generative Adversarial Networks have beenavoided during this thesis, however, because of their long training time.

Finally, the online survey sparked interest among musicians wanting to use themodels. Currently the rhythm guitar, bass and drum generation models havevery long running times, though. Before making the models available to thepublic, these generation times need to be reduced.

102

Bibliography

[1] S. Xie, R. Rastogi, and M. Chang. Deep poetry: Word-level and character-level language models for shakespearean sonnet generation.

[2] L. Gatys, A. Ecker, and M. Bethge. Image style transfer using convolu-tional neural networks.

[3] A. Van Den Oord et al. Wavenet: A generative model for raw audio, 2016.

[4] G. Hadjeres, F. Pachet, and F. Nielsen. Deepbach: a steerable model forbach chorales generation, 2017.

[5] Florian Colombo, Alexander Seeholzer, and Wulfram Gerstner. Deep ar-tificial composer: A creative neural network model for automated melodygeneration. In International Conference on Evolutionary and BiologicallyInspired Music and Art, pages 81—-96. Springer, 2017.

[6] F. Colombo and W. Gerstner. Bachprop: Learning to compose music inmultiple styles, 2018.

[7] Z. Zukowski and C. Carr. Generating black metal and math rock: Beyondbach, beethoven, and beatles, 2018.

[8] How to teach yourself to read piano sheet music. https://www.dxipiano.com/how-to-teach-yourself-to-read-piano-sheet-music/. Ac-cessed: 2019-05-30.

[9] Dynamics (music). https://en.wikipedia.org/wiki/Dynamics_(music). Accessed: 2019-05-30.

[10] Logic pro x. https://www.apple.com/logic-pro/. Accessed: 2019-05-30.

[11] Dynamic’s note velocity.svg. https://commons.wikimedia.org/wiki/File:Dynamic%27s_Note_Velocity.svg. Accessed: 2019-05-30.

103

https://www.dxipiano.com/how-to-teach-yourself-to-read-piano-sheet-music/

https://www.dxipiano.com/how-to-teach-yourself-to-read-piano-sheet-music/

https://en.wikipedia.org/wiki/Dynamics_(music)

https://en.wikipedia.org/wiki/Dynamics_(music)

https://www.apple.com/logic-pro/

https://commons.wikimedia.org/wiki/File:Dynamic%27s_Note_Velocity.svg

https://commons.wikimedia.org/wiki/File:Dynamic%27s_Note_Velocity.svg

[12] Original floyd rose guide. http://www.wiredguitarist.com/2016/08/25/original-floyd-rose-guide/. Accessed: 2019-05-30.

[13] Guitar pro. https://www.guitar-pro.com/en/index.php. Accessed:2019-05-30.

[14] Twelve-bar blues. https://en.wikipedia.org/wiki/Twelve-bar_blues. Accessed: 2019-05-30.

[15] New Harvard Dictionary of Music. Cambridge, MA: Harvard UniversityPress, 1986.

[16] J. Briot, G. Hadjeres, and F. Pachet. Deep learning techniques for musicgeneration – a survey, 2019.

[17] Automatic music transcription software. https://www.lunaverus.com/.Accessed: 2019-05-30.

[18] The midi association. https://www.midi.org/. Accessed: 2019-05-30.

[19] abc notation home page. http://abcnotation.com. Accessed: 2019-05-30.

[20] musicxml. https://www.musicxml.com. Accessed: 2019-05-30.

[21] J. Briot and F. Pachet. Music generation by deep learning – challengesand directions, 2018.

[22] P. Todd. A connectionist approach to algorithmic composition, 1989.

[23] C. Walder. Modelling symbolic music: Beyond the piano roll, 2016.

[24] Lakh database. https://colinraffel.com/projects/lmd/. Accessed:2019-05-30.

[25] Musedata database. http://www.musedata.org. Accessed: 2019-05-30.

[26] Jsb chorales database. http://www-etud.iro.umontreal.ca/

˜boulanni/icml2012. Accessed: 2019-05-30.

[27] Nottingham database. https://ifdo.ca/˜seymour/nottingham/nottingham.html. Accessed: 2019-05-30.

[28] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Modeling tem-poral dependencies in high-dimensional sequences: Application to poly-phonic music generation and transcription. In Proceedings of the 29thInternational Conference on Machine Learning (ICML-12), pages 1159–1166, 2012.

104

http://www.wiredguitarist.com/2016/08/25/original-floyd-rose-guide/

http://www.wiredguitarist.com/2016/08/25/original-floyd-rose-guide/

https://www.guitar-pro.com/en/index.php

https://en.wikipedia.org/wiki/Twelve-bar_blues

https://en.wikipedia.org/wiki/Twelve-bar_blues

https://www.lunaverus.com/

https://www.midi.org/

http://abcnotation.com

https://www.musicxml.com

https://colinraffel.com/projects/lmd/

http://www.musedata.org

http://www-etud.iro.umontreal.ca/~boulanni/icml2012

http://www-etud.iro.umontreal.ca/~boulanni/icml2012

https://ifdo.ca/~seymour/nottingham/nottingham.html

https://ifdo.ca/~seymour/nottingham/nottingham.html

[29] S Lattner, M Grachten, and G Widmer. Imposing higher-level structurein polyphonic music generation using convolutional restricted boltzmannmachines and constraints, 2018.

[30] Douglas Eck and Juergen Schmidhuber. Finding temporal structure inmusic: Blues improvisation with lstm recurrent networks. In Proceedingsof the 12th IEEE Workshop on Neural Networks for Signal Processing,pages 747–756. IEEE, 2002.

[31] Text generation using a rnn with eager execution. https://www.tensorflow.org/tutorials/sequences/text_generation. Accessed:2019-05-30.

[32] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. The MITPress, 2016.

[33] S. Hochreiter and J. Schmidhuber. Long short-term memory, 1997.

[34] B. Sturm et al. Music transcription modelling and composition using deeplearning. In 1st Conference on Computer Simulation of Musical Creativity,2016.

[35] Liang. Feynman. Bachbot: Automatic composition in the style of bachchorales, 2016.

[36] D. Makris, M. Kaliakatsos-Papakostas, I. Karydis, and K. L. Kermani-dis. Combining lstm and feed forward neural networks for conditionalrhythm composition. In Engineering Applications of Neural Networks:18th International Conference, pages 570–582. Springer Nature, 2017.

[37] Goodfellow I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, et al.Generative adversarial nets, 2014.

[38] H. Dong, W. Hsiao, L. Yang, and Y. Yang. Musegan: Multi-track sequen-tial generative adversarial networks for symbolic music generation andaccompaniment, 2017.

[39] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang. Midinet: A convolutional gener-ative adversarial network for symbolic-domain music generation. In Pro-ceedings of the 18th International Society for Music Information RetrievalConference, 2017.

[40] J. Hui. Gan—why it is so hard to train generative ad-versarial networks! https://medium.com/@jonathan_hui/

105

https://www.tensorflow.org/tutorials/sequences/text_generation

https://www.tensorflow.org/tutorials/sequences/text_generation

https://medium.com/@jonathan_hui/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b



gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b.Accessed: 2019-05-30.

[41] A. Turing. Computing machinery and intelligence, 1950.

[42] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. C.Courville, and Y. Bengio. Samplernn: An unconditional end-to-end neuralaudio generation model, 2017.

[43] Ultimate guitar. https://www.ultimate-guitar.com/. Accessed: 2019-05-30.

[44] C. De Boom, T. Demeester, and B. Dhoedt. Character-level recurrentneural networks in practice : comparing training and sampling schemes,2018.

[45] Maximum likelihood decoding with rnns - the good, thebad, and the ugly. https://nlp.stanford.edu/blog/maximum-likelihood-decoding-with-rnns-the-good-the-bad-and-the-ugly/.Accessed: 2019-05-30.

[46] Keras. https://keras.io/. Accessed: 2019-05-30.

[47] Tensorflow. https://www.tensorflow.org/. Accessed: 2019-05-30.

[48] vishnubob/python-midi. https://github.com/vishnubob/python-midi. Accessed: 2019-05-30.

106




https://www.ultimate-guitar.com/

https://nlp.stanford.edu/blog/maximum-likelihood-decoding-with-rnns-the-good-the-bad-and-the-ugly/

https://nlp.stanford.edu/blog/maximum-likelihood-decoding-with-rnns-the-good-the-bad-and-the-ugly/

https://keras.io/

https://www.tensorflow.org/

https://github.com/vishnubob/python-midi

https://github.com/vishnubob/python-midi

Documents

Composition of Multi-instrument Music with Deep Learning€¦ · Composition of Multi-instrument Music with Deep Academic year 2018-2019 Master of Science in Computer Science Engineering