Stylistic Melody Generation with Conditional Variational Auto …epxing/Class/10708-19/assets/... · 2019. 9. 11. · melody given a speciﬁc chord progression, how to gener-ate

Stylistic Melody Generation with Conditional Variational Auto-Encoder

Junyan Jiang (junyanj) 1 Zhiqi Wang (zhiqiw) 1

AbstractStylistic music generation requires human con-trols over high-level music features, e.g., rhythm,pitch, structure, and genre. While Structure is akey notion for both music generation and repre-sentation learning, most existing generative mod-els for music either adopt hard-coding structuresor merely implicitly inherit the structure from thetraining pieces. Consequently, it is still difficultto control music structure in a flexible way in or-der to create new pieces. In this paper, we pro-pose several novel ways trying to represent mu-sic structure using the framework of ConditionalVariational Auto-Encoder (CVAE). Furthermore,we explore the problem of disentanglement ofthe representations of music structure and con-tent for long music pieces. We show in the resultsthat we can extract the bar-level structure of along music piece while creating novel pieces viaswapping the representations of different pieces.

1. IntroductionStylistic music generation is an interesting and importanttopic in computer music, music therapy and computationalcreativity. With the development of deep generative mod-els, researchers have developed several new models for mu-sic generations, like MusicVAE (Roberts et al., 2018) andMidiNet (Yang et al., 2017), the aim of which is to traina model that create new pieces without human guidance.However, music generation in practice does not always ex-pect computers to freestyle; we want to keep in charge ofsome stylistic property of the generative piece. For ex-ample, we may ask the questions like how to generate amelody given a specific chord progression, how to gener-ate pitch contour given rhythm, or how to generate a similarpiece given a sample. This requires us to develop genera-tive models with stylistic controls.

In this project, we investigate the problem of generatinglead melody, an important element for modern popular mu-

1Carnegie Mellon University, Pittsburgh, PA 15213, USA.

10708 Class Project, Spring, 2019.Copyright 2019 by the author(s).

sic. Under this setting, we can classify stylistic controlsinto two categories:

(1) Internal properties of a melody, e.g., rhythm, pitch con-tour, structure;

(2) External context, e.g., chord sequence, tonality.

A successful generative model with a specific stylistic con-trol should be able to analyze and disentangle such featureautomatically if it is an internal property, as well as to gen-erate a new one consistent with human’s controls.

Notice that these two processes are highly related to VAE,as the analysis part can be performed by the encoder and thegeneration part can be performed by the decoder. There-fore, we believe that VAE is a good starting point for fur-ther implementation.

Among these stylistic features, structure is the most dif-ficult one to handle both in analysis tasks and generationtasks (Paulus et al., 2010; Dai et al., 2018). Music struc-ture itself is a complex phenomenon. First, music structureis a multi-level concept (Hamanaka et al., 2018). On barlevel, the structure may refer to how nearby music notesare grouped; on phrase-level, the structure may refer to lo-cal repetition and similarity; on song level, the structuremay refer to the organization of different segments (e.g.,verses and chorus). Secondly, structures may refer to avariety of composition technologies including repetition,duality, melodic sequence and motivic development. Thecomplexity of music structure makes it difficult to be mod-eled. Currently, mainstream models including MusicVAEand MidiNet still struggle with generating melodies withnice structural properties.

The aim of the project is to design a new deep generativemodel based on MusicVAE (Roberts et al., 2018) for stylis-tic melody generation. For internal properties controls, wewill focus on a new generative model with the ability toautomatically analyze and separate the concept of musicstructure given a music piece, and the ability to generatenew songs consistent with the given structure representa-tion. The model combined VAE and self-attention, a pop-ular architecture for modeling sequence structure. For ex-ternal context, we will propose a conditional version of thesystem that utilizes the chord sequence and the tonality asan external control. We will provide some experimental


results on the model, including the effectiveness of suchcontrols and musicality of generated pieces.

2. Related Work2.1. Stylistic Music Generation

Traditionally, rule-based expert systems, automata andHidden Markov Models (HMM) are widely used in au-tomatic music generation tasks (Conklin & Witten, 1995;Lo & Lucas, 2006; Thornton, 2009). Commonly, they allrequire human-defined style parameters for music genera-tion. It is therefore hard for the systems to adapt to newstyles in general.

Recently, the development of deep generative models pro-vides researchers with new insight for sequence generation.In the recent two years, deep generative models becamewidely adopted in music generation. For example, Deep-Bach (Hadjeres et al., 2017) proposed a method to generateBach-style music. DeepBach adopts a Long Short-TermMemory (LSTM) network for melody modeling and usesa pseudo-Gibbs Sampling technique for music generation.While the results are exciting, it is worth noticing that themethods in DeepBach are hard to be reproduced on othermusic styles as the model incorporates strong assumptionson Bach’s style into the model itself. Another model, DeepJ(Mao et al., 2018) combined a Biaxial LSTM architecturewith additional genre conditioning at every layer. To geta parametric representation of the music genre, the modeluses one embedding layer to learn the distributed represen-tation and then another fully-connected hidden layer is ap-plied. The generated samples by DeepJ did demonstratethe stylistic generation. However, the generated piece stilllacks long term structures, and a large amount of data isrequired for training a new genre.

Other attempts include the usage of popular deep genera-tive models. The MidiNet (Yang et al., 2017) adopts Gener-ative Adversarial Networks (GAN) for fixed-length musicgeneration and the MusicVAE (Roberts et al., 2018) usesthe Variational AutoEncoder (VAE) model for melody gen-eration. We will discuss about MusicVAE in detail in 2.3.

2.2. Variational Autoencoder

The variational autoencoder (Kingma & Welling, 2013) arepowerful latent variable models which consist of an en-coder and decoder. The encoder compresses training datainto a latent space, to learn a Gaussian probability densityqθ(z|x), from which we can sample to get noisy values ofrepresentations z. The decoder uses the representation z toreconstruct the original data. Ideally, the latent variable zcaptures the probability characteristics of giving data pointsin the datasets. Recently, VAE and its variance have beenapplied to many fields of applications, and perform efficient

inference by deep neural networks. Conditional-VAE isproposed by giving labels into input, and gained impressiveperformance in Attribute2Image (Yan et al., 2016). Sincewe noticed that harmonic context is an important factorin both perception and composition process, it can be re-garded as a condition to the model and should be providedto both the encoder and decoder as a condition. ConditionalVAE will be applied in our further implement.

However, one drawback of using a single latent unit is thatit is sensitive to changes in one factor, while not for otherfactors. Therefore, disentangled representation are usuallyfactorized so that different independent latent units encodedifferent independent factors in data. In our experiment,the melody contains two important musical factors: thepitch contour and the rhythm pattern. They can be sepa-rated into two latent vectors and have their specific definedloss function. Traditional attempts to learn disentangledrepresentations under supervised learning, which is unreal-istic in our domain. One famous unsupervised approachesβ−VAE (Higgins et al., 2017) proposed to add an extrahyper-parameter to the original VAE objective, which al-lows constraints on the encoding capacity of latent vectorsand encourage factorization.

2.3. MusicVAE

MusicVAE (Roberts et al., 2018) adopts the idea of VAEfor melody generation. In this model, the piano roll of amonophonic melody line m is fed into the encoder to get alatent variable zm, and the decoder tries to reconstruct themelody from zm. The encoder and the decoder use LSTMcells to process variant-length data.

The major contribution of the work is to show that VAEis a viable generation model for music melody, as themodel successfully generated musically meaningful 2-barsamples by interpolating between the latent code of givenmelodies. However, the model fails to generate, or evenreconstruct a long piece (e.g., 16-bar samples) well evenif it explicitly adopts a hierarchical architecture for the de-coder. Also, it is not a stylistic melody generator as no dis-entanglement is performed on the latent space yet. Thus,we cannot control the stylistic parameters of the generativemelody in an initiative way. These two issues of MusicVAEinspired us to work on the project.

2.4. Attention-Based Music Generation

The attention mechanism is adopted in sequence to se-quence models and achieved huge success in many genera-tive tasks like machine translation (Vaswani et al., 2017).Such a mechanism can help the model focus on certaininputs to the fore while diminishing the importance ofthe others. (Huang et al., 2019) combined self-attentionwith seq2seq model, trying to generate music while keep-


ing long-term structures and achieved better results com-pared to only using LSTM without self-attention. Their ex-periment demonstrated that training with relative attentionhas benefits for learning long-term structural information.However, this can not be directly used in VAE, since com-pared with LSTM encoder, lots of information is lost whenVAE encodes inputs into latent variables.

There are also some working attempts to combine varia-tional attention for sequence-to-sequence models (Bahu-leyan et al., 2017a). Such mechanism works on the Vari-ational Encoder-Decoder models (e.g., for machine trans-lation), but due to different encoding targets, these modelscannot be directly applied to VAEs.

3. Proposed MethodsWe here propose three methods that we experiment withduring the entire semester. We will discuss the advantageand disadvantage of each method in the discussion section.

3.1. Bar-Level Similarity Representation

Symbolic music can be regarded as a highly structured se-quence, with repetition and similarity on different scales. Itis generally hard to model music structures both in tran-scription problems and generation problems, and a uni-fied representation of music structure is not yet concluded(Paulus et al., 2010).

Assume that a music melody is denoted as a sequencem1..n. Typically, a token mi refers to the note sequencein one bar or one beat. In our project, we use one bar forconvenience. A traditional metric for describing the mu-sic structure is the similarity matrix T, in which each cellTij denotes the correlation between certain music elements(e.g., melody or rhythm) between mi and mj .

Figure 1. An example of music structure analysis of the first fourbars of the song Two Tigers. The song contains exact repetitionsas well as inexact similarity in pitch contour.

Notice that although the similarity matrix T is very con-venient for music analysis, it is still not obvious how togenerate a new piece according to T itself. Therefore, wehere design a new form of similarity representation withthe capability of both analyzing and generation. We call

Figure 2. An example masked self-attention similarity matrix ofthe song Two Tigers (Best viewed in color). The red cells aremasked out to zero. Darker blue denotes a larger attention value.

it Masked Self-Attention Similarity Matrix, or MSASM inshort.

Given a melody piece m1..n, the h-th MSASM Ah is ann× n matrix given by

Ahi,j =

{Softmax

(QhiK

hj+S

hi−j

T

)i ≥ j

0 i < j(1)

Where T is the temperature hyper-parameter,Sh is the rela-tive position encoder proposed by (Huang et al., 2019), andQh, Kh are the attended variable given by

Qhi = Wh

QEncoder(mi) (2)

Khi = Wh

KEncoder(mi) (3)

And h = 1..H is the head number. Notice that each headmay attend on a different property between two bars mi

and mj . For example, some heads focus more on the rhyth-mic similarity while others focus more on the pitch contoursimilarity.

In the decoding phase, we utilize the similarity matrix togenerate a new sample that shares a similar music structureto the provided one. Given a hidden representation z1..nof each bar, we reweight them according to the attentionvalues in A to get a structure-aware representation:

zhi = WhV zi (4)

z′ih =

{zhi i = 1

zhi Ahi,i +

∑i−1j=1 z

′jhAhi,j i > 1

(5)

z′i = Concat(z′i1, ..., z′i

H) (6)

It is worth noticing that the formula is not exactly the sameas in the original transformer model (Vaswani et al., 2017),


where z′jhAhi,j are replaced by zhjA

hi,j . This is because we

are allowing chained similarity, for example, if bar 3 is sim-ilar to bar 2 and bar 2 is similar to bar 1, it does not nec-essarily imply bar 3 is similar enough to bar 1, but bar 3still requires information in bar 1 for reconstruction as theyhave some correlation. We will provide a probabilistic ex-planation of this approach in 3.1.1.

3.1.1. A PROBABILISTIC VIEW

A trivial way to extend the MusicVAE model to arbitrarymusic length is to apply a 1-bar MusicVAE on each bar ofthe music:

si = Encoder(mi) (7)zi ∼ Psi (8)m′i = Decoder(zi) (9)

Where zi is the variational variable and si is the encodedparameter that controls the variational distribution zi.

Figure 3. The graphical model of the trivial extension of MusicVAE.

Figure 4. An example graphical model of the proposed model.This graph corresponds to the song Two Tigers as shown in Figure1

An obvious problem for the model is that the model is notaware of any structural information of the music. It as-sumes independence between the representations of eachtwo bars, as shown in figure 3. Thus, if we resample thevariables z1..n, any inter-bar structural property will belost.

However, if we instead sample z′ given by equation (4) fordecoding instead of z, we can retain such dependency.

One implementation problem of the model is the fact thatthe MSASM can easily degenerate to a diagonal matrix.In this case, all the dependencies between bars disappear,and the model becomes a trivial case. To prevent this fromhappening, we pose additional penalties Ld to the MSASMon the diagonal entries to encourage the model to explorethe correlation between one bar and its previous ones:

Ld(A) =1

nH

H∑h=1

n∑i=1

Ahi,i (10)

The total loss function is given by:

L = Lrecons + λ1DKL(p(z′ |m)||p(z′)) + λ2Ld(A)

(11)

Where Lrecons is the reconstruction loss and λ1, λ2 are thehyper-parameters.

3.2. Structured Condition

Human composers always refer to some contextual infor-mation when producing a music melody. The contextualinformation may vary depending on the situation, but themost common ones are the harmonic context, i.e., the chordsequence and the tonality, while the chord sequence con-trols the local, short-term context of a melody and the tonal-ity controls the long-term context. As the structure of har-monic context may reflect the structure of melody to someextent, such condition might be useful for an explicit rep-resentation of music structure.

Figure 5. The model structure of the hierarchical CVAE.

We here adopt the Conditional-VAE (CVAE) architecturefor our model. The modification from the VAE to its con-ditional version is simple. For both the encoder and thedecoder, we add a new input providing the whole contextinformation. However, when the VAE is hierarchical, wecan feed the long-term and the short-term context into dif-ferent parts of the encoder and decoder. In our implemen-


tation, we feed the short-term context into local encoders(decoders) and feed the long-term context into global en-coder (decoders), as shown in Figure 5.

The major issue here is how to represent the chord sequenceand the tonality into a numerical vector. For chord symbols,we here adopt a representation method provided by (Raffelet al., 2014), by denoting a chord into a multi-hot 36-Dvector representing its root, bass and pitch class (i.e., thechroma vector).

Figure 6. An example representation of the chord E:maj7/b3.

F:major

- - - - - F - - - - - -Tonal Note

Mode Vector major minor others

Figure 7. An example representation of the tonality F:major.

To represent the tonality, we are performing some simplifi-cation to the problem. We only retain 24 tonality classes(12 major classes and 12 minor classes) that are mostcommon for modern popular songs and discard other raremodes like Dorian and Mixolydian. Also, we do not distin-guish between natural minor, melodic minor and the har-monic minor. Under this setting, we use a one-hot repre-sentation for the 24 tonality classes.

3.3. Content-Structure Disentanglement

One critical problem of the bar-level similarity representa-tion is the fact that the content representation is still notcompact as different latent codes for different bars mayshare similar information, which is redundant. In otherwords, the model does not perform disentanglement be-tween structure and content. To solve this issue, we pro-pose a new model that try to learn a more compact repre-sentation for both structure and content.

We introduce a hierarchical β-VAE (Higgins et al., 2017)whose architecture is shown in Figure 9. The main idea ofthe model is to introduce a variational attention matrix Aas a representation for the structure of a melody. In varia-tional attention (Bahuleyan et al., 2017b), the attention ma-

trix is converted into a variational variable, which is sam-pled from a continuous representation space.

Global Decoder

Local Encoders (Shared Parameters)

Global Encoder

𝐯 𝐯

Mul & Add

…

Chord Condition

Local Decoders (Shared Parameters)

Figure 8. An overview of the hierarchical VAE.

The workflow of the model is as follows. To begin with,we have an input melody m = [m1, ...,mT ] with T bars.Each mt is first fed into a local encoder to get its bar-levelrepresentation st. The global encoder takes in all st andoutputs the distribution of attention matrix A (the structurerepresentation) and the global latent variable z (the contentrepresentation of music theme). For the decoding process,a global decoder is first applied to the sampled global latentvariable z and outputs C fixed-length latent components[v1, ...,vC ] where C is a hyperparameter. We want eachcomponent vc to represent a distinct semantic factor of theoriginal melody (e.g., the rhythm or pitch of the theme).Then, the sampled attention matrix A is used to calculatethe weighted sum of the different latent components vc torecover the bar-level representation s′t:

s′(h)t =

C∑c=1

A(h)t,c vc (12)

s′t = W>d Concat(s

′(1)t , ..., s

′(H)t ) (13)

Here, H denotes the number of heads for the attentionand Wd is a linear transformation matrix. This processcan been understood as rearranging different semantic fac-tors vc of music content using the structure representa-tion A. Finally, the local decoders are applied to recoverthe melody m′t from s′t for each bar. The loss functionF(φ, θ;m, c) of the model is given by


F = EA∼q(A)

φ (·|m,c),z∼q(z)φ (·|m,c)[log pθ(m|z,A, c)]

−βzDKL

[q(z)φ (z|m, c)‖pz(z)

]−βaDKL

[q(A)φ (A|m, c)‖pA(A)

](14)

where c is the condition for the model, φ and θ are theparameters of the encoder and the decoder, respectively.

One thing worth noting is why we introduce the variationalattention loss term to the model. We find that in practice, ifA is not variational, then the model will become problem-atic in a way that A will form a shortcut for informationto transfer from encoder to decoder, and A itself will beenough to recover the whole melody line without any in-formation stored in z. The variational attention (Bahuleyanet al., 2017b) is introduced to solve this issue by addingsome noise to the attention matrix before decoding. Noticethat although the distribution of A can be modeled using aDirichlet distribution as each attention vector sums up to 1,it is not feasible because the reparameterization trick is hardto perform on Dirichlet distributions. Instead, we still usea Normal distribution for A prior and then perform exp(·)and normalization to recover the real attention matrix fromthe sampled latent variable.

sample

Global Decoder

sample

Global Decoder

A

Global Encoder

Global Encoder

(a) (b)

Figure 9. Visualization of the reparameterization part (a) withoutvariational attention and (b) with variational attention.

In our implementation, the global decoder is simply a lin-ear transformation, and the global encoder is a stackedBi-directional Long Short-Term Memory (Bi-LSTM) net-work. The parameters of the variational distributions Aand z are calculated by a linear transformation to the out-puts and the hidden states of the Bi-LSTM, respectively.We adopt the conditional version of MusicVAE (Robertset al., 2018) proposed by Yang et al. (2019) for local en-coders and decoders, using chord progressions as the con-dition for both of them. All local encoders share the sameparameters, and the same for the local decoders.

4. Experiment4.1. Dataset

Lack of datasets always poses severe issues for computermusic. In this project, we will mainly use MIDI datasets forpop songs with vocal melodies, including the Nottinghamdataset (Boulanger-Lewandowski et al., 2012). To make upfor the insufficiency of training datasets, we also created anew dataset from more than 1000 Chinese Karaoke songswith symbolic annotations from the state-of-the-art MusicInformation Retrieval algorithms (Jiang et al., 2010; Bocket al., 2016; Mauch et al., 2015). The annotations are not100% accurate, but our experiments still acquire good re-sults with them. The whole dataset is augmented by pitchshifting with the range from -6 semitones to +5 semitones.

In the meanwhile, we collected another dataset from thewebsite www.hooktheory.com. The website containsmore than 10,000 public music tabs annotated by thecrowd. Most annotations contain phrase-level predomi-nant melody lines along with its chord and tonality context.Since the dataset is phrase-level, it becomes a more suit-able dataset for the local structure experiment comparedto the Nottingham and the Karaoke dataset, since the lat-ter requires random splitting to get the phrase-level musicpieces and we cannot guarantee that the cutting points areprecisely the start/end of a phrase. Our final dataset con-tains 16,000 16-bar pop music samples.

4.2. Results

4.2.1. EXPERIMENT ON STRUCTURED CONDITION

We first experiment the model with structured condition.We want to know if (1) the structured condition is effectivein the context of VAE; (2) if the music structure in har-monic context can be used as an explicit representation ofmelody structure.

To show the effectiveness of condition, we train a modelon 2-bar melody using hierarchical VAE where each localencoder encodes one beat of the melody and each globalencoder output a global latent code to summarize the wholesong. By providing harmonic context, we want the globallatent code to focus more on the inner properties of themelody, e.g., pitch contour and rhythmic pattern, insteadof the ones that depend heavily on the harmonic context.

Figure 10 show that when we alter the condition of theHVAE decoder and kept the latent code unchanged, themodel will generate a melody with a similar contour to theoriginal one while some pitches are slightly adjusted to beconsistent with the harmonic context. This means the con-ditions are effective in the sense that the latent code doesnot represent the melody in a hard way by memorizing theexact pitch. Instead, it memorizes some abstract properties

www.hooktheory.com


ChordCondition

TonalityCondition KL Loss Recon.

LossNo No 0.2254 0.0433Yes No 0.2154 0.0504No Yes 0.2191 0.0783Yes Yes 0.2143 0.0253

Table 1. KL Loss and Reconstruction loss on training over 2-barmelodies, with or without chord condition or tonality condition

of the melody like how the pitch contour goes.

Figure 10. An example of manipulation of the structured condi-tion. The top score contain the original music piece and the har-monic conditions. The bottom score contain the altered conditionsand the generated melody with the altered chords and keys as thecondition.2

However, when it comes to structure representation, wefind that the harmonic conditions are not a suitable con-straint to generate structural melodies. The reason is thatthe conditions are not powerful enough. By altering them,each note pitch only changes up to 1 to 2 semitones. This isbarely enough to change the global structure of the melody.Also, we find the model trained with and without the con-ditions actually acquire similar performance with a similarKL loss, which means that the structure of the conditionsdoes not help much on getting a more compact content rep-resentation of the melody. This observation gives us theidea that the structure information is actually stored insideof the latent variable, instead of the external conditions.

4.2.2. EXPERIMENT ON BAR-LEVEL SIMILARITYREPRESENTATION

We train the model proposed in 3.1 on 16-bar melody withthe following hyper-parameters: (a) number of heads H =16, (b) bar-level latent code dimension zdim = 512, (c)(λ1, λ2) = (1.0, 0.3). After training, the model acquiresa high reconstruction accuracy (97.59%) on the validationset.

One advantage of the model is the fact that each atten-tion matrix has a clear meaning corresponding to the self-similarity between two bars of the song. Therefore, we canvisualize the attention matrix as shown in Figure 12.

However, the model has its limitation in two aspects. First,we find that the structure extracted by the model is incom-plete. When we resample the latent variable, some of theglobal structure information is kept, but some others arelost. See Figure. 11 for such an example. Second, themodel does not have a compact representation of the musiccontent as each bar still has an individual latent variable,which might contain redundant information. Also, we can-not get a disentangled representation of music content fromthe latent representation.

Figure 11. An example of generating a long music piece (thelower part) that has a similar structure to a given piece (the upperpart). In generation, the latent code of the given piece is resam-pled but the MSASM is kept the same. The generated piece has asimilar ABAC structure as the original one. However, some localstructures (e.g., the similarity between the first 2 bars) are lost.

Figure 12. Parts (3 heads) of the MSASM values correspond tothe song in Figure 11. Lighter colors in each matrix denote largervalues and the vertical yellow bars are the separator for differentmatrices.

4.2.3. EXPERIMENT ON STRUCTURE-CONTENTDISENTANGLEMENT

Samplelength 8 bars 16 bars

HVAE 89.63 80.85Proposed 93.82 86.27

Table 2. Reconstruction accuracy for the baseline model and theproposed model in section 3.3.

We train the model proposed in 3.3 with a latent dimensionzdim = 512, number of heads H = 8, number of compo-nents C = 16 and KL loss weight (βz, βa) = (0.1, 2).We find the model hard to train as it tries to acquire amore compact music content representation z. Still, we


Approach StructureRepresentation

ContentRepresentation

Effectiveness onStructure Regularization

Bar-Level SimilarityImplicit(Self-Attention) Too Loose Moderate

Structured ConditionExplicit(Harmonic Context) Loose Poor

Content-StructureDisentanglement Implicit (Attention) Compact Good

Table 3. Comparison of the proposed methods

Exact Repetition

Inexact Repetition

Inexact Repetition

Exact Repetition

Exact Repetition

Inexact Repetition

Downstream Upstream

Downstream Upstream

A A’

A A’

: Chong’er Fei

: Peng You

: Chong’er Fei’s + Peng You’s

: Peng You’s + Chong’er Fei’s

Figure 13. A demonstration of the latent code exchange processon two Chinese pop songs Chong’er Fei and Peng You.4

find the variational attention helps by comparing the re-construction accuracy to the pure hierarchical decoder bythe original MusicVAE (Roberts et al., 2018) (the resultsare shown in Table 2). To validate the effectiveness ofthe content-structure disentanglement, we use latent-codeswapping technique similar to the ones in (Natsume et al.,2018; Jetchev & Bergmann, 2017). Specifically, we aregiven two 8-bar melody lines ma,mb from real music andcalculate their latent representations:

(Aa, za) ← Encode(ma, ca) (15)(Ab, zb) ← Encode(mb, cb) (16)

where ca (cb) denotes the corresponding chord condition tothe melody ma (mb). Then, we generate two new melodiesby swapping one latent representation Aa,Ab:

m′a ← Decode(Ab, za, ca) (17)m′b ← Decode(Aa, zb, cb) (18)

Comparing the generated melody pieces to the originalones allow us to understand which features are controlledby A and z, respectively. An example result is shown in

Figure 13 . We experimented with 10 pop songs and ob-served some interesting results:

1. The generated melody m′a shares a similar musicstructure as mb. The observed structure similarity ismulti-level, including short-term and long-term repe-titions.

2. The generated melody m′a shows a similar rhythmicpattern and local pitch transition with ma.

We conclude from these findings that in the new model, theoriginal latent code z along with the conditions controls thelocal content of the music, while the variational attention Acontrols the structure on how these local content should bearranged to a long music piece. This agrees with human’sperception of music structure.

5. Discussion and ConclusionIn this paper, we design and experiment with several novelmethods trying to find a suitable way to represent musicstructures, and furthermore trying to disentangle it from therepresentation of music content. The comparison of differ-ent methods is shown in Table 3.

While we find the third method works best, it still has sev-eral remaining issues. First, the model still experiences sig-nificant accuracy drop when generalizing to longer music.Second, as the proposed method do not have extra regu-larization on the two latent variables, the disentanglementof structure and content is actually unsupervised. There-fore, the effectiveness of disentanglement is generally hardto guarantee theoretically. Third, the network structure isonly a prototype and need to be further optimized.

ReferencesBahuleyan, H., Mou, L., Vechtomova, O., and Poupart, P.

Variational attention for sequence-to-sequence models.arXiv preprint arXiv:1712.08207, 2017a.

4The audio demo can be accessed via https://trinket.io/library/trinkets/9ab0e87a21.

https://trinket.io/library/trinkets/9ab0e87a21

https://trinket.io/library/trinkets/9ab0e87a21


Bahuleyan, H., Mou, L., Vechtomova, O., and Poupart, P.Variational attention for sequence-to-sequence models.arXiv preprint arXiv:1712.08207, 2017b.

Bock, S., Krebs, F., and Widmer, G. Joint beat and down-beat tracking with recurrent neural networks. In ISMIR,pp. 255–261, 2016.

Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P.Modeling temporal dependencies in high-dimensionalsequences: Application to polyphonic music genera-tion and transcription. arXiv preprint arXiv:1206.6392,2012.

Conklin, D. and Witten, I. H. Multiple viewpoint systemsfor music prediction. Journal of New Music Research,24(1):51–73, 1995.

Dai, S., Zhang, Z., and Xia, G. Music style transfer issues:A position paper. arXiv preprint arXiv:1803.06841,2018.

Hadjeres, G., Pachet, F., and Nielsen, F. Deepbach: a steer-able model for bach chorales generation. In Proceed-ings of the 34th International Conference on MachineLearning-Volume 70, pp. 1362–1371. JMLR. org, 2017.

Hamanaka, M., Hirata, K., and Tojo, S. deepGTTM-III:Multi-task Learning with Grouping and Metrical Struc-tures: 13th International Symposium, CMMR 2017,Matosinhos, Portugal, September 25-28, 2017, RevisedSelected Papers, pp. 238–251. 01 2018. ISBN 978-3-030-01691-3. doi: 10.1007/978-3-030-01692-0 17.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X.,Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae: Learning basic visual concepts with a constrainedvariational framework. In International Conference onLearning Representations, 2017.

Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N.,Simon, I., Hawthorne, C., Dai, A., Hoffman, M., Din-culescu, M., and Eck, D. Music transformer: Generatingmusic with long-term structure. 2019.

Jetchev, N. and Bergmann, U. The conditional analogy gan:Swapping fashion articles on people images. In Proceed-ings of the IEEE International Conference on ComputerVision, pp. 2287–2292, 2017.

Jiang, J., Chen, K., Li, W., and Xia, G. Mirex 2018 sub-mission: A structural chord representation for automaticlarge-vocabulary chord transcription. 2010.

Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. arXiv preprint arXiv:1312.6114, 2013.

Lo, M. and Lucas, S. M. Evolving musical sequences withn-gram based trainable fitness functions. In 2006 IEEEInternational Conference on Evolutionary Computation,pp. 601–608. IEEE, 2006.

Mao, H. H., Shin, T., and Cottrell, G. Deepj: Style-specificmusic generation. In 2018 IEEE 12th International Con-ference on Semantic Computing (ICSC), pp. 377–382.IEEE, 2018.

Mauch, M., Cannam, C., Bittner, R., Fazekas, G., Sala-mon, J., Dai, J., Bello, J., and Dixon, S. Computer-aidedmelody note transcription using the tony software: Ac-curacy and efficiency. 2015.

Natsume, R., Yatagawa, T., and Morishima, S. Rsgan: faceswapping and editing using face and hair representationin latent spaces. arXiv preprint arXiv:1804.03447, 2018.

Paulus, J., Muller, M., and Klapuri, A. State of the artreport: Audio-based music structure analysis. In ISMIR,pp. 625–636. Utrecht, 2010.

Raffel, C., McFee, B., Humphrey, E. J., Salamon, J., Nieto,O., Liang, D., Ellis, D. P., and Raffel, C. C. mir eval: Atransparent implementation of common mir metrics. InIn Proceedings of the 15th International Society for Mu-sic Information Retrieval Conference, ISMIR. Citeseer,2014.

Roberts, A., Engel, J., Raffel, C., Hawthorne, C., andEck, D. A hierarchical latent vector model for learn-ing long-term structure in music. arXiv preprintarXiv:1803.05428, 2018.

Thornton, C. Hierarchical markov modeling for generativemusic. In ICMC, 2009.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-tion is all you need. In Advances in Neural InformationProcessing Systems, pp. 5998–6008, 2017.

Yan, X., Yang, J., Sohn, K., and Lee, H. Attribute2image:Conditional image generation from visual attributes. InEuropean Conference on Computer Vision, pp. 776–791.Springer, 2016.

Yang, L.-C., Chou, S.-Y., and Yang, Y.-H. Midinet:A convolutional generative adversarial network forsymbolic-domain music generation. arXiv preprintarXiv:1703.10847, 2017.

Yang, R., Wang, D., Wang, Z., Chen, T., Jiang, J., and Xia,G. Deep music analogy via latent representation dis-entanglement. In Proceedings of the 19th InternationalSociety for Music Information Retrieval Conference, IS-MIR, 2019.

Documents

Stylistic Melody Generation with Conditional Variational Auto …epxing/Class/10708-19/assets/... · 2019. 9. 11. · melody given a speciﬁc chord progression, how to gener-ate