Drumroll Please: Modeling Multi-Scale Rhythmic Gestures with Flexible Grids

Jon Gillick; Joshua Yang; Carmine-Emanuele Cella; David Bamman

1. Introduction

One of the central affordances of music production and editing tools is the ability to place musical elements at precise positions along a timeline; many genres of music have emerged out of communities of artists working within the constraints of perfectly consistent rhythms and tempos. How and when to diverge from that grid is an important factor for creators to consider; many of the instrumental gestures that our ears are attuned to, like drumrolls, guitar strums, and trills, are composed of groups of rhythmic onsets that live in the spaces between the grid lines. Some music producers, like UK garage artist Burial, prefer to ignore the grid altogether (), while others rely on setting global parameters like “swing” (). Both of these approaches have their drawbacks: working completely without a grid is too time consuming for most to consider, and adjusting timing with global parameters offers only broad strokes rather than precise control.

Given these limitations, one of the more intriguing directions offered by AI for expanding the rhythmic possibilities in music production is its ability to assist users in intelligently and subtly keeping their music “off-the-grid” by modeling the rhythmic nuances of existing music. In the context of drums and percussion (our focus in this paper), several systems based on modeling instrumental performances have already been designed and made available within mainstream music production tools like Ableton Live (; ; ).

Despite a long history of research in expressive performance analysis and generation (see Cancino-Chacón et al. () for a detailed review), generating expressive musical parts on the scale of even one or two measures in length remains a challenging problem.

Most research on expressive performance generation has been situated in the context of Western art music () and often relies on note-level alignments with scores or other structural elements of notated music like dynamics markings (). Some recent approaches based on deep learning have instead attempted to jointly model both composition and performance using MIDI data sourced from instruments outfitted with sensors (like a Disklavier or electronic drum kit), or from audio recordings automatically transcribed to MIDI (). Still, designing and engineering models that work well enough to generate compelling outputs requires large instrument-specific training datasets (), compromising on temporal precision through varying levels of quantization (), or both. To enable creators to explore the potential uses of expressive performance models in practice, we would like to be able to train music generation models with as little data as possible (), while at the same time preserving the nuances of expressive music that can only be captured with precise temporal resolution.

In this paper, by taking a close look at the representations used to encode drum performance data, we take steps to address some of the challenges that arise when modeling off-the-grid data with neural networks. We analyze the tradeoffs imposed by different representations, propose an alternative approach called Flexible Grids, and conduct experiments to investigate the relative advantages of each data representation.

Although the primary focus of this paper deals with methodology — how to represent musical data when working with machine learning models — our work is motivated by the range of real-world applications that depend on these underlying mathematical representations of music. For this reason, in choosing our technical direction, we prioritize applicability toward directions that would otherwise be difficult for creators to explore (off-the-grid music), real-world constraints on data size and computational efficiency that are necessary for making AI broadly accessible (; ), and considerations of interpretability and controllability that matter to music creators when working with AI (). Concretely, our contributions include the following:

We analyze and compare existing data representations that have been used recently for music generation in the MIDI domain, highlighting opportunities for improvement.
We present Flexible Grids (visualized in Figure 3 and described in Section 3) as an alternative data representation, along with motivations and implementation details.
Using the Groove MIDI Dataset (), a collection of drumset recordings containing expressive timing and dynamics, we experiment with training Variational AutoEncoder models using Flexible Grids as well as several other data representations, comparing results through quantitative metrics and a listening survey carried out with drummers. We also compare the same set of representations in classifying the anonymized identities of drummers in the dataset.

Code and trained models are available at: https://github.com/jrgillick/groove.

2. Data Representations for Musical Events

Recent work on music generation in the MIDI domain typically takes one of two broadly defined approaches to representing musical data. These two categories, which we will refer to throughout this paper as Fixed-Grid and Event-Based representations, differ primarily in terms of how they handle musical time and tempo. While not all existing approaches fit neatly into one bucket or the other, this distinction is convenient for summarizing the main factors to consider when choosing a musical representation; Huang and Yang () draw a similar distinction while connecting Fixed-Grid and Event-Based representations respectively with similar structures developed in Computer Vision and Natural Language Processing.

2.1 Fixed-Grid representations

Fixed-Grid representations break down music into equal chunks of time, typically associating each timestep with a musical duration such as an 8th note or a 16th note. As a consequence, musical constructs like tempo and beat subdivision can be built into Fixed-Grids, with time usually defined relative to a local or global tempo. This structure accords with theories of how humans perceive rhythm in that when multiple rhythmic onsets take place within a short time frame (a “beat bin”), humans tend to group them together, hearing them as forming a single beat (). Tempo-relative representations are also advantageous for machine learning because they implicitly keep track of time, while at the same time outlining a shared structure for jointly modeling music recorded at different tempos.

Besides capturing some useful temporal musical structure, the other defining characteristic of Fixed-Grid representations is that they have a consistent size; this means that any sequence lasting one measure will always have the same number of timesteps (e.g. 16 or 32) and the same number of features per timestep, regardless of the density of musical events actually present in that sequence. Fixing the size of sequences is desirable for two main reasons: first, it enforces fewer constraints on the machine learning models and architectures we can choose from (feed forward and convolutional neural networks are a workable choice here), and second, maintaining an alignment with musical time allows us to design more predictable interactions: for example, we can select, visualize, or manipulate musical parts that last for a specific number of beats or measures.

Formally, a Fixed-Grid representation consists of a grid of size (T × E × M), where a musical sequence consists of T timesteps, a given timestep t includes a maximum of E possible events, and each event contains M modification parameters for capturing details like expressive timing and dynamics.

Figure 1 shows an example of a drum machine interface with 16 timesteps and 12 instruments, which can be represented with a Fixed-Grid using T = 16 and E = 12; the controls for “Tempo” and “Shuffle” can each be implemented with a single parameter that applies to the entire sequence. Figure 2 demonstrates a different Fixed-Grid design for a drum machine, which includes an option to let users switch between resolutions of T along with three settings for velocity defined by the “dynamic” parameter.

Figure 1

Fixed-Grid representation of a 1-measure pattern for 12 drums in a web interface designed by Yuri Suzuki (808303.studio) and inspired by Roland’s TR-808 Rhythm Composer.

Figure 2

Fixed-Grid representation of a 1-measure pattern for a single drum in the interface for Propellerhead’s ReDrum drum machine.

Figure 3

(a) One measure of drums from the Groove MIDI Dataset visualized in pianoroll format. In a grid at 16th-note resolution, 9 of the 15 snare drum hits in this measure would be mapped to duplicate slots in a matrix; of these, only 3 notes (colored in yellow) could be kept, and the other 6 (colored in red) would need to be discarded or quantized. (b) Mapping drum onset events to slots in our proposed Flexible Grid data representation. Red notes are considered secondary. Each instrument channel (kick, snare, hi-hat, etc.) receives one primary event per 16th note timestep, and space for secondary events is distributed with the minimum number of slots needed to fit the densest passages in the training set. Every event here has two continuous modification parameters for velocity and timing offsets.

In addition to the role they play in drum machines, Digital Audio Workstations (DAW’s), and other musical devices, Fixed-Grids are common choices for music modeling and generation. Recent examples include MidiNet (), which generates melodies and chords, and MusicVAE () which creates melodies and drum patterns. MIDI-VAE () models instrument dynamics in addition to sequences of notes, and GrooVAE () includes both instrument dynamics and expressive microtiming. R-VAE () and InpaintNet () explore finer resolutions as well as ternary divisions on a grid, with R-VAE modeling timesteps as small as 32nd-note triplets.

The main downside of Fixed-Grid representations in the context of machine learning is that it can be difficult to choose an appropriate resolution for T. Too fine a resolution (such as a 1/128th note) results in long and sparse sequences that are difficult to model, while too coarse a resolution (like an 8th note or 16th note) can result in a lossy representation where some notes need to be quantized or discarded (; ). This tradeoff often arises when music is sparse in some places and dense in others, which happens commonly when fast runs or alternate articulations are played on the same instrument.

2.2 Event-Based representations

Event-Based representations also have a long history in music generation; they have been used in models based on Markov Chains (; ), Recurrent Neural Networks (; ; ) and Transformers (; ). In contrast to Fixed-Grid representations, which keep track of an event’s temporal position by encoding it relative to a specific point on a timeline, Event-Based representations track the passage of time through a discrete vocabulary of time-shift events, each of which moves a playhead forward by a specific increment. These increments can be measured in musical durations like 8th or 16th notes, for example to generate jazz improvisations () or folk tunes (), but of particular interest for this work are a recent series of models of expressive performance that use more fine-grained timespans, with vocabularies allowing time shifts as short as 8 milliseconds. These extended vocabularies of time shifts makes room for models to learn directly from data in formats like MIDI without explicitly modeling tempo and beat.

PerformanceRNN () and Music Transformer () both take this approach, using Event-Based representations handling time in milliseconds to generate piano performances. REMI (), replaces milliseconds with beat-based timesteps along with a modifier to handle local tempo variations in an Event-Based representation for pop piano music.

The main downside of Event-Based representations that measure time at high enough resolution for expressive music generation is that in exchange for flexibility, they sacrifice metrical and grouping structures that are connected to the way humans perceive music (). Empirical results show that generative models trained with these representations tend to sound less realistic than similarly parameterized models trained with a Fixed-Grid and can have trouble maintaining steady rhythms, particularly over long sequences (; ).

3. Flexible Grid Representations

To address some of the challenges posed by Fixed-Grid or Event-Based data representations, we introduce a new data representation called a Flexible Grid (visualized in Figure 3). Our design for this representation stems from the following question: How can we best encode every musical event in a dataset of expressive performances into fixed-length sequences without needing to quantize or discard any notes?

3.1 Avoiding quantization with continuous offsets

As a starting point, we begin with the data representation proposed by Gillick et al. () for the GrooVAE model (which we treat as a baseline for experiments in section 4). This representation, used for modeling expressive drumming with a kit containing 9 drums, encodes drum hits onto a 16th-note grid along with two continuous modification parameters that define, respectively, a velocity v between 0 and 1, and a timing offset o between -0.5 and 0.5, which indicates where between two adjacent metrical positions a note onset occurs. Using the notation from Section 2.1, one measure of drums can be represented by GrooVAE in a Fixed-Grid of size (T = 16) × (E = 9) × (M = 3). Because of the continuous offset parameters, the drum hits captured here do not need to be quantized, so microtiming is preserved at the same resolution it was originally captured. Evidence from several studies indicating that timing fluctuations at the level of individual notes are better explained as deviations from a local tempo rather than as short-term changes in tempo (; ), supports this choice of representation using timing offsets rather than tempo changes. Building off of this representation, we use the same modification parameters v and o to accompany each event in a Flexible Grid.

3.2 Avoiding skipped notes with secondary events

The Fixed-Grid representation used by GrooVAE breaks down, however, when more then one onset occurs at the same timestep on the same instrument channel. This is a common occurrence whenever a fast musical gesture spans multiple onsets (e.g. a flam, roll, or double stroke on a snare drum).

Figure 3(a) shows one example of a measure from the Groove MIDI Dataset that leads to this problem: At three different points in this measure, the snare drum channel contains two or more events mapping to one point in time and so cannot be fully captured by the Fixed-Grid at 16th-note resolution. Of 9 snare drum onsets, only the 3 shown in yellow are preserved, while the 6 shown in red are ignored. Whenever we run out of slots in the matrix like this, we need to make a choice about which to keep; Gillick et al. () choose to keep the loudest event when faced with this decision. While the reasons underlying this kind of quantization are not easy to make transparent to users of tools built on these representations, low-level decisions like this can have a far-reaching impact on the ways that tools actually can be used.

One way to avoid skipping notes is to increase the resolution of T from 16th notes to 32nd notes, 64th notes, and so on (; ). This approach, however, does not easily resolve the problem; in the example shown in Figure 3, a 32nd-note resolution still misses 4 of the 9 notes in question, and a 64th-note resolution misses two. Moreover, increasing the resolution makes sequences longer and correlations between related positions in the grid less regular. Previous results (), show that music generation models using Fixed-Grids with too high a resolution are more difficult to train and produce more audible artifacts. A second option for avoiding skipped notes, then, is to switch to a tempo-free Event-Based representation in order to bypass the problem through the use of variable length sequences. This choice, however, comes at the cost of potentially less data-efficient training and generated outputs that may accumulate timing errors over the course of a sequence.

Rather than taking one of the above approaches, we instead observe that the snare drum events in Figure 3 can be accommodated into a grid if we allocate three extra slots for snares, increasing the E dimension of our matrix from 9 to 12 so that at each timestep, we have a maximum of 4 snare drum events along with one event for each of the other 8 instruments in the drum kit. This simple change lets us encode this entire measure into a grid of dimension (T = 16) × (E = 12) × (M = 3) without any dropped events. Viewed another way, we concatenate our primary grid P, of size (T = 16) × (E = 9) × (M = 3), with a secondary grid S of size (T = 16) × (E = 3) × (M = 3). P encodes the blue and yellow notes in Figure 3, while S encodes the red notes.

Encoding in this way can provide two advantages over increasing the temporal resolution: first, a smaller and denser matrix gets us to the point where we do not lose any data, and second, the musical events featurized by the secondary matrix S share a common structure that differs from the events in the primary matrix P: all of these events represent musical gestures moving faster than the subdivision of the grid, and they all occur in close proximity to other events on the same channel, which presumably correspond to other onsets produced by the same gesture (e.g. a drumroll). This method of constructing S does not have the undesirable side effect of degrading the rich correlation structure in P (P is left unchanged), which happens when we increase the resolution of T from 16 to 32. Another way to think about why this representation should be beneficial is that, similar to the way in which Fixed-Grid representations make machine learning problems easier by injecting information about metrical position, separating primary and secondary events injects contextual information about musical gestures into the data representation. Taken together, P and S make up a Flexible Grid representation of the drums shown in Figure 3.

3.3 Applying a Flexible Grid to a dataset

Although we can encode the measure shown in Figure 3 into the P and S matrices above, surely there are other measures in our dataset that will not be captured by that encoding, in which we have secondary slots for snare drums, but not for the other 8 drum channels.

If we generalize our method of expanding S, however, we can construct a Flexible Grid that fits every sequence in the data; this can be thought of as making space in S for events that happen as fast as the fastest gestures in our data, but no faster than that. To do this with the Groove MIDI Dataset, we map every drum onset to its closest 16th note timestep, count the number of onset events mapped to each instrument channel (snare drum, kick drum, closed hi-hat, etc.) at every timestep, and then take the maximum value of this quantity for each of the 9 drum instruments that occurs anywhere in the entire dataset. These resulting 9 values E_c (representing the maximum number of possible events for each channel) correspond to the maximum number of times that each instrument in the kit was played within the span surrounding a single timestep.

Table 1 shows the result of this computation applied to the Groove MIDI Dataset at 16th-note resolution: we design S to fit one extra ride cymbal, 2 more open hi-hats, 6 additional snares, and so on; by adding a total of 21 extra grid slots per 16th note, we can capture every event in the dataset.

Table 1

Statistics of the Groove MIDI Dataset used to build a Flexible Grid Representation at 16th note resolution.


Drum	Max # of Onsets within 1/16 Note

Kick	3
Snare	7
Closed Hi-hat	4
Open Hi-hat	3
Low Tom	3
Mid Tom	3
Hi Tom	3
Crash Cymbal	2
Ride Cymbal	2

Total	30

This approach to constructing primary and secondary grids and encoding a musical sequence into the relevant locations can be summarized with the following sequence of steps:

Algorithm 1: Encoding a musical sequence into a Flexible Grid.

Associate every event in a musical sequence with the closest point in time on the grid.
For each input channel c (in our case one of 9 drums in a drum set), count the maximum number of events E_c that have been associated with any single timestep.
Set the dimension of E in the primary event matrix P to be equal to the total number of instrument channels, and set the dimension of E in the secondary matrix to ∑_c(E_c – 1).
When encoding a sequence, map the first event at time t and channel c into the corresponding position in the matrix P, along with its modification parameters for velocity and timing offset.
If there are any more events at time t on channel c, map each of those in temporal order to the corresponding positions in the matrix S, such that subsequent slots will never be filled without first filling all previous slots. If there is no event in P at time t and channel c, then S cannot contain any events at that position.

3.4 Considerations for designing Flexible Grids

Of course, the choices of what events to consider as primary depend on the content of the music in the dataset and especially on the types of repetition that take place most often. For example, if our dataset contains many 8th-notes and 8th-note triplets, as pointed out by Vigliensoni et al. (), we may benefit from constructing a primary grid that includes both of those resolutions. Or, to take another example, if our dataset contains many possible pitches (e.g. 88 piano keys), we might want to fit the more common pitches, like those in the current key center, into a primary grid, while leaving slots for out-of-key notes to a secondary grid (e.g. with modification parameters for sharps, flats, octaves, and so on).

While the structure outlined here is perhaps the most straightforward arrangement of a Flexible Grid and could be plugged into drum machine interfaces or off-the-shelf machine learning models, given an appropriate model or musical context, the secondary matrix S could be structured differently, for example as a variable-length sequence in a hybrid setting alongside the fixed matrix P.

4 Experiments

4.1 Data

To explore Flexible Grids and compare with other representations in the context of machine learning models, we conduct experiments using data from the Groove MIDI Dataset (). This data consists of about 14 hours of professional drum performances (recorded by a total of 10 drummers) captured in MIDI format on an electronic drum kit. It was recorded by drummers playing along to a metronome, so we are able to assume a known tempo and downbeat (this is one of the main structural assumptions we need to make; in situations where this information is not captured with the dataset, we would need to automatically infer these quantities using beat-tracking). The drumming in this dataset is representative of typical rhythmic patterns from several styles including jazz, latin, and rock music (full details of the dataset are available online). For our experimental setup, we divide the dataset into 2-measure segments with a 1-measure sliding window, following the same procedure as Gillick et al. (). This results in a training set of about 17000 2-measure drum sequences and development and test sets containing about 2200 sequences each.

4.2 Data representations for comparison

For experiments, we consider four baseline data representations, keeping machine learning model architectures, hyperparameters, and training procedures the same, while changing the data representation.

Fixed-Grid(16) This baseline corresponds to the data representation used by Gillick et al. () for generating drums with the GrooVAE model. Here, events for each of the 9 drum categories are encoded using a fixed grid at 16th-note resolution, with continuous modification parameters for each event’s velocity and timing offset relative to the nearest 16th note. A 2-measure drum sequence is represented using a grid with dimensions (T = 32) × (E = 9) × (M = 3).

Fixed-Grid(32) Here, to add resolution in the time domain, we increase the number of timesteps T from 16 to 32 per measure, so a 2-measure sequence has dimension (T = 64) × (E = 9) × (M = 3).

Fixed-Grid(64) This representation further increases the number of timesteps per measure to 64, using a grid of dimension (T = 128) × (E = 9) × (M = 3) to represent two measures of drums.

Event-Based For this baseline, we use the Event-Based representation from Oore et al. (), where MIDI notes are converted into variable-length sequences using a vocabulary V of 9 Note-on events, 127 Time-shift events from 8-1000ms, and 32 Set-velocity events (Note-off events are not needed for our percussion dataset). With this data structure, 2-measure sequences are represented by a variable length matrix of size (T = t) × (V = 168), with the sequence length t taking values up to 300 (the largest number of tokens in this vocabulary needed to represent any 2-measure sequence in the training set). We convert all data to a tempo of 120BPM before any other processing.

Flexible Grid(16) We use a Flexible Grid constructed at 16th-note resolution as described in Section 3. The P component of this representation is equivalent to the first baseline, Fixed-Grid(16). The S component is a secondary grid of size (T = 32) × (E = 21) × (M = 3). For modeling, we concatenate P and S along the E dimension into a (T = 32) × (E = 30) × (M = 3) grid.

4.3 Analysis of skipped notes

Because the drumming in the dataset is quite varied, the prevalence of different kinds of gestures also vary depending on the drummer and the musical material. We first extend the same analysis applied to the measure in Figure 3 to the entire dataset in order to understand how many notes are quantized or dropped by each data representation. This measurement aims to give a sense about the scope of the impact a data representation can have when used in models. If a representation drops many events, this effect will always be passed on to any models that use it. If a representation does not drop any notes, we can say that it has the potential to accurately model all the details of the data; of course, the question of evaluating how those models actually perform is left for subsequent modeling experiments.

4.4 Music generation with VAE

Next, we explore training a Variational AutoEncoder model to unconditionally generate 2-measure musical parts. In practice, this model can be used for generating new drum loops, interpolating between existing loops, or other applications that motivate research into VAE’s for music (). While this experiment aims to capture the most general setting for generation in order to best isolate the effects of the data representation, VAE’s also include encoders (unlike autoregressive models or Generative Adversarial Networks), which are important for any creative applications that involve conditional modeling based on user input control signals like MIDI scores or rhythmic performance gestures ().

For our model, we adopt the Recurrent-VAE neural network architecture used by Gillick et al., (). While examining a variety of different models here in conjunction with choices of data representation merits further exploration, we restrict ourselves to one model here to focus on differences between representations. This architecture is convenient because it lends itself well to both fixed and variable-length sequences; we are able to use the same network for all 5 conditions including the Event-Based representation. We follow the same choice of hyperparameters as Gillick et al. (), except for reducing the value of the VAE regularization parameter β from 0.2 to 0.002 (increasing the weight given to the reconstruction loss component of the objective function), which we found worked better for the baseline model before adopting this change for these experiments.

We train 5 VAE models, one using each of the 4 baseline data representations, as well as one using the proposed Flexible Grid representation. We are interested here in both the perceptual qualities of model outputs (how good do they sound?) as well as in the types of gestures that are present in generated music (do they capture the diversity in gestures, generating drumrolls, flams, and so on?).

As one way of exploring differences with regard to perceptual quality, we conduct an online listening survey with 11 expert drummers, asking each participant to provide pairwise rankings for 15 pairs of generated samples (a total of 165 trials), with pairs drawn randomly from a pool of 128 samples from each model. In choosing their subjective preference for each pair, participants are informed that all samples have been generated by machine learning models, but they are not told anything about the differences between groups or about the specific focus of the study. Before running the survey with our participants, in preliminary comparisons by our research team, we found that two of the baselines (Fixed-Grid(32) and Fixed-Grid(64)), had a noticeably higher proportion of audible artifacts, so we chose to focus our survey resources on the remaining two baselines (Fixed-Grid(16) and Event-Based) to obtain a larger sample for the most important comparisons. While the survey aims to capture overall subjective differences between outputs from each model, we do not ask participants about more specific differences in order to avoid introducing biases by directing them to listen for particular details (like the presence or absence of drumrolls).

4.5 Reconstruction with VAE

In this experiment, we measure the onset-level reconstruction performance of VAE models trained on each representation, reporting F1-scores. Because a VAE may add or drop notes in reconstruction (it is responsible here for joint generation of both the drum pattern and its expressive timing and dynamics), the alignment between original and reconstructed notes is not known. Given a note n_si from a sequence s in the test set and a reconstructed sequence r generated by a VAE, we define n_si as having been correctly reconstructed if any note n_rj of the same category (e.g. snare drum) is present in r and appears within 20ms of the original note n_si. We choose this tolerance of 20ms based on an approximate upper bound of the temporal resolution of human listeners’ ability to discriminate sounds, which has been shown to vary from as little as 2ms in some cases to about 20ms in others (; ). Each reconstructed note n_rj is only allowed to match one note in the original sequence, to avoid rewarding models that average or quantize note timings. To estimate the best alignment between s and r, we use dynamic time warping for each drum instrument, to match snare drums in s with snare drums in r, and so on.

For this evaluation (in which we are not constrained by a limited number of human listeners to make judgements), we include a second model architecture to broaden the scope of our comparisons: in addition to the Recurrent VAE used for the generation experiment in Section 4.4, we also train a Convolutional VAE using each data representation (except for Event-Based, which requires a network capable of processing variable-length inputs). Here, we replace the recurrent networks with convolutional encoders and decoders based on the DCGAN architecture (), adjusting the numbers of convolutional filters so that each model has approximately the same number of parameters.

4.6 Classification

In our final modeling experiment, we compare the different data representations for two classification tasks given a 2-measure sequence: the 10-way classification task of predicting the identity of the drummer, and the 18-way task of predicting a genre as labeled in the Groove MIDI Dataset. We use an MLP neural network model with a single hidden layer for this experiment, again fixing the model architecture and varying the data representation. The hypothesis here is that the features defined by the different ways of representing the same data may be more or less discriminative for categorizing music by performer or genre; for example, drummers may play the same pattern but express it through different stylistic gestures in their playing.

5. Results and Discussion

5.1 Analysis of skipped notes

Table 2 displays the total numbers and percentages of notes that are dropped when we convert from MIDI to each data representation and back, without doing anything else. Fixed-Grid(16) drops 6.94% of the events in the dataset, which gives a sense of how much detail is lost in existing models using the representation from Gillick et al. (). Increasing the grid resolution in Fixed-Grid(32) and Fixed-Grid(64) cuts down this number significantly to 2.85% and 0.92% respectively. The Event-Based representation does much better according to this measurement, only causing distortion in time for 0.1% of the drum hits.

Table 2

Statistics of the counts and percentages of events in the Groove MIDI Dataset training data that would be quantized or dropped by different data representations, before any modeling takes place. Variable length sequences in the Event-Based representation are between 4 and 300 tokens long.


Representation	# Skipped	% Skipped	Size

Fixed-Grid (16)	24038	6.94%	32 × 9 × 3
Fixed-Grid (32)	9875	2.85%	64 × 9 × 3
Fixed-Grid (64)	3210	0.92%	128 × 9 × 3

Event-Based	348	0.10%	X × 168

Flexible Grid (16)	0	0	32 × 30 × 3

We also find that 58% of the 2-measure sequences used in our modeling experiments have at least one drum hit that is dropped when using the Fixed-Grid(16) representation. This means that for 42% of our datapoints, Fixed-Grid(16) is sufficient for encoding all the data.

5.2 Music generation with VAE

Figure 4 shows the results of the listening survey conducted with drummers. Three data representations (Flexible Grid(16), Fixed-Grid(16), and Event-Based) were each compared against each other; we show the head-to-head results aggregated across all participants for each comparison.

Figure 4

Results of a blind head-to-head listening survey. Eleven drummers each participated in 15 trials for this survey, each of them choosing between pairs of two-measure drum loops generated by VAE’s trained on each of three data representations.

Results show that both grid-based representations were preferred when compared against the Event-Based one (about 70% of the time). This accords with previous results demonstrating the benefits of beat and tempo-relative representations in music generation (). The most likely explanation here, which we experienced when piloting the survey ourselves, is that it can be jarring to listen to short drum loops that do not keep at least a relatively consistent beat; many of the samples generated by the Event-Based model exhibit this tendency, whereas the clips from the other two models usually do not. One potential confounding factor that could work against the Event-Based model in this comparison is that it is responsible for learning about tempo; we control for this factor, however, by converting all sequences to the same tempo (120 BPM) before applying any other pre-processing.

In the third comparison, comparing Flexible Grid(16) with the Fixed-Grid(16) baseline, we do not find a significant difference between the two groups (p=0.34). Here, the differences between the two models are subtler; the main difference is that Flexible Grid(16) is capable of generating a wider variety of gestures like drumrolls and flams (Event-Based also offers this capability, but has other drawbacks). In the context of this survey, where drummers were asked to listen to 2-measure loops without any musical context, these gestures, which appear in some samples but not others, did not strongly influence listeners in their choices. Taken together, however, that the Flexible Grid(16) model can generate a more diverse set of musical gestures, while at the same time remaining comparable to Fixed-Grid(16) and preferable to Event-Based in this survey, offers evidence to support Flexible Grids as having combined advantages from each of the two baselines.

5.3 Reconstruction with VAE

Figure 5 shows the F1-scores for reconstructing sequences in the test set using VAE models trained on each data representation. Each model is trained once and then evaluated across different groupings of the test data. Because not all drumming in the dataset contains the same distribution of gestures, to tease apart differences between representations, we stratify our evaluations here by the number of events that do not fit into the baseline Fixed-Grid(16) matrix (and so would be dropped in the VAE’s input). If we include the 42% of 2-measure sequences that are fully captured by this baseline, we can see that in the LSTM setting, Fixed-Grid(16) performs best according to this metric, with an F1-score of 0.638, compared to 0.620 from Flexible Grid(16). As we increase the the proportion of sequences with fast gestures in the evaluation, however, Flexible Grid(16) overtakes the baseline here when considering only sequences with 9 or more secondary events. In the CNN setting, however, Flexible Grid(16) performs best across the whole distribution.

Figure 5

VAE Reconstruction (F1 scores per onset), plotted against sequences with increasingly more drumrolls and fast gestures. Data are aggregated such that the leftmost point on the line includes all drum sequences, the next point includes all drum sequences that have at least one event captured in the secondary matrix S, and so on.

These results demonstrate how the impact of the events captured by each data representation are passed on to models trained using each one. Even though Fixed-Grid(32), Fixed-Grid(64), and Event-Based all encode more notes than Fixed-Grid(16) (as shown in Table 2), the corresponding models are not able to learn as well, so the reconstruction metrics are lower.

5.4 Classification

Table 3 summarizes the results of models trained to classify drummer identities and musical genres using each data representation. We find that while performance for Genre ID is fairly consistent across representations, Flexible Grid(16) performs better for classifying drummer identities, reaching an accuracy of 0.683, more than 3 absolute points better than the next best model at 0.650. This result suggests that encoding expressive music using a Flexible Grid captures some information about the gestures that each drummer uses which can help to discriminate between the different players.

Table 3

Accuracy Scores Classifying Drummer Identity with an MLP neural network, with 95% bootstrap confidence intervals. The Event-Based representation is excluded here because the variable-length representation does not enable modeling with a feed-forward classification model.


Representation	Drummer ID	Genre ID

Fixed-Grid (16)	0.634 ±0.027	0.547 ±0.026
Fixed-Grid (32)	0.650 ±0.026	0.544 ±0.026
Fixed-Grid (64)	0.615 ±0.026	0.519 ±0.026

Event-based	N/A	N/A

Flexible Grid (16)	0.683 ±0.024	0.540 ±0.027

6. Conclusions

Whenever researchers or technology designers work with musical data, we need to pay close attention to the representations we use when converting real world data into formats suitable for computational modeling. Every model that treats music as data must choose some representation, and there is a long history of systems and models using different data representations, which we categorize and summarize in Section 2. These choices of data representation are made at an early stage in the series of decisions that shape how music technology is built, designed, deployed, and ultimately put into the hands of creators, and small decisions here can have a large impact down the road.

Previous research suggests that the ways in which creators actually find uses for machine learning-based tools often diverges from the intentions of technology designers (; ), and questions around how these underlying data representations will ultimately matter to music creators () may not be thoroughly answered in the near future. Still, as applications based on machine learning become more integrated into the real world creative processes of music producers, composers, and performers of different backgrounds and levels, we can expect that low level choices of representation certainly will matter.

This paper takes a close look at the relative strengths and weaknesses of different approaches to representing expressive percussion data. We find that Fixed-Grid approaches used in the past have not been able to capture all the rich details of multi-scale musical gestures, while Event-Based representations are often more difficult to train and interact with; in response, we propose Flexible Grid data representations as a balance between these two endpoints. We find that when used for music generation, models trained on Flexible Grids are able to generate music of similar perceptual quality to Fixed-Grids, while at the same time incorporating details of the expressive drumming gestures captured by Event-Based representations. As more datasets and applications are developed around expressive music data (automatic transcription from audio to MIDI offers one path forward), we hope that the underlying motivations and design choices of the data representations explored here will be beneficial in a range of other musical settings.

Additional File

The additional file for this article can be found as follows:

Supplementary Files

Supporting Audio Files. DOI: https://doi.org/10.5334/tismir.98.s1

Transactions of the International Society for Music Information Retrieval

RESEARCH ARTICLE

Drumroll Please: Modeling Multi-Scale Rhythmic Gestures with Flexible Grids

Abstract

1. Introduction