Multitrack Music Transformer

Multitrack Music Transformer

Hao-Wen Dong Ke Chen Shlomo Dubnov Julian McAuley Taylor Berg-Kirkpatrick
University of California San Diego

Content

Summary of the compared models
Best samples
Examples of unconditional generation
Examples of instrument-informed generation
Examples of 4-beat continuation
Examples of 16-beat continuation
Citation

Summary of the compared models

MMT: Our proposed Multitrack Music Transformer model
MMM: A decoder-only transformer using the MultiTrack representation proposed by Ens and Pasquier (2020)¹
REMI+: A decoder-only transformer using the REMI+ representation proposed by von Rütte et al. (2022)²

Model	Instrument control	Compound tokens	Average sample length (second)	Inference speed (notes per second)
MMT (ours)	✓	✓	100.42	11.79
MMM	✕	✕	38.69	5.66
REMI+	✕	✕	28.69	3.58

Note: All the samples are generated in single pass through the model using a sequence length of 1024. Thus, the generated music is usually shorter for a more complex ensemble than a simple ensemble.

Best samples

Best unconditioned generation samples

Settings: Only a `start-of-song’ event is provided to the model. The model generates the instrument list and subsequently the note sequence.

Best instrument-informed generation samples

Settings: The model is given a ‘start-of-song’ event followed by a sequence of instrument codes and a ‘start-of-notes’ event to start with. The model then generates the note sequence.

Ensemble: piano, church-organ, voices
Ensemble: contrabass, harp, english-horn, flute
Ensemble: trumpet, trombone
Ensemble: church-organ, viola, contrabass, strings, voices, horn, oboe

Best 4-beat continuation samples

Settings: All instrument and note events in the first 4 beats are provided to the model. The model then generates subsequent note events that continue the input music.

Examples of unconditioned generation (unselected)

Settings: Only a `start-of-song’ event is provided to the model. The model generates the instrument list and subsequently the note sequence.

	Sample 1	Sample 2	Sample 3
MMT (ours)
MMM
REMI+

Examples of instrument-informed generation (unselected)

Settings: The model is given a ‘start-of-song’ event followed by a sequence of instrument codes and a ‘start-of-notes’ event to start with. The model then generates the note sequence.

	Sample 1	Sample 2	Sample 3
MMT (ours)

Examples of 4-beat continuation (unselected)

Settings: All instrument and note events in the first 4 beats are provided to the model. The model then generates subsequent note events that continue the input music.

	Sample 1	Sample 2	Sample 3
MMT (ours)
Ground truth

Examples of 16-beat continuation (unselected)

Settings: All instrument and note events in the first 16 beats are provided to the model. The model then generates subsequent note events that continue the input music.

	Sample 1	Sample 2	Sample 3
MMT (ours)
Ground truth

Citation

Hao-Wen Dong, Ke Chen, Shlomo Dubnov, Julian McAuley and Taylor Berg-Kirkpatrick, “Multitrack Music Transformer,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

@inproceedings{dong2023mmt,
    author = {Hao-Wen Dong and Ke Chen and Shlomo Dubnov and Julian McAuley and Taylor Berg-Kirkpatrick},
    title = {Multitrack Music Transformer},
    booktitle = {Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    year = 2023,
}

Jeff Ens and Philippe Pasquier, “MMM: Exploring conditional multi-track music generation with the transformer,” arXiv preprint arXiv:2008.06048, 2020. ↩
Dimitri von Rütte, Luca Biggio, Yannic Kilcher, and Thomas Hofmann, “FIGARO: Generating symbolic music with fine-grained artistic control,” arXiv preprint arXiv:2201.10936, 2022. ↩