ICASSP 2023
Multitrack Music Transformer
Hao-Wen Dong
Ke Chen
Shlomo Dubnov
Julian McAuley
Taylor Berg-Kirkpatrick
University of California San Diego
paper demo video slides code reviews
| Model | Instrument control | Compound tokens | Average sample length (second) | Inference speed (notes per second) |
|---|---|---|---|---|
| MMT (ours) | ✓ | ✓ | 100.42 | 11.79 |
| MMM | ✕ | ✕ | 38.69 | 5.66 |
| REMI+ | ✕ | ✕ | 28.69 | 3.58 |
Note: All the samples are generated in single pass through the model using a sequence length of 1024. Thus, the generated music is usually shorter for a more complex ensemble than a simple ensemble.
Settings: Only a `start-of-song’ event is provided to the model. The model generates the instrument list and subsequently the note sequence.
Settings: The model is given a ‘start-of-song’ event followed by a sequence of instrument codes and a ‘start-of-notes’ event to start with. The model then generates the note sequence.
| Ensemble: piano, church-organ, voices | |
| Ensemble: contrabass, harp, english-horn, flute | |
| Ensemble: trumpet, trombone | |
| Ensemble: church-organ, viola, contrabass, strings, voices, horn, oboe |
Settings: All instrument and note events in the first 4 beats are provided to the model. The model then generates subsequent note events that continue the input music.
Settings: Only a `start-of-song’ event is provided to the model. The model generates the instrument list and subsequently the note sequence.
| Sample 1 | Sample 2 | Sample 3 | |
|---|---|---|---|
| MMT (ours) | |||
| MMM | |||
| REMI+ |
Settings: The model is given a ‘start-of-song’ event followed by a sequence of instrument codes and a ‘start-of-notes’ event to start with. The model then generates the note sequence.
| Sample 1 | Sample 2 | Sample 3 | |
|---|---|---|---|
| MMT (ours) |
Settings: All instrument and note events in the first 4 beats are provided to the model. The model then generates subsequent note events that continue the input music.
| Sample 1 | Sample 2 | Sample 3 | |
|---|---|---|---|
| MMT (ours) | |||
| Ground truth |
Settings: All instrument and note events in the first 16 beats are provided to the model. The model then generates subsequent note events that continue the input music.
| Sample 1 | Sample 2 | Sample 3 | |
|---|---|---|---|
| MMT (ours) | |||
| Ground truth |
Hao-Wen Dong, Ke Chen, Shlomo Dubnov, Julian McAuley and Taylor Berg-Kirkpatrick, “Multitrack Music Transformer,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
@inproceedings{dong2023mmt,
author = {Hao-Wen Dong and Ke Chen and Shlomo Dubnov and Julian McAuley and Taylor Berg-Kirkpatrick},
title = {Multitrack Music Transformer},
booktitle = {Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year = 2023,
}
Jeff Ens and Philippe Pasquier, “MMM: Exploring conditional multi-track music generation with the transformer,” arXiv preprint arXiv:2008.06048, 2020. ↩
Dimitri von Rütte, Luca Biggio, Yannic Kilcher, and Thomas Hofmann, “FIGARO: Generating symbolic music with fine-grained artistic control,” arXiv preprint arXiv:2201.10936, 2022. ↩