Generating NES chiptunes with Transformers

Back in February of 2022 I had the idea to train a GPT transformer to generate NES music. I chose NES music because it has a few very useful constraints. The NES soundcard only has 4 channels, two (PWM) square channels, a triangle channel, and a noise channel. I also happen to really enjoy the way it sounds.

1943 - The Battle of Midway

The amount of character that’s able to be conveyed through these four channels is really impressive. In the Midway OST the noise channel timbre is alternated to sound like a military snare pattern.

For anyone who isn’t familiar with NES music, Tim Follin is regarded as the greatest composer for the NES soundcard. The Silver Surfer OST is one of my favorites of any game.

Silver Surfer NES OST

Anyways, on with the actual code.

First step is always to do a bit of research. I stumbled across LAKH-NES which was a Transformer-XL trained on the LAKH MIDI dataset. The Lakh MIDI dataset consists of over 170,000 MIDI files. I clicked on one at random to see if the MIDI was formatted properly, and it happened to be Margaritaville by Jimmy Buffet. What are the odds? For the fine-tuning, the LAKH-NES researchers used the NES MDB dataset and converted it to a proprietary text format using multiple embeddings

Unfortunately the code they used was in Python 2. I tried making a Docker container to run their code but it was incredibly difficult to get everything versioned properly. In the end I settled on using a file format called .ABC, which is less dense than multiple embeddings, but dense enough to fit most songs in less than 2048 tokens, which is the maximum token limit for a reasonably-sized GPT model.

First I used Mido to preprocess the data. To account for stero tracks and duplicate tracks, I used Mido to remove all tracks of the same length. Next the tracks needed to be standardized. In MIDI spec a note_on must always be followed by a note_off. However a note_on with a velocity of 0 is the same as a note_off. In the LAKH dataset some of the songs use note_off and some use note_on with a zero velocity. I decided that all songs should be using note_on with zero velocity.

Next step was to convert the MIDI to a text representation. I was using the Music21 library to estimate the BPM and key signature of songs, however the Music21 library only supports MIDI -> xml, which is too large of a representation. Luckily there is an xml -> abc conversion tool in Python. However it runs very, very slowly (and some files just fail to convert). I let the tool run in the background for a week while I was out of town to convert the entire dataset.

The next step for the pre-training dataset was to remove all tracks with greater than 2048 tokens. I concatenated the ABC files and ran them through the ai-textgen tokenizer, then tokenized all files and threw out the ones with more than 2048 tokens. I used a vocab size of 8192 which just means there are 8192 possible tokens that can be used to re-create all of the input files. Unfortunately after all of the pruning I was only left with about 25Mb of data to train my network.

I found the ai-textgen library a lot more straightforward and easier to use than the HuggingFace transformers. Most of the important settings were already configured for me. I had the most success with the following parameters:

config = build_gpt2_config(vocab_size=8192, max_length=2048, dropout=0.0, n_embd=768, n_layer=12, n_head=8) # 92M param

A gpt-NEO model with this config ends up having 92M parameters, which was the limit of what my GPU vram could handle.

Training in Google colab on a P100 I was able to train for about 150,000 steps per day. Performance started to fall off after about 900k timesteps, so I ended up fine-tuning on the 750k steps model snapshot.

Music break! Here is a song generated by the network after 1M timesteps.

Ok well it’s not very good. But I still have to finetune the model on the NES music.

The NESmdb dataset ended up being in a strange format that was incompatible with my workflow, so I opted to simply scrape every MIDI file from vgmusic that was for the NES or a system with a similar soundcard. So C64 and Gameboy OST were fair game. This dataset after processing ended up being really small, only about 2,000 files or 5Mb. The network trained on it incredibly quickly. After about 25k timesteps the model had finished training and it was time for the final results:

https://www.youtube.com/watch?v=7pEvu_nC2LQ&ab_channel=Seabass

The results weren’t very satisfying. I was hoping for something catchier but unfortunately I think I had three main issues:

  1. Not enough data or quality of data (I was unaware of this dataset at the time)
  2. Not enough information density (can be fixed with multiple embeddings)
  3. Not enough processing power (92M is pretty small for a GPT network, although Stable Diffusion only has 892M params)

Hopefully I can revisit this project at some point in the future, but it’s low-hanging fruit so I assume someone else will do it better than I ever could by the time it becomes feasible on an average GPU.

In the meantime, I still have a fully-working project that you can run on your own pc. You can generate AI songs fully from scratch, or feed it and existing MIDI and remix tracks using the AI. It also includes the Colab notebooks you can use to train your own models from scratch.

https://github.com/pickles976/chiptune-ai
https://github.com/pickles976/chiptune-react

Extras:

I also went on a tangent converting MIDI -> png and then training StyleGAN-2-ADA on the resulting images. It didn’t turn out well. The GAN was able to represent the larger patterns of notes very easily, but it blended the pixels too much and made discordant ‘black midi’. That repo is available here if for some reason you want to convert MIDI to png.

Written on September 2, 2022