muzic

Museformer

Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation, by Botao Yu, Peiling Lu, Rui Wang, Wei Hu, Xu Tan, Wei Ye, Shikun Zhang, Tao Qin, Tie-Yan Liu, NeurIPS 2022, is a Transformer with a novel fine- and coarse-grained attention (FC-Attention) for music generation. Specifically, with the fine-grained attention, a token of a specific bar directly attends to all the tokens of the bars that are most relevant to music structures (e.g., the previous 1st, 2nd, 4th and 8th bars, selected via similarity statistics); with the coarse-grained attention, a token only attends to the summarization of the other bars rather than each token of them so as to reduce the computational cost. The advantages are two-fold. First, it can capture both music structure-related correlations via the fine-grained attention, and other contextual information via the coarse-grained attention. Second, it is efficient and can model over 3X longer music sequences compared to its full-attention counterpart. Both objective and subjective experimental results demonstrate its ability to generate long music sequences with high quality and better structures.

demo: link


Information flow of FC-Attention.

The following content describes the steps to run Museformer. All the commands are run at the root directory of Museformer (named as root_dir) unless specified.

1. Dataset

We use the Lakh MIDI dataset (LMD-full). Specifically, we first preprocess it as described in the Appendix of our paper. The final dataset (see the file lists here) contains 29,940 MIDI files. Their time signatures are all 4/4, and the instruments are normalized to 6 basic ones: square synthesizer (80), piano (0), guitar (25), string (48), bass (43), drum, where in the parentheses are MIDI program IDs if applicable. Put all the MIDI files in data/midi.

Note: If you want to train Museformer on an arbitrary dataset with various time signatures and instruments instead of only the ones mentioned above, please see all the [General Use] part throughout the document.

Install MidiProcessor. Then, encode the MIDI files into tokens:

midi_dir=data/midi
token_dir=data/token
mp-batch-encoding $midi_dir $token_dir --encoding-method REMIGEN2 --normalize-pitch-value --remove-empty-bars --ignore-ts --sort-insts 6tracks_cst1

where the arguments are explained as follows:

[General Use] To make the representation support various time signatures and instruments, please set --encoding-method REMIGEN and --sort-insts id instead of the ones in the above commend, and also remove the --ignore-ts parameter.

After encoding, you should see the token representation of each MIDI file in output_dir.

Then, run the following command to gather the tokens for each split.

token_dir=data/token
split_dir=data/split
for split in train valid test : 
	do python tools/generate_token_data_by_file_list.py data/meta/${split}.txt $token_dir $split_dir ;
done

[General Use] To use an arbitrary dataset, please create the MIDI file lists for your dataset on your own as data/meta/{train,valid,test}.txt before running the above command.

Next, use fairseq-preprocess to make binary data:

split_dir=data/split
data_bin_dir=data-bin/lmd6remi

mkdir -p data-bin

fairseq-preprocess \
  --only-source \
  --trainpref $split_dir/train.data \
  --validpref $split_dir/valid.data \
  --testpref $split_dir/test.data \
  --destdir $data_bin_dir \
  --srcdict data/meta/dict.txt

Now, you should see the binary data in data-bin/lmd6remi.

[General Use] Set --srcdict data/meta/general_use_dict.txt, which is a vocabulary list that contains various time signatures and instruments.

2. Environment

The implementation of Museformer relies on specific hardware and software environment.

3. Train

Run the following command to train Museformer:

bash ttrain/mf-lmd6remi-1.sh

In our experiment, we run it on 4 GPUs, and the batch size is set to 1, so the real batch size is 4. Current implementation only supports batch size = 1. You may change UPDATE_FREQ to modify the real batch size.

By modifying con2con and con2sum, you can control the bars for the fine-grained attention and the coarse-grained attention, respectively.

[General Use] Please add --beat-mask-ts True for the fairseq-train commend.

In your first run, it may take some time to build up auxiliary information and compile CUDA kernels, so you may take a cup of coffee at this moment.

You can download a checkpoint here, and put it in checkpoints/mf-lmd6remi-1 for evaluation and inference.

4. Evaluation

You can obtain perplexity on the test set with the following command:

bash tval/val__mf-lmd6remi-x.sh 1 checkpoint_best.pt 10240

The number 10240 indicates the maximum sequence length for calculation.

5. Inference

Use the following command to generate 5 music pieces, with the random seed set to 1:

mkdir -p output_log
seed=1
printf '\n\n\n\n\n' | bash tgen/generation__mf-lmd6remi-x.sh 1 checkpoint_best.pt ${seed} | tee output_log/generation.log

The number of \n controls the number of generated music pieces. The generation would take a while. Once done, the generation log will be saved at output_log/generation.log.

Then, use the following command to extract the generated token sequences from the generation log:

python tools/batch_extract_log.py output_log/generation.log output/generation --start_idx 1

You should see token representation of each generated music piece in output/generation.

Finally, run the following command to convert the token sequences into MIDI files:

python tools/batch_generate_midis.py --encoding-method REMIGEN2 --input-dir output/generation --output-dir output/generation

You should see the MIDI files in output/generation.