Training an audio keyword spotter with PyTorch

by Chris Lovett

This tutorial will show you how to train a keyword spotter using PyTorch. A keyword spotter listens to an audio stream from a microphone and recognizes certain spoken keywords. Since it is always listening there is good reason to find a keyword spotter that can run on a very small low-power co-processor so the main computer can sleep until a word is recognized. The ELL compiler makes that possible. In this tutorial you will train a keyword spotter using the the speech commands dataset which contains 65,000 recordings of 30 different keywords (bed, bird, cat, dog, down, eight, five, four, go, happy, house, left, marvin, nine, no, off, on, one, right, seven, sheila, six, stop, three, tree, two, up, wow, yes, zero) each about one second long.

This is the dataset used to train the models in the speech commands model gallery. The Getting started with keyword spotting on the Raspberry Pi tutorial uses these models. Once you learn how to train a model you can then train your own custom model that responds to different keywords, or even random sounds, or perhaps you want just a subset of the 30 keywords for your application. This tutorial shows you how to do that.

Before you begin

Complete the following steps before starting the tutorial.

What you will need

  • Laptop or desktop computer with at least 16 GB of RAM.
  • Optional NVidia Graphics Card that supports CUDA. You will get great results with a GTX 1080 which is commonly used for training neural networks.
  • The NVidia CUDA 10.0 SDK from


The picture below illustrates the process you will follow in this tutorial. First you will convert the wav files into a big training dataset using a featurizer. This dataset is the input to the training process which outputs a trained keyword spotter. The keyword spotter can then be verified by testing.


Training a Neural Network is a computationally intensive task that takes millions or even billions of floating point operations. That is why you probably want to use CUDA accelerated training. Fortunately, audio models train pretty quickly. On an NVidia 1080 graphic card the 30 keyword speech_commands dataset trains in about 3 minutes using PyTorch. Without CUDA the same training takes over 2.5 hours on an Intel Core i7 CPU.

ELL Root

After you have installed and built the ELL compiler, you also need to set an environment variable named ELL_ROOT that points to the location of your ELL git repo, for example:

[Linux] export ELL_ROOT="~/git/ell"

[Windows] set ELL_ROOT=d:\git\ell

Subsequent scripts depend on this path being set correctly. Now make a new working folder:

mkdir tutorial
cd tutorial

Installing PyTorch

Installing PyTorch with CUDA is easy to do using your Conda environment. If you don’t have a Conda environment, see the ELL setup instructions (Windows, Ubuntu Linux, macOS).

You may want to create a new Conda environment for PyTorch training, or you can add PyTorch to your existing one. You can create a new environment easily with this command:

conda create -n torch python=3.6

Activate it

[Linux] source activate torch
[Windows] activate torch

Then follow the PyTorch setup instructions for Anaconda and your version of CUDA as described on this page:

Helper Python Code

This tutorial uses python scripts located in your ELL git repo under tools/utilities/pythonlibs/audio and tools/utilities/pythonlibs/audio/training. When you see a python script referenced below like, just prefix that with the full path to that script your ELL git repo.

Downloading the Training Data

Next, you will need to download the training data set:

Google crowd sourced the creation of these recordings so you get a nice variety of voices. Google released it under the Creative Commons BY 4.0 license.

Go ahead and download that file and move it into a folder named audio then unpack it using this Linux command:

tar xvf speech_commands_v0.01.tar.gz

On Windows you can use the Windows Subsystem for Linux to do the same. Alternatively, you can install 7-zip. 7-zip will install a new menu item so you can right click the speech_commands_v0.01.tar.gz and select “Extract here”. The total disk space required for the uncompressed files is about 2 GB. When complete your audio folder should contain 30 folders plus one named background_noise. You should also see the following additional files:

  • validation_list.txt - the list of files that make up the validation set
  • testing_list.txt - the list of files in the testing set

Lastly, you will need to create the training_list.txt file containing all the wav files (minus the validation and test sets) which you can do with this command:

python --wav_files audio --max_files_per_directory 1600
copy audio\categories.txt .

Where audio is the path to your unpacked speech command wav files. This will also create a categories.txt file in the same folder. This file lists the names of the keywords (directories) found in the audio folder. Copy that file to your working tutorial folder.

Note that the command line above includes the option --max_files_per_directory 1600. This options limits the training list to a maximum of 1600 files per subdirectory and will result in a training dataset of around 1.2 GB (using the make_dataset command line options shown below). Feel free to try other numbers here or remove the limit entirely to use every available file for training. You will notice there are not exactly the same number of training files in each subdirectory, but that is ok. Without any limits the full speech_commands training dataset file will be about 1.6 GB and the script may use up to 6 gb RAM to get the job done.

As you can see you can try different sized training datasets. When you are ready to experiment, figure out which gives the best results, all the training files, or a subset.

Create a Featurizer Model

As shown in the earlier tutorial the featurizer model is a mel-frequency cepstrum (mfcc) audio transformer which preprocesses audio input, preparing it for use by the training process.

This featurizer is created as an ELL model using the make_featurizer command:

python --sample_rate 16000 --window_size 512 --input_buffer_size 512 --filterbank_type mel --filterbank_size 80 --filterbank_nfft 512 --nfft 512 --log

You should see a message saying “Saving featurizer.ell” and if you print this using the following command line:

[Linux] $ELL_ROOT/build/bin/print -imap featurizer.ell
[Windows] %ELL_ROOT%\build\bin\release\print -imap featurizer.ell -fmt dgml -of graph.dgml

then you will see the following nodes:


You can now compile this model using the ELL model compiler to run on your PC using the familiar wrap command:

[Windows] python %ELL_ROOT%\tools\wrap\ --model_file featurizer.ell --outdir compiled_featurizer --module_name mfcc

[Linux] python $ELL_ROOT/tools/wrap/ --model_file featurizer.ell --outdir compiled_featurizer --module_name mfcc

Then you can build the compiled_featurizer folder using cmake as done in prior tutorials. You will do this a lot so it might be handy to create a little batch or shell script called makeit that contains the following:


mkdir build
cd build
cmake -G "Visual Studio 15 2017 Win64" -Thost=x64 ..
cmake --build . --config Release
cd ..


mkdir build
cd build
cmake ..
cd ..

So compiling a wrapped ELL model is now this simple:

pushd compiled_featurizer

Create the Dataset using the Featurizer

Now you have a compiled featurizer, so you can preprocess all the audio files using this featurizer and create a compressed numpy dataset with the result. This large dataset will contain one row per audio file, where each row contains all the featurizer output for that file. The featurizer output is smaller than the raw audio, but it will still end up being a pretty big file, (about 1.2 GB). Of course it depends how many files you include in the set. Remember for best training results the more files the better, so you will use the training_list.txt you created earlier which selected 1600 files per keyword. You need three datasets created from each of the list files in your audio folder using make_dataset as follows:

python --list_file audio/training_list.txt --featurizer compiled_featurizer/mfcc --sample_rate 16000 --window_size 40 --shift 40 --auto_scale
python --list_file audio/validation_list.txt --featurizer compiled_featurizer/mfcc --sample_rate 16000 --window_size 40 --shift 40 --auto_scale
python --list_file audio/testing_list.txt --featurizer compiled_featurizer/mfcc --sample_rate 16000 --window_size 40 --shift 40 --auto_scale

Where the audio folder contains your unpacked .wav files. If your audio files are in a different location then simply provide the full path to it in the above commands.

The reason for the --sample_rate 16000 argument is that small low powered target devices might not be able to record and process audio at very high rates. So while your host PC can probably do 96kHz audio and higher just fine, this tutorial shows you how to down sample the audio to something that will run on a tiny target device. The main point being that you will get the best results if you train the model on audio that is sampled at the same rate that your target device will be recording.

The --auto_scale option converts raw integer audio values to floating point numbers in the range [-1, 1].

Creating the datasets will take a while, about 10 minutes or more, so now is a great time to grab a cup of tea. It will produce three files in your working folder named training.npz, validation.npz and testing.npz which you will use below.

You will notice in the output that any folder starting with “_” special category named <null>. This means you can add negative audio samples to these folders so the model learns to ignore words also rather than falsely recognizing them as one of the 30 words you want to recognize.

Train the Keyword Spotter

You can now finally train the keyword spotter using the train_classifier script:

python --architecture GRU --audio .

This script will use PyTorch to train a GRU based model using the datasets you created earlier then it will export an onnx model from that. The file will be named KeywordSpotter.onnx and if all goes well you should see console output like this:

Loading d:\datasets\Audio\Kaggle\train\audio\training_list.npz...
Loading d:\datasets\Audio\Kaggle\train\audio\validation_list.npz...
Training keyword spotter using 46953 rows of featurized training input...
Epoch 0, Loss 1.5005683898925781, Validation Accuracy 65.404
Epoch 1, Loss 1.191438913345337, Validation Accuracy 76.931
Epoch 2, Loss 1.031152606010437, Validation Accuracy 81.928
Epoch 3, Loss 0.8680925369262695, Validation Accuracy 84.891
Epoch 4, Loss 0.8902071118354797, Validation Accuracy 85.996
Epoch 5, Loss 0.7643579244613647, Validation Accuracy 87.294
Epoch 6, Loss 0.7202472686767578, Validation Accuracy 87.058
Epoch 7, Loss 0.7902640104293823, Validation Accuracy 87.618
Epoch 8, Loss 0.7076990008354187, Validation Accuracy 88.163
Epoch 9, Loss 0.7693088054656982, Validation Accuracy 87.765
Epoch 10, Loss 0.6854181289672852, Validation Accuracy 88.561
Epoch 11, Loss 0.805185854434967, Validation Accuracy 88.414
Epoch 12, Loss 0.5628467202186584, Validation Accuracy 89.121
Epoch 13, Loss 0.7616006135940552, Validation Accuracy 89.239
Epoch 14, Loss 0.6700156331062317, Validation Accuracy 89.107
Epoch 15, Loss 0.5844067931175232, Validation Accuracy 89.180
Epoch 16, Loss 0.583706259727478, Validation Accuracy 89.166
Epoch 17, Loss 0.6179401278495789, Validation Accuracy 89.225
Epoch 18, Loss 0.6164156794548035, Validation Accuracy 88.856
Epoch 19, Loss 0.5983876585960388, Validation Accuracy 89.225
Epoch 20, Loss 0.6414792537689209, Validation Accuracy 89.328
Epoch 21, Loss 0.600459635257721, Validation Accuracy 89.269
Epoch 22, Loss 0.5980986952781677, Validation Accuracy 89.460
Epoch 23, Loss 0.5426868200302124, Validation Accuracy 89.372
Epoch 24, Loss 0.552982747554779, Validation Accuracy 89.682
Epoch 25, Loss 0.5772491097450256, Validation Accuracy 89.446
Epoch 26, Loss 0.5909024477005005, Validation Accuracy 89.298
Epoch 27, Loss 0.5863549709320068, Validation Accuracy 89.814
Epoch 28, Loss 0.605217456817627, Validation Accuracy 89.505
Epoch 29, Loss 0.5369739532470703, Validation Accuracy 89.785
Trained in 380.31 seconds, saved model ''
Training accuracy = 98.849 %
Loading d:\datasets\Audio\Kaggle\train\audio\testing_list.npz...
Testing accuracy = 89.696 %
saving onnx file: KeywordSpotter.onnx

So here you see the model has trained pretty well and is getting an evaluation score of 89.696% using the testing_list.npz dataset. The testing_list contains files that the training_list never saw before so it is expected that the test score (89.7%) will always be lower than the final training accuracy (98.8%). The real trick is increasing that test score. This problem has many data scientists employed around the world!

Importing the ONNX Model

In order to try your new model using ELL, you first need to import it from ONNX into the ELL format as follows:

[Linux] python $ELL_ROOT/tools/importers/onnx/ GRUKeywordSpotter.onnx
[Windows] python %ELL_ROOT%\tools\importers\onnx\ GRUKeywordSpotter.onnx

This will generate an ELL model named GRUKeywordSpotter.ell which you can now compile using similar technique you used on the featurizer:

[Linux] python $ELL_ROOT%/tools/wrap/ --model_file GRUKeywordSpotter.ell --outdir KeywordSpotter --module_name model
[Windows] python %ELL_ROOT%\tools\wrap\ --model_file GRUKeywordSpotter.ell --outdir KeywordSpotter --module_name model

then compile the resulting KeywordSpotter project using your new makeit command:

pushd KeywordSpotter

Testing the Model

So you can now take the new compiled keyword spotter for a spin and see how it works. You can measure the accuracy of the ELL model using the testing list. The script will do that:

python --classifier KeywordSpotter/model --featurizer compiled_featurizer/mfcc --sample_rate 16000 --list_file audio/testing_list.txt --categories categories.txt --reset --auto_scale

This is going back to the raw .wav file input and refeaturizing each .wav file using the compiled featurizer. This is similar to what you will do on your target device while processing microphone input. As a result this test pass will take a little longer (about 2 minutes). You will see every file scroll by telling you which one passed or failed with a running pass rate. The last page of output should look something like this:

PASSED 88.37%: zero
FAILED: go, expecting zero, path=zero/d0faf7e4_nohash_4.wav
FAILED: go, expecting zero, path=zero/d0faf7e4_nohash_5.wav
PASSED 88.35%: zero
PASSED 88.35%: zero
PASSED 88.38%: zero
PASSED 88.38%: zero
PASSED 88.39%: zero
PASSED 88.39%: zero
PASSED 88.40%: zero
PASSED 88.40%: zero
PASSED 88.41%: zero
PASSED 88.41%: zero
Test completed in 135.90 seconds
6043 passed, 792 failed, pass rate of 88.41 %
Best prediction time was 0.0 seconds

The final pass rate printed here is 88.49% which is close to the pytorch test accuracy of 89.69%.

But how will this model perform on a continuous stream of audio from a microphone? You can try this out using the following tool:

[Linux] python $ELL_ROOT/tools/utilities/pythonlibs/audio/ --classifier KeywordSpotter\model --featurizer compiled_featurizer/mfcc --categories categories.txt --sample_rate 16000 --threshold 0.8 --auto_scale

[Windows] python %ELL_ROOT%\tools\utilities\pythonlibs\audio\ --classifier KeywordSpotter\model --featurizer compiled_featurizer/mfcc --categories categories.txt --sample_rate 16000 --threshold 0.8 --auto_scale

Speak some words from categories.txt slowly and clearly. You will find that the accuracy is not as good, it recoginizes the first word you speak, but nothing else. If you run the script without “–reset” argument then the test is run as one continuous steam of audio with no model reset between each .wav file. In this case you will see the test score drop to about 70%. So why is this? Well, remember the trainer has one row per wav recording and this helps the trainer know when to reset the GRU node hidden state. See the init_hidden method. But in live audio how do you know when one word stops and another starts? Sometimes spoken words blur together. How does ELL then know when to reset the GRU nodes hidden state? By default ELL does not reset the hidden state so the GRU state blurs together over time and gets confused especially if there is no clear silence between consecutive words.

So how can you fix this? Well, this is where Voice Activity Detection (VAD) can come in handy. ELL actually has a node called VoiceActivityDetectorNode that you can add to the model. The input is the same featurizer input that the classifier uses, and the output is an integer value 0 if there is no activity detected and 1 if there is activity. This output signal can then be piped into the GRU nodes as a “reset_trigger”. The GRU nodes will then reset themselves when they see that trigger change from 1 to 0 (the end of a word). To enable this you will need to edit the ELL GRUKeywordSpotter.ell file using the script:

python GRUKeywordSpotter.ell --sample_rate 16000 --window_size 512 --tau_up 1.5 --tau_down 0.09 --large_input 4 --gain_att 0.01 --threshold_up 3.5 --threshold_down 0.9 --level_threshold 0.02

This will edit the ELL model, remove the dummy reset triggers on the two GRU nodes and replace them with a VoiceActivityDetectorNode. Your new GRUKeywordSpotter.ell should now look like this:


Note: you can use the following tool to generate these graphs:

[Linux] $ELL_ROOT/build/bin/release/print -imap GRUKeywordSpotter.ell -fmt dot -of
[Windows] %ELL_ROOT%\build\bin\release\print -imap GRUKeywordSpotter.ell -fmt dgml -of graph.dgml

And you can view graph.dgml using Visual Studio. On Linux you can use the dot format which can be viewed using GraphViz.

You can now compile this new GRUKeywordSpotter.ell model using as before and try it out. You should see the test_ell_model accuracy increase back up from 70% to about 85%. The VoiceActivityDetector is not perfect on all the audio test samples, especially those with high background noise. The VoiceActivityDetector has many parameters that you can see in These parameters can be tuned for your particular device to get the best result. You can use the <ELL_ROOT>/tools/utilities/pythonlibs/audio/ tool to help with that.

You can also use the script again and see how it behaves when you speak the 30 different keywords listed in categories.txt. You should notice that it works better now because of the VoiceActivityDetection whereas previously you had to click “stop” and “record” to reset the model. Now it resets automatically and is able to recognize the next keyword after a small silence. You still cannot speak the keywords too quickly, so this solution is not perfect. Understanding full conversation speech is a different kind of problem that requires bigger models and audio datasets that include whole phrases.


The script has a number of other options you can play with including number of epochs, batch_size, learning_rate, and the number of hidden_units to use in the GRU layers. Note that training also has an element of randomness to it, so you will never see the exact same numbers even when you retrain with the exact same parameters. This is due to the Stochastic Gradient Descent algorithm that is used by the trainer.

The neural network you just trained is described by the KeywordSpotter class in the script. You can see the __init__ method of the GRUKeywordSpotter class creating two GRU nodes and a Linear layer which are used by the forward method as follows:

    def forward(self, input):
        # input has shape: [seq,batch,feature]
        gru_out, self.hidden1 = self.gru1(input, self.hidden1)
        gru_out, self.hidden2 = self.gru2(gru_out, self.hidden2)
        keyword_space = self.hidden2keyword(gru_out)
        result = F.log_softmax(keyword_space, dim=2)
        # return the mean across the sequence length to produce the
        # best prediction of which word it found in that sequence.
        # we can do that because we know each window_size sequence in
        # the training dataset contains at most one word and a single
        # word is not spread across multiple training rows
        result = result.mean(dim=0)
        return result

You can now experiment with different model architectures. For example, what happens if you add a third GRU layer? To do that you can make the following changes to the KeywordSpotter class:

  1. Add the construction of the 3rd GRU layer in the __init__ method:

     self.gru3 = nn.GRU(hidden_dim, num_keywords)
  2. Change the Linear layer to take a different number of inputs since the output of gru3 will now be size num_keywords instead of hidden_dim:

     self.hidden2keyword = nn.Linear(num_keywords, num_keywords)
  3. Add a third hidden state member in the init_hidden method:

     self.hidden3 = None
  4. Use this new GRU layer in the forward method right after the gru2.

     gru_out, self.hidden3 = self.gru3(gru_out, self.hidden3)
  5. That’s it! Pretty simple. Now re-run train_classifier as before. Your new model should get an evaluation accuracy that is similar, so the 3rd GRU layer didn’t really help much. But accuracy is not the only measure, you might also want to compare the performance of these two models to see which one is faster. It is often a speed versus accuracy trade off with neural networks which is why you see a speed versus accuracy graph on the ELL Model Gallery.

There are many other things you could try. So long as the trainer still shows a decreasing loss across epochs then it should train ok. You know you have a broken model if the loss never decreases.

As you can see, PyTorch makes it pretty easy to experiment. To learn more about PyTorch see their excellent tutorials.

Hyper-parameter Tuning

So far your training has used all the defaults provided by train_classifier, but an important step in training neural networks is tuning all the training parameters. These include learning_rate, batch_size, weight_decay, lr_schedulers and their associated lr_min and lr_peaks, and of course the number of epochs. It is good practice to do a set of training runs that test a range of different values independently testing all these hyper-parameters. You can probably get a full 1% increase in training accuracy by finding the optimal parameters.

Cleaning the Data

The speech commands dataset contains many bad training files, either total silence, or bad noise, clipped words and some completely mislabelled words. Obviously this will impact your training accuracy. So included in this tutorial is a bad_list.txt file. If you copy that to your speech commands folder (next to your testing_list.txt file) then re-run with the additional argument --bad_list bad_list.txt then it will create a cleaner test, validation and training list.

Re-run the featurization steps with and retrain your model and you should now see the test accuracy jump to about 94%. So this is a good illustration of the fact that higher quality labelled data can make a big difference in model performance.

Next steps

That’s it, you just trained your first keyword spotter using PyTorch and you compiled and tested it using ELL, congratulations! You can now use your new featurizer and classifier in your embedded application, perhaps on a Raspberry Pi, or even on an Azure IOT Dev Kit, as shown in the DevKitKeywordSpotter example.

There are also lots of things you can experiment with as listed above. You can also build your own training sets, or customize the speech commands set by simply modifying the the training_list.txt file, then recreate the training_list.npz dataset using, retrain the model and see how it does.

Another dimension you can play with is mixing noise with your clean .wav files. This can give you a much bigger dataset and a model that performs better in noisy locations. Of course, this depends on what kind of noise you care about. Crowds of people? Industrial machinery? Just the hum of computers and HVAC systems in an Office environment? It depends on your application. You can experiment using the noise files included in the speech_commands dataset. There are some more advanced options on the script to help with mixing noise in with the audio recordings.

The github speech commands gallery contains some other types of models, some based on LSTM nodes, for example. The GRU architecture does well on smaller sized models, but LSTM hits the highest score when it maximizes the hidden state size.


For any PyTorch issues, see

Exception: Please set your ELL_ROOT environment, as you will be using python scripts from there If you see this error then you need to following ELL setup instructions above and clone the ELL repo, build it, and set an environment variable that points to the location of that repo. This tutorial uses scripts in that repo, and those scripts use the binaries that are built there.