Training an audio keyword spotter with PyTorch
by Chris Lovett
This tutorial will show you how to train a keyword spotter using PyTorch. A keyword spotter listens to an audio stream from a microphone and recognizes certain spoken keywords. Since it is always listening there is good reason to find a keyword spotter that can run on a very small low-power co-processor so the main computer can sleep until a word is recognized. The ELL compiler makes that possible. In this tutorial you will train a keyword spotter using the the speech commands dataset which contains 65,000 recordings of 30 different keywords (bed, bird, cat, dog, down, eight, five, four, go, happy, house, left, marvin, nine, no, off, on, one, right, seven, sheila, six, stop, three, tree, two, up, wow, yes, zero) each about one second long.
This is the dataset used to train the models in the speech commands model gallery. The Getting started with keyword spotting on the Raspberry Pi tutorial uses these models. Once you learn how to train a model you can then train your own custom model that responds to different keywords, or even random sounds, or perhaps you want just a subset of the 30 keywords for your application. This tutorial shows you how to do that.
Before you begin
Complete the following steps before starting the tutorial.
What you will need
- Laptop or desktop computer with at least 16 GB of RAM.
- Optional NVidia Graphics Card that supports CUDA.
The picture below illustrates the process you will follow in this tutorial. First you will convert the wav files into a big training dataset using a featurizer. This dataset is the input to the training process which outputs a trained keyword spotter. The keyword spotter can then be verified by testing.
Training a Neural Network is a computationally intensive task that takes millions or even billions of floating point operations. That is why you probably want to use CUDA accelerated training. Fortunately, audio models train pretty quickly. On an NVidia 1080 graphic card the 30 keyword speech_commands dataset trains in about 3 minutes using PyTorch. Without CUDA the same training takes over 2.5 hours on an Intel Core i7 CPU.
After you have installed and built the ELL compiler, you also need to set an environment variable named ELL_ROOT that points to the location of your ELL git repo, for example:
[Linux] export ELL_ROOT="~/git/ell" [Windows] set ELL_ROOT=d:\git\ell
Subsequent scripts depend on this path being set correctly. Now make a new working folder and copy the scripts from this location:
mkdir tutorial cd tutorial [Linux] cp $ELL_ROOT/tools/utilities/pythonlibs/audio/training [Windows] copy %ELL_ROOT%\tools\utilities\pythonlibs\audio\training
You may want to create a new Conda environment for PyTorch training, or you can add PyTorch to your existing one. You can create a new environment easily with this command:
conda create -n torch python=3.6
[Linux] source activate torch [Windows] activate torch
Then do the following:
[Linux, macOS] conda install pytorch cuda90 torchvision -c pytorch [Linux, macOS] pip install pyaudio [Windows] conda install pytorch cuda90 -c pytorch [Windows] pip install torchvision [Windows] conda install -c defaults intel-openmp -f [Windows] pip install pyaudio
Downloading the Training Data
Next, you will need to download the training data set:
- Speech Commands Dataset (1.4 gigabytes)
Google crowd sourced the creation of these recordings so you get a nice variety of voices. Google released it under the Creative Commons BY 4.0 license.
Go ahead and download that file and move it into a folder named
audio then unpack it using this Linux command:
tar xvf speech_commands_v0.01.tar.gz
On Windows you can use the Windows Subsystem for Linux to do the same. Alternatively, you can install 7-zip. 7-zip will install a new menu item so you can right click the
speech_commands_v0.01.tar.gz and select “Extract here”.
The total disk space required for the uncompressed files is about 2 GB. When complete your
audio folder should contain 30 folders plus one named background_noise. You should also see the following additional files:
- validation_list.txt - the list of files that make up the validation set
- testing_list.txt - the list of files in the testing set
Lastly, you will need to create the training_list.txt file containing all the wav files (minus the validation and test sets) which you can do with this command:
python make_training_list.py --wav_files audio --max_files_per_directory 1600 copy audio\categories.txt .
audio is the path to your unpacked speech command wav files. This will also create a categories.txt file in the same folder. This file lists the
names of the keywords (directories) found in the audio folder. Copy that file to your working tutorial folder.
Note that the command line above includes the option
--max_files_per_directory 1600. This options limits the training list to a maximum of 1600 files per subdirectory and will result in a training dataset of around 1.2 GB (using the make_dataset command line options shown below). Feel free to try other numbers here or remove the limit entirely to use every available file for training. You will notice there are not exactly the same number of training files in each subdirectory, but that is ok. Without any limits the full speech_commands training dataset file will be about 1.6 GB and the make_training_list.py script may use up to 6 gb RAM to get the job done.
As you can see you can try different sized training datasets. When you are ready to experiment, figure out which gives the best results, all the training files, or a subset.
Create a Featurizer Model
This featurizer is created as an ELL model using the
python make_featurizer.py --sample_rate 8000 --window_size 256 --input_buffer_size 256 --filterbank_size 40 --log
You should see a message saying “Saving featurizer.ell” and if you print this using the following command line:
[Linux] %ELL_Root%/build/bin/print -imap featurizer.ell [Windows] %ELL_Root%\build\bin\release\print -imap featurizer.ell -fmt dgml -of graph.dgml
then you will see the following nodes:
You can now compile this model using the ELL model compiler to run on your PC using the familiar
[Windows] python %ELL_ROOT%\tools\wrap\wrap.py --model_file featurizer.ell --outdir compiled_featurizer --module_name mfcc [Linux] python $ELL_ROOT/tools/wrap/wrap.py --model_file featurizer.ell --outdir compiled_featurizer --module_name mfcc
Then you can build the compiled_featurizer folder using cmake as done in prior tutorials.
You will do this a lot so it might be handy to create a little batch or shell script called
makeit that contains the following:
mkdir build cd build cmake -G "Visual Studio 15 2017 Win64" .. cmake --build . --config Release cd ..
#!/bin/bash mkdir build cd build cmake .. make cd ..
So compiling a wrapped ELL model is now this simple:
pushd compiled_featurizer makeit popd
Create the Dataset using the Featurizer
Now you have a compiled featurizer, so you can preprocess all the audio files using this featurizer and create a compressed numpy dataset with the result. This large dataset will contain one row per audio file, where each row contains all the featurizer output for that file. The featurizer output is smaller than the raw audio, but it will still end up being a pretty big file, (about 1.2 GB). Of course it depends how many files you include in the set. Remember for best training results the more files the better, so you will use the training_list.txt you created earlier which selected 1600 files per keyword. You need three datasets created from each of the list files in your audio folder using
make_dataset as follows:
python make_dataset.py --list_file audio/training_list.txt --featurizer compiled_featurizer/mfcc --sample_rate 8000 python make_dataset.py --list_file audio/validation_list.txt --featurizer compiled_featurizer/mfcc --sample_rate 8000 python make_dataset.py --list_file audio/testing_list.txt --featurizer compiled_featurizer/mfcc --sample_rate 8000
Where the audio folder contains your unpacked .wav files. If your audio files are in a different location then simply provide the full path to it in the above commands.
The reason for the
--sample_rate 8000 argument is that small low powered target devices might not be able to record and process audio at very high rates.
So while your host PC can probably do 96kHz audio and higher just fine, this tutorial shows you how to down sample the audio to something that will run on a tiny target device. The main point being that you will get the best results if you train the model on audio that is sampled at the same rate that your target device will be recording.
Creating the datasets will take a while, about 10 minutes or more, so now is a great time to grab a cup of tea. It will produce three files in your working folder named training.npz, validation.npz and testing.npz which you will use below.
You will notice in the output that the background_noise folder has been mapped to the special category named
<null>. Any folder starting with an underscore is mapped this way and it is a handy way to provide “negative samples” that you want your model to learn to ignore. Negative examples are important if you want a model that doesn’t produce too many false positives.
Train the Keyword Spotter
You can now finally train the keyword spotter using the
python train_classifier.py --architecture GRU --audio .
This script will use PyTorch to train a GRU based model using the datasets you created earlier then it will export an onnx model from that. The file will be named KeywordSpotter.onnx and if all goes well you should see console output like this:
Loading d:\datasets\Audio\Kaggle\train\audio\training_list.npz... Loading d:\datasets\Audio\Kaggle\train\audio\validation_list.npz... Training keyword spotter using 46953 rows of featurized training input... Epoch 0, Loss 1.5005683898925781, Validation Accuracy 65.404 Epoch 1, Loss 1.191438913345337, Validation Accuracy 76.931 Epoch 2, Loss 1.031152606010437, Validation Accuracy 81.928 Epoch 3, Loss 0.8680925369262695, Validation Accuracy 84.891 Epoch 4, Loss 0.8902071118354797, Validation Accuracy 85.996 Epoch 5, Loss 0.7643579244613647, Validation Accuracy 87.294 Epoch 6, Loss 0.7202472686767578, Validation Accuracy 87.058 Epoch 7, Loss 0.7902640104293823, Validation Accuracy 87.618 Epoch 8, Loss 0.7076990008354187, Validation Accuracy 88.163 Epoch 9, Loss 0.7693088054656982, Validation Accuracy 87.765 Epoch 10, Loss 0.6854181289672852, Validation Accuracy 88.561 Epoch 11, Loss 0.805185854434967, Validation Accuracy 88.414 Epoch 12, Loss 0.5628467202186584, Validation Accuracy 89.121 Epoch 13, Loss 0.7616006135940552, Validation Accuracy 89.239 Epoch 14, Loss 0.6700156331062317, Validation Accuracy 89.107 Epoch 15, Loss 0.5844067931175232, Validation Accuracy 89.180 Epoch 16, Loss 0.583706259727478, Validation Accuracy 89.166 Epoch 17, Loss 0.6179401278495789, Validation Accuracy 89.225 Epoch 18, Loss 0.6164156794548035, Validation Accuracy 88.856 Epoch 19, Loss 0.5983876585960388, Validation Accuracy 89.225 Epoch 20, Loss 0.6414792537689209, Validation Accuracy 89.328 Epoch 21, Loss 0.600459635257721, Validation Accuracy 89.269 Epoch 22, Loss 0.5980986952781677, Validation Accuracy 89.460 Epoch 23, Loss 0.5426868200302124, Validation Accuracy 89.372 Epoch 24, Loss 0.552982747554779, Validation Accuracy 89.682 Epoch 25, Loss 0.5772491097450256, Validation Accuracy 89.446 Epoch 26, Loss 0.5909024477005005, Validation Accuracy 89.298 Epoch 27, Loss 0.5863549709320068, Validation Accuracy 89.814 Epoch 28, Loss 0.605217456817627, Validation Accuracy 89.505 Epoch 29, Loss 0.5369739532470703, Validation Accuracy 89.785 Trained in 380.31 seconds, saved model 'KeywordSpotter.pt' Training accuracy = 98.849 % Loading d:\datasets\Audio\Kaggle\train\audio\testing_list.npz... Testing accuracy = 89.696 % saving onnx file: KeywordSpotter.onnx
So here you see the model has trained pretty well and is getting an evaluation score of 89.696% using the testing_list.npz dataset. The testing_list contains files that the training_list never saw before so it is expected that the test score (89.7%) will always be lower than the final training accuracy (98.8%). The real trick is increasing that test score. This problem has many data scientists employed around the world!
Importing the ONNX Model
In order to try your new model using ELL, you first need to import it from ONNX into the ELL format as follows:
[Linux] python $ELL_ROOT/tools/importers/onnx/onnx_import.py GRUKeywordSpotter.onnx [Windows] python %ELL_ROOT%\tools\importers\onnx\onnx_import.py GRUKeywordSpotter.onnx
This will generate an ELL model named GRUKeywordSpotter.ell which you can now compile using similar technique you used on the featurizer:
[Linux] python $ELL_ROOT%/tools/wrap/wrap.py --model_file GRUKeywordSpotter.ell --outdir KeywordSpotter --module_name model [Windows] python %ELL_ROOT%\tools\wrap\wrap.py --model_file GRUKeywordSpotter.ell --outdir KeywordSpotter --module_name model
then compile the resulting KeywordSpotter project using your new
pushd KeywordSpotter makeit popd
Testing the Model
So you can now take the new compiled keyword spotter for a spin and see how it works. You can measure the accuracy of the ELL model using the testing list. The
eval_test.py script will do that:
python eval_test.py --classifier KeywordSpotter/model --featurizer compiled_featurizer/mfcc --sample_rate 8000 --list_file audio/testing_list.txt --categories categories.txt --reset
This is going back to the raw .wav file input and refeaturizing each .wav file using the compiled featurizer. This is similar to what you will do on your target device while processing microphone input. As a result this test pass will take a little longer (about 2 minutes). You will see every file scroll by telling you which one passed or failed with a running pass rate. The last page of output should look something like this:
... PASSED 88.37%: zero FAILED: go, expecting zero, path=zero/d0faf7e4_nohash_4.wav FAILED: go, expecting zero, path=zero/d0faf7e4_nohash_5.wav PASSED 88.35%: zero PASSED 88.35%: zero PASSED 88.38%: zero PASSED 88.38%: zero PASSED 88.39%: zero PASSED 88.39%: zero PASSED 88.40%: zero PASSED 88.40%: zero PASSED 88.41%: zero PASSED 88.41%: zero Test completed in 135.90 seconds 6043 passed, 792 failed, pass rate of 88.41 % Best prediction time was 0.0 seconds
The final pass rate printed here is 88.49% which is close to the pytorch test accuracy of 89.69%.
But how will this model perform on a continuous stream of audio from a microphone? You can try this out using the following tool:
[Linux] python $ELL_ROOT/tools/utilities/pythonlibs/audio/view_audio.py --classifier KeywordSpotter\model --featurizer compiled_featurizer/mfcc --categories categories.txt --sample_rate 8000 --threshold 0.8 [Windows] python %ELL_ROOT%\tools\utilities\pythonlibs\audio\view_audio.py --classifier KeywordSpotter\model --featurizer compiled_featurizer/mfcc --categories categories.txt --sample_rate 8000 --threshold 0.8
Speak some words from
categories.txt slowly and clearly. You will find that the accuracy is not as good, it recoginizes the first word you speak, but nothing else.
If you run the
eval_test.py script without “–reset” argument then the test is run as one continuous steam of audio with no model reset between each .wav file.
In this case you will see the test score drop to about 70%. So why is this?
Well, remember the trainer has one row per wav recording and this helps the trainer know when to reset the GRU node hidden state.
See the init_hidden method.
But in live audio how do you know when one word stops and another starts? Sometimes spoken words blur together.
How does ELL then know when to reset the GRU nodes hidden state?
By default ELL does not reset the hidden state so the GRU state blurs together over time and gets confused especially if there is no clear silence between consecutive words.
So how can you fix this? Well, this is where Voice Activity Detection (VAD) can come in handy. ELL actually has a node called VoiceActivityDetectorNode that you can add to the model. The input is the same featurizer input that the classifier uses, and the output is an integer value 0 if there is no activity detected and 1 if there is activity. This output signal can then be piped into the GRU nodes as a “reset_trigger”. The GRU nodes will then reset themselves when they see that trigger change from 1 to 0 (the end of a word). To enable this you will need to edit the ELL GRUKeywordSpotter.ell file using the
python add_vad.py GRUKeywordSpotter.ell --sample_rate 8000 --window_size 256 --tau_up 1.5 --tau_down 0.09 --large_input 4 --gain_att 0.01 --threshold_up 3.5 --threshold_down 0.9 --level_threshold 0.02
This will edit the ELL model, remove the dummy reset triggers on the two GRU nodes and replace them with a VoiceActivityDetectorNode. Your new GRUKeywordSpotter.ell should now look like this:
Note: you can use the following tool to generate these graphs:
[Linux] %ELL_ROOT%\build\bin\release\print -imap GRUKeywordSpotter.ell -fmt dot -of graph.dot [Windows] %ELL_ROOT%\build\bin\release\print -imap GRUKeywordSpotter.ell -fmt dgml -of graph.dgml
And you can view graph.dgml using Visual Studio. On Linux you can use the
dot format which can be viewed using GraphViz.
You can now compile this new GRUKeywordSpotter.ell model using
wrap.py as before and try it out. You should see the
eval_test accuracy increase back up from 70% to about 85%. The VoiceActivityDetector is not perfect on all the audio test samples, especially those with high background noise. The VoiceActivityDetector has many parameters that you can see in
add_vad.py. These parameters can be tuned for your particular device to get the best result. You can use the
<ELL_ROOT>/tools/utilities/pythonlibs/audio/vad_test.py tool to help with that.
You can also use the
view_audio.py script again and see how it behaves when you speak the 30 different keywords listed in categories.txt. You should notice that it works better now because of the VoiceActivityDetection whereas previously you had to click “stop” and “record” to reset the model. Now it resets automatically and is able to recognize the next keyword after a small silence. You still cannot speak the keywords too quickly, so this solution is not perfect. Understanding full conversation speech is a different kind of problem that requires bigger models and audio datasets that include whole phrases.
train_classifier.py script has a number of other options you can play with including number of epochs, batch_size, learning_rate, and the number of hidden_units to use in the GRU layers. Note that training also has an element of randomness to it, so you will never see the exact same numbers even when you retrain with the exact same parameters. This is due to the Stochastic Gradient Descent algorithm that is used by the trainer.
The neural network you just trained is described by the KeywordSpotter class in the
train_classifier.py script. You can see the __init__ method of the GRUKeywordSpotter class creating two GRU nodes and a Linear layer which are used by the forward method as follows:
def forward(self, input): # input has shape: [seq,batch,feature] gru_out, self.hidden1 = self.gru1(input, self.hidden1) gru_out, self.hidden2 = self.gru2(gru_out, self.hidden2) keyword_space = self.hidden2keyword(gru_out) result = F.log_softmax(keyword_space, dim=2) # return the mean across the sequence length to produce the # best prediction of which word it found in that sequence. # we can do that because we know each window_size sequence in # the training dataset contains at most one word and a single # word is not spread across multiple training rows result = result.mean(dim=0) return result
You can now experiment with different model architectures. For example, what happens if you add a third GRU layer? To do that you can make the following changes to the KeywordSpotter class:
Add the construction of the 3rd GRU layer in the __init__ method:
self.gru3 = nn.GRU(hidden_dim, num_keywords)
Change the Linear layer to take a different number of inputs since the output of gru3 will now be size num_keywords instead of hidden_dim:
self.hidden2keyword = nn.Linear(num_keywords, num_keywords)
Add a third hidden state member in the init_hidden method:
self.hidden3 = None
Use this new GRU layer in the forward method right after the gru2.
gru_out, self.hidden3 = self.gru3(gru_out, self.hidden3)
That’s it! Pretty simple. Now re-run
train_classifieras before. Your new model should get an evaluation accuracy that is similar, so the 3rd GRU layer didn’t really help much. But accuracy is not the only measure, you might also want to compare the performance of these two models to see which one is faster. It is often a speed versus accuracy trade off with neural networks which is why you see a speed versus accuracy graph on the ELL Model Gallery.
There are many other things you could try. So long as the trainer still shows a decreasing loss across epochs then it should train ok. You know you have a broken model if the loss never decreases.
As you can see, PyTorch makes it pretty easy to experiment. To learn more about PyTorch see their excelent tutorials.
That’s it, you just trained your first keyword spotter using PyTorch and you compiled and tested it using ELL, congratulations! You can now use your new featurizer and classifier in your embedded application, perhaps on a Raspberry Pi, or even on an Azure IOT Dev Kit.
There are also lots of things you can experiment with as listed above. You can also build your own training sets, or customize the speech commands set by simply modifying the the training_list.txt file, then recreate the training_list.npz dataset using
make_dataset.py, retrain the model and see how it does.
Another dimension you can play with is mixing noise with your clean .wav files. This can give you a much bigger dataset and a model that performs better in noisy locations. Of course, this depends on what kind of noise you care about. Crowds of people? Industrial machinery? Just the hum of computers and HVAC systems in an Office environment? It depends on your application. You can experiment using the noise files included in the speech_commands dataset.
The github speech commands gallery contains some other types of models, some based on LSTM nodes, for example. The GRU architecture does well on smaller sized models, but LSTM hits the highest score when it maximizes the hidden state size. You can also see on the gallery that the 8kHz audio is faster, but it loses accuracy.
For any PyTorch issues, see https://pytorch.org/.
Exception: Please set your ELL_ROOT environment, as you will be using python scripts from there
If you see this error then you need to following ELL setup instructions above and clone the ELL repo, build it, and set an environment variable that points to the location of that repo. This tutorial uses scripts in that repo, and those scripts use the binaries that are built there.