I love using k-fold cross validation for my machine learning projects. But especially when I am dealing with neural network models that take hours or even days to train doing a full k-folds style analysis becomes an uncomfortably heavy computational burden. Unfortunately for models with such long training times I usually abandon training an esemble of models and just train one model with a single train/validation split.

I really wanted a way to get at least some of the diagnostic benefits you get from having an ensemble of semi-independently trained models the way you do in K-folds, but without needing to wait days or weeks for my neural nets to train. I started experimenting with weakly coupled mixtures of models. Instead of feeding most of the data to K otherwise independent models as in K-folds why not try feeding just a fraction 1/K of the data to each model and let the models communicate about their parameters with each other in a controlled way. I thought that perhaps by cleverly controlling what information is passed between which models, how often messages are passed, and how information from them may be used I could effectively isolate the information in some data folds from the values of the parameters of some of the models. In this way I could hopefully save some computation time over a k-folds cross validation without sacrificing all of its benefits.

Extracting Phonemes From Speech Samples¶

My best single model for the recent speech recognition kaggle competition. Was a model based on the idea of extracting a probabilistic map of the phonemes present in a particular speech sample and to then using that phoneme map as a feature set to predict the word.

The dataset provided consists of examples of 30 different words with one word appearing in each 1 second sample. Since there is no phonetic information provided other than which word is which the first step was to turn each word into a phonetic spelling.

Wavelet Features For Speech Recognition.¶

I've been partipating in the TensorFlow speech recognition challenge.

https://www.kaggle.com/c/tensorflow-speech-recognition-challenge

It seems like the most common approach is to begin by turning the audio into a spectrogram and then feeding that into a 2D CNN. One trouble with spectrograms is that you have to trade off resolution in frequency for resolution in time and vice versa. In principle you can get higher resolution in time for higher frequencies than you can for lower frequencies but when you pick an input length for your short time fourier transform you lose temporal resolution much below the window length.

Wavelets are one possible way around this limitation

Asymptotic Labs (Posts about neural networks)

Deep Learning 101

Deep Learning 101¶

SLC Python Meetup: June 2, 2021¶

Visualizing Convolution Kernels

Parameter Diffusion

Extracting Phonemes

Extracting Phonemes From Speech Samples¶

3D CNN for audio data

3D Time/Frequency/Phase Representation of Audio for Speech Recognition.¶

Wavelet Spectrograms for Speech Recognition

Wavelet Features For Speech Recognition.¶