Asymptotic Labs (Posts about audio)

Audio Yo

Tim Anderton — Wed, 02 Oct 2019 02:10:57 GMT

This is the notebook for the presentation that I will be giving at the SLC python meetup tomorrow. This slide and other slides like these containing the rough planned text of the talk are "skip" slides and should (hopefully) get removed when I compile the notebook into a slides format. (Bet you didn't know nbconvert could do that!).

To make a notebook into a slide show you will also need to annotate the cells of the notebook to say whether they represent a new slide, a continuation of the previous slide (slide type indicated by a "-"), a sub-slide (optional material for a slide which can be navigated with up/down or skipped by going left right), a fragment (which is on the previous slide but appears when you next hit right), or is to be skipped all together. This information can be added by changing the cell toolbar type to slideshow ( View->Cell Toolbar->slideshow)

you can then launch the slideshow using an nbconvert command like the following,

jupyter nbconvert slideshow.ipynb --to slides --post serve

Anyway the style of this post will be slightly different than normal and you will just have to imagine me waving my hands around frantically and waiting for dramatic ... pauses.

Eigen-Techno

Tim Anderton — Mon, 26 Mar 2018 14:24:13 GMT

I recently found an analysis of techno music using principal component analysis (PCA).

https://www.math.uci.edu/~isik/posts/Eigentechno.html

The author, Umut Isik, extracted more than 90,000 clips of one bar worth of music from 10,000 different techno music tracks and stretched the music to all have the same tempo of 128 bpm. Then Isik fed those tracks through PCA and then analyzed the resulting principal vectors and the quality of music approximation with a growing number of components. Since the principal vectors are modeling sound we can listen to them which is rather fun. I won't cover all the same material as Isik did in that post and it is worth a read so nip over and give it a look if you haven't already. Isik has very kindly bundled up the source data and made it available for others to use (theres a link at the end of the post linked above). Playing around with such a fun data set is too tempting to resist so I'm going to do my own "eigentechno" analysis here.

Extracting Phonemes

Tim Anderton — Wed, 24 Jan 2018 17:37:19 GMT

Extracting Phonemes From Speech Samples¶

My best single model for the recent speech recognition kaggle competition. Was a model based on the idea of extracting a probabilistic map of the phonemes present in a particular speech sample and to then using that phoneme map as a feature set to predict the word.

The dataset provided consists of examples of 30 different words with one word appearing in each 1 second sample. Since there is no phonetic information provided other than which word is which the first step was to turn each word into a phonetic spelling.

3D CNN for audio data

Tim Anderton — Mon, 22 Jan 2018 07:00:00 GMT

3D Time/Frequency/Phase Representation of Audio for Speech Recognition.¶

I recently participated in a speech recognition kaggle competition. Although I didn't come close to the top of the leaderboard (238th place with 87% accuracy vs 91% accuracy for the winners) I learned quite a bit about handling audio data and had a lot of fun. One of the more novel things I tried during the competition was to spatially encode the phase information in the audio and pass the results into a 3D CNN.

A common pre-processing step in speech recognition is to turn the 1D audio into a 2D spectrogram. The spectrogram the volume of the audio as a function of time and at a particular frequency. Spectrograms are a great way of summarizing the important information in an audio clip in a way that makes it accessible visually. Here is a spectrogram of an utterance of the word "marvin".

Wavelet Spectrograms for Speech Recognition

Tim Anderton — Mon, 08 Jan 2018 07:00:00 GMT

Wavelet Features For Speech Recognition.¶

I've been partipating in the TensorFlow speech recognition challenge.

https://www.kaggle.com/c/tensorflow-speech-recognition-challenge

It seems like the most common approach is to begin by turning the audio into a spectrogram and then feeding that into a 2D CNN. One trouble with spectrograms is that you have to trade off resolution in frequency for resolution in time and vice versa. In principle you can get higher resolution in time for higher frequencies than you can for lower frequencies but when you pick an input length for your short time fourier transform you lose temporal resolution much below the window length.

Wavelets are one possible way around this limitation