<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Asymptotic Labs (Posts about speech recognition)</title><link>http://asymptoticlabs.com/</link><description></description><atom:link href="http://asymptoticlabs.com/categories/speech-recognition.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2022 &lt;a href="mailto:quidditymaster@gmail.com"&gt;Tim Anderton&lt;/a&gt; </copyright><lastBuildDate>Wed, 31 Aug 2022 21:28:18 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>3D CNN for audio data</title><link>http://asymptoticlabs.com/posts/audio-3DCNN.html</link><dc:creator>Tim Anderton</dc:creator><description>&lt;div class="cell border-box-sizing text_cell rendered"&gt;&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;h2 id="3D-Time/Frequency/Phase-Representation-of-Audio-for-Speech-Recognition."&gt;3D Time/Frequency/Phase Representation of Audio for Speech Recognition.&lt;a class="anchor-link" href="http://asymptoticlabs.com/posts/audio-3DCNN.html#3D-Time/Frequency/Phase-Representation-of-Audio-for-Speech-Recognition."&gt;¶&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I recently participated in a &lt;a href="https://www.kaggle.com/c/tensorflow-speech-recognition-challenge"&gt; speech recognition &lt;/a&gt; kaggle competition. Although I didn't come close to the top of the leaderboard (238th place with 87% accuracy vs 91% accuracy for the winners) I learned quite a bit about handling audio data and had a lot of fun. One of the more novel things I tried during the competition was to spatially encode the phase information in the audio and pass the results into a 3D CNN.&lt;/p&gt;
&lt;p&gt;A common pre-processing step in speech recognition is to turn the 1D audio into a 2D &lt;a href="https://en.wikipedia.org/wiki/Spectrogram"&gt;spectrogram&lt;/a&gt;. The spectrogram the volume of the audio as a function of time and at a particular frequency. Spectrograms are a great way of summarizing the important information in an audio clip in a way that makes it accessible visually. Here is a spectrogram of an utterance of the word "marvin".&lt;/p&gt;
&lt;p&gt;&lt;img src="http://asymptoticlabs.com/images/marvin_specgram.jpg" alt="marvin_specgram"&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://asymptoticlabs.com/posts/audio-3DCNN.html"&gt;Read more…&lt;/a&gt; (22 min remaining to read)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
</description><category>audio</category><category>neural networks</category><category>speech recognition</category><guid>http://asymptoticlabs.com/posts/audio-3DCNN.html</guid><pubDate>Mon, 22 Jan 2018 07:00:00 GMT</pubDate></item><item><title>Wavelet Spectrograms for Speech Recognition</title><link>http://asymptoticlabs.com/posts/waveletSpectrogramsTFSR.html</link><dc:creator>Tim Anderton</dc:creator><description>&lt;div tabindex="-1" id="notebook" class="border-box-sizing"&gt;
    &lt;div class="container" id="notebook-container"&gt;

&lt;div class="cell border-box-sizing text_cell rendered"&gt;&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;h2 id="Wavelet-Features-For-Speech-Recognition."&gt;Wavelet Features For Speech Recognition.&lt;a class="anchor-link" href="http://asymptoticlabs.com/posts/waveletSpectrogramsTFSR.html#Wavelet-Features-For-Speech-Recognition."&gt;¶&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I've been partipating in the TensorFlow speech recognition challenge.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.kaggle.com/c/tensorflow-speech-recognition-challenge"&gt;https://www.kaggle.com/c/tensorflow-speech-recognition-challenge&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;It seems like the most common approach is to begin by turning the audio into a spectrogram and then feeding that into a 2D CNN. One trouble with spectrograms is that you have to trade off resolution in frequency for resolution in time and vice versa. In principle you can get higher resolution in time for higher frequencies than you can for lower frequencies but when you pick an input length for your short time fourier transform you lose temporal resolution much below the window length.&lt;/p&gt;
&lt;p&gt;Wavelets are one possible way around this limitation &lt;/p&gt;&lt;p&gt;&lt;a href="http://asymptoticlabs.com/posts/waveletSpectrogramsTFSR.html"&gt;Read more…&lt;/a&gt; (17 min remaining to read)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/body&gt;&lt;/html&gt;
</description><category>audio</category><category>kaggle</category><category>machine learning</category><category>neural networks</category><category>speech recognition</category><category>wavelets</category><guid>http://asymptoticlabs.com/posts/waveletSpectrogramsTFSR.html</guid><pubDate>Mon, 08 Jan 2018 07:00:00 GMT</pubDate></item></channel></rss>