nsh - Speech Recognition With CMU Sphinx

Blog about speech technologies - recognition, synthesis, identification. Mostly it's about scientific part of it, the core design of the engines, the new methods, machine learning and about about technical part like architecture of the recognizer and design decisions behind it.

Do you trust speech transcription in the cloud

When information is already lost

In speech recognition we frequently deal with noisy or simply corrupted recordings. For example, in call center recordings you still get error rates like 50% or 60% even with the best algorithms. Someone calls while driving, others on the noisy street. Some use really bad phones. And the question raises how to improve the accuracy for such recordings. People try different complex algorithms and train model on GPU for many weeks while the answer is simple - the information in such recordings is simply lost and you can not decode them accurately.

Data transfer and data storage are expensive and VoIP providers often save every cent since they all operate on very low margin. That means they often use codecs with bugs or bad transmission lines and as a result you simply get unintelligible speech. Then everyone uses cell phones thus you have multiple encoding-decoding rounds where information from the microphone is encoded with AMR then encoded into 729 and finally converted into MP3 for storage,  you have many codecs and often frame drops. As a result the quality of sound is sometimes very bad. And recognition accuracy is zero.

The quality of speech is hard to measure but there are ways to do that. The easiest way requires controlled experiments where you send the data from one endpoint and measure distortion on another endpoint. There are also single-side tools developed by ITU in PESQ 563 that simply take an audio file and give you the sound quality score which takes many parameters into account and estimates the quality of the speech. They are rough, but still can give you some idea how noisy your audio is. And if it is really noisy, the way to improve it is not to apply better algorithms and put more research in speech recognition but simply go to the VoIP provider and demand better sound quality.

Given we have such a tool we might want to introduce the normalized word error rate which takes into account how good the recording is. So you really want to decode high quality recordings accurately and you probably do not care about bad quality recordings.

When accuracy matters, sound quality is really important. If possible you can use your own VoIP stack sending audio directly from the mobile microphone to the server. But when calls come to play, it is usually hopeless.

Learning with huge memory

Recently a set of papers were published about "memorization" in neural networks. For example:

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer


Understanding deep learning requires rethinking generalization

It seems that large memory system has a point, you don’t need millions of computing cores in CPU and, it is too power-expensive, you could just go ahead with very large memory and reasonable amount of cores to access memory with hashing (think of Shazam or randlm, or G2P by analogy). You probably do not need heavy tying either.

Advantages are: you can quickly incorporate new knowledge, just put new values in memory, you can model corner cases since they are all still accessible, and, again, you are much more energy-efficient.

Maybe we will see mobile phones with 1Tb of memory sometimes.

Future of texts

It seems that people will loose the ability to read, comprehend and remember long texts soon, the question now is - is it possible to deliver very complex messages without texts?
The critical issue is to design a flow of information into human brain which will both allow to scan though extremely large amounts of data and deduct new meanings. Text/speech is indeed quite slow channel for that, vision might be reasonable.

Visualization seems relevant if we want to keep human intelligence instead of replacing it with pure computer intelligence.  Works like LargeVis

Visualizing Large-scale and High-dimensional Data
by Jian Tang, Jingzhou Liu, Ming Zhang, Qiaozhu Mei

are much more important then. See also the LargeVis project on github.

The case against probabilistic models in metric spaces

A recent discussion on kaldi group about OOV words reminded me about this old problem.

One of the things that makes modern recognizers so unnatural is probabilistic models behind them. It's a core design decision to build the recognizer on terms of probability of classes and use models which are all probabilistic. Probabilistic models are easy to estimate, but they do not often fit the reality.

In the most common situation, if you have two classes A and B and garbage class G, a point from the garbage is either estimated as A or B and it is very hard to properly classify it as G. While probability of the signal is easy to estimate from the database based on examples, probability of the garbage is very hard. You need to have a huge database of garbage examples or you will not be able to get the garbage estimate properly. As a result, the current systems can not drop non-speech sounds and often create very misleading hypothesis. Bad things also happen in training, incorrectly labelled examples significantly disturb correct probability estimation and model has no means to detect them.

And in a long term the chase for probabilistic model is getting worse, everything is reduced to probabilistic framework. People talk about graphical models, Gaussian processes, stick-breaking model, Monte-Carlo sampling when they simply need to optimize the number of Gaussians in the mixture with a simple cost function. And they never tell you can simply train 500 Gaussians mixture and that will work equally well.

Same issue you might see in search engines, you can not use "not" in the search, for example, you can not search for a "restaurant not on the river bank". Though some companies try to implement such search, this effort is not widespread yet.

Situation slightly changes if we consider some real space of variants, for example a metric space. Much more reasonable decision might be made with geometrical models. You just look on the distance between the observation and the expectation and make a decision based on certain threshold. Of course you need to train the threshold and the distance function but this decision relies only on observation and the distance, not on the probability of everything else. Yes, I'm talking about plain old SVMs.

Metric is really the key here, with generic space indeed you can not invent something more advanced than simple bayesian rule. However, in presence of metric you might hope that you'll get much more interesting results from using it or at least combining metric decision with probabilistic decision.

Unfortunately there is no much information about it on the net, almost all AI books start with probabilistic reasoning as a natural approach to intelligence. I found some research like this paper, but it is far from being complete. Any links on more complete research  on the topic would be really appreciated.

IWSLT 2015

IWSLT 2015 proceedings recently appeared. This is an important competition in ASR focused on TED talks translation (and, more interesting for us, transcription).

Best system from MITLL-AFRL had a nice WER 6.6%.

It is interesting that most of the winner system (same was in MGB challenge Cambridge system ) were using combinations of customized HTK + Torch and Kaldi. Kaldi alone does not get the best performance (11.4%), plain custom HTK is usually better with WER 10.0% (see Table 8). And combination usually gives ground-breaking result.

There is something interesting here.

Harmonic Noise Model in Speech Recognition

Recently I came around a nice demo about generation of natural sounds from physical models. This is really an exciting topic because while Hollywood can now draw almost everything like Star Wars, the sound generation is pretty limited and unexplored area. For example, really high quality speech still can not be created by computers, no matter how powerful they are. This leads to a question of speech signal representation.

Accurate speech signal representations made a big difference in different areas of speech processing like TTS, voice conversion, voice coding. The core idea is very simple and straightforward but also powerful - we notice the fact that acoustic signals are either produced by harmonic oscillation in which case it has structure or by a turbulence cavitation in which case we see something like white noise. In speech such classes are represented by vowels and sibilant consonants, everything else is a mixture of those with some degree of turbulence and some degree of structure. However, this does not really speech-specific, all other real world signals except artificial ones might be analyzed from this point of view.

Such representation allowed to greatly improve voice compression in the class of MELP codecs (mixed excitations linear prediction). Basically we represent the speech as noise and harmonics and compress them separately. That allowed to improve compression of speech signal to unbelievable 600b/s. Mixed excitation was very important in text-to-speech synthesis. And it really made a big difference, as was proven quite some time ago by Mixed excitation for HMM-based speech synthesis by Takayoshi Yoshimura at al. 2001.

Unfortunately there is very little published research on mixed excitation models for speech recognition. I only found a paper A harmonic-model-based front end for robust speech recognition by Michael L. Seltzer which does consider harmonic and noise model but focus on robust speech recognition and not the advantages of the model itself. However, I believe such model can be quite important for speech analysis because it allows to classify speech events with very high degree of certainty. For example, if you consider a task of creating TTS system from voice recording, you might still notice that even best algorithms still confuse sounds a lot, assign incorrect boundaries, select wrong annotation. More accurate signal representation could help here.

It would be great if readers share more links on this, thank you!

On SANE 2015 Videos on Signal Separation

Recently a great collection of videos from Speech and Audio in the Northeast (SANE) 2015 workshop has been shared. The main topic of the workshop was sound signal separation which I consider very important direction of research for the new future, something that would be critical to solve to get human-like performance of speech recognition systems.

We did some experiments with NMF and other methods to robustly recognize overlapped speech before but my conclusion is that unless training and test conditions are carefully matched the whole system does not really work, anything unknown on the background really destroys recognition result. For that reason I was very interested to check recent progress in the field. The research is pretty early stage but there are very interesting results for sure.

The talk by Dr. Paris Smaragdis is quite useful to understand connection between non-negative matrix factorization and more recent approach with neural networks which also demonstrate how neural network works by selecting principal components from the data.

One interesting bit from the talk above is the announcement of the bitwise neural networks which are very fast and effective way to classify inputs. I believe it could be another big advancement in the performance of the speech recognition algorithms. The details could be found in the following publication: Bitwise Neural Networks by Minje Kim and Paris Smaragdis. Overall, the idea of the bit-compressed computation to reduce memory bandwidth seem very important (LOUDS language model in Google mobile recognizer also from this area). I think NVIDIA should be really concerned about it since GPU is certainly not the device this type of algorithms need. No more need in expensive Teslas.

Another interesting talk was by Dr.Tuomas Virtanen in which a very interesting database and the approach to use neural networks for separation of different event types is presented.  The results are pretty entertaining.

This video also had quite important bits, one of them is the announcement of Detection and Classification of Acoustic Scenes and Events Challenge 2016 (DCASE 2016) in which acoustic scene classification would be evaluated. The goal of acoustic scene classification is to classify a test recording into one of predefined classes that characterizes the environment in which it was recorded — for example "park", "street", "office". The discussion of the challenge which starts soon is already going in the challenge group, this would be very interesting to participate.