The Grand Janitor's Blog: HTK

Showing posts with label HTK. Show all posts

Tuesday, January 08, 2013

Two Views of Time-Signal : Global vs Local

As I have been working on Sphinx at work and start to chat with Nicholay more, one thing I realize is that several frequently used components of Sphinx need to rethink. Here is one example related to my work recently.

Speech signal or ...... in general time signal can be processed in two ways: you either process as a whole, or you process in blocks. The former, you can call it a global view, the latter, you can call it a local view. Of course, there are many other names: block/utterance, block/whole but essentially the terminology means the same thing.

For most of the time, global and local processing are the same. So you can simply say: the two types of the processing are equivalent.

Of course, not when you start to an operation which use information available. For a very simple example, look at cepstral mean normalization (CMN). Implementing CMN in block mode is certainly an interesting problem. For example, how do you estimate the mean if you have a running window? When you think about it a little bit, you will realize it is not a trivial problem. That's probably why there are still papers on cepstral mean normalization.

Translate to sphinx, if you look at sphinxbase's sphinx_fe, you will realize that the implementation is based on the local mode, i.e. every once in a while, samples are consumed, processed and write onto the disc. There is no easy way to implement CMN on sphinx_fe because it is assumed that the consumer (such as decode, bw) will do these stuffs their own.

It's all good though there are interesting consequence: what the SF's guys said about "feature" is really all the processing that can be done in the local sense. Rather than the "feature" you see in either the decoders or bw.

This special point of view is ingrained within sphinxbase/sphinxX/sphinxtrain (Sphinx4? not sure yet.) . This is quite different from what you will find in HTK which see feature vector as the vector used in Viterbi decoding.

That bring me to another point. If you look deeper, HTK such as HVite/HCopy are highly abstract. So each tool was designed to take care of its own problem well. HCopy really means to provide just the feature, whereas HVite is just doing Viterbi algorithm on a bunch of features. It's nothing complicated. On the other hand, Sphinx are more speech-oriented. In that world, life is more intertwined. That's perhaps why you seldom hear people use Sphinx to do research other than speech recognition. You can, on the other hand, do other machine learning tasks in HTK.

Which view is better? If you ask me, I hope that both HTK and Sphinx are released in Berkeley license. Tons of real-life work can be saved because each cover some useful functionalities.

Given that only one of them are released in a liberal license (Sphinx), then may be what we need is to absorb some design paradigm from HTK. For example, HTK has a sense of organizing data as pipes. That something SphinxTrain can use. This will enhance work of Unix users, who are usually contribute the most in the community.

I also hope that eventually there are good clones of HTK tools but made available in Berkeley/GNU license. Not that I don't like the status quo: I am happy to read the code of HTK (unlike the time before 2.2......). But as you work in the industry for a while, many are actually using both Sphinx and HTK to solve their speech research-related problems. Of course, many of these guys (, if they are honest,) need to come up with extra development time to port some HTK functions into their own production systems. Not tough, but you will wonder whether time can be better spent ......

Arthur

Thursday, December 27, 2012

Speech Recognition vs SETI

If you track news of CMUSphinx, you may notice that the Sourceforge guys start to distribute data through BitTorrent (link).

That's a great move. One of the issues in ASR is the lack of machine power in training. To make a blunt example, it's possible to squeeze extra performance by searching for the best training parameters. Not to say a lot of modern training techniques take some time to run.

I do recommend all of your help the effort. Again, me not involved at all, just feel that it is a great cause.

Of course, things in ASR are never easy so I want to give two subtle points about the whole distributed approach of training.

Improvement over the years?

First question you may ask, now does that mean, ASR can be like project such as SETI, which would automatically improve over the years? Not yet, ASR still has its unique challenge.

The major part I would see is how we can incrementally increase phonetically-balanced transcribed audio. Note that it is not just audio, but transcribed audio. Meaning: someone needs to go to listen to the audio, spending 5-10 times real time to write down what the audio really say word-by-word. All these transcriptions need to clean up and in a certain format.

This is what Voxforge tries to achieve and it's not a small undertaking. Of course, comparing to the speed of the industry development, the progress is still too slow. The last time I heard, Google was training their acoustic model with 38000 hours of data. A WSJ corpus is a toy task compared to it.

Now, thinking in this way, let's say if we want to build the best recognizer through open source, what is the bottleneck? I bet the answer doesn't lie on machine power, whether we have enough transcribed data would be the key. So that's something to ponder about.

(Added Dec 27, 2012, on the part of initial amount of data, Nickolay corrected me saying that amount of data from Sphinx is already in terms of 10000 hours. That includes "librivox recordings, transcribed podcasts, subtitled videos, real-life call recordings, voicemail messages".

So it does sound like Sphinx has the amount of data which rivals commercial companies. I am very interested to see how we can train an acoustic model with that amount of data.)

We build it, they will come?

ASR is always shrouded with misunderstanding. Many believe it is a solved problem, many believe it is a unsolvable problem. 99.99% of world population are uninformed about the problem.

I bet a lot of people would be fascinated by SETI, which .... Woa .... allows you to communicated to unknown intelligent sentients in the universe. Rather than on ASR, which ..... Em ..... basically many regards as a source of satires/parodies these days.

So here comes another problem, the public don't understand ASR enough to see it as an important problem. When you think about this more, this is a dangerous situation. Right now, couple of big companies control the resource of training cutting-edge speech recognizers. So let's say in the futre everyone needs to talk with a machine in a daily basis. These big companies would be so powerful that they can control our daily life. To be honest to you, this thought haunts me from time to time.

I believe we should continue to spread information on how to properly use an ASR system. At the same time, continue to build application to show case ASR and let the public understand its inner-working. Unlike subatomic particle physics, HMM-based ASR is not that difficult to understand. On this part, I appreciate all the effort which are done by developers of CMUSphinx, HTK, Julius and all other open source speech recognition projects.

Conclusion

I love the recent move of Sphinx spreading acoustic data using BitTorrent, it is another step to work towards a self-improving speech recognition system. There are still things we need to ponder in the open source speech community. I mentioned a couple, feel free to bring up more in the comment section.

Arthur

Wednesday, December 26, 2012

Me and CMU Sphinx

As I update this blog more frequently, I noticed more and more people are directed to here. Naturally, there are many questions about some work in my past. For example, "Are you still answering questions in CMUSphinx forum?" and generally requests to have certain tutorial. So I guess it is time to clarify my current position and what I plan to do in future.

Yes, I am planning to work on Sphinx again but no, I probably don't hope to be a maintainer-at-large any more. Nick proves himself to be the most awesome maintainer in our history. Through his stewardship, Sphinx prospered in the last couple of years. That's what I hope and that's what we all hope.

So for that reason, you probably won't see me much in the forum, answering questions. Rather I will spend most of my time to implement, to experiment and to get some work done.

There are many things ought to be done in Sphinx. Here are my top 5 list:

Sphinx 4 maintenance and refactoring
PocketSphinx's maintenance
An HTKbook-like documentation : i.e. Hieroglyphs.
Regression tests on all tools in SphinxTrain.
In general, modernization of Sphinx software, such as using WFST-based approach.

This is not a small undertaking so I am planning to spend a lot of time to relearn the software. Yes, you hear it right. Learning the software. In general, I found myself very ignorant in a lot of software details of Sphinx at 2012. There are many changes. The parts I really catch up are probably sphinxbase, sphinx3 and SphinxTrain. One PocketSphinx and Sphinx4, I need to learn a lot.

That is why in this blog, you will see a lot of posts about my status of learning a certain speech recognition software. Some could be minute details. I share them because people can figure out a lot by going through my status. From time to time, I will also pull these posts together and form a tutorial post.

Before I leave, let me digress and talk about this blog a little bit: other than posts on speech recognition, I will also post a lot of things about programming, languages and other technology-related stuffs. Part of it is that I am interested in many things. The other part is I feel working on speech recognition actually requires one to understand a lot of programming and languages. This might also attract a wider audience in future.

In any case, I hope I can keep on. And hope you enjoy my articles!

Arthur

Sunday, December 16, 2012

Landscape of Open Source Speech Recognition software at the end of 2012 (I)

As I am back, I start to visit all my old friends - all open source speech recognition toolkits. The usual suspects are still around. There are also many new kids in town so this is a good place to take a look.

It was a good exercise for me, 5 years of not thinking about open source speech recognition is a bit long. It feels like I am getting in touch with my body again.

I will skip CMU Sphinx in this blog post as you probably know something about it if you are reading this blog. Sphinx is also quite a complicated projects so it is rather hard to describe entirely in one post. This post serves only as an overview. Most of the toolkit listed here have rich documentation. You will find much useful information there.

HTK

I checked out the Cambridge HTK web page. Disappointingly, the latest version is still 3.4.1, so we are still talking about MPE and MMIE, which is still great but not as exciting as other new kids in town such as KALDI.

HTK has always been one of my top 3 speech recognition systems since most of my graduate work are done using HTK. There are also many tricks you can do with the tools.

As a toolkit, I also find its software engineering practice admirable. For example, the software command was based on common libraries written beneath. (Earlier versions such as 1.5 or 2.1 would restrict access to the memory allocation library HMem.) When reading the source code, you feel much regularities and there doesn't seem to be much duplicated code.

The license disallows commercial use but that's okay. With ATK, which is released in a freer license, you can also include the decoder code into a commercial application.

Kaldi

The new kid in town. It is headed by Dr. Dan Povey, who researched many advanced acoustic modeling techniques. His recognizers attract much interest as it has implemented features such as subspace GMM and FST-based speech recognizer. Of all, this features feel like more "modern".

I only have little exposure on the toolkit (but determined to learn more). Unlike Sphinx and HTK, it is written in C++ instead of C. As of this writing, Kaldi's compilation takes a long time and the binaries are *huge*. In my setup, it took me around 5G of disc space to compile. It probably means I haven't setup correctly ...... or more likely, the executable is not stripped. That means working on Kaldi's source code actively would take some discretion in terms of HD.

Another interesting part of Kaldi is that it is using weighted finite state transducer (WFST) as the unifying knowledge source representation. To contrast this, you may say most of the current open source speech recognizers are using ad-hoc knowledge source.

Are there any differences in terms of performance you ask? In my opinion, probably not much if you are doing an apple to apple comparison. The strength of using WFST is that when you need to introduce new knowledge, in theory you don't have to hack the recognizer. You just need to write your knowledge in an FST and compose it with your knowledge network, then you are all set.

In reality, the WFST-based technology seems to still have practice problem. As the vocabulary size goes large and knowledge source got more complicated, the composed decoding WFST would naturally outgrow the system memory. As a result, many sites propose different technique to make decoding algorithm works.

Those are downsides but the appeal of the technique should not be overlooked. That's why Kaldi becomes one of my favorite toolkits recently.

Julius

Julius is still around! And I am absolutely jubilant about it. Julius is a high-speed speech recognizer which can decode a 60k vocabulary. One speed-up techniques of Sphinx 3.X was context-independent phone Gaussian mixture model selection (CIGMMS) and I borrowed this idea from Julius when I first wrote.

Julius is only the decoder and the beauty of it is that it never claims to be more than that. Accompanied with the software, there is a new Juliusbook, which is the guide on how to use the software. I think the documentation are in greater-depth than other similar documentations.

Julius comes with a set of Japanese models, not English. This might be one of the reasons why it is not as popular (more like talk about) as HTK/Sphinx/Kaldi.

(Note at 20130320: I later learned that Julius also comes with an English model now. In fact, some anecdotes suggest the system is more accurate than Sphinx 4 with broadcast news. I am not surprised. HTK was as acoustic model trainer.)

So far......

I went through three of my favorite recognition toolkits. In the next post, I will cover several other toolkits available.

Arthur

Friday, May 18, 2012

What should be our focus in Speech Recognition?

If you worked in a business long enough, you start to understand better what type of work are important. As many things in life, sometimes the answer is not trivial. For example, in speech recognition, what are the important ingredients to work on?

Many people will instinctively say the decoder. For many, the decoder, the speech recognizer, oorr the "computer thing" which does all the magic of recognizing speech, is the core of the works.

Indeed, working on a decoding is loads of fun. If you a fresh new programmer, it is also one of those experiences, which will teach you a lot of things. Unlike thousands of small, "cool" algorithms, writing a speech recognizer requires you to work out a lot of file format issues, system issues. You will also touch a fairly advanced dynamic programming problem : writing a Viterbi search. For many, it means several years of studying source code bases from the greats such as HTK, Sphinx and perhaps in house recognizers.

Writing a speech recognizer is also very important when you need to deal with speed issues. You might want to fit a recognizer into your mobile phone or even just a chip. For example, in Voci, an FPGA-based speech recognizer was built to cater ultra-high speed speech recognition (faster than 100xRT). All these system-related issues required understanding of the decoder itself.

This makes speech recognition an exciting field similar to chess programming. Indeed the two fields are very similar in terms of code development. Both require deep understanding of search as a process. Both have eccentric figures popped up and popped out. There are more stories untold than told in both field. Both are fascinating fields.

There is one thing which speech recognition and chess programming are very different. This is also a subtle point which even many savvy and resourceful programmers don't understand. That is how each of these machines derived their knowledge sources. In speech, you need to have a good model to do decent jobs for your task. In chess though, most programmers can proceed to write a chess player with the standard piece values. As a result, there is a process before anyone can use a speech recognizer. That is to first train an acoustic model and a language model.

The same decoder, having different acoustic models and language models, can give users perceptions ranging from a total trainwreck to the a modern wonder, borderline to magic. Those are the true ingredients of our magic. Unlike magicians though, we are never shy to talk about these secret ingredients. They are just too subtle to discuss. For example, you won't go to a party and tell your friends that "Using an ML estimate is not as good as using an MPFE estimate in speech recognition. It usually results in absolutely 10% performance gap." Those are not party talks. Those are talks when you want to have no friends. :)

In both type of tasks, one require learning different from a programming training. 10 years ago, those skill are generally carried by "Mathematician, Statistician or People who specialized in Machine Learning". Now there is new name : "Big Data Analyst".

Before I stopped, let me mention another type of work, which are important in real life. What I want to say is transcription and dictionary work. If you asked some high-minded researchers in the field, they will almost think those are not interesting work. Yet, in real-life, you can almost always learn something new and improve your systems based on them. May be I will talk about this more next time.

The Grand Janitor