Friday, May 18, 2012

What should be our focus in Speech Recognition?

If you worked in a business long enough, you start to understand better what type of work are important.   As many things in life, sometimes the answer is not trivial.   For example, in speech recognition, what are the important ingredients to work on?

Many people will instinctively say the decoder.  For many, the decoder, the speech recognizer, oorr the "computer thing" which does all the magic of recognizing speech, is the core of the works.

Indeed, working on a decoding is loads of fun.  If you a fresh new programmer, it is also one of those experiences, which will teach you a lot of things.   Unlike thousands of small, "cool" algorithms, writing a speech recognizer requires you to work out a lot of file format issues, system issues.   You will also touch a fairly advanced dynamic programming problem : writing a Viterbi search.   For many, it means several years of studying source code bases from the greats such as HTK, Sphinx and perhaps in house recognizers.

Writing a speech recognizer is also very important when you need to deal with speed issues.  You might want to fit a recognizer into your mobile phone or even just a chip.   For example, in Voci, an FPGA-based speech recognizer was built to cater ultra-high speed speech recognition (faster than 100xRT).   All these system-related issues required understanding of the decoder itself.

This makes speech recognition an exciting field similar to chess programming.  Indeed the two fields are very similar in terms of code development.   Both require deep understanding of search as a process. Both have eccentric figures popped up and popped out.   There are more stories untold than told in both field.  Both are fascinating fields.

There is one thing which speech recognition and chess programming are very different.   This is also a subtle point which even many savvy and resourceful programmers don't understand.   That is how each of these machines derived their knowledge sources.   In speech, you need to have a good model to do decent jobs for your task.   In chess though, most programmers can proceed to write a chess player with the standard piece values.   As a result, there is a process before anyone can use a speech recognizer.  That is to first train an acoustic model and a language model.  

The same decoder, having different acoustic models and language models, can give users perceptions ranging from a total trainwreck to the a modern wonder, borderline to magic.   Those are the true ingredients of our magic.   Unlike magicians though, we are never shy to talk about these secret ingredients.   They are just too subtle to discuss.   For example, you won't go to a party and tell your friends that "Using an ML estimate is not as good as using an MPFE estimate in speech recognition.  It usually results in absolutely 10% performance gap."  Those are not party talks.  Those are talks when you want to have no friends. :)

In both type of tasks, one require learning different from a programming training.   10 years ago, those skill are generally carried by "Mathematician, Statistician or People who specialized in Machine Learning".   Now there is new name : "Big Data Analyst".

Before I stopped, let me mention another type of work, which are important in real life.  What I want to say is transcription and dictionary work.   If you asked some high-minded researchers in the field, they will almost think those are not interesting work.   Yet, in real-life, you can almost always learn something new and improve your systems based on them.  May be I will talk about this more next time.

The Grand Janitor





Sunday, May 13, 2012

Restart

Again, I feel rejuvenated.   Last few months of experience start to make me more unified both as a person and as a technical person.   When you start to work on something which draw up all you know in your life, you know that you are walking on the right path.

Things are starting to look more and more interesting.

The Grand Janitor

Friday, May 11, 2012

Development of Sphinx 3.X (X = 6 to 8) and its Ramification.

One of the things I have done back in Sphinx is to so called "Great Refactoring" of Sphinx 3, SphinxTrain and sphinxbase.   It was started by me but mostly took up by Dave (in a disgruntled manner :) ).    I write this article to reflect the whole process and ask if I have done the right thing.

The background is like this: as you know, the CMU sphinx project has many recognizers.   Sphinx2, 3, 4, PocketSphinx and MultiSphinx.   It's easy to understand why that happened in the first place.  CMU is an university and understandably would have many different types of projects.  In essence,  when someone think of a good new idea, they will simply implement a recognizer.  The by-product of it would be a PhD thesis or some kind of project reports.

There is nothing wrong with that.  Think of the pain of understanding and changing a recognizer which has 10-30 thousand lines of code, you will know that it is not for the faint of heart.  Many of the original programmers of the recognizers also have practical reason to ignore code re-usability - many of them have deadlines to meet.  So I always feel empathy towards them.

Of course, on the other side of the coin,  having many recognizers gives users a mild amount of pain.   Just to look at 3.0 and 3.3, command-line interface had changed (e.g. -meanfn becomes -mean).   So when people need to interface with the code,  it would take some understanding.   The bigger problem is that do you expect a certain feature appears in one of the decoders to appear in another?   This kind of inconsistency is very hard to explain to normal users.

So here comes the first change at 3.5, or around 6-7 years ago, I decided to merge 3.0's series of tools and recognizer with 3.3, the fast decoder.  I got to say, the decision is mainly driven by young naivete and year-long insomnia.  ( :) ).   There were also frustration from users which drove me to make those changes.  In 3.5, the main thing I did was just to "port" the tool from the old 3.0 such as allphone, astar, align to 3.x.   There are some command-line interface changes.   So far, all are cool.

Then it comes to 3.6, at this point, I started to realize a lot of underlying functions and libraries are duplicated.   For example, we have multiple GMM computation routines but you can't use them in all tools which call GMM computation.   Like allphone in 3.5 used GMM computation, but you can't expect any fast GMM computation in 3.4 can be used in allphone.  Simply because the library wasn't shared.

So what did young and naive me thought?  Let's try to write a single architecture to incorporate all these different things! (!!!!)  Now... this is what I think where things go wrong.

Let me explain a little bit more.  There is a legitimate reason why the original programmer (Ravi) decides to split the tools into multiple parts and let code duplicates.   Simply because, the issue in align is not necessarily the issue of decode.   If the programmer of align needs to consider issues of decode, then it will take a long time to really get any programming done.

This happens to be the case of Sphinx 3.X.  Now for the development of Sphinx 3.X, there was another undesirable factor.  That is I decided to leave - I simply couldn't overcome the economic force at the time - a startup company is willing to hire me.

To complicate the matter,  we *also* decide to factor out common parts between SphinxTrain and sphinx3 to avoid code duplication between the two.   Again, it is driven by legitimate concern,  the fact that there were two feature extraction routines in two packages constantly make users ask themselves whether the front-end are matched.

All of these except I am leaving are good things but they just entail coding time.  Now the end effect is that it makes the effort too big, too time-consuming.  3.6 took me around 1 year to write and release. I release an official release at around mid of 2006 but there are still too many issues in the program.  The latter 3.8, Dave has taken up and really fixed many bugs.  So I always think it's Dave to make sphinx 3.X in the current stable form.

To the credit of the guys in the team, they really bash me : Evandro, being circumspect and consistent, always asked if it is a good idea in the first place.   Ravi, always the wise man, had brought up the issues of merging the code.  And of course, there is Dave, he deserves most of the credits for fixing a lot of nasty bugs.

So, in fact, it is really I should be blamed in the process.  I guess I am finally mature enough to apologize to everyone.

So you may wonder why I said all of these?  Oh well, first of all, that's because I am going to put work on the recognizers again.   Not just on Sphinx 3, but all other recognizers.  So my first hope is that I don't repeat my past problems.

Now given the code is being iterated in last 6 years, the benefit of merging the code in Sphinx 3 starts to really show up.  People can do a lot of more things than the past.   Is it good enough?  I don't think so.  Sphinx 3 has a lot of potentials but it's very misunderstood.  In a nutshell, I need to put more work on it in the future.

The Grand Janitor

Friday, May 04, 2012

Being a programmer in your 20s and 30s

It's funny how a person changes.  I always thought 20s was my best time.  It sort of was.   Generally, that was the moment you are energetic, can burn as much as you can, naively think that life, relationship can last forever. Also, you unconditionally trust other people.  

Things will turn when you are 30s, you start to realize your skill, your prowess to growth has a limit.  In exchange, you grow wiser.  In my case, I found my reads on people are much better, I start to go behind other people's word and try to understand people's intention.  I start to treasure genuine friendship, protest contrived politeness and faked honesty.

I also start to know when is the best time to be quiet and when is the best time to give a come back.  The former is important because if you are the only one who shine, your team will perpetually has the capability of one person, who is you.  

The latter is also important because if you are always quiet, there are just people who will step on your toes harder and harder.   They will think you are weak and can be bullied.   In real life, as in the time when you are in high-school, the bullies love to bully the weak.   Making sure they have a hard time to do so, is a very important life skill.

I will never go back to the time I can hack a program for 20 hours, sleep, and then hack it again for another 20 hours.  Will I feel regret about it? Probably not, in exchange, I learn that sometimes you can solve a 20-hour problem with 2 hour, you can still sleep and make a living.   For all that matter, it seems to be a better deal. :)

The Grand Janitor


Wednesday, May 02, 2012

Start to look at the repository tree

Programming as a profession is a a strange one.   If you are a doctor, you can usually carry your knowledge and skills from one place to another provided that you have exactly the same tool.    If you are a programmer, you speed and skill are partially determined by the tools you build in house for a particular place.   So for example, I am not supposed to use any tool I built when I worked in the small video-advertising start-up.   Even if I can do something in 1 second at that period of time, if I change my job, I will need to restart and rebuild the tool again.   We are probably talking about days to rebuild the tool and weeks to refine it again.

There is one exception: if you worked in open source, much of your code would be stored in a public place.   Even when you have left your job for long time, it is legit for you to use it again.  You don't have to solve the same problem again and again.   This is the beauty of open source and I am greatly benefited by it personally. 

As I start to regain my muscles in Sphinx, I start to notice that there are much changes in last 6 years.  Just look at the top level of Subversion:

File Rev.AgeAuthorLast log entry
 Parent Directory
 CLP/ 10079 23 months dhdfu Finally add an -F argument to use the full path in the control file as the label…
PocketSphinxAndroidDemo/ 11117 9 months nshmyrev Wrapper for nbest
 SimpleLM/ 22 12 years rickyhoughton Initial revision
 Speech-Recognizer-SPX/ 8933 3 years nshmyrev Update module to recent pocketsphinx API
 SphinxTrain/ 11350 9 days nshmyrev Extract warped features during 000 stage if VTLN is enabled. See for detailsht
 archive_s3/ 7289 4 years egouvea Fixed error message in decoder script reporting failure in bw, and made result d…
 cmuclmtk/ 11035 10 months nshmyrev Fixes bug in wngram2idngram and adds a test for it
 cmudict/ 11348 3 weeks air cleaned up documentation and code (a bit) recompiled the dict
 gst-sphinx/ 7848 4 years dhdfu Support changing language models at runtime (maybe)
 htk2s3conv/ 11336 6 weeks nshmyrev Adds warning about different number of mixtures
 jsgfparser/ 7230 4 years dhdfu Fix the main program to output the only public rule if no rule is specified, and…
 logios/ 11339 4 weeks tkharris remove duplicated code
 misc_scripts/ 10147 22 months dhdfu handle zero references
 multisphinx/ 10945 12 months dhdfu clean up better and introduce vocabulary maps
 pocketsphinx/ 11351 8 days nshmyrev Updated lat2dot script. I need to move it to the other location though
 pocketsphinx-extra/ 9972 2 years dhdfu add sc models with mixture_weights and mdef.txt files
 scons/ 5868 5 years egouvea updated the scons support to reflect that plugin.jar is now part of the package
 share/ 5532 6 years egouvea Setting dsp and dsw files to have have windows EOL regardless where it's downloa…
 sphinx2/ 8767 3 years egouvea Updated the sphinx-2 MS files to MS .NET, consistent with the other packages, an…
 sphinx3/ 11329 2 months nshmyrev Patch to solve memory issues in python module. See for detailshttps://bugzilla
 sphinx4/ 11344 3 weeks nshmyrev Properly sets logger for AudioFileDataSource. Thanks to Bandele Ola.
 sphinx_fsttools/ 10791 14 months nshmyrev Some bit in AM to FST conversion
 sphinxbase/ 11346 3 weeks nshmyrev Properly select buffer size when using audioresample. Thanks to balkce See fo…
 tools/ 9009 3 years nshmyrev Updated to the latest release of sphinx4
 web/ 10249 21 months nshmyrev There is no sphinx3 development anymore
How exciting is that?  You got only 6 to 7 top level directories 7 years ago!

From now on, I will start to put more notes on different tools in the repository. 

The Grand Janitor