Monday, December 31, 2012

Favorite words of Dec 31, 2012

Spanish: riesgo
English: promulgate, sagacious

Esperanto : Ekstermensigu
Toki Pona: All 121 words of it.  Liking them in a day should be okay. 

To me, I always try to think how we can build a recognizer for make-up language such as Esperanto, Toki Pona or Klingon.


Saturday, December 29, 2012

SRI LM Toolkit 1.7 released

Not exactly news, it happened 6 days ago.

See this thread:

Looks pretty good.   I thought that the toolkit would be stagnated but now I am glad that it is maintained.   There seems to be many bug fixes.

Goodies: Web Ngram from MS:

You got to have thick skin......

Linux Chews Up Kernel Maintainer for Introducing UserSpace Bug.

That's just the way it is.  Remember, there are always better, stronger, more famous, more powerful programmers working with you.  So criticisms will come one day.

My take, as long as if you don't work with them in person, just treat them as an icon on the screen.   :D


Developer's meeting note at 2010

This catches my eyes when I browse through CMUSphinx's blog.  That generally decides how the project will go.

Looks like resources is still an issue......

Friday, December 28, 2012

Where to start when tracing source code of a speech recognition toolkit?

Modern speech recognition software are complicated piece of software.  To understand it, you need to have some basic understanding of the principle of speech recognition, as well as some ideas on the programming language being used.

By now, you may hear a lot of people say they know about a speech recognizer.   And by now, you probably realize that most of these people have absolutely no ideas what's going on inside a recognizer.   So if you are reading this blog message, you are probably telling yourself, "I might want to trace the codebase of some recognizers' code." Be it Sphinx, HTK, Julius, Kaldi or whatever codebase you are looking at.

For the above toolkits, I will say I only know in detail about Sphinx,  probably a little bit about HTK's HVite.  But I won't say the same for others.  In fact, even in Sphinx, I only know intimately about Sphinx 3/SphinxTrain/sphinxbase triplet.   So just like you, I hope to learn more.

So here it begs the question: how would you trace a speech recognition toolkit codebase? If you think it is easy, probably because you worked in speech recognition for a while and you probably shouldn't read this post.

Let's just use sphinx as an example, there are hundreds of files in each component of Sphinx.   So where should you start?    A blunt approach would be reading each of the file one by one.   That's not a smart the way.   So here is a suggestion for you : focus on the following four things,

  1. Viterbi algorithm
  2. Workflow of training
  3. Baum-Welch algorithm. 
  4. Estimation algorithms of language models. 
When you know where the Viterbi algorithm is, you will soon figure out how the feature vector is generated.  On the same vein: if you know where the Baum-Welch algorithm, you will probably know how the statistics are generated.   If you know the workflow of the training, then you will understand the how the model is "evolved".   If you know how the language model is estimated, then you would have understanding of one of the most important heuristic of the search. 

Some of you may protest, how about the front-end? Isn't that important too?  True, but not when you try to understand a codebase.  For all practical purpose, a feature vector is just an N-dimensional vector.  The waveform is just an NxT matrix.   You can certainly do a lot of fancy things on this NxT matrix.   But when you think of Viterbi and Baum-Welch, they probably just read the frames and then calculate Gaussian distribution.  That's pretty much it's how much you want to know a front-end. 

How about adaptation algorithms?  That I think it's important.   But it should probably go after understanding of the major things in the code.   Because no matter whether you are doing adaptation online or doing this in speaker adaptive training.  It is something on top of the Baum-Welch algorithm.   Some implementation stick adaptation within the Baum-Welch executable.  There is certainly nothing wrong about it.   But it is still a kind of add-on. 

How about decoding API?  Those are useful things to know but it is more important when you just need to write an application.  For example, in Sphinx4, you just need to know how to call the Recognizer class.  In sphinx3, live_decode is what you need to know.   But only understanding those won't give you too much insights of how the decoder really works. 

How about the data structure?  Those are sort of important and should be understood when you try to understand a certain algorithm.   In the case of languages such as Java and C++, you should probably take notes on a custom-made data structure.  Or whether the designer call a specific data structure libraries.  Like Boost in C++. 

I guess this pretty much sums it all.  Now let me get back to one non-trivial item on the list, which is the workflow of training.   Many of you might think that recognition systems differ from each other because they have different decoders.  Dead wrong!  As I stressed from time to time, they differ because they have different acoustic models and language models.  So that's why in many research labs, much effort was put on preserving the parameters and procedures of how models is trained.  Much effort was also put to fine tuned this procedure.  

On this part,  I got to say open source speech recognition still has long long way to go.  For starter, there is no much sharing of recipes among speech hobbyists.   What many try to do is to search for a good model.   If you don't know how to train a model, you probably don't even know how to improve it for you own project.   


Sphinx 4 from a C Background : Setting up Eclipse as the IDE

This is another baby step on how one can learn about Sphinx 4.   As I mentioned in the previous post,  it is nicer to use an IDE when you use Java code.  Since I have some exposure in Eclipse, I choose it as an example on how to setup a Sphinx 4 build.

Before I go on there were many posts, written by others, discuss the procedure.  You may take a look of them as well.

You will also need to know how to install JSAPI (link).  It is crucial to get the compilation right. 

Eclipse as a Development Environment

If you never use Eclipse before, it is a little bit like a more versatile version of Emacs.   It's major use is on Java but lately there are more and more people use it as IDE for C/C++ as well.  Not to say there are more different development packages for different programming languages. 

If you come from background such as emacs/vi development, one thing you need to know is that shortcuts are quite different from your current platform.  That takes some time to adapt to but generally I think the advantage worth the cost.

Another thing you might want to be mentally prepare, Eclipse's Java compilation doesn't generate build log.  Instead it will generate a list of errors in compilation.   They are basically equivalent thing.  Though, if you are used to Visual C++ type of IDE with an error log, you won't get what you want.  

To me, those are minor nuisances, using Eclipse to browse code has the extra advantage of readily-made documentation as well as a flatten structure.  Those features will save you many keystrokes if compared to using vanilla emacs. 

In my description, I am using Eclipse Juno.  Hopefully it won't change too much by the time you are compiling the code.  Of course, if there is popular demand, I might write another post which describe later version of Eclipse as well.

The compilation in High Level

Building Sphinx 4 essentially means the following four tasks:
  1. Downloading Sphinx4 source code
  2. Install JSAPI.
  3. Incorporate the proper libraries. 
  4. Do the build. 
In my case, I slightly stumbled on 1, naturally, just like you, I was thinking "well, why JSAPI something separate from the codebase?"  Of course, if you worked in Java before, there are many projects required you to build with external codebase.  So I don't think too bad. 

So let me go through the procedure of the build. 

Downloading Sphinx4 source code from Subclipse
  • A plain simple svn command is fine, downloading the tarball will give you a more stable version.  I will suggest a more attractive option is to use SVN module of Eclipse, subclipse.   To do that, you may want to follow "Downloading Subclipse" from Setting up Development Environment .   (Notice that there was a typo in the post should be "tigris" instead "trigris" on the location field.) 
  • Once you finished checking out Subclipse.  Start a new Project 
    • New -> Project -> SVN -> Checkout Projects from SVN
  • Choose "Create a New Repository Location"
  • Remember to only download trunk/sphinx4 (Note: there are many branches and location, for starter, you will be interested how the trunk look like.)
Once you check out the code, in your Package Explorer (Alt-Shift-Q -> P) will look like this. 

Package Explorer View after code is check out from SVN

Now you might notice that there is a red question mark besides the sphinx4 project (I named it "sphinx4_grandjanitor" but you can name it whatever you want.) You might also notice that in your Problem screen, there are 2 errors :

Now this is really because lib/jsapi.jar wasn't installed correctly.  So the next step is to install jsapi.jar

Install JSAPI

I tried the install of both Windows Vista and Linux.  In windows, go to \sphinx4\lib and type

> jsapi.exe

Then accept the license.

In Linux, in the same directory.  do

> sh

One common problem for Linux here: you need to install uudecode if you want to install jsapi.  In that case, try to install sharutil.  On Ubuntu, it works for me when I do

> apt-get install sharutil 

At this point you should see your directory should have a file named jsapi.jar

Incorporate the proper libraries

This is another part which took me a while.  Before you go on to configure your path, you need to do one more step to make to configure libraries.   In Eclipse, right click you Sphinx4/lib directory and choose Refresh first.  This will make jsapi.jar appears your Package Explorer.  It should look like this:

When JSAPI.jar is properly installed

Then, you can change the build path, go to your project again, right click and choose Build Path -> Configure Build, Libraries, choose Add Jar, then add the libraries you need.

Now.... wait, what are the jar files we need again?

Yeah, so this is another place which can cause confusions.  In fact, because Sphinx has expanded its code from time to time, so the answer of which jar files to add depends.   As of Dec 28, 2012, you should add

  • junit
  • jsapi
  • js
  • fst
This list will likely to grow in future.  I am also pretty sure you might need to do different things if you want to compile in a different setting or write your own code.

Do the build

In modern Eclipse, building should be automatic, what you should see should be 0 errors but many warnings.    I generally don't approve of warnings but as a developer, it's pretty tough to eliminate them all.


There you have it, a little guide on Sphinx 4 compilation with Eclipse.  Notice that this guide may or may not fit your purpose because I focus on downloading the code from Subclipse.   Doing a Link Source should do the trick if you want to incorporate the code yourself.   I might do another post later but the web has many articles described this already, you should be able to find a set of good instructions. 


Related Posts: 

A note on GIT author and committer

I considered GIT as a great improvement over CVS and SVN.   SVN is okay if the codebase is not too large because SVN server sometimes get into lockup mode.

One thing I like about GIT is the differentiation between author and committer.    The author is the original writer of the code.   The committer would be the one who commits the code.   This makes ownership and responsibility clearer.   

(Some houses discourage to look at who commit a change and demand programmers to take care of the problems themselves.  My one comment: mental constipation.)

So if you want to change the author of a commit.  Do

> git commit -m "Your message" --author "Firstname Lastname "

In GIT commit, you will see "Firstname Lastname" become the author,  to look at the original committers, use 
>git log --pretty=full ./

An obviously, this information can be push to a remote repository.   Simple

>git push 

should work. 


Pondering Unix Philosophy

These are two great articles by James Hague on text processing vs visual programming.

The Unix Philosophy and a Fear of Pixels
Living inside your own Black Box

His main point is visual programming is often dismissed because it is way more difficult than text processing.    It is a little bit like a lot of "stupid" things in the world such as Windows programming.   They are actually quite tough to do well.

On speech processing, I guess it is appropriate to think sound programming is tougher than text processing as well.   You may even think in speech processing, no one come up with a generic "Sound User Interface" IDE yet.


Thursday, December 27, 2012

Speech Recognition vs SETI

If you track news of CMUSphinx, you may notice that the Sourceforge guys start to distribute data through BitTorrent (link).

That's a great move.   One of the issues in ASR is the lack of machine power in training.  To make a blunt example, it's possible to squeeze extra performance by searching for the best training parameters.    Not to say a lot of modern training techniques take some time to run.

I do recommend all of your help the effort.  Again, me not involved at all, just feel that it is a great cause.

Of course, things in ASR are never easy so I want to give two subtle points about the whole distributed approach of training.

Improvement over the years?

First question you may ask,  now does that mean, ASR can be like project such as SETI, which would automatically improve over the years?  Not yet, ASR still has its unique challenge.

The major part I would see is how we can incrementally increase phonetically-balanced transcribed audio.   Note that it is not just audio, but transcribed audio.  Meaning: someone needs to go to listen to the audio, spending 5-10 times real time to write down what the audio really say word-by-word.   All these transcriptions need to clean up and in a certain format.  

This is what Voxforge tries to achieve and it's not a small undertaking.   Of course, comparing to the speed of the industry development, the progress is still too slow.  The last time I heard, Google was training their acoustic model with 38000 hours of data.   A WSJ corpus is a toy task compared to it.

Now, thinking in this way, let's say if we want to build the best recognizer through open source, what is the bottleneck?  I bet the answer doesn't lie on machine power,  whether we have enough transcribed data would be the key.   So that's something to ponder about.

(Added Dec 27, 2012, on the part of initial amount of data, Nickolay corrected me saying that amount of data from Sphinx is already in terms of 10000 hours.   That includes "librivox recordings, transcribed podcasts, subtitled videos, real-life call recordings, voicemail messages".

So it does sound like Sphinx has the amount of data which rivals commercial companies.  I am very interested to see how we can train an acoustic model with that amount of data.)

We build it, they will come?

ASR is always shrouded with misunderstanding.   Many believe it is a solved problem, many believe it is a unsolvable problem.   99.99% of world population are uninformed about the problem.   

I bet a lot of people would be fascinated by SETI, which .... Woa .... allows you to communicated to unknown intelligent sentients in the universe.   Rather than on ASR, which ..... Em ..... basically many regards as a source of satires/parodies these days.  

So here comes another problem,  the public don't understand ASR enough to see it as an important problem.   When you think about this more,  this is a dangerous situation.   Right now, couple of big companies control the resource of training cutting-edge speech recognizers.    So let's say in the futre everyone needs to talk with a machine in a daily basis.   These big companies would be so powerful that they can control our daily life.   To be honest to you, this thought haunts me from time to time.   

I believe we should continue to spread information on how to properly use an ASR system.  At the same time, continue to build application to show case ASR and let the public understand its inner-working.   Unlike subatomic particle physics,  HMM-based ASR is not that difficult to understand.   On this part, I appreciate all the effort which are done by developers of CMUSphinx, HTK, Julius and all other open source speech recognition projects.


I love the recent move of Sphinx spreading acoustic data using BitTorrent,  it is another step to work towards a self-improving speech recognition system.   There are still things we need to ponder in the open source speech community.   I mentioned a couple, feel free to bring up more in the comment section. 


Words of the Day (Dec 27, 2012)

decathect, stridulant

You can probably say,  "He decathect from Mary by making a stridulant voice all the time."


Readings at Dec 27, 2012

I have been thinking of playing with C code analysis for a while.  Then I stumble on Eli Bendersky's pycparser,  I guess I will have some fun to play with it. 

Also strongly recommend everyone to read his stuffs which I found highly informative. 


Wednesday, December 26, 2012

Me and CMU Sphinx

As I update this blog more frequently, I noticed more and more people are directed to here.   Naturally,  there are many questions about some work in my past.   For example, "Are you still answering questions in CMUSphinx forum?"  and generally requests to have certain tutorial.  So I guess it is time to clarify my current position and what I plan to do in future.

Yes, I am planning to work on Sphinx again but no, I probably don't hope to be a maintainer-at-large any more.   Nick proves himself to be the most awesome maintainer in our history.   Through his stewardship, Sphinx prospered in the last couple of years.  That's what I hope and that's what we all hope.    

So for that reason, you probably won't see me much in the forum, answering questions.  Rather I will spend most of my time to implement, to experiment and to get some work done. 

There are many things ought to be done in Sphinx.  Here are my top 5 list:
  1. Sphinx 4 maintenance and refactoring
  2. PocketSphinx's maintenance
  3. An HTKbook-like documentation : i.e. Hieroglyphs. 
  4. Regression tests on all tools in SphinxTrain.
  5. In general, modernization of Sphinx software, such as using WFST-based approach.
This is not a small undertaking so I am planning to spend a lot of time to relearn the software.  Yes, you hear it right.  Learning the software.  In general, I found myself very ignorant in a lot of software details of Sphinx at 2012.   There are many changes.  The parts I really catch up are probably sphinxbase, sphinx3 and SphinxTrain.   One PocketSphinx and Sphinx4, I need to learn a lot. 

That is why in this blog, you will see a lot of posts about my status of learning a certain speech recognition software.   Some could be minute details.   I share them because people can figure out a lot by going through my status.   From time to time, I will also pull these posts together and form a tutorial post. 

Before I leave, let me digress and talk about this blog a little bit: other than posts on speech recognition, I will also post a lot of things about programming, languages and other technology-related stuffs.  Part of it is that I am interested in many things.  The other part is I feel working on speech recognition actually requires one to understand a lot of programming and languages.   This might also attract a wider audience in future. 

In any case,  I hope I can keep on.  And hope you enjoy my articles!


Sphinx4 from a C background : Installation of Eclipse

That's another baby step but I guess Eclipse installation is much less painful these days.

When I used Eclipse back in 2008, it was rather difficult to download and install.   Part of the reason is that the software house I worked with didn't have a strong culture of documentation.

Downloading Eclipse Juno for Java Developer was pretty easy.  My next step is to incorporate Sphinx 4 directory and do a compilation.


Sphinx4 from a C background : first few steps

As I set out earlier,  one of my goals is to grok all of the components.  I challenged myself to work with Java, which I feel less proficient than my C/C++/Python/Perl.

What should you think when you go from one language to another?  One and only one answer : don't make a judgement too early.  

For example, compilation of Sphinx4 takes 4 steps:
  1. Download and install JDE. 
  2. Download and install ant. 
  3. run ant
If you haven't used JDE, ant or never look at a build.xml, you would feel a bit overwhelmed.    But be patient, there are a lot of goodies of Java.  Most of them are very well thought in terms of software engineering. 

I followed the process.  Woa,  Sphinx 4 is now at beta 6 and it grows to 366 files.   Sounds like groking it will take some time then. 

So what would be your strategy if you want to go forward to understand a Java project such as Sphinx4?   My suggestion: download a good IDE such as Eclipse or NetBeans.

If you are like me, coming from a emacs background, learning Eclipse would take you sometime as well.   But again: don't make a judgement too early.  Eclipse is nice in its own way.  (At least it's not Visual X.....)    

Practically, using Eclipse to understand the code also has its advantage.  Unlike C-package organization, Java software usually has deep directory hierarchy.  Using emacs would definitely cause you more keystrokes.  The only exception I know of is JDEE.  That again will take you some setup time.

In any case, I got it started.  So, my next goal is to go through all materials of Sphinx 4 again.  This time I demand myself to grok.   I will start from the Sphinx 4 documentation page.  Then expand to source code-level of undersand. 


Favorite words of the day (Dec 25, 2012)

English: avidity, glissade
Spanish: llevarse

From time to time, I will post my favorite word of the day.  Part of it is my personal record, part of it is my view on programming.  Most capable programmer I know actually know multiple languages and can discern differences between them.

More importantly, you would find the same word can mean differently in two languages.  Think false cognates such as "actualmente" (lately) and "actually" (really).

So if you have issues of differentiating usage of keywords in different programming languages. (Think "static".)  Then learning a different real language will be a way to help you.


Prof. Might's "12 resolutions for programmers"

I quoted the headers but you should all go to read the whole thing.  It makes you a better programmer and a better person.

  1. Go analog.
  2. Stay healthy.
  3. Embrace the uncomfortable.
  4. Learn a new programming language.
  5. Automate.
  6. Learn more mathematics.
  7. Focus on security.
  8. Back up your data.
  9. Learn more theory.
  10. Engage the arts and humanities.
  11. Learn new software.
  12. Complete a personal project.


Tuesday, December 25, 2012

Readings (Dec 25, 2012)

  • Electric Meat (link
  • Web-based OS (link)
  • Advice to Aimless, Excited Programmers (link)
  • If You're Not Gonna Use It, Why Are You Building It? (link)
  • It's Like That Because It Has Always Been Like (link)
  • Dangling by a Trivial Feature (link)


Monday, December 24, 2012

Installation of Python and Pygames

I was teaching my little brother on how to make a game.  Pygames naturally come to my mind as it is pretty easy to understand and program.

I have tried to use pygames on Ubuntu and Windows.  Both are fine.  On windows though, I found that using installers for both python and pygame is the simplest.  I was using python 2.7.  If you had installed pygame 1.7 or earlier, make sure you remove the pygame directory under existing installation before you install.


Some Reflections on Programming Languages

This is actually a self-criticizing piece.  Oh well, but call it reflection doesn't hurt.

When I first started out in speech recognition, I have a notion that C++ is the best language in the world.  For daily work? "Unix commands such as cut, split work well. "  To take care of most of my processing issues, I used some badly written bash shell.  Around the middle of the grad school, I started to learn that perl is lovely for string processing.   Then I thought perl is the best language in the world, except it is a bit slow.

After C++ and perl, I then learned C, Java, Python.  A little bit of objective-C and sampled many other languages.   For now, I will settle on C and Perl are probably the two languages I am most proficient.  I also tend to like them the most.   There is one difference between me and the twenty-something me though - instead of arguing which language is the best, I will simply go to learn more about any programming language in the world.

Take C as an example, many would praise it to be the procedure language which is closest to the machine.  I love to use C and write a lot of my algorithms in C.  But when you need to maintain and extend a C codebase, it is a source of a pain because, there is no inherent inheritance mechanism to work with, so a programmer needs to implement their own class-implementation.  Many function pointers.  There is also no memory-checking, so an extra step of memory checking is necessary.  Debugging is also a special skill.

Take perl.  It is very useful in text processing and has very flexible syntax.   But this flexibility also makes perl script hard to read sometimes.    For example, for a loop, do you want to implement it as a foreach-loop or by a map?   Those confuse lesser programmers.  Also, when you try to maintain large scale project with perl, many programmers remark to me OOP in perl seems to "just organize the code better".

How about C++?  We love the templates, we love the structure.   In practice though, the standard changes all the time.  Most house fixes the compiler version to make sure their C++ source code compiled.

How about Java?  There is memory boundary checking.  After a year or two on a dot-com, I also learned that Tomcat servlet is a thing in web development.   It is also easy to learn and one mainstream programming language taught in school these days.  Those I dig.  What's the problem? You may say speed is an issue.  Wrong.  Many Java code can be optimized such that it is as fast as its C or C++ codebase.   The issue in practice is that the process of bytecode conversion is non-trivial to many.  That is why it raises doubts in a software team on whether the language is the cause of speed issues.  

For me, I also care about the fate of Java as an open language after Oracle bought Sun Microsystem.

How about Python?  I guess this is a language I know least about.  So far, it seems to take care of a lot of problems in perl. I found the regular expression takes some time to learn.  Though other than that, the language is quite easy to learn and quite nice to maintain.  I guess the only thing I would say it is the slight difference between different Python 2.X starts to annoy me.

I guess a more important point here:  every language has its strength and weakness.  In real life, you probably need to prepare to write the same algorithm in all languages you know.   So there is no room for you to say "Hey! Programming language A is better than programming language B. Wahahaha.  Since I am using A rather than B, I rock, you suck!"  No, rather you got to accept that writing in unfamiliar language is essential for tech person's life.

I learned this through my spiritual predecessor, Eric Thayer, who organized the source code of SphinxTrain.  He once said to me, (I rephrase here,) "Arguing about programming languages is one of the most stupidest thing in the world."

Those words enlightened me.

Perhaps that is why I have been reading "C Programming a Modern Approach", "The C++ Programming Language",  "Java in a Nutshell", "Programming Perl" and "Programming Python" from time to time because I never feel satisfy with my skills on any of them.  I hope to learn D and Go soon and make sure I am proficient in Objective-C soon.  It will take me a lifetime to learn them, but on something deep like programming, learning, other than arguing, seems to be a better strategy to go.



For a period of time, getting up is a daunting thing to me.   You see...... computers used to be a tool to let me realize myself.  I like to work, play with one.  It was not a job.

Since when it is changed for me?  It was the time when I think of a computer to be solely a tool of making money.   That's how many people in the field think.  Programming is no longer a pursuit of skill.   It is a way to get higher salary, win programming competition and have bragging right on lunch table. Knowledge in speech recognition?  It is not to solve one of the biggest problem in human history.  It is for winning contracts from defense,  beating other sites and again bragging to your esteemed colleagues.   These sicken me.

In my view, it is fine to think of money issue.  In fact, everyone should take care of their own personal finance and have basic understanding of economics...... BUT......  It doesn't mean everything has to be driven solely money.   

Rather, everyone should have passion, which allows them to wake up everyday, not being daunted by the workload of the day, but think "Woa,  there are 10 cool things I want to do.  What should I work on today?" and feel excited about life. 


Tuesday, December 18, 2012

Readings at Dec 18, 2012

From time to time, I will put interesting technology reading in my blog.   Enjoy.

  1. The value of typing code : By John Cook, after all these years, I got to concur that code I didn't type are not code that I grok. 
  2. The Founder's dilemma : Recommended by Joel Spolsky.  It sounds like an interesting book to check out as I am sick of overly qualitative statement in the startup world. 
  3. Tutorial on Python NLTK:  by Sujit Pal.  Python NLTK is something I want to check out for long time.  
  4. Pure Virtual Destructor in C++ : by Eli Bendersky.
  5. Dumping A C++ Object Memory Layout With Clang : by Eli Bendersky

Monday, December 17, 2012

How to Ask Questions in the Sphinx Forum?

Many go to different open source toolkits to look for a ready-to-use speech recognizer, and seldom get what they want.   Many feel disappointed and curse that developers of open source speech recognizer just couldn't catch up with commercial product.   Few know why and few decide to write about the reason.

People in the field blame Hollywood for lion share of the problem.  Indeed, many people believe ASR should work similarly to scenes of Space Odyssey 2001 or Star Trek.   We are far far away from there.   You may say SIRI is getting close.  True.   But when you look closer, SIRI doesn't always get what you say right, her strength lies on the very intelligent response system.

Unlike compilers such as GCC, speech recognition toolkit such as the CMU Sphinx project HTK are toolkits.   The mathematical models these toolkits provided were trained and fit to certain group of samples. Whereas, applications such as Google Voice or SIRI gather 100 or even 1000 times more data when they train a model.   This is the fundamental reason why you don't get the premium recognition rate you think you entitled to.

Many people (me included) saw that as a problem.  Unfortunately, to collect clean transcribed data has always been a problem.   Voxforge is the only attempt I am aware of to resolve the issue.    They are still growing up but it will be a while they can collect enough data to rival with commercial applications.

* * *
Now what does that tell you when you ask questions in CMU Sphinx or other speech recognition forum?   For users who expect out-of-the-box super performance, I would say "Sorry, we are not there yet."  In fact, speech recognition, in general, is probably not in performance shown in the original Star Trek yet (that will require accent adaptation and very good noise cancellation since the characters seem to be able to use the recognizer any time they like).

How about many users who have a little bit (or much) programming background? I would say one thing important.  As a programmer, you probably get used to look at the code, understand what it's done, do something cute and feel awesome from time-to-time.  You can't do that if you seriously want to develop a speech recognition system.

Rather, you should think like a data analyst.  For example, when you feel the recognition rate is bad, what is your evidence?  What is your data set?  What is the size of your data set? If you have a set, can you share the set?   If you don't have numerical measure, have you at least use pencil or paper to mark down at least some results and some mistakes? Report them when you ask questions, then you will get useful answers back.

If you go to look at programming forum, many ask questions with the source such that people can repeat the problem easily.    Some even go further to pinpoint location of the problem.    This is probably what you want to do if you get stuck.

* * *

Before I end this post, let's also bring up the issue of how usually ASR problem is solved?  Like...... if you see performance is bad, what should you do?

Some speech recognition problems can be solved readily.  For example, if you try to recognize digit strings but only get one digit at a time, chances are your grammar was written incorrectly.  If you see completely crappy speech recognition performance, then I will first check if the front-end of decoder match exactly as the front-end used to train the models.

For the rest,  the strength of the model is really the issue.   So most of your time should spend on learning and understanding techniques of model improvement.    For example, do you want to collect data and boost up your acoustic model?  Or if you know more about the domain, can you crawl some text on the web and help your language model?   Those are the first ideas you should think about.

There are also an exoteric group of people in the world who ask a different question, "Can we use a different estimation algorithm to make the better?"  That is the basis of MMIE, MPE and MFE.   If you found yourself mathematically proficient (perhaps need to be very proficient......), then learning those techniques and implement some of them would help boosting up the performance as well.   What I mentioned such as MMIE are just the basics,  each site has their own specialized technique and you might want to know.

Of course, you normally don't have to think so deep.   Adding more data is usually the first step of ASR improvement.    If you start to think something advance and if you can,  please try to put your implementation somewhere public such that everyone in the world can try it out.   These are something small to do, but I believe if we keep on doing something small right, there will be a day we can make open source speech recognizers as the commercial ones.


Sunday, December 16, 2012

Landscape of Open Source Speech Recognition software at the end of 2012 (I)

As I am back, I start to visit all my old friends - all open source speech recognition toolkits.  The usual suspects are still around.  There are also many new kids in town so this is a good place to take a look.

It was a good exercise for me, 5 years of not thinking about open source speech recognition is a bit long.   It feels like I am getting in touch with my body again.

I will skip CMU Sphinx in this blog post as you probably know something about it if you are reading this blog.   Sphinx is also quite a complicated projects so it is rather hard to describe  entirely in one post.   This post serves only as an overview.  Most of the toolkit listed here have rich documentation.   You will find much useful information there.


I checked out the Cambridge HTK web page.  Disappointingly, the latest version is still 3.4.1, so we are still talking about MPE and MMIE, which is still great but not as exciting as other new kids in town such as KALDI.   

HTK has always been one of my top 3 speech recognition systems since most of my graduate work are done using HTK.   There are also many tricks you can do with the tools.   

As a toolkit, I also find its software engineering practice admirable.   For example, the software command was based on common libraries written beneath.  (Earlier versions such as 1.5 or 2.1 would restrict access to the memory allocation library HMem.)   When reading the source code, you feel much regularities and there doesn't seem to be much duplicated code. 

The license disallows commercial use but that's okay.  With ATK, which is released in a freer license, you can also include the decoder code into a commercial application.


The new kid in town.   It is headed by Dr. Dan Povey, who researched many advanced acoustic modeling techniques.   His recognizers attract much interest as it has implemented features such as subspace GMM and FST-based speech recognizer.   Of all, this features feel like more "modern". 

I only have little exposure on the toolkit (but determined to learn more).   Unlike Sphinx and HTK, it is written in C++ instead of C.   As of this writing, Kaldi's compilation takes a long time and the binaries are *huge*.   In my setup, it took me around 5G of disc space to compile.   It probably means I haven't setup correctly ...... or more likely, the executable is not stripped.   That means working on Kaldi's source code actively would take some discretion in terms of HD.  

Another interesting part of Kaldi is that it is using weighted finite state transducer (WFST) as the unifying knowledge source representation.   To contrast this, you may say most of the current open source speech recognizers are using ad-hoc knowledge source.   

Are there any differences in terms of performance you ask?  In my opinion, probably not much if you are doing an apple to apple comparison.   The strength of using WFST is that when you need to introduce new knowledge,  in theory you don't have to hack the recognizer.  You just need to write your knowledge in an FST and compose it with your knowledge network, then you are all set. 

In reality, the WFST-based technology seems to still have practice problem.  As the vocabulary size goes large and knowledge source got more complicated, the composed decoding WFST would naturally outgrow the system memory.   As a result, many sites propose different technique to make decoding algorithm works.  

Those are downsides but the appeal of the technique should not be overlooked.   That's why Kaldi becomes one of my favorite toolkits recently. 


Julius is still around!  And I am absolutely jubilant about it.  Julius is a high-speed speech recognizer which can decode a 60k vocabulary. One speed-up techniques of Sphinx 3.X was context-independent phone Gaussian mixture model selection (CIGMMS) and I borrowed this idea from Julius when I first wrote.  

Julius is only the decoder and the beauty of it is that it never claims to be more than that.  Accompanied with the software, there is a new Juliusbook, which is the guide on how to use the software.  I think the documentation are in greater-depth than other similar documentations. 

Julius comes with a set of Japanese models, not English.   This might be one of the reasons why it is not as popular (more like talk about) as HTK/Sphinx/Kaldi. 

(Note at 20130320: I later learned that Julius also comes with an English model now.  In fact, some anecdotes suggest the system is more accurate than Sphinx 4 with broadcast news.  I am not surprised.  HTK was as acoustic model trainer.)

So far......

I went through three of my favorite recognition toolkits.  In the next post, I will cover several other toolkits available. 


The Grand Janitor's Blog

For the last year or so, I have been intermittently playing with several components of CMU Sphinx.  It is an intermittent effort because I am wearing several hats in Voci.

I find myself go back to Sphinx more and more often.   Being more experienced, I start to approach the project again carefully: tracing code, taking nodes and understanding what has been going on.  It was humbling experience - speech recognition has changed, Sphinx has more improvement than I can imagine. 

The life of maintaining sphinx3 (and occasionally dip into SphinxTrain) was one of the greatest experience I had in my life.   Unfortunately, not many of my friends know.  So Sphinx and I were pretty much disconnected for several years. 

So, what I plan to do is to reconnect.    One thing I have done throughout last 5 years was blogging so my first goal is to revamp this page. 

Let's start small: I just restarted RSS feeds.   You may also see some cross links to my other two blogs, Cumulomaniac, a site on my take of life, Hong Kong affairs as well as other semi-brainy topics,  and  333 weeks, a chronicle of my thoughts on technical management as well as startup business. 

Both sites are in Chinese and I have been actively working on them and tried to update weekly. 

So why do I keep this blog then?  Obviously the reason is for speech recognition.   Though, I start to realize that doing speech recognition has much more than just writing a speech recognizer.   So from now on, I will post other topics such as natural language processing, video processing as well as many low-level programming information.   

This will mean it is a very niche blog.   Could I keep up at all?  I don't know.   As my other blogs, I will try to write around 50 messages first and see if there is any momentum. 


Friday, December 14, 2012

Self Criticism : Hieroglyph

When I was working on CMU Sphinx, I was more an aggressive young guy and love to start many projects (still am).   So I started many projects and not many of them completed.   I wasn't completely insane: what was lacking at that point of development is that we lack of passion and momentum.  So working on many things give a sense of we are moving forward.

One of the projects, which I feel I should be responsible, is the Hieroglyph.   It was meant to be a complete set of documentation for several Sphinx components work together.   But when I finished the 3rd draft, my startup work kicked in.    That's why what you can see is only an incomplete form of the document.

Fast-forward 6 years later, it was unfortunate that the document is still the comprehensive source of sphinx if you want to understand the underlying structure/method of CMU Sphinx C-based executables.     The current CMU Sphinx encompasses way more than I decided to cover.   For example, the Java-based Sphinx4 has gained much followings.   And pocketsphinx is pretty much the de-facto speech recognizer for embedded speech recognition.

If you were following me (unlikely but possible), I have personally changed substantially.   For example, my job experience taught me that Java is a very important language and having a recognizer in Java would significantly boost the project.    I also feel embedded speech recognition is probably the real future of our life.

Back to Hieroglyph, suffice to say it is not yet a sufficient document.   I hope that I can go back to it and ask what I can do to make it better.


New Triplet is Released

Just learned from the CMUSphinx's main site.  It sounds like there is a new triplet of sphinxbase and SphinxTrain released.

I took a look of the changes.   Most of the changes work towards better reuse between SphinxTrain and sphinxbase.   I think this is very encouraging.

There are around 600-700 SVN update since the last major release of triplet.   I think Nick and the SF guys are doing great jobs on the toolkit.

As for training,  one encouraging part is that there are efforts to improve the training procedure.   I have always been maintaining that model training is the heart of speech recognition.   A good model is the key of getting good speech recognition and performance.   And great performance is the key of getting great user experience.

When will CMU Sphinx walk on the right path?   I am still waiting but I am increasingly optimistic.


(PS. I have nothing to do with this release.  Though, I guess it's time to go back to actual open-source coding.)