The Grand Janitor's Blog

Thursday, December 27, 2012

Speech Recognition vs SETI

If you track news of CMUSphinx, you may notice that the Sourceforge guys start to distribute data through BitTorrent (link).

That's a great move. One of the issues in ASR is the lack of machine power in training. To make a blunt example, it's possible to squeeze extra performance by searching for the best training parameters. Not to say a lot of modern training techniques take some time to run.

I do recommend all of your help the effort. Again, me not involved at all, just feel that it is a great cause.

Of course, things in ASR are never easy so I want to give two subtle points about the whole distributed approach of training.

Improvement over the years?

First question you may ask, now does that mean, ASR can be like project such as SETI, which would automatically improve over the years? Not yet, ASR still has its unique challenge.

The major part I would see is how we can incrementally increase phonetically-balanced transcribed audio. Note that it is not just audio, but transcribed audio. Meaning: someone needs to go to listen to the audio, spending 5-10 times real time to write down what the audio really say word-by-word. All these transcriptions need to clean up and in a certain format.

This is what Voxforge tries to achieve and it's not a small undertaking. Of course, comparing to the speed of the industry development, the progress is still too slow. The last time I heard, Google was training their acoustic model with 38000 hours of data. A WSJ corpus is a toy task compared to it.

Now, thinking in this way, let's say if we want to build the best recognizer through open source, what is the bottleneck? I bet the answer doesn't lie on machine power, whether we have enough transcribed data would be the key. So that's something to ponder about.

(Added Dec 27, 2012, on the part of initial amount of data, Nickolay corrected me saying that amount of data from Sphinx is already in terms of 10000 hours. That includes "librivox recordings, transcribed podcasts, subtitled videos, real-life call recordings, voicemail messages".

So it does sound like Sphinx has the amount of data which rivals commercial companies. I am very interested to see how we can train an acoustic model with that amount of data.)

We build it, they will come?

ASR is always shrouded with misunderstanding. Many believe it is a solved problem, many believe it is a unsolvable problem. 99.99% of world population are uninformed about the problem.

I bet a lot of people would be fascinated by SETI, which .... Woa .... allows you to communicated to unknown intelligent sentients in the universe. Rather than on ASR, which ..... Em ..... basically many regards as a source of satires/parodies these days.

So here comes another problem, the public don't understand ASR enough to see it as an important problem. When you think about this more, this is a dangerous situation. Right now, couple of big companies control the resource of training cutting-edge speech recognizers. So let's say in the futre everyone needs to talk with a machine in a daily basis. These big companies would be so powerful that they can control our daily life. To be honest to you, this thought haunts me from time to time.

I believe we should continue to spread information on how to properly use an ASR system. At the same time, continue to build application to show case ASR and let the public understand its inner-working. Unlike subatomic particle physics, HMM-based ASR is not that difficult to understand. On this part, I appreciate all the effort which are done by developers of CMUSphinx, HTK, Julius and all other open source speech recognition projects.

Conclusion

I love the recent move of Sphinx spreading acoustic data using BitTorrent, it is another step to work towards a self-improving speech recognition system. There are still things we need to ponder in the open source speech community. I mentioned a couple, feel free to bring up more in the comment section.

Arthur

Words of the Day (Dec 27, 2012)

decathect, stridulant

You can probably say, "He decathect from Mary by making a stridulant voice all the time."

Arthur

Readings at Dec 27, 2012

I have been thinking of playing with C code analysis for a while. Then I stumble on Eli Bendersky's pycparser, I guess I will have some fun to play with it.

Also strongly recommend everyone to read his stuffs which I found highly informative.

Arthur

Wednesday, December 26, 2012

Me and CMU Sphinx

As I update this blog more frequently, I noticed more and more people are directed to here. Naturally, there are many questions about some work in my past. For example, "Are you still answering questions in CMUSphinx forum?" and generally requests to have certain tutorial. So I guess it is time to clarify my current position and what I plan to do in future.

Yes, I am planning to work on Sphinx again but no, I probably don't hope to be a maintainer-at-large any more. Nick proves himself to be the most awesome maintainer in our history. Through his stewardship, Sphinx prospered in the last couple of years. That's what I hope and that's what we all hope.

So for that reason, you probably won't see me much in the forum, answering questions. Rather I will spend most of my time to implement, to experiment and to get some work done.

There are many things ought to be done in Sphinx. Here are my top 5 list:

Sphinx 4 maintenance and refactoring
PocketSphinx's maintenance
An HTKbook-like documentation : i.e. Hieroglyphs.
Regression tests on all tools in SphinxTrain.
In general, modernization of Sphinx software, such as using WFST-based approach.

This is not a small undertaking so I am planning to spend a lot of time to relearn the software. Yes, you hear it right. Learning the software. In general, I found myself very ignorant in a lot of software details of Sphinx at 2012. There are many changes. The parts I really catch up are probably sphinxbase, sphinx3 and SphinxTrain. One PocketSphinx and Sphinx4, I need to learn a lot.

That is why in this blog, you will see a lot of posts about my status of learning a certain speech recognition software. Some could be minute details. I share them because people can figure out a lot by going through my status. From time to time, I will also pull these posts together and form a tutorial post.

Before I leave, let me digress and talk about this blog a little bit: other than posts on speech recognition, I will also post a lot of things about programming, languages and other technology-related stuffs. Part of it is that I am interested in many things. The other part is I feel working on speech recognition actually requires one to understand a lot of programming and languages. This might also attract a wider audience in future.

In any case, I hope I can keep on. And hope you enjoy my articles!

Arthur

Sphinx4 from a C background : Installation of Eclipse

That's another baby step but I guess Eclipse installation is much less painful these days.

When I used Eclipse back in 2008, it was rather difficult to download and install. Part of the reason is that the software house I worked with didn't have a strong culture of documentation.

Downloading Eclipse Juno for Java Developer was pretty easy. My next step is to incorporate Sphinx 4 directory and do a compilation.

Arthur

Sphinx4 from a C background : first few steps

As I set out earlier, one of my goals is to grok all of the components. I challenged myself to work with Java, which I feel less proficient than my C/C++/Python/Perl.

What should you think when you go from one language to another? One and only one answer : don't make a judgement too early.

For example, compilation of Sphinx4 takes 4 steps:

Download and install JDE.
Download and install ant.
run ant

If you haven't used JDE, ant or never look at a build.xml, you would feel a bit overwhelmed. But be patient, there are a lot of goodies of Java. Most of them are very well thought in terms of software engineering.

I followed the process. Woa, Sphinx 4 is now at beta 6 and it grows to 366 files. Sounds like groking it will take some time then.

So what would be your strategy if you want to go forward to understand a Java project such as Sphinx4? My suggestion: download a good IDE such as Eclipse or NetBeans.

If you are like me, coming from a emacs background, learning Eclipse would take you sometime as well. But again: don't make a judgement too early. Eclipse is nice in its own way. (At least it's not Visual X.....)

Practically, using Eclipse to understand the code also has its advantage. Unlike C-package organization, Java software usually has deep directory hierarchy. Using emacs would definitely cause you more keystrokes. The only exception I know of is JDEE. That again will take you some setup time.

In any case, I got it started. So, my next goal is to go through all materials of Sphinx 4 again. This time I demand myself to grok. I will start from the Sphinx 4 documentation page. Then expand to source code-level of undersand.

Arthur

Favorite words of the day (Dec 25, 2012)

English: avidity, glissade
Spanish: llevarse

From time to time, I will post my favorite word of the day. Part of it is my personal record, part of it is my view on programming. Most capable programmer I know actually know multiple languages and can discern differences between them.

More importantly, you would find the same word can mean differently in two languages. Think false cognates such as "actualmente" (lately) and "actually" (really).

So if you have issues of differentiating usage of keywords in different programming languages. (Think "static".) Then learning a different real language will be a way to help you.

Arthur

Prof. Might's "12 resolutions for programmers"

I quoted the headers but you should all go to read the whole thing. It makes you a better programmer and a better person.

"

Go analog.
Stay healthy.
Embrace the uncomfortable.
Learn a new programming language.
Automate.
Learn more mathematics.
Focus on security.
Back up your data.
Learn more theory.
Engage the arts and humanities.
Learn new software.
Complete a personal project.

Arthur

Tuesday, December 25, 2012

Readings (Dec 25, 2012)

Electric Meat (link)
Web-based OS (link)
Advice to Aimless, Excited Programmers (link)
If You're Not Gonna Use It, Why Are You Building It? (link)
It's Like That Because It Has Always Been Like (link)
Dangling by a Trivial Feature (link)

Arthur

Monday, December 24, 2012

Installation of Python and Pygames

I was teaching my little brother on how to make a game. Pygames naturally come to my mind as it is pretty easy to understand and program.

I have tried to use pygames on Ubuntu and Windows. Both are fine. On windows though, I found that using installers for both python and pygame is the simplest. I was using python 2.7. If you had installed pygame 1.7 or earlier, make sure you remove the pygame directory under existing installation before you install.

Arthur

Some Reflections on Programming Languages

This is actually a self-criticizing piece. Oh well, but call it reflection doesn't hurt.

When I first started out in speech recognition, I have a notion that C++ is the best language in the world. For daily work? "Unix commands such as cut, split work well. " To take care of most of my processing issues, I used some badly written bash shell. Around the middle of the grad school, I started to learn that perl is lovely for string processing. Then I thought perl is the best language in the world, except it is a bit slow.

After C++ and perl, I then learned C, Java, Python. A little bit of objective-C and sampled many other languages. For now, I will settle on C and Perl are probably the two languages I am most proficient. I also tend to like them the most. There is one difference between me and the twenty-something me though - instead of arguing which language is the best, I will simply go to learn more about any programming language in the world.

Take C as an example, many would praise it to be the procedure language which is closest to the machine. I love to use C and write a lot of my algorithms in C. But when you need to maintain and extend a C codebase, it is a source of a pain because, there is no inherent inheritance mechanism to work with, so a programmer needs to implement their own class-implementation. Many function pointers. There is also no memory-checking, so an extra step of memory checking is necessary. Debugging is also a special skill.

Take perl. It is very useful in text processing and has very flexible syntax. But this flexibility also makes perl script hard to read sometimes. For example, for a loop, do you want to implement it as a foreach-loop or by a map? Those confuse lesser programmers. Also, when you try to maintain large scale project with perl, many programmers remark to me OOP in perl seems to "just organize the code better".

How about C++? We love the templates, we love the structure. In practice though, the standard changes all the time. Most house fixes the compiler version to make sure their C++ source code compiled.

How about Java? There is memory boundary checking. After a year or two on a dot-com, I also learned that Tomcat servlet is a thing in web development. It is also easy to learn and one mainstream programming language taught in school these days. Those I dig. What's the problem? You may say speed is an issue. Wrong. Many Java code can be optimized such that it is as fast as its C or C++ codebase. The issue in practice is that the process of bytecode conversion is non-trivial to many. That is why it raises doubts in a software team on whether the language is the cause of speed issues.

For me, I also care about the fate of Java as an open language after Oracle bought Sun Microsystem.

How about Python? I guess this is a language I know least about. So far, it seems to take care of a lot of problems in perl. I found the regular expression takes some time to learn. Though other than that, the language is quite easy to learn and quite nice to maintain. I guess the only thing I would say it is the slight difference between different Python 2.X starts to annoy me.

I guess a more important point here: every language has its strength and weakness. In real life, you probably need to prepare to write the same algorithm in all languages you know. So there is no room for you to say "Hey! Programming language A is better than programming language B. Wahahaha. Since I am using A rather than B, I rock, you suck!" No, rather you got to accept that writing in unfamiliar language is essential for tech person's life.

I learned this through my spiritual predecessor, Eric Thayer, who organized the source code of SphinxTrain. He once said to me, (I rephrase here,) "Arguing about programming languages is one of the most stupidest thing in the world."

Those words enlightened me.

Perhaps that is why I have been reading "C Programming a Modern Approach", "The C++ Programming Language", "Java in a Nutshell", "Programming Perl" and "Programming Python" from time to time because I never feel satisfy with my skills on any of them. I hope to learn D and Go soon and make sure I am proficient in Objective-C soon. It will take me a lifetime to learn them, but on something deep like programming, learning, other than arguing, seems to be a better strategy to go.

Arthur

Passion

For a period of time, getting up is a daunting thing to me. You see...... computers used to be a tool to let me realize myself. I like to work, play with one. It was not a job.

Since when it is changed for me? It was the time when I think of a computer to be solely a tool of making money. That's how many people in the field think. Programming is no longer a pursuit of skill. It is a way to get higher salary, win programming competition and have bragging right on lunch table. Knowledge in speech recognition? It is not to solve one of the biggest problem in human history. It is for winning contracts from defense, beating other sites and again bragging to your esteemed colleagues. These sicken me.

In my view, it is fine to think of money issue. In fact, everyone should take care of their own personal finance and have basic understanding of economics...... BUT...... It doesn't mean everything has to be driven solely money.

Rather, everyone should have passion, which allows them to wake up everyday, not being daunted by the workload of the day, but think "Woa, there are 10 cool things I want to do. What should I work on today?" and feel excited about life.

Arthur

Tuesday, December 18, 2012

Readings at Dec 18, 2012

From time to time, I will put interesting technology reading in my blog. Enjoy.

The value of typing code : By John Cook, after all these years, I got to concur that code I didn't type are not code that I grok.
The Founder's dilemma : Recommended by Joel Spolsky. It sounds like an interesting book to check out as I am sick of overly qualitative statement in the startup world.
Tutorial on Python NLTK: by Sujit Pal. Python NLTK is something I want to check out for long time.
Pure Virtual Destructor in C++ : by Eli Bendersky.
Dumping A C++ Object Memory Layout With Clang : by Eli Bendersky

Arthur

Monday, December 17, 2012

How to Ask Questions in the Sphinx Forum?

Many go to different open source toolkits to look for a ready-to-use speech recognizer, and seldom get what they want. Many feel disappointed and curse that developers of open source speech recognizer just couldn't catch up with commercial product. Few know why and few decide to write about the reason.

People in the field blame Hollywood for lion share of the problem. Indeed, many people believe ASR should work similarly to scenes of Space Odyssey 2001 or Star Trek. We are far far away from there. You may say SIRI is getting close. True. But when you look closer, SIRI doesn't always get what you say right, her strength lies on the very intelligent response system.

Unlike compilers such as GCC, speech recognition toolkit such as the CMU Sphinx project HTK are toolkits. The mathematical models these toolkits provided were trained and fit to certain group of samples. Whereas, applications such as Google Voice or SIRI gather 100 or even 1000 times more data when they train a model. This is the fundamental reason why you don't get the premium recognition rate you think you entitled to.

Many people (me included) saw that as a problem. Unfortunately, to collect clean transcribed data has always been a problem. Voxforge is the only attempt I am aware of to resolve the issue. They are still growing up but it will be a while they can collect enough data to rival with commercial applications.

* * *
Now what does that tell you when you ask questions in CMU Sphinx or other speech recognition forum? For users who expect out-of-the-box super performance, I would say "Sorry, we are not there yet." In fact, speech recognition, in general, is probably not in performance shown in the original Star Trek yet (that will require accent adaptation and very good noise cancellation since the characters seem to be able to use the recognizer any time they like).

How about many users who have a little bit (or much) programming background? I would say one thing important. As a programmer, you probably get used to look at the code, understand what it's done, do something cute and feel awesome from time-to-time. You can't do that if you seriously want to develop a speech recognition system.

Rather, you should think like a data analyst. For example, when you feel the recognition rate is bad, what is your evidence? What is your data set? What is the size of your data set? If you have a set, can you share the set? If you don't have numerical measure, have you at least use pencil or paper to mark down at least some results and some mistakes? Report them when you ask questions, then you will get useful answers back.

If you go to look at programming forum, many ask questions with the source such that people can repeat the problem easily. Some even go further to pinpoint location of the problem. This is probably what you want to do if you get stuck.

* * *

Before I end this post, let's also bring up the issue of how usually ASR problem is solved? Like...... if you see performance is bad, what should you do?

Some speech recognition problems can be solved readily. For example, if you try to recognize digit strings but only get one digit at a time, chances are your grammar was written incorrectly. If you see completely crappy speech recognition performance, then I will first check if the front-end of decoder match exactly as the front-end used to train the models.

For the rest, the strength of the model is really the issue. So most of your time should spend on learning and understanding techniques of model improvement. For example, do you want to collect data and boost up your acoustic model? Or if you know more about the domain, can you crawl some text on the web and help your language model? Those are the first ideas you should think about.

There are also an exoteric group of people in the world who ask a different question, "Can we use a different estimation algorithm to make the better?" That is the basis of MMIE, MPE and MFE. If you found yourself mathematically proficient (perhaps need to be very proficient......), then learning those techniques and implement some of them would help boosting up the performance as well. What I mentioned such as MMIE are just the basics, each site has their own specialized technique and you might want to know.

Of course, you normally don't have to think so deep. Adding more data is usually the first step of ASR improvement. If you start to think something advance and if you can, please try to put your implementation somewhere public such that everyone in the world can try it out. These are something small to do, but I believe if we keep on doing something small right, there will be a day we can make open source speech recognizers as the commercial ones.

Arthur

Sunday, December 16, 2012

Landscape of Open Source Speech Recognition software at the end of 2012 (I)

As I am back, I start to visit all my old friends - all open source speech recognition toolkits. The usual suspects are still around. There are also many new kids in town so this is a good place to take a look.

It was a good exercise for me, 5 years of not thinking about open source speech recognition is a bit long. It feels like I am getting in touch with my body again.

I will skip CMU Sphinx in this blog post as you probably know something about it if you are reading this blog. Sphinx is also quite a complicated projects so it is rather hard to describe entirely in one post. This post serves only as an overview. Most of the toolkit listed here have rich documentation. You will find much useful information there.

HTK

I checked out the Cambridge HTK web page. Disappointingly, the latest version is still 3.4.1, so we are still talking about MPE and MMIE, which is still great but not as exciting as other new kids in town such as KALDI.

HTK has always been one of my top 3 speech recognition systems since most of my graduate work are done using HTK. There are also many tricks you can do with the tools.

As a toolkit, I also find its software engineering practice admirable. For example, the software command was based on common libraries written beneath. (Earlier versions such as 1.5 or 2.1 would restrict access to the memory allocation library HMem.) When reading the source code, you feel much regularities and there doesn't seem to be much duplicated code.

The license disallows commercial use but that's okay. With ATK, which is released in a freer license, you can also include the decoder code into a commercial application.

Kaldi

The new kid in town. It is headed by Dr. Dan Povey, who researched many advanced acoustic modeling techniques. His recognizers attract much interest as it has implemented features such as subspace GMM and FST-based speech recognizer. Of all, this features feel like more "modern".

I only have little exposure on the toolkit (but determined to learn more). Unlike Sphinx and HTK, it is written in C++ instead of C. As of this writing, Kaldi's compilation takes a long time and the binaries are *huge*. In my setup, it took me around 5G of disc space to compile. It probably means I haven't setup correctly ...... or more likely, the executable is not stripped. That means working on Kaldi's source code actively would take some discretion in terms of HD.

Another interesting part of Kaldi is that it is using weighted finite state transducer (WFST) as the unifying knowledge source representation. To contrast this, you may say most of the current open source speech recognizers are using ad-hoc knowledge source.

Are there any differences in terms of performance you ask? In my opinion, probably not much if you are doing an apple to apple comparison. The strength of using WFST is that when you need to introduce new knowledge, in theory you don't have to hack the recognizer. You just need to write your knowledge in an FST and compose it with your knowledge network, then you are all set.

In reality, the WFST-based technology seems to still have practice problem. As the vocabulary size goes large and knowledge source got more complicated, the composed decoding WFST would naturally outgrow the system memory. As a result, many sites propose different technique to make decoding algorithm works.

Those are downsides but the appeal of the technique should not be overlooked. That's why Kaldi becomes one of my favorite toolkits recently.

Julius

Julius is still around! And I am absolutely jubilant about it. Julius is a high-speed speech recognizer which can decode a 60k vocabulary. One speed-up techniques of Sphinx 3.X was context-independent phone Gaussian mixture model selection (CIGMMS) and I borrowed this idea from Julius when I first wrote.

Julius is only the decoder and the beauty of it is that it never claims to be more than that. Accompanied with the software, there is a new Juliusbook, which is the guide on how to use the software. I think the documentation are in greater-depth than other similar documentations.

Julius comes with a set of Japanese models, not English. This might be one of the reasons why it is not as popular (more like talk about) as HTK/Sphinx/Kaldi.

(Note at 20130320: I later learned that Julius also comes with an English model now. In fact, some anecdotes suggest the system is more accurate than Sphinx 4 with broadcast news. I am not surprised. HTK was as acoustic model trainer.)

So far......

I went through three of my favorite recognition toolkits. In the next post, I will cover several other toolkits available.

Arthur

The Grand Janitor's Blog

For the last year or so, I have been intermittently playing with several components of CMU Sphinx. It is an intermittent effort because I am wearing several hats in Voci.

I find myself go back to Sphinx more and more often. Being more experienced, I start to approach the project again carefully: tracing code, taking nodes and understanding what has been going on. It was humbling experience - speech recognition has changed, Sphinx has more improvement than I can imagine.

The life of maintaining sphinx3 (and occasionally dip into SphinxTrain) was one of the greatest experience I had in my life. Unfortunately, not many of my friends know. So Sphinx and I were pretty much disconnected for several years.

So, what I plan to do is to reconnect. One thing I have done throughout last 5 years was blogging so my first goal is to revamp this page.

Let's start small: I just restarted RSS feeds. You may also see some cross links to my other two blogs, Cumulomaniac, a site on my take of life, Hong Kong affairs as well as other semi-brainy topics, and 333 weeks, a chronicle of my thoughts on technical management as well as startup business.

Both sites are in Chinese and I have been actively working on them and tried to update weekly.

So why do I keep this blog then? Obviously the reason is for speech recognition. Though, I start to realize that doing speech recognition has much more than just writing a speech recognizer. So from now on, I will post other topics such as natural language processing, video processing as well as many low-level programming information.

This will mean it is a very niche blog. Could I keep up at all? I don't know. As my other blogs, I will try to write around 50 messages first and see if there is any momentum.

Arthur

Friday, December 14, 2012

Self Criticism : Hieroglyph

When I was working on CMU Sphinx, I was more an aggressive young guy and love to start many projects (still am). So I started many projects and not many of them completed. I wasn't completely insane: what was lacking at that point of development is that we lack of passion and momentum. So working on many things give a sense of we are moving forward.

One of the projects, which I feel I should be responsible, is the Hieroglyph. It was meant to be a complete set of documentation for several Sphinx components work together. But when I finished the 3rd draft, my startup work kicked in. That's why what you can see is only an incomplete form of the document.

Fast-forward 6 years later, it was unfortunate that the document is still the comprehensive source of sphinx if you want to understand the underlying structure/method of CMU Sphinx C-based executables. The current CMU Sphinx encompasses way more than I decided to cover. For example, the Java-based Sphinx4 has gained much followings. And pocketsphinx is pretty much the de-facto speech recognizer for embedded speech recognition.

If you were following me (unlikely but possible), I have personally changed substantially. For example, my job experience taught me that Java is a very important language and having a recognizer in Java would significantly boost the project. I also feel embedded speech recognition is probably the real future of our life.

Back to Hieroglyph, suffice to say it is not yet a sufficient document. I hope that I can go back to it and ask what I can do to make it better.

Arthur

New Triplet is Released

Just learned from the CMUSphinx's main site. It sounds like there is a new triplet of sphinxbase and SphinxTrain released.

http://cmusphinx.sourceforge.net/2012/12/new-release-sphinxbase-0-8-pocketsphinx-0-8-and-sphinxtrain-0-8/

I took a look of the changes. Most of the changes work towards better reuse between SphinxTrain and sphinxbase. I think this is very encouraging.

There are around 600-700 SVN update since the last major release of triplet. I think Nick and the SF guys are doing great jobs on the toolkit.

As for training, one encouraging part is that there are efforts to improve the training procedure. I have always been maintaining that model training is the heart of speech recognition. A good model is the key of getting good speech recognition and performance. And great performance is the key of getting great user experience.

When will CMU Sphinx walk on the right path? I am still waiting but I am increasingly optimistic.

Arthur

(PS. I have nothing to do with this release. Though, I guess it's time to go back to actual open-source coding.)

Monday, June 04, 2012

CMU Sphinx Documentation

I was browsing the documentation section of cmusphinx.org and was very impressed. Compared to my ad-hoc version of documents back in www.cs.cmu.edu/~archan, or the old robust group document, it is a huge improvement.

What is the challenging to develop documentation for speech recognition? I believe the toughest part is that some people still see speech recognition as a programming task. In real-life though, speech recognition application should be viewed as a data analysis task.

Here is why: suppose you work on a normal programming task, once you figure out the algorithm, you job is pretty much done.

On a speech app though, that is just a tiny step towards a system which is good. For example, you might notice that your dictionary is not refined enough such that some of the words are not recognized correctly. Or you found that your language model has something wrong such that a certain trigrams never appears.

Those tasks, in terms of skill sets, require a person to stay in front of the Linux console, then come up with a Eureka moment : "Oh, that's what's wrong!". So the job "Speech Scientist" usually requires knowledge of statistics, machine learning and more generally good analytic skills.

Your basic Linux skill is also extremely important: e.g. a senior researcher once shows me how he did many things solely on perl one-liner. As it turns out, when you can wield perl one-liner correctly, you can solve many text processing problem with one command! This would save you a lot of time in writing a throw-away script and allow you to focus on analysis why things are going wrong.

Back to good speech application documentation: one of the challenging part is to convey this real-life work-flow of Speech Scientist to the open source community. Many of us learn (and thrive to learn more...) this kind of skill in a hard way: writing reports, papers, presentations and be ready to get feedback from other people. You will also find yourself amazed by some brilliant insights and analyses too. (There are stupid analysis too but that's parts of life......)

The Sphinx project collectively has gone a long way on this front of development. If you have time, check out
http://cmusphinx.org/wiki, I found much of material very useful. Check it out!

The Grand Janitor.

Friday, May 18, 2012

What should be our focus in Speech Recognition?

If you worked in a business long enough, you start to understand better what type of work are important. As many things in life, sometimes the answer is not trivial. For example, in speech recognition, what are the important ingredients to work on?

Many people will instinctively say the decoder. For many, the decoder, the speech recognizer, oorr the "computer thing" which does all the magic of recognizing speech, is the core of the works.

Indeed, working on a decoding is loads of fun. If you a fresh new programmer, it is also one of those experiences, which will teach you a lot of things. Unlike thousands of small, "cool" algorithms, writing a speech recognizer requires you to work out a lot of file format issues, system issues. You will also touch a fairly advanced dynamic programming problem : writing a Viterbi search. For many, it means several years of studying source code bases from the greats such as HTK, Sphinx and perhaps in house recognizers.

Writing a speech recognizer is also very important when you need to deal with speed issues. You might want to fit a recognizer into your mobile phone or even just a chip. For example, in Voci, an FPGA-based speech recognizer was built to cater ultra-high speed speech recognition (faster than 100xRT). All these system-related issues required understanding of the decoder itself.

This makes speech recognition an exciting field similar to chess programming. Indeed the two fields are very similar in terms of code development. Both require deep understanding of search as a process. Both have eccentric figures popped up and popped out. There are more stories untold than told in both field. Both are fascinating fields.

There is one thing which speech recognition and chess programming are very different. This is also a subtle point which even many savvy and resourceful programmers don't understand. That is how each of these machines derived their knowledge sources. In speech, you need to have a good model to do decent jobs for your task. In chess though, most programmers can proceed to write a chess player with the standard piece values. As a result, there is a process before anyone can use a speech recognizer. That is to first train an acoustic model and a language model.

The same decoder, having different acoustic models and language models, can give users perceptions ranging from a total trainwreck to the a modern wonder, borderline to magic. Those are the true ingredients of our magic. Unlike magicians though, we are never shy to talk about these secret ingredients. They are just too subtle to discuss. For example, you won't go to a party and tell your friends that "Using an ML estimate is not as good as using an MPFE estimate in speech recognition. It usually results in absolutely 10% performance gap." Those are not party talks. Those are talks when you want to have no friends. :)

In both type of tasks, one require learning different from a programming training. 10 years ago, those skill are generally carried by "Mathematician, Statistician or People who specialized in Machine Learning". Now there is new name : "Big Data Analyst".

Before I stopped, let me mention another type of work, which are important in real life. What I want to say is transcription and dictionary work. If you asked some high-minded researchers in the field, they will almost think those are not interesting work. Yet, in real-life, you can almost always learn something new and improve your systems based on them. May be I will talk about this more next time.

The Grand Janitor

Sunday, May 13, 2012

Restart

Again, I feel rejuvenated. Last few months of experience start to make me more unified both as a person and as a technical person. When you start to work on something which draw up all you know in your life, you know that you are walking on the right path.

Things are starting to look more and more interesting.

The Grand Janitor

Friday, May 11, 2012

Development of Sphinx 3.X (X = 6 to 8) and its Ramification.

One of the things I have done back in Sphinx is to so called "Great Refactoring" of Sphinx 3, SphinxTrain and sphinxbase. It was started by me but mostly took up by Dave (in a disgruntled manner :) ). I write this article to reflect the whole process and ask if I have done the right thing.

The background is like this: as you know, the CMU sphinx project has many recognizers. Sphinx2, 3, 4, PocketSphinx and MultiSphinx. It's easy to understand why that happened in the first place. CMU is an university and understandably would have many different types of projects. In essence, when someone think of a good new idea, they will simply implement a recognizer. The by-product of it would be a PhD thesis or some kind of project reports.

There is nothing wrong with that. Think of the pain of understanding and changing a recognizer which has 10-30 thousand lines of code, you will know that it is not for the faint of heart. Many of the original programmers of the recognizers also have practical reason to ignore code re-usability - many of them have deadlines to meet. So I always feel empathy towards them.

Of course, on the other side of the coin, having many recognizers gives users a mild amount of pain. Just to look at 3.0 and 3.3, command-line interface had changed (e.g. -meanfn becomes -mean). So when people need to interface with the code, it would take some understanding. The bigger problem is that do you expect a certain feature appears in one of the decoders to appear in another? This kind of inconsistency is very hard to explain to normal users.

So here comes the first change at 3.5, or around 6-7 years ago, I decided to merge 3.0's series of tools and recognizer with 3.3, the fast decoder. I got to say, the decision is mainly driven by young naivete and year-long insomnia. ( :) ). There were also frustration from users which drove me to make those changes. In 3.5, the main thing I did was just to "port" the tool from the old 3.0 such as allphone, astar, align to 3.x. There are some command-line interface changes. So far, all are cool.

Then it comes to 3.6, at this point, I started to realize a lot of underlying functions and libraries are duplicated. For example, we have multiple GMM computation routines but you can't use them in all tools which call GMM computation. Like allphone in 3.5 used GMM computation, but you can't expect any fast GMM computation in 3.4 can be used in allphone. Simply because the library wasn't shared.

So what did young and naive me thought? Let's try to write a single architecture to incorporate all these different things! (!!!!) Now... this is what I think where things go wrong.

Let me explain a little bit more. There is a legitimate reason why the original programmer (Ravi) decides to split the tools into multiple parts and let code duplicates. Simply because, the issue in align is not necessarily the issue of decode. If the programmer of align needs to consider issues of decode, then it will take a long time to really get any programming done.

This happens to be the case of Sphinx 3.X. Now for the development of Sphinx 3.X, there was another undesirable factor. That is I decided to leave - I simply couldn't overcome the economic force at the time - a startup company is willing to hire me.

To complicate the matter, we *also* decide to factor out common parts between SphinxTrain and sphinx3 to avoid code duplication between the two. Again, it is driven by legitimate concern, the fact that there were two feature extraction routines in two packages constantly make users ask themselves whether the front-end are matched.

All of these except I am leaving are good things but they just entail coding time. Now the end effect is that it makes the effort too big, too time-consuming. 3.6 took me around 1 year to write and release. I release an official release at around mid of 2006 but there are still too many issues in the program. The latter 3.8, Dave has taken up and really fixed many bugs. So I always think it's Dave to make sphinx 3.X in the current stable form.

To the credit of the guys in the team, they really bash me : Evandro, being circumspect and consistent, always asked if it is a good idea in the first place. Ravi, always the wise man, had brought up the issues of merging the code. And of course, there is Dave, he deserves most of the credits for fixing a lot of nasty bugs.

So, in fact, it is really I should be blamed in the process. I guess I am finally mature enough to apologize to everyone.

So you may wonder why I said all of these? Oh well, first of all, that's because I am going to put work on the recognizers again. Not just on Sphinx 3, but all other recognizers. So my first hope is that I don't repeat my past problems.

Now given the code is being iterated in last 6 years, the benefit of merging the code in Sphinx 3 starts to really show up. People can do a lot of more things than the past. Is it good enough? I don't think so. Sphinx 3 has a lot of potentials but it's very misunderstood. In a nutshell, I need to put more work on it in the future.

The Grand Janitor

Friday, May 04, 2012

Being a programmer in your 20s and 30s

It's funny how a person changes. I always thought 20s was my best time. It sort of was. Generally, that was the moment you are energetic, can burn as much as you can, naively think that life, relationship can last forever. Also, you unconditionally trust other people.

Things will turn when you are 30s, you start to realize your skill, your prowess to growth has a limit. In exchange, you grow wiser. In my case, I found my reads on people are much better, I start to go behind other people's word and try to understand people's intention. I start to treasure genuine friendship, protest contrived politeness and faked honesty.

I also start to know when is the best time to be quiet and when is the best time to give a come back. The former is important because if you are the only one who shine, your team will perpetually has the capability of one person, who is you.

The latter is also important because if you are always quiet, there are just people who will step on your toes harder and harder. They will think you are weak and can be bullied. In real life, as in the time when you are in high-school, the bullies love to bully the weak. Making sure they have a hard time to do so, is a very important life skill.

I will never go back to the time I can hack a program for 20 hours, sleep, and then hack it again for another 20 hours. Will I feel regret about it? Probably not, in exchange, I learn that sometimes you can solve a 20-hour problem with 2 hour, you can still sleep and make a living. For all that matter, it seems to be a better deal. :)

The Grand Janitor

Wednesday, May 02, 2012

Start to look at the repository tree

Programming as a profession is a a strange one. If you are a doctor, you can usually carry your knowledge and skills from one place to another provided that you have exactly the same tool. If you are a programmer, you speed and skill are partially determined by the tools you build in house for a particular place. So for example, I am not supposed to use any tool I built when I worked in the small video-advertising start-up. Even if I can do something in 1 second at that period of time, if I change my job, I will need to restart and rebuild the tool again. We are probably talking about days to rebuild the tool and weeks to refine it again.

There is one exception: if you worked in open source, much of your code would be stored in a public place. Even when you have left your job for long time, it is legit for you to use it again. You don't have to solve the same problem again and again. This is the beauty of open source and I am greatly benefited by it personally.

As I start to regain my muscles in Sphinx, I start to notice that there are much changes in last 6 years. Just look at the top level of Subversion:

File	Rev.	Age	Author	Last log entry
Parent Directory
CLP/	10079	23 months	dhdfu	Finally add an -F argument to use the full path in the control file as the label…
PocketSphinxAndroidDemo/	11117	9 months	nshmyrev	Wrapper for nbest
SimpleLM/	22	12 years	rickyhoughton	Initial revision
Speech-Recognizer-SPX/	8933	3 years	nshmyrev	Update module to recent pocketsphinx API
SphinxTrain/	11350	9 days	nshmyrev	Extract warped features during 000 stage if VTLN is enabled. See for detailsht…
archive_s3/	7289	4 years	egouvea	Fixed error message in decoder script reporting failure in bw, and made result d…
cmuclmtk/	11035	10 months	nshmyrev	Fixes bug in wngram2idngram and adds a test for it
cmudict/	11348	3 weeks	air	cleaned up documentation and code (a bit) recompiled the dict
gst-sphinx/	7848	4 years	dhdfu	Support changing language models at runtime (maybe)
htk2s3conv/	11336	6 weeks	nshmyrev	Adds warning about different number of mixtures
jsgfparser/	7230	4 years	dhdfu	Fix the main program to output the only public rule if no rule is specified, and…
logios/	11339	4 weeks	tkharris	remove duplicated code
misc_scripts/	10147	22 months	dhdfu	handle zero references
multisphinx/	10945	12 months	dhdfu	clean up better and introduce vocabulary maps
pocketsphinx/	11351	8 days	nshmyrev	Updated lat2dot script. I need to move it to the other location though
pocketsphinx-extra/	9972	2 years	dhdfu	add sc models with mixture_weights and mdef.txt files
scons/	5868	5 years	egouvea	updated the scons support to reflect that plugin.jar is now part of the package
share/	5532	6 years	egouvea	Setting dsp and dsw files to have have windows EOL regardless where it's downloa…
sphinx2/	8767	3 years	egouvea	Updated the sphinx-2 MS files to MS .NET, consistent with the other packages, an…
sphinx3/	11329	2 months	nshmyrev	Patch to solve memory issues in python module. See for detailshttps://bugzilla…
sphinx4/	11344	3 weeks	nshmyrev	Properly sets logger for AudioFileDataSource. Thanks to Bandele Ola.
sphinx_fsttools/	10791	14 months	nshmyrev	Some bit in AM to FST conversion
sphinxbase/	11346	3 weeks	nshmyrev	Properly select buffer size when using audioresample. Thanks to balkce See fo…
tools/	9009	3 years	nshmyrev	Updated to the latest release of sphinx4
web/	10249	21 months	nshmyrev	There is no sphinx3 development anymore

How exciting is that? You got only 6 to 7 top level directories 7 years ago!

From now on, I will start to put more notes on different tools in the repository.

The Grand Janitor

Thursday, April 12, 2012

Getting back to the project.....

After several years not touching Sphinx (or for that regard, any serious coding), I start to have a conversation with myself, namely, the me who maintained Sphinx 3.X 6 years ago.

When I was working with the project, I was tasked to work on Sphinx 3. I have been an advocate of Sphinx 3 ever since. To say the truth, I might have overdone it - there are many great recognizers in the world. Just look within the family: Sphinx 4, PocketSphinx and recently MultiSphinx by Dave are all great recognizers. (Dave has also fixed a lot of my bugs. So if you look into the source code, you will see places where he screamed, or I paraphrase "Arthur, what are you talking about?")

Experience with many outside companies changed me. I literally turned from a naive twenty something guy to a thirty something guy. Still naive, but my world view has certainly changed. In fact, for many purposes, I found that learning all components of Sphinx is very beneficial.

Let's think in this way: each of the project from CMU Sphinx was meant to solve a practical problem in real life. For example, in Sphinx 4, not only you have great out-of-the-box performance. You also got the native code which can be incorporated into Java-based servers. This is a huge plus when you are thinking of writing a web application. And web applications will be around for a long time.

Same as PocketSphinx, it is meant to be a version of Sphinx which can be integrated different embedded systems. I am yet to learn about MultiSphinx but I always have faith on Dave and his ideas.

This makes me want to learn again. It's weird, once you open your mind, you will see doors everywhere. For me, my next targets would be learning Sphinx 4 and PocketSphinx. Both of them have great importance. Will I still work on Sphinx 3? Probably. X can always bigger than 8. It's the programming reality which makes me change. As I would think now, it's a good change, a very good change.

The Grand Janitor

Monday, November 21, 2011

The Grand Janitor After CMU Sphinx

I have left the development of CMU Sphinx for around 6 years. Geez. Talking about changes. During the time, I went to work for one startup and one defense contractor.   Start numerous non-speech related blogs.

I certainly have fun but feel drifted at the same time - both companies I worked with are extraordinary but their causes are not mine.    As you know, life without a cause is a tough life.

And now when I am inspecting Sphinx and open source speech recognition again.   Wow, there are tons of changes.   The awareness of the need of open source speech recognition has never been so acute and high.   The performance of open source speech recognition still requires a lot of work but it is no longer unthinkable to deploy an open source speech recognizer in a real application.

There are more resources for learning how to use a speech recognizer.   Thanks to dedicated Sphinx developers such as David Huggins-Daines and Nickolay Shmyrev. Many more people learn about how to properly use Sphinx and there are more documentation around.

There are also more resources for building a speech recognizer. One notable effort is Voxforge led by Ken McClean which dedicated to accumulate clean and transcribed data over the time. Though I don't know how large is its size, I admire the dedication of Ken.    Someone should start such a project long long time ago. Once it is started, there is a chance that open source data would be an important source of speech data in future.

In my last 6 years, I can only act as a bystander of Sphinx development. I change job again recently and will work with a company which is close to Sphinx. I don't know how much I will do *real* work. But I am glad that Sphinx and I cross paths again. At the very least, I hope to contribute ideas to the community and help this great project grows.

The Grand Janitor

Sunday, December 12, 2010

I am back

Hi Guys,
     I stopped using this blog for 3 years and now I decide to claim it. My life as the "Grand Janitor" of the Sphinx software is very memorable for me.   It was unfortunate for me to stop the blog and had only write on-line in other venues.

   I will start to blog more about speech recognition and natural language processing. This is probably time for me to read up again. My another blog, Random Thought of Arthur Chan, will solely put my thought on other random things in the world.

     In any case, it's good to meet all of you again. We'll have fun.

The Grand Janitor

Wednesday, July 22, 2009

Random Thought: Cloud

When I was in College in Hong Kong, I love to stare at the blue sky and just watching pieces of cloud floating from my left to right. There was much open space in the University. My favorite thing to do is to skip classes and watch some clouds.

To many of my friends, that is a ridiculous habit. Though most of them see them as part of my little eccentricities in my little unsung college career.

In another words, I have done worse. :) So they are not truly surprised and I am not that disappointed by their misunderstanding of clouds.

My true disappointment comes when I tried to share this interesting hobby with a mathematically-oriented friend. This guy is genuinely smart. In terms of Math, I think he is about 5 years ahead of me. So I thought he would understand.

So I told him my true intention of watching cloud - I would like to predict weather based on observing the cloud. That, to me, is a totally reasonable application of Mathematics. This is his response,

"You read "Wind and Cloud" too much.".

("Wind and Cloud" is a popular martial art comic book in Hong Kong. It's about a two martial experts, "Wind" and "Cloud" and their adventure in China.)

Many people asked me why I chose to live in US instead of Hong Kong, or even Bigger China. This story is probably an example of why.

In Hong Kong (or probably the bigger China), it is a difficult thing for students to imagine that advanced mathematics could have anything to do with complex subjects such as metereology at all. Also, there is a big gap between the expert knowledge of a certain field and the general public. So even if you have a technical background and you are smart enough to learn, you could still be ignorant on branches of other fields.

Of course, an even deeper problem is that imagination and creativity is not an emphasis in technical subjects such as Science and Mathematics. In the secondary school curriculum, they were usually not taught to inspire students to discover Mathematics themeslves. This explains the behavior of my smart friend.

There are social consequences of this, students grow up like this will probably unable to appreciate interesting thought from the youngs. That is to say scientific and technical workers are not truly appreciated. This compounds with the general money-loving attitude in Hong Kong. You will not surprised that Science and Technology is tough to develop there.

We cannot say the States' education is perfect, there are tons of holes and problems in it as well. But perhaps because Americans are always more adventurous in nature. They always see possibilities. That's why if you asked a smart student in U.S. the same question, you would probably got an account of General Circulation Model, how the basic equations is written. How Stoke-Navier equation can be used in this problem. (If we digress, then we would chat about how Stoke-Navier equation could be one of the 7 Millenium problems.)

I don't resent my friend's comment. What I see was that a smart person like him was wasted in the system. How many more of these situations happened in the past? I have no idea. What I know is that this is the true impedance of generating good scientific and technical workers.

Wednesday, November 07, 2007

Statistically Insignificant Me

Slightly related my last post. It relates to an interesting issue of whether we should share the bookshelf in the first place.

Why is it an issue? Well, privacy. Suppose someone is malicious and try to figure you out. The best way is to try to gather all information about you and work against you.

Another concern of mine is rather interesting and absolutely speculative, what if information I read will affect my thought and what if people could reconstruct it just from the information I read? That will open up a lot of interesting application. e.g. We might be able to predict what a person will do better.

Just like in other time series problem such as speech recognition and quantitative analysis. Human life could simply be defined by a series of time events. Some (forget the quote) believes that one human life could be stored in hard-disk and some starts to collect human life and see whether it could be model.

Information of what you read could tell a lot of who you are. Do you read Arthur C. Clarke? Do you read Jane Austen? Do you read Stephen King? Do you read Lora Roberts? From that information, one could build a machine learner to reverse map to who you are and how you make decision. We might just call this a kind of personality modeling.

It seems to me these are entirely possible from the standpoint of what we know. Yet, I still decide to share my bookshelf? Why?

Well, this was crystal-clear moment for me (and perhaps for you as well) which helps me to make a decision: Very simple, *I* am statistically in-significant.

If you happen to come to this web page, the only reason you come is because you are connected to me. How likely will that happened?

I know about 150 persons in my life. The world has about 6 billion. So that simply means the chance of me being discovered is around 1.5 x 10^-8. It is already pretty low.

Now, when other people know me and recommend me to someone else. Then this probability will be boosted up because 1) my PageRank will increase, 2) people follow my link deep enough will eventually discovered my bookshelves.

Yet, if I try to stay low-profile, (say not try to do SEO, not recommend any friends to go to my page) then it is reasonable to expect the factor mentioned is smaller than 1.

Further, 1.5 x 10^-8 is an upper bound as an estimate because
1, Not all my friends are interested in me (discounting factor : 0.6, a conservative one, the actual number is probably higher but I just don't want to face it. ;) )
2, My friends who are interested in me might not follow my links (discounting factor: 0.01)

So we are talking about an event with probability as low as 10^-9 or 10^-10 here. That seems to me close to cheap cryptographic algorithm.

But notice here, my security is not come from hiding or cryptography. My security merely come from my statistical insignificance. In English, I am very open but no one cares. And I am still a happy treebear. ;)

That's why you see my bookshelf. Long story for a simple decision. If you happen to read this, I hope you enjoy it.

-a

Visual Bookshelves

I love to read and like to write reviews for every books I read. None of them will change the world but it still loves to do it. That's why by definition - I'm a bookworm. Not even feel shy about it. ;)

I go quite far: try to record every books I read on a blog and start to put them in a blog called "ContentGeek". Luckily, I haven't gone very far. Because once I discovered Visual Bookshelves, there is no need for me to do it all.

Visual Bookshelves allow users to look up a book from Amazon, add comments and stored it in a database. It also shows the book cover of the books. What else could I want more?

So anyway, this is the link of my visual bookshelves:

http://www.cs.cmu.edu/~archan/personal/bookshelf.html

Enjoy.

-a

Monday, October 01, 2007

Prof. Randy Pausch's Last Lecture

http://video.google.com/videoplay?docid=362421849901825950&hl=en

If you work or study in Computer Science or Electrical Engineering, I would highly recommend this video to you. It makes me remember why I decide to work in this industry in the first place.

-a

Wednesday, August 01, 2007

David's plan on Sphinx 3.7

http://lima.lti.cs.cmu.edu/mediawiki/index.php/Sphinx3

A great read, it touches the heart of implementation issues of all sphinxen. And its criticism on my implementation a right straight to the point.

I felt very relieved when the current maintainer attack what I did in the past. (Some features I did were rather stupid.) This shows that Sphinx is still alive and will still be alive.

-a

Friday, July 20, 2007

Life in Scanscout

Hi Guys,
Scanscout (www.scanscout.com) is a rather interesting company. . If you look at this blog, you probably know that I have been there for a while.

My direct supervisor doesn't like to give away too much. I think he has a point (as he is a *v* smart guy"). This contradicts to my philosophy of information sharing. So alright, as a compromise, here are couple of things I could share. (Of course, my estimate of the probably of anyone looking at this blog is about 1/10^9, so I guess it doesn't matter that much......)

1, We have a massage chair and it is awesome.
2, We have a foozball table and have a tournament every Friday. Beware, there are several good players. (I always get the lowest score.)
3, It is on the fore-front of video advertising. I am glad that I've joined. :-)

Arthur Chan

Sunday, April 15, 2007

mosedecoder

Mosedecoder: http://www.statmt.org/moses/

Ah. This is not exactly news. It has been around since 2006 John Hopkins workshop.

mosedecoder is probably the first open source statistical machine translation implementation in the world. For quite a while, only the IBM models training portion of the code could be found in GIZA++. So for people who is interested in SMT, they will probably turn to Pharaoh, a close source implementation available in the web.

I could have some fun. ;-)

-a

Monday, March 05, 2007

Third Draft of Hieroglyphs

Hi all,

It has been a while I worked on the Hieroglyphs (the fancy name I made for sphinx documentation). This is perhaps the only things I haven't wrapped up in CMU. Therefore I decided to release a draft. You can find it

at

http://www-2.cs.cmu.edu/~archan/documentation/sphinxDocDraft3.pdf

It still looks pretty messy but it starts to look like a book now.

Several chapters and sections were trimmed in this draft. You will still see a lot of ?. Those are signals of not enough proof-reading. Forgive me, when I have more time, I will try to fix some of them in near future.

Grand Janitor