Saturday, November 16, 2013

"The Grand Janitor Blog" is Moving

After 8 years of using Blogger, I finally make enough sense to get a dot com.  Blogger just has too many idiosyncrasies which make it hard to use and expand.

You can find my new blog, "The Grand Janitor Blog V2" at   I already write one message there.  Hope you enjoy. 


Tuesday, September 17, 2013

Future Plan for "The Grand Janitor Blog"

I have been crazily busy so blogging was rather slow for me.   Though I have a stronger and stronger feeling that my understanding is closer to the state of the art of speech recognition.   And for now, the state of the art of speech recognition, we got to talk about the whole deep neural network trend.

There is nothing conceptually new in the use of hybrid HMM-DBN-DNN.   It has been proposed under the name HMM-ANN in the past.   What is new is that there is new algorithm which allow fast training of multi-layered neural network.   It is mainly due to Hinton's breakthrough in 2006: it suggests training a DBN-DNN can be first initialized by pretrained RBM.

I am naturally very interested in this new trend.   IBM, Microsoft and Googles' results show that DBN-DNN is not a toy model we saw last two decades.

Well, that's all for my excitement on DBN, I still have tons of things to learn.    Back to the "Grand Janitor Blog",  as I had tried to improve the blog layout 4 months ago,  I got to say I feel very frustrated by Blogger and finally decide to move to WordPress.

I hope to move within the next month or so.  I will write a more proper announcement later on.


Tuesday, June 25, 2013

Apology, Updates and Misc.

There are some questions on LinkedIn about the whereabouts of this blog.   As you may notice, I haven't done any updates for a while.   I was crazy busy by work in Voci (Good!) and many life challenges, just like everyone.    Having a lot of fun with programming, as I am working with two of my most favorite languages - C and Python.  Life is not bad at all.

My apology to all readers though, it could be tough to blog sometimes.  Hopefully, this situation will change later this year.....

Couple of worthwhile news in ASR,  Goldman-Sach won the trial in the Dragon law suit.  There is also the VB's piece of MS doubling up speed in their recognizer.

I don't know how to make out of the lawsuit but only feel a bit sad.  Dragon has been the homes of many elite speech programmers/developers/researchers.  Many old-timers of speech were there.   Most of them sigh about the whole L&H fiasco.   If I were them, I would feel the same too.   In fact, once you know a bit of ASR history, you would notice that the fall of L&H gave rise to one you-know-its-name player nowadays.  So in a way, the fate of two generations of ASR guys are altered.

As for the MS piece, we are following another trend these days, which is the emergence of DBN.  Is it surprising?  Probably not, it's rather easy to speed up neural network calculation.  (Training is harder, but that's what DBN is strong compared to previous NN approach.)

On Sphinx, I will point out one recent bug contributed by Ricky Chan, which exposed a problem in bw's MMIE training.   I am yet to try it but I believe Nick has already incorporated into the open-source code base.

Another items which Nick has been stressing lately is to use python, instead of perl, as the scripting language of SphinxTrain.   I think that's a good trend.  I like perl and use one-liner, map/grep type of program a lot.  Generally though, it's hard to find a concrete coding standard for perl.   Whereas python seems to be cleaner and naturally lead to OOP.  This is an important issue - perl programmers and perl programming style seems to be spawned from many different type of languages.   The original (bad) C programmer would fondly use globals and write functions with 10 arguments.  The original C++ programmer might expect language support on OOP but find that "it is just a hash".   These style difference could make perl training script hard to maintain.

That's why I like python more.  Even very bad script seems to convert itself to more maintainable script.   There is also a good pathway for python/C connect.  (Cython is probably the best.)

In any case, that's what I have this time.  I owe all of you many articles.  Let's see if I can write some in the near future.


Monday, May 06, 2013

Translation of "Looking forward (only 263 weeks left)"

As requested by Pranav, a good friend of Sphinx, I translated one of article "Looking forward (only 263 weeks left)" from my Chinese blog "333 weeks" (original). So here it is, enjoy!

"April was a long long month.

I spent most of my time on solving technical problems.  With great help of colleagues, we finally got all issues resolved.  I also start to put some time into new tasks.  The Boston Marathon Explosion was tough for everyone, but we kind of having closure now.  As for investment, mine is in pace with S&P.  The weather is also getting better.  Do we finally feel spring again?

I think the interesting part in April is that I spent more time in writing, may it be blogging, articles.  I wrote quite a bit even when I was busy.  I mentioned the Selection of Cumulomaniac.  At this
stage, I am copyediting and proofreading the drafts.  It's a good thing to write and blog as I love to connect with the like-minds."


Saturday, May 04, 2013

My Chinese Blogs : Cumulomaniac and 333 Weeks


I hadn't updated this blog for a while.   April had been a long month and the whole Boston Marathon Explosion was difficult for me.   I end up spending quite a bit of time to work on my other blogs such as Cumulomanic and 333 weeks.   If you go click them, they are all in Chinese.    In the past, it was okay, but recently there are more and more friends of mine asking me what the whole thing is about.    So it might deserves some explanation. 


Cumulomaniac is more of my personal photography and writing blog.  From time to time, I go take pictures of clouds around Boston and shared it with my friends in Hong Kong.  You know Hong Kong?  It's probably hard for my American friends to even start to imagine if they only watch Jackie Chan's movie:   I used to live in a 500 square feet room with a pitiful size of bathroom and kitchen.  It's called 500 sq feets but it feels like 300 sq feets because over the years my family has piled tons of stuffs there. 
The place I lived in Hong Kong, called Sham Shui Po, locate close to a flea market and a computer shopping malls.  That's perhaps why I am in love with gadgets in the first place.  

For this context, the most important thing you should know is that there is no skyline in Hong Kong.   So it was a big change when I first to the States.  I guess there is a reason to share my friends with "my sky".  

Startup Employee 333 Weeks 

As you might know, I am working on yet another startup, Voci with some great minds graduated from Carnegie Mellon.   When I took up the job, I decided to stay with this company for a while.   I set the time to be 333 weeks.   So the blog Startup Employee 333 weeks chronicled my story in the company.  

I chose to write in Chinese because it is yet another blog topic which was discussed to the death by American bloggers.   In Hong Kong/China though, there are still many people living in a bureaucratic system and live their lives as big companies' employees, they might not very familiar with how "startupers" work and live.  There are also much misunderstanding from people who work in a normal traditional job on startup.  

My focus in 333 Weeks is usually project management, communication and issues when you work in a startup.   Those are what we programmers called "soft stuffs" so I seldom like to bring them up in The Grand Janitor's Blog. 

Why Didn't you Write Them In English?

I gave partial answers in the above paragraphs.  In general, my rule of blog writings is to make sure my message are targeted to a well-defined niche group.    The Grand Janitor is really for speech professionals while 333 weeks are written for aspiring start up guys.    So that's pretty much sum up why you don't see my other messages in the past?

Another (rather obvious) reason is my English.   My English writing has never quite caught up with my Chinese writing.   Don't get me wrong.  I write English way faster than Chinese.   I also write a lot.   The issue is that I never feel I can embellish my articles with English phrases as I do with Chinese phrases.   

It changes quite a bit recently as I feel my English writing has improved.   (May be because I hanged out with a bunch of comedians lately. :) )  But I still feel some topics are better to be written in a certain language.   

Hopefully this can be changed in near future.  In fact,  Pranav Jawale, a good friend of Sphinx, has recently interested in one article I wrote in 333 weeks.  And I am going to translate it soon.  

If you are interested in any articles I wrote in Chinese, feel free too tell me.  I can always translate them and put it to GJB. 


Saturday, April 20, 2013

The Boston Marathon Explosion : Afterthought

It has been a crazy week.  Lives were crazy for Bostonians ...... and perhaps all Americans.   From the explosion to the capture of suspect were only 5 days.   I still feel still disoriented from the whole event.  

I feel the warmth from friends and families: there were more than 20 messages from facebook, linkedins, twitters from all over the world to ask about my situation in Boston.  They are all friends who never been to Boston so they don't know that Copley square is a well-known shopping area and only few who are affluent enough would live there.   Saying so, I was lucky enough to decide not to return books to BPL central that day.   But I was shock by the whole thing.  Some describe it as the most devastating terrorist attach since 9-11.  I have to concur.   Even though we can't establish a direct link between the two suspect and any terrorist organizations yet,  the event is at least be inspired by on-line instruction on how to make improvised pressure cooker bomb. 

To even now, no one could clearly explain the motives of the suspect.  Family members are giving confusing answers on psychological profiles of the suspects.   It's hard to judge at this point and may be we should hear more from the authorities. 

My condolences to the families of all victims, to the transit police officer who died at the front-line, to all who was injured.   I sincerely hope the Boston authority can soon help us understand why the tragedy happens. 


Thursday, April 04, 2013

Wednesday, March 27, 2013

GJB Wednesday Speech-related Links/Commentaries (DragonTV, Siri vs Xiao i Robot, Coding with Voice)

ZhiZhen company (智臻網絡科技) from Shanghai is suing Apple for infringing their patents.  (The original Shanghai Daily article) From the news, back in 2006, ZhiZhen has already developed the engine for Xiao i Robot (小i機械人).  A video 8 months ago (as below). 

Technically, it is quite possible that a Siri-like system can be built at 2006.  (Take a Look at Olympus/Ravenclaw.)  Of course, the Siri-like interface you see here is certainly built in the advent of smartphone (, which by my definition, after iPhone is released).   So overall speaking, it's a bit hard to say who is right.  

Of course, when interpreting news from China, it's tempting to use slightly different logic. In the TC article, OP (Etherington) suggested that the whole lawsuit could be state-orchestrated. It could be related to recent Beijing's attack of Apple. 

I don't really buy the OP's argument, Apples is constantly sued in China (or over the world).  It is hard to link the two events together.  

This is definitely not the Siri for TV.

Oh well, Siri is not just speech recognition, there is also the smart interpretation in the sentence level: scheduling, making appointments, do the right search.   Those by themselves are challenges.    In fact, I believe Nuance only provides the ASR engine for Apple. (Can't find the link, I read it from Matthew Siegler.)

In the scenario of TV,  what annoys users most are probably switching channels and  searching programs.  If I built a TV, I would also eliminate the any set-top boxes. (So cable companies will hate me a lot). 

With the technology profile of all big companies, Apple seems to own all technologies need.  It also takes quite a lot of design (with taste) to realize such a device. 

Using Python to code by Voice

Here is an interesting look of how ASR can be used in coding.   Some notes/highlights:
  • The speaker, Travis Rudd, had RSI 2 years ago.  After a climbing accident, He decided to code using voice instead.  Now his RSI is recovered, he claims he is still using it for 40-60%. 
  • 2000 voice commands, which are not necessarily English words.   The author used Dragonfly to control emacs in windows.
  • How does variables work?  Turns out most variables are actually English phrases. There are specific commands to get these phrases delimited by different characters. 
  • The speaker said "it's not very hard" for others to repeat.  I believe there will be some amount of customizations.  It takes him around 3 months.  That's pretty much how much time a solution engineer needs to take to tune an ASR system. 
  • The best language to program in voice : Lisp. 
One more thing.   Rudd also believe it will be very tough to do the same thing with CMUSphinx.  

Ah...... models, models, models. 

Earlier on Grand Janitor's Blog

Some quick notes on what a "Good training system" should look like: (link).
GJB reaches the 100th post! (link)


Tuesday, March 26, 2013

Tuesday's Links (Meetings and more)


Is Depression Really Biochemical (AssertTrue)

Meetings are Mutexes (Vivek Haldar)

So True.  It doesn't count all the time you use to prepare a meeting.

Exhaustive Testing is Not a Proof of Correctness

True, but hey.  Writing regression tests is never a bad thing. If you rely only on your brain on testing, it bounds to fail one way or the other.

Apple :

Apple's iPhone 5 debuts on T-Mobile April 12 with $99 upfront payment plan
iWatchHumor (DogHouseDiaries)


Yahoo The Marissa Mayer Turnaround

Out of all commentaries on Marissa Mayer's realm.  I think Jean-Louis Gassée goes straight to the point and I agree most.   You cannot use a one size fit all policy.  So WFH is not always appropriate as well.


The Management-free Organization

Monday, March 25, 2013

Good ASR Training System

The term "speech recognition" is a misnomer.

Why do I say that? I have explained this point in an old article "Do We Have True Open Source Dictation?, which I wrote back in 2005: To recap,  a speech recognition system consists of a Viterbi decoder, an acoustic model and a language model.  You could have a great recognizer but bad accuracy performance if the models are bad.

So how does that related to you, a developer/researcher of ASR?    The answer is ASR training tools and process usually become a core asset of your inventories.    In fact, I can tell you when I need to work on acoustic model training, I need to spend full time to work on it and it's one of the absorbing things I have done.  

Why is that?  When you look at development cycles of all tasks in making an ASR systems.   Training is the longest.  With the wrong tool, it is also the most error prone.    As an example, just take a look of Sphinx forum, you will find that majority of non-Sphinx4 questions are related to training.    Like, "I can't find the path of a certain file", "the whole thing just stuck at the middle".

Many first time users complain with frustration (and occasionally disgust) on why it is so difficult to train a model.   The frustration probably stems from the perception that "Shouldn't it be well-defined?"   The answer is again no. In fact how a model should be built (or even which model should be built) is always subjects to change.   It's also one of the two subfields in ASR, at least IMO, which is still creative and exciting in research.  (Another one: noisy speech recognition.)  What an open source software suite like Sphinx provide is a standard recipe for everyone.

Saying so, is there something we can do better for an ASR training system?   There is a lot I would say, here are some suggestions:
  1. A training experiment should be created, moved and copied with ease,
  2. A training experiment should be exactly repeatable given the input is exactly the same,
  3. The experimenter should be able to verify the correctness of an experiment before an experiment starts. 
Ease of Creation of an Experiment

You can think of a training experiment as a recipe ...... not exactly.   When we read a recipe and implement it again, we human would make mistakes.

But hey! We are working with computers.   Why do we need to fix small things in the recipe at all? So in a computer experiment, what we are shooting for is an experiment which can be easily created and moved around.

What does that mean?  It basically means there should be no executables which are hardwired to one particular environment.   There should also be no hardware/architecture assumption in the training implementations.   If there is, they should be hidden.

Repeatability of an Experiment

Similar to the previous point, should we allow difference when running a training experiment?  The answer should be no.   So one trick you heard from experienced experimenters is that you should keep the seed of random generators.   This will avoid minute difference happens in different runs of experiments.

Here someone would ask.   Shouldn't us allow a small difference between experiments?  We are essentially running a physical experiment.

I think that's a valid approach.  But to be conscientious, you might want to run a certain experiment many times to calculate an average.    In a way, I think this is my problem with this thinking.  It is slower to repeat an experiment.    e.g.  What if you see your experiment has 1% absolute drop?  Do you let it go? Or do you just chalk it up as noise?   Once you allow yourself to not repeat an experiment exactly, there will be tons of questions you should ask.

Verifiability of an Experiment

Running an experiment sometimes takes day, how do you make sure running it is correct? I would say you should first make sure trivial issues such as missing paths, missing models, or incorrect settings was first screened out and corrected.

One of my bosses used to make a strong point and asked me to verify input paths every single time.  This is a good habit and it pays dividend.   Can we do similar things in our training systems?

Apply it on Open Source

What I mentioned above is highly influenced by my experience in the field.   I personally found that sites, which have great infrastructure to transfer experiments between developers, are the strongest and faster growing.   

To put all these ideas into open source would mean very different development paradigm.   For example, do we want to have a centralized experiment database which everyone shares?   Do we want to put common resource such as existing paramatized inputs (such as MFCC) somewhere in common for everyone?  Should we integrate the retrieval of these inputs into part of our experiment recipe? 

Those are important questions.   In a way, I think it is the most type of questions we should ask in open source. Because regardless of much volunteer's effort.  Performance of open source models is still lagging behind the commercial models.  I believe it is an issue of methodology.  


Monday's Links (Brain-Computer Interface, Apple and more)


How to Write Six Important Papers a Year without Breaking a Sweat: The Deep Immersion Approach to Deep Work
It’s Like They’re Reading My Mind (Slate)


Apple Buys Indoor Mapping Company WifiSLAM (LA times)
How Apple Invites Facile Analysis (Business Insiders)
So long, break-even (Horace Dediu)

After big channels picked up Richards' story:

Startups have a sexism problem

R2-D2 Day ...... for real!

Saturday, March 23, 2013

The 100th Post: Why The Grand Janitor's Blog?

Since I decided to revamp The Grand Janitor's Blog last December, it has been 100 posts. (I cheat a bit, so "not since then".)

It's funny to describe time with the number of articles you write.   In blogging though, that makes complete sense.

I have started several blogs in the past.  Only 2 of them survive (, Cumulomanic and "Start-Up Employees 333 weeks", both in Chinese) .  When you cannot maintain your blog for more than 50 posts, you blog just dies, or simply to disappear into oblivion.

Yet I make it.  So here's an important question to ask: what makes me keep on?

I believe the answer is very simple.  There is no bloggers so far who work on the niche of speech recognition: None on automatic speech recognition (ASR) systems, even though there was much progress.  None on engines, even much work has been done in open source.   None on applications, even great projects such as Simon was there.

Nor there were discussion on how open source speech recognition can be applied to the commercial world, even when there are dozens of companies are now based on Sphinx (e.g. my employer Voci,  EnglishCentral and Nexiwave ), and they are filling the startup space.

How about how the latest technology such as deep neural network (DNN) and weighted finite state transducers (WFST) would affect us?  I can see them in academic conferences, journals or sometimes tradeshows...... but not in a blog.

But blogging, which we all know, is probably the most prominent form of how people are getting news these days.

And news about speech recognition, once you understand them, is fascinating. 

The only blog which comes close is Nicholay's blog : nsh.   When I try to recover as a speech recognition programmer, nsh was a great help.  So thank you, Nick, thank you.

But there is only one nsh.  There are still have a lot of fascinating to talk about...... Right?

So probably the reason why I keep on working:  I want to invent something I want: a kind of information hub on speech recognition technology, commercial/open source, applications/engines, theory/implementations, the ideals/the realities.

I want to bring my unique perspective: I was in academia, in industrial research and now in the startup world so I know quite well people's mindsets in each group.

I also want to connect with all of you.  We are working on one of the most exciting technology in the world.   Not everyone understands that.  It will take time for all of us, to explain to our friends and families what speech recognition can really do and why it matters.

In any case, I hope you enjoy this blog.  Feel free to connect with me on Plus, LinkedIn and Twitter.


Friday, March 22, 2013

C++ vs C

I have been mainly a C programmer.  Because of work though, I have been working with many codebase which is written in C++.

Many programmers will tell you C++ is a necessary evil.  I agreed.  Using C to emulate object oriented feature such as polymorphism, inheritance or even the idea of objects is not easy.   It also easily confused novice programmer.

So why C++ frustrates many programmers then?   I guess my major complaint is that its standard has be evolving and many compilers cannot catch up with the latest.

For example, it's very hard for gcc 4.7 to compile code which can be compiled by gcc 4.2 . Chances are  some of the language feature is outdated and they will generate compiler error.

On the other hand, C exhibit much greater stability across compiler.   If you look at the C portion of the triplet (PocketSphinx, SphinxTrain, Sphinxbase), i.e. 99% of the code.  Most of them just compile across different generation of gcc.  This makes things easier to maintain.


Friday's Readings


GCC 4.8.0 released
Browser War Revisited
DARPA wants unique automated tools to rapidly make computers smarter

Just As CEO Heins Predicted, BlackBerry World Now Plays Home To Over 100,000 Apps
Apple updates Podcasts app with custom stations, on-the-go playlists and less ‘skeuomorphic’ design

The whole PyCon2013's Fork the Dongle business:

The story:
'Sexist joke' web developer whistle-blower fired (BBC) and then......

Breaking: Adria Richards fired by SendGrid for calling out developers on Twitter

Different views:

From those who works with Richards before: Adria Richards, PyCon, and How We All Lost
The apology from PlayHaven's developer: Apology from the developer
Rachel-Sklar from BI: Rachel-Sklar takes
Someone thinks this is a good time to sell their T-shirt: Fork My Dongle T-Shirt
Is PyCon2013 so bad? (Short answer: no) What really happened at PyCon 2013

Your view:

Frankly, if you want to support woman in our industry, donate to this awesome 9 year old.
9 Year Old Building an RPG to Prove Her Brothers Wrong!


Friday Speech-related Links

Future Windows Phone speech recognition revealed in leaked video

Whether you like Softie, they are innovative in speech recognition in these few years.  I am looking forward for their integration of DBN in many of their products.

German Language Learning Startup Babbel Buys Disrupt Finalist PlaySay To Target The U.S. Market

Not exactly in ASR but language learning has been a main stay.  Look at EnglishCentral, they have been around and kicking well.

HMM with scipy-learn

When I first learned HMM, I was always hoping to use a scripting language to train the simplest HMM.  scipy-learn is one such software.

Google Keep

Voice memo is a huge market.  But mobile continus speech recognition is a very challenging task.  Yet, with Google technology, I think it should be better than its competitor, Evernote.


Thursday, March 21, 2013

Thursday Links (FuzzBuzz programming, Samsung, Amazon and more)


Placebo Surgery : Still think acupuncture is a thing?

Expertise, the Death of Fun, and What to Do About It by James Hague

Indeed, it got hard to learn.  My two cents: always keep notes on your work.  See every mistakes as an opportunity to learn.   And always learn new things, never stop.

FizzBuzz programming (2007)

It's sad that it is true.

Technology in general:

Samsung smartwatch product

I still look for the Apple's product more.   I guess I was there when iPhone came out, it's rather hard to not say Samsung plagiarize.......

The Economics of Amazon Prime (link)

When I go to Amazon, using Prime has indeed became an option,  especially for the thousand ebook which cause less than $2.99.   Buying ten of them is very close to the monthly subscription fee of Amazon Prime.

Starbucks and Square don't seem to "mix" well (link)

Other newsworthy:

As Crop Prices Surge, Investment Firms and Farmers Vie for Land

Crop has reversed its course,  if you are interested in restaurants business (like me), this has a huge impact of the whole food chain.

The many failures of the personal finance industry

Many geeky friends of mine are not making good sense in personal finance.  This is a good link to understand the industry.


Thursday Speech-related Readings

Speech Recognition Stumbles at Leeds Hospital

I wonder who the vendor is.

Google Peanut Gallery (Slate)

Interesting showcase again.  Google always has pretty impressive speech technology.

Where Siri Has Trouble Hearing, a Crowd of Humans Could Help

Combining fragments of recognition a rather interesting idea though it's probably not new.  I am glad it is taking off though.

Google Buys Neural Net Startup, Boosting Its Speech Recognition, Computer Vision Chops

This is huge.  Once again, it says something about the power of DNN approach. It is probably the real focus in the next 5 years.

Duolingo Adds Offline Mode And Speech Recognition To Its Mobile App

I always wonder how the algorithm works.  Confidence-based algorithm of verification has always been tough to get it work.  But then again, the whole deal of reCAPTCHA is really try to differentiate between human and machines.  So it's probably not as complicated than I thought.

Some notes on DNS 12: link

The whole sentence mode is the more interesting part.  Does it make users more frustrated though? I am curious.


Tuesday, March 19, 2013

Landscape of Open Source Speech Recognition Software (II : Simon)

Around December last year, I wrote an article on open source speech recognizers.  I covered HTK, Kaldi and Julius.   One thing you should know, just like CMUSphinx,  all of these packages contain their own versions of Viterbi algorithms' implementation.   So when you asked someone who is in the field of speech recognition, they will usually say open source speech recognizers are Sphinx, HTK, Kaldi and Julius.

That's how I usually view speech recognition too.    After years working in the industry though, I start to realize this definition of seeing speech recognizer = Viterbi algorithm could be constraining.   In fact,  from the user's point of view,  a good speech application system should be a combination of

a recognizer + good models + good GUI.

I like to call the former type of "speech recognizer" as "speech recognition engines" but the latter type as "speech recognition applications".   Both types of "speech recognizers" are worthwhile applications.   From the users' point of view, it might just be a technicality to differentiate them.

When I am recovering as a speech recognition programmer (another name throwing :) ),  one thing I notice is that there is much effort on writing "speech recognition applications".   It is a good trend because most people from academia really didn't spend too much time to write good speech applications.   And in open source, we badly need good applications such as dictation machine, IVR and C&C.

One effort which really impressed me is Simon.   It is weird because most of the time I only care about engine-level type of software.   But in the case of Simon, you can see couple of its features are really solving problems in real life and integrated to the bigger them of open source speech recognition.

  • In 0.4.0, Simon starts to integrate with Sphinx.   So if someone wants to develop it commercially, they can.
  • The Simon's team also intentionally make context switching in the application, that's good work as well.   In general, if you always use a huge dictionary, you are just over-recognizing words in a certain context. 
  • Last and not least, I like the fact it integrates itself to Voxforge.  Voxforge is the open source answer to a large speech database of commercial speech company.  So integration with Voxforge will ensure an increasing amount of data for your application.
So kudo to the Simon team!  I believe this is the right kind of thinking to start a good speech application. 


sphinxbase 0.8 and SphinxTrain 1.08

I have done some analysis on sphinxbase0.8 and SphinxTrain 1.08 and try to understand if it is very different from sphinxbase0.7 and SphinxTrain1.0.7.  I don't see big difference but it is still a good idea to upgrade.

  • (sphinxbase) The bug in cmd_ln.c is a must fix.  Basically the freeing was wrong for all ARG_STRING_LIST argument.  So chances are you will get a crash when someone specify a wrong argument name and cmd_ln.c forces an exit.  This will eventually lead to a cmd_ln_val_free. 
  • (sphinxbase) There were also couple of changes in fsg tools.  Mostly I feel those are rewrites.  
  • (SphinxTrain) sphinxtrain, on the other hands, have new tools such as g2p framework.  Those are mostly openfst-based tool.  And it's worthwhile to put them into SphinxTrain. 
One final note here: there is a tendency of CMUSphinx, in general, starts to turn to C++.   C++ is something I love and hate. It could sometimes be nasty especially dealing with compilation.  At the same time, using C to emulate OOP features is quite painful.   So my hope is that we are using a subset of C++ which is robust across different compiler version. 


Monday, March 18, 2013

Python multiprocessing

As my readers may noticed, I haven't updated this blog as I have pretty heavy workload. It doesn't help that I was sick in the middle of March as well. Excuses aside though, I am happy to come back. If I couldn't write much about Sphinx and programming, I think it's still worth it to keep posting links.

I also come up with requests on writing more details on individual parts of Sphinx.   I love these requests so feel free to send me more.   Of course, it usually takes me some time to fully grok a certain part of Sphinx and I could describe it in an approachable way.   So before that, I could only ask for your patience.

Recently I come up with parallel processing a lot and was intrigued on how it works in the practice. In python, a natural choice is to use the library multiprocessing. So here is a simple example on how you can run multiple processes in python. It would be very useful in the modern days CPUs which has multi-cores.

Here is an example program on how that could be done:

1:  import multiprocessing  
2:  import subprocess  
3:    jobs = []  
4:    for i in range (N):  
5:      p = multiprocessing.Process(target=process, \  
6:                      name = 'TASK' + str(i), \  
7:                      args=(i, ......  
8:                    )  
9:      )  
10:     jobs.append(p)  
11:     p.start()  
12:   for j in jobs:  
13:     if j.is_alive():  
14:        print 'Waiting for job %s' %(  
15:        j.join()  

The program is fairly trivial. Interesting enough, it is also quite similar to the multithreading version in python. Line 5 to 11 is where you run your task and I just wait for the tasks finished from Line 12 to 15.

It feels little bit less elegant than using Pool because it provides a waiting mechanism for the entire pool of task.  Right now, I am essentially waiting for job which is still running by the time job 1 is finished.

Is it worthwhile to go another path which is thread-based programming.  One thing I learned in this exercise is that older version of python, multi-threaded program can be paradoxically slower than the single-threaded one. (See this link from Eli Bendersky.) It could be an easier being resolved in recent python though.


Thursday, February 28, 2013

Readings at Feb 28, 2013

Taeuber's Paradox and the Life Expectancy Brick Wall by Kas Thomas

Simplicity is Wonderful, But Not a Requirement by James Hague

Yeah.  I knew a professor who always want to rewrite speech recognition systems such that is easier for research.   Ahh...... modern speech recognition systems are complex any way.   Not making mistakes is already very hard.   Not to say building a good research system which easy to use for everyone. (Remember, everyone has their different research goal.)


Monday, February 25, 2013

On sh*tty job.

I read "Why Hating Your Shitty Job Only Makes It Worse",  there is something positive about the article but I can't completely agree with the authors.

Part of the dilemma at work in a traditional office space is that inevitably some kind of a*holes and bad system will appear in your life.   The question is whether you want to ignore it or not.   You should be keenly aware of your work condition and make rational decision of staying an leaving.


Monday, February 18, 2013

A look on Sphinx3's initialization

I worked on Sphinx 3 a lot.  In these days, it was generally regarded as an "old-style" recognizer as compared to Sphinx 4 and PocketSphinx.   It is also not support officially by the SF's guys.

Coders of speech recognition think a little bit different.  They usually stick to a certain codebase which they feel comfortable with.   For me, it is not just a personal preference, it also reflects how much I know about a certain recognizer.  For example, I know quite a bit of how Sphinx 3 performs.   In these days, I tried to learn how Sphinx 4 fare as well.   So far, if you ask me to choose an accurate recognizer, I will still probably choose Sphinx 3, not because the search technology is better (Sphinx 4 is way superior), but because it can easily made to support several advanced modeling types.  This seems to be how the 2010 developer meeting concluded as well.

But that was just me. In fact, I am bullish on all Sphinx recognizers.  One thing I want to note is the power of Sphinx 4 in development.  There are many projects are based on Sphinx 4.  In these days, if you want to get a job on speech recognizer, knowing Sphinx 4 is probably a good ticket.  That's why I am quite keen on learning it more so hopefully I can write on both recognizers more.

In any case, this is a Sphinx 3's article.  I will probably write more on each components.   Feel free to comments.

How Sphinx3 is initialized:

Here is a listing of function used on how Sphinx 3 is initialized I got from Sphinx 3.0.8.  Essentially, there are 3 layers of initialization, kb_init, kbcore_init and s3_am_init.  Separating kb_init and kbcore_init probably starts very early in Sphinx 3.  Whereas separating s3_am_init from kbcore_init was probably from me. (So all blames on me.)  That is to support -hmmdir.

     -> kbcore_init (*)  
     -> beam_init  
     -> pl_init  
     -> fe_init  
     -> feat_array  
     -> stat_init  
     -> adapt_am_init  
     -> set operation mode  
     -> srch_init  
     -> Look for feat.params very early on.   
     -> logmath_init  
     -> feat_init  
     -> s3_am_init (*)  
     -> cmn_init  
     -> dict_init  
     -> misc. models init  
       mgau_init such as  
       -> subvq_init  
       -> gs_read  
     -> lmset_init  
     -> fillpen_init  
     -> dict2pid_build <- Should put into search  
     -> read_lda  
     -> read in mdef.   
     -> depends on -senmgau type  
       .cont. mgau_init  
       .s2semi. s2_semi_mgau_init  
           if (-kdtree)  
       .semi or .s3cont.   
     -> tmat_init  
  • -hmmdir override all other sub-parameters. 


Toning down on Blogging......

Guess I was too confident again to build up this site.  Once again I feel work took me much time and couldn't work on this blog soon.

It also depends on whether my work are something camera-ready.   Currently I have couple of articles which are ready to hash out but need some refinement.

Let's see if I can come back for month a month later.  For now, you might see a lot of filler materials on this blog.


Saturday, February 16, 2013

Friday, February 15, 2013

Wednesday, February 06, 2013

Readings at Feb 6, 2013

Employees leave Manager Not Companies Well said.

Caffeine Jitters As a programmer, I found drinking any caffeinated drinks anti-productive.   In the short-term, coke, coffee or power drinks gave you a kick.  In long term, they are debilitating and cause you feel tired.  

I think Dave is the guy who taught me to drink tea at CMU, I haven't changed this habit since.


Tuesday, February 05, 2013

Should we go to College?

Reading James Altucher's "I Was Blind But Now I See", he made a controversial point: don't send kids into college.   Before you throw stuffs, his point is sophisticated.   You would think you could refute him by saying "What about profession such as lawyer and doctor?" But then Altucher counters that to be a professional,  you just need to read the right book and ask the right question to the right people.   It is difficult to refute : say if you want to learn programming, taking a university course and getting credits don't really help that much.   Working on an open source project or an internship does.    On speech recognition, classes may be useful but at the end of the day, reading papers, or generally talking with experts in the field is the real help. 

So what is the meaning of University then?   Though I have many friends who have graduate degrees and I have a master myself,  I do appreciate Altucher's point.    Because what he said highlight some of my doubt about the college education system.   e.g. Is a person really smarter after 5 years of college education?   Do they learn better?   Does it worth the $50000 debt?    When I look at many of my friends,  for most of the time, the answer is no.    The truth is for many who want to learn, they will seek out college education after they have some experience.    They actively seek for knowledge they lack of.  On the other hand, when I look at many of my PhD friends, they either have no motivation to learn nor their duty gives them no time to learn.   It is a pity.   

I believe learning is a life-long issue and it should be independent to any institutions.   


Monday, February 04, 2013

Thursday, January 31, 2013

January 2013 Write-up

Miraculously, I still have some momentum for this blog and I have kept on the daily posting schedule.

Here is a write up for this month:  Feel free to look at this post on how I plan to write this blog:

Some Vision of the Grand Janitor's Blog

Sphinx' Tutorials and Commentaries

SphinxTrain1.07's bw:

Commentary on SphinxTrain1.07's bw (Part I)
Commentary on SphinxTrain1.07's bw (Part II)

Part I describes the high-level layout, Part II and describe half the state network was built.

Acoustic Score and Its Sign
Subword Units and their Occasionally Non-Trivial Meanings

Sphinx 4 from a C background : Material for Learning


Goldman Sachs not Liable
Aaron Swartz......

Other writings:

On Kurzweil : a perspective of an ASR practitioner



Goldman Sachs not Liable

Here is the Bloomberg's piece.   Sounds like it's a real case closed.

Of course, we are all still feeling the consequences.


Wednesday, January 30, 2013

Speech-related Readings at Jan 30, 2013

Amazon acquired Ivona:

I am aware of Amazon's involvement in ASR.   Though it's a question on the domain.

Goldman-Dragon Trial:

I simply hope Dr. Baker has a closure on the whole thing.   In fact, when you think about it,  the whole L&H fallout is the reason why the ASR industry has a virtual monopoly now.  So if you are interested in ASR, you should be concerned.


Tuesday, January 29, 2013

Subword Units and their Occasionally Non-Trivial Meanings

While I was writing the next article on bw,  I found myself forget the meaning of different type of subword units (i.e phones, biphones, diphones, triphones and such).  So I decide to write a little note.

On this kind of topics, someone would likely to come up and say "X always mean Y bla bla bla etc and so on."  My view (and hope) is that the wording of a certain should reflect what it means.  So when I hear a term and can come up with multiple definition in your head, I would say the naming convention is a little bit broken.

Phones vs Phonemes
Linguist distinguish between phoneme and phone The former usually means a kind of abstract categorization of a sound, whereas the latter usually mean the actual realization of a sound.

In a decoder though, what you see most is the term phone.

Biphones vs Diphones

(Ref here) " ..... one important difference between the two units. Biphones are understood tobe just left or right context dependent (mono)phones. On the other hand, diphones represent the transition regions that strech between the two ”centres” of the subsequent phones. "

So that's why there can be left-biphone and right-biphone.  Diphones is intuitively better in synthesis.

Possible combination of left-biphones/right-biphones/diphones are all N^2.  With N equals to the number of phones. 

Btw, the link I gave also has a term called "bi-diphone", which I don't think it's a standard term. 


For most of the time, it means considering both left and right context.  Possible combinations N^3. 


For most of the time, it means considering both the two left and two right contexts. Possible combinations N^5. 


For most of the time, it means considering both three left and three right  contexts. Possible combinations N^7. 

"Quadphones" and Other possible confusions in terminology. 

I guess what I don't feel comfortable are terms such as "Quadphones".   Even quinphones and heptaphones can potentially means different things from time-to-time.  

For example, if you look at LID literature, occasionally, you will see the term quadphone.  But it seems the term "phone 4-gram" (or more correctly quadgram...... if you think too much,) might be a nicer choice.  

Then there is how the context looks like:  2 left 1 right? 1 right 2 left?   Come to think of it, this terminology is confusing for even triphones because we can also mean a phone depend on 2 left or 2 right phones.  ASR people don't feel that ways probably because of a well-established convention.  Of course, the same can be said for quinphone and hetaphones. 


Monday, January 28, 2013

Readings at Jan 28, 2013

Tools of the Trade : Mainly an iOS article but it has many tools on maintaining contacts, task lists and requests.
C11 : I have no idea C99 tried to implement variable length array.  It's certainly not very successful in the past 10 years.....   Another great book to check out is Ning's C book.
How to make iPhone App that actually sells : Again, probably not just for iOS but generally for writing free/shareware.
Bayesian vs Non-Bayesian:  Nice post.  I don't fully grok Bayesian/Non-Bayesian but if you know better, they are essentially two schools of thoughts. (ASR? The whole training process starts from a flat-start, you figure.)

Saturday, January 26, 2013

On Kurzweil : a perspective of an ASR practitioner

Many people who don't work on the fields of AI, NLP and ASR have heard of Kurzweil.   To my surprise, many seem to give little thought on what he said and just follow his theory wholeheartedly.

In this post, I just want to comment on one single little thing, which is whether real-time speech-to-speech translation can be achieved in 2010s.  This is a very specific prediction from Kurzweil's book "The Singularity is Near".

My discussion would mainly focus on ASR first.  So even though my examples below are not exactly pure ASR systems, I will skip the long winding wording of saying "ASR of System X".  And to be frank, MT and Response system probably goes through similar torturous development process anyway.   So, please, don't tell me that "System X is actually ASR + Y", that sort of besides the point.

Oh well, you probably ask why bother, don't we have a demo of real-time speech-to-speech translation from Microsoft already?

True, but if you observe the demo carefully, it is based on read speech.  I don't want to speculate much but I doubt it is a generic language model which wasn't tuned to the lecture.   In a nutshell, I disbelieve it is something you can use it in real-life.

Let's think of a more real-life example: Siri, are we really getting 100% correct response now?  (Not to boil down to ASR WER ...... yet)  I don't think so. Even with adaptation, I don't think Siri understand what I said every single time.    For most of the time, I follow the unofficial command list of Siri, let it improve with adaptation..... but still, it is not perfect.

Why? It is the hard cold reality: ASR is still not perfect, with all the advancement in HMM-based speech recognition.  All the key technologies we know in the last 20 years: CMLLR, MMIE, MPE, fMPE, DMT, consensus network, multipass decodings, SAT, SAT with MMIE or all the nicest front-ends, all the engineerings.   Nope, we are not yet having a feel-good accuracy.  Indeed, human speech recognition is not 0% WER neither but for some reasons, the current state-of-the-art ASR performance is not reaching there.

And Siri, we all know is the state-of-the-art.

Just digress a little bit: Now most of the critics when they write to this point, will then lament that "oh, there is just some invisible barrier out there and human just couldn't make a Tower Babel, blabla....".  I believe most of these "critics" have absolutely no ideas what they are talking about.   To identify these air-head critics, just try to see if they put "cognitive science" into the articles, then you don't know they never work on real-life ASR system.

I, on the hand, do believe one day we can get there.  Why?  Because when people work on one of these speech recognition evaluation tasks, many would tell you : given a certain test set and with enough time and gumption, you would be able to create a system without any errors.  So to me, it is more of an issue of whether some guys grinding on the problem, but not feasibility issue.

So where are we now in ASR?  Recently, In ICASSP 2012,  a Google paper, trained 87 thousand hour of data.  That is probably the largest scale of training I know.  Oh well, where are we now? 10%.  Go down from 12%.  So the last big experiment I know, it's probably the 3000 hours experiment back in 2006-7.  The Google authors are probably using a tougher test set.  So the initial recognition rate was yet again lower.

Okay, speculation time.  So let's assume, that human can always collect 10 times more labelled data for every 6-7 years AND we can do an AM training on them. When will we go to have say 2% WER on the current Google test set?   If we just think of very simple linear interpolation.  It will take 4 * 6 years = 24 years to collect 10000 times more data (or 8 billion hour of data).    So we are way-way past the 2010s deadline from Kurzweil.

And that's a wild speculation.   Computation resources probably will work out itself by that time.  What I doubt most is whether the progress would be linear.  

Of course, it might be non-linearly better too.  But here is another point: it's not just about the training set, it's about the test set.  If we truly want a recognizer to work for *everyone* in the planet, then the very right thing to do is test your recognizer on our whole population.  If we can't then you want to sample enough human speech to represent the Earth's population, the current test set might not be representative enough.   So it is possible that when we increase our test set, we found that the initial recognition rate has go down again.   And it seems to me our test set is still in the state of mimicking human population.

My discussion so far are mostly on acoustic model.  On the language model side,  the problem will mainly on domain specificity.   Also bear in mind, human language can evolve.  So, say we want to build a system which build a customized language model for each human being in the planet.  At a particular moment of time, you might not be able to get enough data to build such a language model.

For me, the point of the whole discussion is that ASR is an engineering system, not some idealistic discussion topic.  There will always be tradeoff.   You may say: "What if a certain technology Y emerge in the next 50 years?" I heard that a lot Y could be quantum computing or brain simulation or brain-human interface or machine implementation of brain.    Guys..... those, I got to admit are very smart idea in our time, and give it another 30-40 years, we might see something useful.   For now, ASR really has nothing to do with them.  I never heard of machine implementation of the audio cortex, or even an accurate construction of audio pathway.  Nor, there is an easy progress of dissecting mammal inner ear and bring understanding on what's going on in human ear.   From what I know, we seem to know some, but there are lots of other things we don't know.

That's why I think it's better to buckle down and just to try to work out our stuffs.  Meaning, try to come up with more interesting mathematical model, try to come up with more computational efficient method.   Those .... I think are meaningful discussion.   As for Kurzweil, no doubt he is a very smart guy, but at least on ASR, I don't think he knows what he talks about.

Of course, I am certainly not the only person who complains Kurzweil.  Look at how Douglas Hofstadter's criticism:

"It’s as if you took a lot of very good food and some dog excrement and blended it all up so that you can't possibly figure out what's good or bad. It's an intimate mixture of rubbish and good ideas, and it's very hard to disentangle the two, because these are smart people; they're not stupid."

Sounds like very reasonable to me.


Friday, January 25, 2013

Acoustic Score and Its Signness

Over the years, I got asked about why acoustic score could be a positive number all the time. That occasionally lead to a kind of big confusion from beginner users. So I write this article as a kind of road sign for people.

Acoustic score per frame is essentially the log value of continuous distribution function (cdf). In Sphinx's case, the cdf is a multi-dimensional Gaussian distribution. So Acoustic score per phone will be the log likelihood of the phone HMM. You can extend this definition to word HMM.

For the sign. If you think of a discrete probability distribution, then this acoustic score thingy should always be negative. (Because log of a decimal number is negative.) In the case of a Gaussian distribution though, when the standard deviation is small, it is possible that the value is larger than 1. (Also see this link). So those are the time you will see a positive value.

One thing you might feel disharmonious is the magnitude of the likelihood you see. Bear in mind, Sphinx2 or Sphinx3 are using a very small logbase. We are also talking about a multi-dimensional Gaussian distribution. It makes numerical values become bigger.


Also see:
My answer on the Sphinx Forum

Thursday, January 24, 2013

Commentary on SphinxTrain1.07's bw (Part II : next_state_utt.c's First Half)

I will go on with the analysis of bw.  In my last post, we understand the high-level structure of the bw program.  So we now turns to the details of how the state lattice was built.  Or how next_state_utt.c works.

As always, here is the standard disclaimer: this is another long and super technical post I only expect a small group of programmer with exoteric interest of Baum-Welch algorithm can read.   It's not for faint of heart but if you understand the whole thing, you would have some basic but genuine understanding of Baum-Welch algorithm in real life.

The name of the module, next_state_utt.c, is quite unassuming but it is an important key of understanding the Baum-Welch algorithm.  The way how the state lattice is structured affects how parameter estimation works.   The same statement says for not only Baum-Welch estimation but also other estimation algorithm in speech.

But what so difficult about the coding the lattice?  Here are two points I think it is worthwhile to point out:

  1. I guess an average programmer can probably work out a correct concatenation of all phone HMMs if all phones are context-independent in 3 days to 1 week.  But in many advanced systems, most of them are using context-dependent phones.  So you go to make sure at the right time, the right triphone state was used.
  2. In Sphinx, it got a bit more complicated because there is a distinction between positions of triphones.  This is quite specific to Sphinx and you won't find it in systems such as HTK.  So it further complicates coding.  You will see in the following discussion, Sphinx has back off cases of position-dependent triphone estimation from time-to-time.   In my view, it might not be too different from the position-independent triphone system.  (It's certainly fun to read. :) ) 
My goal here is to analyze the next_state_utt.c, how it works and to state some part one can improve.

High-Level Layout of next_utt_state.c:

 -> mk_wordlist (mk_wordlist.c)  
 -> mk_phone_list (mk_phonelist.c)  
 -> cvt2triphone (cvt2triphone.c)  
 -> state_seq_make (state_seq_make.c)  

Nothing fancy here. We first make a wordlist (mk_wordlist) , then make a phone list (mk_phone_list), then convert the phone list to triphone (cvt2triphones), then create the state sequence (state_seq_make).

Now before we go on, just at this level, you may already discover one issue of Sphinx's bw.  It is using a list to represent phone models.   So, let's assume if you want to model a certain word with multiple pronunciations, you probably can't do it without changing the code.

Another important thing to note: just like many non-WFST systems, it is not that easy to make a simple phoneme system with Sphinx.  (HTK is an exception but you can always turn on a flag to expand context. Just look up the manual.)  Say if you want to express your phoneme system to be one phoneme word, then you would want your dictionary look like:


But then, if a word is a phone, should you actually want to build a network of cross-word triphones?  You probably want to if you want to shoot for performance - all of the most accurate phoneme-based system has some sort of context-dependency there.  (The Brno's recognizer probably has some, but I don't really grok why it is so good.)

But if you want to do your own interesting experiments, this fixed behavior may not suit your appetite.   Maybe you just want to use a context-independent phone system for some toy experiments.  But then, you are probably always building a triphone system.  So, it might or might not be what you like.

So if you really want to trigger the CI-model behavior, what can you do?  Take a look of my next post, in cvt2triphone.c, if the model definition file only specify CI states, then no triphone conversion will occur.   In a way, that is to say the system assume if you just train the CI model, you will get the CI model but there is no explicit way to turn it off.


mk_wordlist is rather trivial:

 char **mk_wordlist(char *str,  
             uint32 *n_word)  
   uint32 n_w;  
   uint32 i;  
   char **wl;  
   n_w = n_words(str);  
   wl = ckd_calloc(n_w, sizeof(char *));  
   wl[0] = strtok(str, " \t");  
   for (i = 1; i < n_w; i++) {  
      wl[i] = strtok(NULL, " \t");  
   assert(strtok(NULL, " \t") == NULL);  
   *n_word = n_w;  
   return wl;  

With one line of transcripts, mk_wordlist transform it to an array of C-string.  Memory of the string are allocated.


mk_phone_list is still trivial but there is a bit more detail

1:  acmod_id_t *mk_phone_list(char **btw_mark,  
2:                  uint32 *n_phone,  
3:                  char **word,  
4:                  uint32 n_word,  
5:                  lexicon_t *lex)  
6:  {  
7:    uint32 n_p;  
8:    lex_entry_t *e;  
9:    char *btw;  
10:    unsigned int i, j, k;  
11:    acmod_id_t *p;  
12:    /*  
13:     * Determine the # of phones in the sequence.  
14:     */  
15:    for (i = 0, n_p = 0; i < n_word; i++) {  
16:       e = lexicon_lookup(lex, word[i]);  
17:       if (e == NULL) {  
18:         E_WARN("Unable to lookup word '%s' in the lexicon\n", word[i]);  
19:         return NULL;  
20:       }  
21:       n_p += e->phone_cnt;  
22:    }  
23:    /*  
24:     * Allocate the phone sequence  
25:     */  
26:    p = ckd_calloc(n_p, sizeof(acmod_id_t));  
27:    /*  
28:     * Allocate the between word markers  
29:     */  
30:    btw = ckd_calloc(n_p, sizeof(char));  
31:    for (i = 0, k = 0; i < n_word; i++) {     /* for each word */  
32:       e = lexicon_lookup(lex, word[i]);  
33:       if (e->phone_cnt == 0) /* Ignore words with no pronunciation */  
34:        continue;  
35:       for (j = 0; j < e->phone_cnt-1; j++, k++) {     /* for all but the last phone in the word */  
36:         p[k] = e->ci_acmod_id[j];  
37:       }  
38:       p[k] = e->ci_acmod_id[j];     /* move over the last phone */  
39:       btw[k] = TRUE;               /* mark word boundary following  
40:                             kth phone */  
41:       ++k;  
42:    }  
43:    *btw_mark = btw;  
44:    *n_phone = n_p;  
45:    assert(k == n_p);  
46:    return p;  
47:  }  

In line 15-22:, we first look up the pronunciations of the words. (Remember, right now we can only look up one.) It then allocate the an array of phones with ID (in the type of acmod_id_t).

Now here is special part of the code, other the array of phones, it also allocate an array call "between word markers".  So what's the mechanism?  Let me give an example.

Suppose you have a transcript with word sequence "MY NAME IS CLINTON"

       mk_word_list would create

     word[0] -> MY  
     word[1] -> NAME  
     word[2] -> IS  
     word[3] -> CLINTON  

       mk_print_list (with my best guess of pronunciations) would create
     ph[0] -> M      btw[0] -> 0
     ph[1] -> AY      btw[1] -> 1
     ph[2] -> N      btw[2] -> 0
     ph[3] -> EY      btw[3] -> 0
     ph[4] -> M      btw[4] -> 1
     ph[5] -> IY      btw[5] -> 0
     ph[6] -> S      btw[6] -> 1
     ph[7] -> K      btw[7] -> 0
     ph[8] -> L      btw[8] -> 0
     ph[9] -> IY      btw[9] -> 0
     ph[10] -> N      btw[10] -> 0
     ph[11] -> T      btw[11] -> 0
     ph[12] -> AX     btw[12] -> 0
     pH[13] -> N      btw[13] -> 1

So essentially it would indicate there is a word end at a certain phone.

I believe such as representation are for convenience purpose: it facilitate determination of whether a word is at the beginning, the middle or the end.

An alternative here is to do an optional silence.  This, according to HTK handbook, usually reduce the WER slightly.  It seems to be reasonable to figure out where the location of a phone is using silences as a marker.

A digression: acmod_id_t

1:  typedef unsigned char ci_acmod_id_t;  
2:  #define NO_CI_ACMOD     (0xff)  
3:  #define MAX_CI_ACMOD     (0xfe)  
4:  typedef uint32 acmod_id_t;  
5:  #define NO_ACMOD     (0xffffffff)     /* The ID representing no acoustic  
6:                            * model. */  
7:  #define MAX_ACMOD     (0xfffffffe)     /* The max ID possible */  

Now, we used acmod_id_t in mk_phone_list, but what is it really?  So let's take a detour of acmod_id_ds.t ("ds" stands for data structure.)

acmod_id_t is essentially a uint32, which is just a the size of unsigned integer or 2^32 -1.  Why -1? Notice that MAX_CI_ACMOD was defined as 0xfe?

The more interesting part here: we saw ci_acmod_id_t is only a character type.  This is obviously another problem here, in some languages, one may be interested to express it with more than 255 phones. (Why 255?)

We'll meet acmod_set a little bit more.  But let us move on first - sometimes code tracing will be more motivated when you see the code before the data structure.   Many suggest otherwise: Indeed, once you know the data, code will make more sense.   But in practice, you will most likely read the code first and needs to connect things together.   Thus IMO: both approach has their merit in code tracing.

So far ..... and next post

To avoid clutter a single post, I will stop and put the rest of next_utt_states.c (cvt2phones and state_seq_make) on another post.   But I want to summarize several things I have observed so far:

  1. Current bw doesn't create a word network so it has issues to handle multiple pronunciations. 
  2. Current bw automatically expand triphone contexts. There is no explicit way to turn it off.
  3. bw is not doing optional silence in the network. 
  4. bw doesn't work for more than 255 CI phones. 
Btw, SphinxTrain1.08 has several changed which replace mk_word_list with data structure from sphinxbase.  Those are encouraging changes.  If I have time, I will cover them. 


Tuesday, January 22, 2013

Learning vs Writing

I haven't done any serious writings for a week.  Mostly post interesting readings just to keep up the momentum.   Work is busy so I slowed down.  Another concern is what to write.   Some of the topics I have been writing such as Sphinx4 and SphinxTrain take a little bit of research to get them right.

So far I think I am on the right track.  There are not many bloggers on  speech recognition.  (Nick is an exception.)   To really increase awareness of how ASR is done in practice, blogging is a good way to go.

I also describe myself as "recovering" because there are couple of years I hadn't seriously thought about open source Sphinx.  In fact though I was working on speech related stuffs, I didn't spend too much time on mainstream ASR neither because my topic is too esoteric.

Not to say, there are many new technologies emerged in the last few years.   The major one I would say is the use of neural network in speech recognition.  It probably won't replace HMM soon but it is a mainstay for many sites already.   WFST, with more tutorial type of literature available, has become more and more popular.    In programming, Python now is a mainstay plus job-proof type of language.  The many useful toolkit such as scipy, nltk by themselves deserves book-length treatment.  Java starts to be like C++, a kind of necessary evil you need to learn.  C++ has a new standard.   Ruby is huge in the Web world and by itself is fun to learn.

All of these new technologies took me back to a kind of learning mode.   So some of my articles become longer and in more detail.   For now, they probably cater to only a small group of people in the world.   But it's okay, when you blog, you just want to build landmarks on the blogosphere.   Let people come to search for them and get benefit from it.   That's my philosophy of going on with this site.