Wednesday, March 27, 2013

GJB Wednesday Speech-related Links/Commentaries (DragonTV, Siri vs Xiao i Robot, Coding with Voice)

ZhiZhen company (智臻網絡科技) from Shanghai is suing Apple for infringing their patents.  (The original Shanghai Daily article) From the news, back in 2006, ZhiZhen has already developed the engine for Xiao i Robot (小i機械人).  A video 8 months ago (as below). 

Technically, it is quite possible that a Siri-like system can be built at 2006.  (Take a Look at Olympus/Ravenclaw.)  Of course, the Siri-like interface you see here is certainly built in the advent of smartphone (, which by my definition, after iPhone is released).   So overall speaking, it's a bit hard to say who is right.  

Of course, when interpreting news from China, it's tempting to use slightly different logic. In the TC article, OP (Etherington) suggested that the whole lawsuit could be state-orchestrated. It could be related to recent Beijing's attack of Apple. 

I don't really buy the OP's argument, Apples is constantly sued in China (or over the world).  It is hard to link the two events together.  

This is definitely not the Siri for TV.

Oh well, Siri is not just speech recognition, there is also the smart interpretation in the sentence level: scheduling, making appointments, do the right search.   Those by themselves are challenges.    In fact, I believe Nuance only provides the ASR engine for Apple. (Can't find the link, I read it from Matthew Siegler.)

In the scenario of TV,  what annoys users most are probably switching channels and  searching programs.  If I built a TV, I would also eliminate the any set-top boxes. (So cable companies will hate me a lot). 

With the technology profile of all big companies, Apple seems to own all technologies need.  It also takes quite a lot of design (with taste) to realize such a device. 

Using Python to code by Voice

Here is an interesting look of how ASR can be used in coding.   Some notes/highlights:
  • The speaker, Travis Rudd, had RSI 2 years ago.  After a climbing accident, He decided to code using voice instead.  Now his RSI is recovered, he claims he is still using it for 40-60%. 
  • 2000 voice commands, which are not necessarily English words.   The author used Dragonfly to control emacs in windows.
  • How does variables work?  Turns out most variables are actually English phrases. There are specific commands to get these phrases delimited by different characters. 
  • The speaker said "it's not very hard" for others to repeat.  I believe there will be some amount of customizations.  It takes him around 3 months.  That's pretty much how much time a solution engineer needs to take to tune an ASR system. 
  • The best language to program in voice : Lisp. 
One more thing.   Rudd also believe it will be very tough to do the same thing with CMUSphinx.  

Ah...... models, models, models. 

Earlier on Grand Janitor's Blog

Some quick notes on what a "Good training system" should look like: (link).
GJB reaches the 100th post! (link)


Tuesday, March 26, 2013

Tuesday's Links (Meetings and more)


Is Depression Really Biochemical (AssertTrue)

Meetings are Mutexes (Vivek Haldar)

So True.  It doesn't count all the time you use to prepare a meeting.

Exhaustive Testing is Not a Proof of Correctness

True, but hey.  Writing regression tests is never a bad thing. If you rely only on your brain on testing, it bounds to fail one way or the other.

Apple :

Apple's iPhone 5 debuts on T-Mobile April 12 with $99 upfront payment plan
iWatchHumor (DogHouseDiaries)


Yahoo The Marissa Mayer Turnaround

Out of all commentaries on Marissa Mayer's realm.  I think Jean-Louis Gassée goes straight to the point and I agree most.   You cannot use a one size fit all policy.  So WFH is not always appropriate as well.


The Management-free Organization

Monday, March 25, 2013

Good ASR Training System

The term "speech recognition" is a misnomer.

Why do I say that? I have explained this point in an old article "Do We Have True Open Source Dictation?, which I wrote back in 2005: To recap,  a speech recognition system consists of a Viterbi decoder, an acoustic model and a language model.  You could have a great recognizer but bad accuracy performance if the models are bad.

So how does that related to you, a developer/researcher of ASR?    The answer is ASR training tools and process usually become a core asset of your inventories.    In fact, I can tell you when I need to work on acoustic model training, I need to spend full time to work on it and it's one of the absorbing things I have done.  

Why is that?  When you look at development cycles of all tasks in making an ASR systems.   Training is the longest.  With the wrong tool, it is also the most error prone.    As an example, just take a look of Sphinx forum, you will find that majority of non-Sphinx4 questions are related to training.    Like, "I can't find the path of a certain file", "the whole thing just stuck at the middle".

Many first time users complain with frustration (and occasionally disgust) on why it is so difficult to train a model.   The frustration probably stems from the perception that "Shouldn't it be well-defined?"   The answer is again no. In fact how a model should be built (or even which model should be built) is always subjects to change.   It's also one of the two subfields in ASR, at least IMO, which is still creative and exciting in research.  (Another one: noisy speech recognition.)  What an open source software suite like Sphinx provide is a standard recipe for everyone.

Saying so, is there something we can do better for an ASR training system?   There is a lot I would say, here are some suggestions:
  1. A training experiment should be created, moved and copied with ease,
  2. A training experiment should be exactly repeatable given the input is exactly the same,
  3. The experimenter should be able to verify the correctness of an experiment before an experiment starts. 
Ease of Creation of an Experiment

You can think of a training experiment as a recipe ...... not exactly.   When we read a recipe and implement it again, we human would make mistakes.

But hey! We are working with computers.   Why do we need to fix small things in the recipe at all? So in a computer experiment, what we are shooting for is an experiment which can be easily created and moved around.

What does that mean?  It basically means there should be no executables which are hardwired to one particular environment.   There should also be no hardware/architecture assumption in the training implementations.   If there is, they should be hidden.

Repeatability of an Experiment

Similar to the previous point, should we allow difference when running a training experiment?  The answer should be no.   So one trick you heard from experienced experimenters is that you should keep the seed of random generators.   This will avoid minute difference happens in different runs of experiments.

Here someone would ask.   Shouldn't us allow a small difference between experiments?  We are essentially running a physical experiment.

I think that's a valid approach.  But to be conscientious, you might want to run a certain experiment many times to calculate an average.    In a way, I think this is my problem with this thinking.  It is slower to repeat an experiment.    e.g.  What if you see your experiment has 1% absolute drop?  Do you let it go? Or do you just chalk it up as noise?   Once you allow yourself to not repeat an experiment exactly, there will be tons of questions you should ask.

Verifiability of an Experiment

Running an experiment sometimes takes day, how do you make sure running it is correct? I would say you should first make sure trivial issues such as missing paths, missing models, or incorrect settings was first screened out and corrected.

One of my bosses used to make a strong point and asked me to verify input paths every single time.  This is a good habit and it pays dividend.   Can we do similar things in our training systems?

Apply it on Open Source

What I mentioned above is highly influenced by my experience in the field.   I personally found that sites, which have great infrastructure to transfer experiments between developers, are the strongest and faster growing.   

To put all these ideas into open source would mean very different development paradigm.   For example, do we want to have a centralized experiment database which everyone shares?   Do we want to put common resource such as existing paramatized inputs (such as MFCC) somewhere in common for everyone?  Should we integrate the retrieval of these inputs into part of our experiment recipe? 

Those are important questions.   In a way, I think it is the most type of questions we should ask in open source. Because regardless of much volunteer's effort.  Performance of open source models is still lagging behind the commercial models.  I believe it is an issue of methodology.  


Monday's Links (Brain-Computer Interface, Apple and more)


How to Write Six Important Papers a Year without Breaking a Sweat: The Deep Immersion Approach to Deep Work
It’s Like They’re Reading My Mind (Slate)


Apple Buys Indoor Mapping Company WifiSLAM (LA times)
How Apple Invites Facile Analysis (Business Insiders)
So long, break-even (Horace Dediu)

After big channels picked up Richards' story:

Startups have a sexism problem

R2-D2 Day ...... for real!

Saturday, March 23, 2013

The 100th Post: Why The Grand Janitor's Blog?

Since I decided to revamp The Grand Janitor's Blog last December, it has been 100 posts. (I cheat a bit, so "not since then".)

It's funny to describe time with the number of articles you write.   In blogging though, that makes complete sense.

I have started several blogs in the past.  Only 2 of them survive (, Cumulomanic and "Start-Up Employees 333 weeks", both in Chinese) .  When you cannot maintain your blog for more than 50 posts, you blog just dies, or simply to disappear into oblivion.

Yet I make it.  So here's an important question to ask: what makes me keep on?

I believe the answer is very simple.  There is no bloggers so far who work on the niche of speech recognition: None on automatic speech recognition (ASR) systems, even though there was much progress.  None on engines, even much work has been done in open source.   None on applications, even great projects such as Simon was there.

Nor there were discussion on how open source speech recognition can be applied to the commercial world, even when there are dozens of companies are now based on Sphinx (e.g. my employer Voci,  EnglishCentral and Nexiwave ), and they are filling the startup space.

How about how the latest technology such as deep neural network (DNN) and weighted finite state transducers (WFST) would affect us?  I can see them in academic conferences, journals or sometimes tradeshows...... but not in a blog.

But blogging, which we all know, is probably the most prominent form of how people are getting news these days.

And news about speech recognition, once you understand them, is fascinating. 

The only blog which comes close is Nicholay's blog : nsh.   When I try to recover as a speech recognition programmer, nsh was a great help.  So thank you, Nick, thank you.

But there is only one nsh.  There are still have a lot of fascinating to talk about...... Right?

So probably the reason why I keep on working:  I want to invent something I want: a kind of information hub on speech recognition technology, commercial/open source, applications/engines, theory/implementations, the ideals/the realities.

I want to bring my unique perspective: I was in academia, in industrial research and now in the startup world so I know quite well people's mindsets in each group.

I also want to connect with all of you.  We are working on one of the most exciting technology in the world.   Not everyone understands that.  It will take time for all of us, to explain to our friends and families what speech recognition can really do and why it matters.

In any case, I hope you enjoy this blog.  Feel free to connect with me on Plus, LinkedIn and Twitter.


Friday, March 22, 2013

C++ vs C

I have been mainly a C programmer.  Because of work though, I have been working with many codebase which is written in C++.

Many programmers will tell you C++ is a necessary evil.  I agreed.  Using C to emulate object oriented feature such as polymorphism, inheritance or even the idea of objects is not easy.   It also easily confused novice programmer.

So why C++ frustrates many programmers then?   I guess my major complaint is that its standard has be evolving and many compilers cannot catch up with the latest.

For example, it's very hard for gcc 4.7 to compile code which can be compiled by gcc 4.2 . Chances are  some of the language feature is outdated and they will generate compiler error.

On the other hand, C exhibit much greater stability across compiler.   If you look at the C portion of the triplet (PocketSphinx, SphinxTrain, Sphinxbase), i.e. 99% of the code.  Most of them just compile across different generation of gcc.  This makes things easier to maintain.


Friday's Readings


GCC 4.8.0 released
Browser War Revisited
DARPA wants unique automated tools to rapidly make computers smarter

Just As CEO Heins Predicted, BlackBerry World Now Plays Home To Over 100,000 Apps
Apple updates Podcasts app with custom stations, on-the-go playlists and less ‘skeuomorphic’ design

The whole PyCon2013's Fork the Dongle business:

The story:
'Sexist joke' web developer whistle-blower fired (BBC) and then......

Breaking: Adria Richards fired by SendGrid for calling out developers on Twitter

Different views:

From those who works with Richards before: Adria Richards, PyCon, and How We All Lost
The apology from PlayHaven's developer: Apology from the developer
Rachel-Sklar from BI: Rachel-Sklar takes
Someone thinks this is a good time to sell their T-shirt: Fork My Dongle T-Shirt
Is PyCon2013 so bad? (Short answer: no) What really happened at PyCon 2013

Your view:

Frankly, if you want to support woman in our industry, donate to this awesome 9 year old.
9 Year Old Building an RPG to Prove Her Brothers Wrong!


Friday Speech-related Links

Future Windows Phone speech recognition revealed in leaked video

Whether you like Softie, they are innovative in speech recognition in these few years.  I am looking forward for their integration of DBN in many of their products.

German Language Learning Startup Babbel Buys Disrupt Finalist PlaySay To Target The U.S. Market

Not exactly in ASR but language learning has been a main stay.  Look at EnglishCentral, they have been around and kicking well.

HMM with scipy-learn

When I first learned HMM, I was always hoping to use a scripting language to train the simplest HMM.  scipy-learn is one such software.

Google Keep

Voice memo is a huge market.  But mobile continus speech recognition is a very challenging task.  Yet, with Google technology, I think it should be better than its competitor, Evernote.


Thursday, March 21, 2013

Thursday Links (FuzzBuzz programming, Samsung, Amazon and more)


Placebo Surgery : Still think acupuncture is a thing?

Expertise, the Death of Fun, and What to Do About It by James Hague

Indeed, it got hard to learn.  My two cents: always keep notes on your work.  See every mistakes as an opportunity to learn.   And always learn new things, never stop.

FizzBuzz programming (2007)

It's sad that it is true.

Technology in general:

Samsung smartwatch product

I still look for the Apple's product more.   I guess I was there when iPhone came out, it's rather hard to not say Samsung plagiarize.......

The Economics of Amazon Prime (link)

When I go to Amazon, using Prime has indeed became an option,  especially for the thousand ebook which cause less than $2.99.   Buying ten of them is very close to the monthly subscription fee of Amazon Prime.

Starbucks and Square don't seem to "mix" well (link)

Other newsworthy:

As Crop Prices Surge, Investment Firms and Farmers Vie for Land

Crop has reversed its course,  if you are interested in restaurants business (like me), this has a huge impact of the whole food chain.

The many failures of the personal finance industry

Many geeky friends of mine are not making good sense in personal finance.  This is a good link to understand the industry.


Thursday Speech-related Readings

Speech Recognition Stumbles at Leeds Hospital

I wonder who the vendor is.

Google Peanut Gallery (Slate)

Interesting showcase again.  Google always has pretty impressive speech technology.

Where Siri Has Trouble Hearing, a Crowd of Humans Could Help

Combining fragments of recognition a rather interesting idea though it's probably not new.  I am glad it is taking off though.

Google Buys Neural Net Startup, Boosting Its Speech Recognition, Computer Vision Chops

This is huge.  Once again, it says something about the power of DNN approach. It is probably the real focus in the next 5 years.

Duolingo Adds Offline Mode And Speech Recognition To Its Mobile App

I always wonder how the algorithm works.  Confidence-based algorithm of verification has always been tough to get it work.  But then again, the whole deal of reCAPTCHA is really try to differentiate between human and machines.  So it's probably not as complicated than I thought.

Some notes on DNS 12: link

The whole sentence mode is the more interesting part.  Does it make users more frustrated though? I am curious.


Tuesday, March 19, 2013

Landscape of Open Source Speech Recognition Software (II : Simon)

Around December last year, I wrote an article on open source speech recognizers.  I covered HTK, Kaldi and Julius.   One thing you should know, just like CMUSphinx,  all of these packages contain their own versions of Viterbi algorithms' implementation.   So when you asked someone who is in the field of speech recognition, they will usually say open source speech recognizers are Sphinx, HTK, Kaldi and Julius.

That's how I usually view speech recognition too.    After years working in the industry though, I start to realize this definition of seeing speech recognizer = Viterbi algorithm could be constraining.   In fact,  from the user's point of view,  a good speech application system should be a combination of

a recognizer + good models + good GUI.

I like to call the former type of "speech recognizer" as "speech recognition engines" but the latter type as "speech recognition applications".   Both types of "speech recognizers" are worthwhile applications.   From the users' point of view, it might just be a technicality to differentiate them.

When I am recovering as a speech recognition programmer (another name throwing :) ),  one thing I notice is that there is much effort on writing "speech recognition applications".   It is a good trend because most people from academia really didn't spend too much time to write good speech applications.   And in open source, we badly need good applications such as dictation machine, IVR and C&C.

One effort which really impressed me is Simon.   It is weird because most of the time I only care about engine-level type of software.   But in the case of Simon, you can see couple of its features are really solving problems in real life and integrated to the bigger them of open source speech recognition.

  • In 0.4.0, Simon starts to integrate with Sphinx.   So if someone wants to develop it commercially, they can.
  • The Simon's team also intentionally make context switching in the application, that's good work as well.   In general, if you always use a huge dictionary, you are just over-recognizing words in a certain context. 
  • Last and not least, I like the fact it integrates itself to Voxforge.  Voxforge is the open source answer to a large speech database of commercial speech company.  So integration with Voxforge will ensure an increasing amount of data for your application.
So kudo to the Simon team!  I believe this is the right kind of thinking to start a good speech application. 


sphinxbase 0.8 and SphinxTrain 1.08

I have done some analysis on sphinxbase0.8 and SphinxTrain 1.08 and try to understand if it is very different from sphinxbase0.7 and SphinxTrain1.0.7.  I don't see big difference but it is still a good idea to upgrade.

  • (sphinxbase) The bug in cmd_ln.c is a must fix.  Basically the freeing was wrong for all ARG_STRING_LIST argument.  So chances are you will get a crash when someone specify a wrong argument name and cmd_ln.c forces an exit.  This will eventually lead to a cmd_ln_val_free. 
  • (sphinxbase) There were also couple of changes in fsg tools.  Mostly I feel those are rewrites.  
  • (SphinxTrain) sphinxtrain, on the other hands, have new tools such as g2p framework.  Those are mostly openfst-based tool.  And it's worthwhile to put them into SphinxTrain. 
One final note here: there is a tendency of CMUSphinx, in general, starts to turn to C++.   C++ is something I love and hate. It could sometimes be nasty especially dealing with compilation.  At the same time, using C to emulate OOP features is quite painful.   So my hope is that we are using a subset of C++ which is robust across different compiler version. 


Monday, March 18, 2013

Python multiprocessing

As my readers may noticed, I haven't updated this blog as I have pretty heavy workload. It doesn't help that I was sick in the middle of March as well. Excuses aside though, I am happy to come back. If I couldn't write much about Sphinx and programming, I think it's still worth it to keep posting links.

I also come up with requests on writing more details on individual parts of Sphinx.   I love these requests so feel free to send me more.   Of course, it usually takes me some time to fully grok a certain part of Sphinx and I could describe it in an approachable way.   So before that, I could only ask for your patience.

Recently I come up with parallel processing a lot and was intrigued on how it works in the practice. In python, a natural choice is to use the library multiprocessing. So here is a simple example on how you can run multiple processes in python. It would be very useful in the modern days CPUs which has multi-cores.

Here is an example program on how that could be done:

1:  import multiprocessing  
2:  import subprocess  
3:    jobs = []  
4:    for i in range (N):  
5:      p = multiprocessing.Process(target=process, \  
6:                      name = 'TASK' + str(i), \  
7:                      args=(i, ......  
8:                    )  
9:      )  
10:     jobs.append(p)  
11:     p.start()  
12:   for j in jobs:  
13:     if j.is_alive():  
14:        print 'Waiting for job %s' %(  
15:        j.join()  

The program is fairly trivial. Interesting enough, it is also quite similar to the multithreading version in python. Line 5 to 11 is where you run your task and I just wait for the tasks finished from Line 12 to 15.

It feels little bit less elegant than using Pool because it provides a waiting mechanism for the entire pool of task.  Right now, I am essentially waiting for job which is still running by the time job 1 is finished.

Is it worthwhile to go another path which is thread-based programming.  One thing I learned in this exercise is that older version of python, multi-threaded program can be paradoxically slower than the single-threaded one. (See this link from Eli Bendersky.) It could be an easier being resolved in recent python though.