Thursday, January 31, 2013

January 2013 Write-up

Miraculously, I still have some momentum for this blog and I have kept on the daily posting schedule.

Here is a write up for this month:  Feel free to look at this post on how I plan to write this blog:

Some Vision of the Grand Janitor's Blog

Sphinx' Tutorials and Commentaries

SphinxTrain1.07's bw:

Commentary on SphinxTrain1.07's bw (Part I)
Commentary on SphinxTrain1.07's bw (Part II)

Part I describes the high-level layout, Part II and describe half the state network was built.

Acoustic Score and Its Sign
Subword Units and their Occasionally Non-Trivial Meanings

Sphinx 4 from a C background : Material for Learning


Goldman Sachs not Liable
Aaron Swartz......

Other writings:

On Kurzweil : a perspective of an ASR practitioner



Goldman Sachs not Liable

Here is the Bloomberg's piece.   Sounds like it's a real case closed.

Of course, we are all still feeling the consequences.


Wednesday, January 30, 2013

Speech-related Readings at Jan 30, 2013

Amazon acquired Ivona:

I am aware of Amazon's involvement in ASR.   Though it's a question on the domain.

Goldman-Dragon Trial:

I simply hope Dr. Baker has a closure on the whole thing.   In fact, when you think about it,  the whole L&H fallout is the reason why the ASR industry has a virtual monopoly now.  So if you are interested in ASR, you should be concerned.


Tuesday, January 29, 2013

Subword Units and their Occasionally Non-Trivial Meanings

While I was writing the next article on bw,  I found myself forget the meaning of different type of subword units (i.e phones, biphones, diphones, triphones and such).  So I decide to write a little note.

On this kind of topics, someone would likely to come up and say "X always mean Y bla bla bla etc and so on."  My view (and hope) is that the wording of a certain should reflect what it means.  So when I hear a term and can come up with multiple definition in your head, I would say the naming convention is a little bit broken.

Phones vs Phonemes
Linguist distinguish between phoneme and phone The former usually means a kind of abstract categorization of a sound, whereas the latter usually mean the actual realization of a sound.

In a decoder though, what you see most is the term phone.

Biphones vs Diphones

(Ref here) " ..... one important difference between the two units. Biphones are understood tobe just left or right context dependent (mono)phones. On the other hand, diphones represent the transition regions that strech between the two ”centres” of the subsequent phones. "

So that's why there can be left-biphone and right-biphone.  Diphones is intuitively better in synthesis.

Possible combination of left-biphones/right-biphones/diphones are all N^2.  With N equals to the number of phones. 

Btw, the link I gave also has a term called "bi-diphone", which I don't think it's a standard term. 


For most of the time, it means considering both left and right context.  Possible combinations N^3. 


For most of the time, it means considering both the two left and two right contexts. Possible combinations N^5. 


For most of the time, it means considering both three left and three right  contexts. Possible combinations N^7. 

"Quadphones" and Other possible confusions in terminology. 

I guess what I don't feel comfortable are terms such as "Quadphones".   Even quinphones and heptaphones can potentially means different things from time-to-time.  

For example, if you look at LID literature, occasionally, you will see the term quadphone.  But it seems the term "phone 4-gram" (or more correctly quadgram...... if you think too much,) might be a nicer choice.  

Then there is how the context looks like:  2 left 1 right? 1 right 2 left?   Come to think of it, this terminology is confusing for even triphones because we can also mean a phone depend on 2 left or 2 right phones.  ASR people don't feel that ways probably because of a well-established convention.  Of course, the same can be said for quinphone and hetaphones. 


Monday, January 28, 2013

Readings at Jan 28, 2013

Tools of the Trade : Mainly an iOS article but it has many tools on maintaining contacts, task lists and requests.
C11 : I have no idea C99 tried to implement variable length array.  It's certainly not very successful in the past 10 years.....   Another great book to check out is Ning's C book.
How to make iPhone App that actually sells : Again, probably not just for iOS but generally for writing free/shareware.
Bayesian vs Non-Bayesian:  Nice post.  I don't fully grok Bayesian/Non-Bayesian but if you know better, they are essentially two schools of thoughts. (ASR? The whole training process starts from a flat-start, you figure.)

Saturday, January 26, 2013

On Kurzweil : a perspective of an ASR practitioner

Many people who don't work on the fields of AI, NLP and ASR have heard of Kurzweil.   To my surprise, many seem to give little thought on what he said and just follow his theory wholeheartedly.

In this post, I just want to comment on one single little thing, which is whether real-time speech-to-speech translation can be achieved in 2010s.  This is a very specific prediction from Kurzweil's book "The Singularity is Near".

My discussion would mainly focus on ASR first.  So even though my examples below are not exactly pure ASR systems, I will skip the long winding wording of saying "ASR of System X".  And to be frank, MT and Response system probably goes through similar torturous development process anyway.   So, please, don't tell me that "System X is actually ASR + Y", that sort of besides the point.

Oh well, you probably ask why bother, don't we have a demo of real-time speech-to-speech translation from Microsoft already?

True, but if you observe the demo carefully, it is based on read speech.  I don't want to speculate much but I doubt it is a generic language model which wasn't tuned to the lecture.   In a nutshell, I disbelieve it is something you can use it in real-life.

Let's think of a more real-life example: Siri, are we really getting 100% correct response now?  (Not to boil down to ASR WER ...... yet)  I don't think so. Even with adaptation, I don't think Siri understand what I said every single time.    For most of the time, I follow the unofficial command list of Siri, let it improve with adaptation..... but still, it is not perfect.

Why? It is the hard cold reality: ASR is still not perfect, with all the advancement in HMM-based speech recognition.  All the key technologies we know in the last 20 years: CMLLR, MMIE, MPE, fMPE, DMT, consensus network, multipass decodings, SAT, SAT with MMIE or all the nicest front-ends, all the engineerings.   Nope, we are not yet having a feel-good accuracy.  Indeed, human speech recognition is not 0% WER neither but for some reasons, the current state-of-the-art ASR performance is not reaching there.

And Siri, we all know is the state-of-the-art.

Just digress a little bit: Now most of the critics when they write to this point, will then lament that "oh, there is just some invisible barrier out there and human just couldn't make a Tower Babel, blabla....".  I believe most of these "critics" have absolutely no ideas what they are talking about.   To identify these air-head critics, just try to see if they put "cognitive science" into the articles, then you don't know they never work on real-life ASR system.

I, on the hand, do believe one day we can get there.  Why?  Because when people work on one of these speech recognition evaluation tasks, many would tell you : given a certain test set and with enough time and gumption, you would be able to create a system without any errors.  So to me, it is more of an issue of whether some guys grinding on the problem, but not feasibility issue.

So where are we now in ASR?  Recently, In ICASSP 2012,  a Google paper, trained 87 thousand hour of data.  That is probably the largest scale of training I know.  Oh well, where are we now? 10%.  Go down from 12%.  So the last big experiment I know, it's probably the 3000 hours experiment back in 2006-7.  The Google authors are probably using a tougher test set.  So the initial recognition rate was yet again lower.

Okay, speculation time.  So let's assume, that human can always collect 10 times more labelled data for every 6-7 years AND we can do an AM training on them. When will we go to have say 2% WER on the current Google test set?   If we just think of very simple linear interpolation.  It will take 4 * 6 years = 24 years to collect 10000 times more data (or 8 billion hour of data).    So we are way-way past the 2010s deadline from Kurzweil.

And that's a wild speculation.   Computation resources probably will work out itself by that time.  What I doubt most is whether the progress would be linear.  

Of course, it might be non-linearly better too.  But here is another point: it's not just about the training set, it's about the test set.  If we truly want a recognizer to work for *everyone* in the planet, then the very right thing to do is test your recognizer on our whole population.  If we can't then you want to sample enough human speech to represent the Earth's population, the current test set might not be representative enough.   So it is possible that when we increase our test set, we found that the initial recognition rate has go down again.   And it seems to me our test set is still in the state of mimicking human population.

My discussion so far are mostly on acoustic model.  On the language model side,  the problem will mainly on domain specificity.   Also bear in mind, human language can evolve.  So, say we want to build a system which build a customized language model for each human being in the planet.  At a particular moment of time, you might not be able to get enough data to build such a language model.

For me, the point of the whole discussion is that ASR is an engineering system, not some idealistic discussion topic.  There will always be tradeoff.   You may say: "What if a certain technology Y emerge in the next 50 years?" I heard that a lot Y could be quantum computing or brain simulation or brain-human interface or machine implementation of brain.    Guys..... those, I got to admit are very smart idea in our time, and give it another 30-40 years, we might see something useful.   For now, ASR really has nothing to do with them.  I never heard of machine implementation of the audio cortex, or even an accurate construction of audio pathway.  Nor, there is an easy progress of dissecting mammal inner ear and bring understanding on what's going on in human ear.   From what I know, we seem to know some, but there are lots of other things we don't know.

That's why I think it's better to buckle down and just to try to work out our stuffs.  Meaning, try to come up with more interesting mathematical model, try to come up with more computational efficient method.   Those .... I think are meaningful discussion.   As for Kurzweil, no doubt he is a very smart guy, but at least on ASR, I don't think he knows what he talks about.

Of course, I am certainly not the only person who complains Kurzweil.  Look at how Douglas Hofstadter's criticism:

"It’s as if you took a lot of very good food and some dog excrement and blended it all up so that you can't possibly figure out what's good or bad. It's an intimate mixture of rubbish and good ideas, and it's very hard to disentangle the two, because these are smart people; they're not stupid."

Sounds like very reasonable to me.


Friday, January 25, 2013

Acoustic Score and Its Signness

Over the years, I got asked about why acoustic score could be a positive number all the time. That occasionally lead to a kind of big confusion from beginner users. So I write this article as a kind of road sign for people.

Acoustic score per frame is essentially the log value of continuous distribution function (cdf). In Sphinx's case, the cdf is a multi-dimensional Gaussian distribution. So Acoustic score per phone will be the log likelihood of the phone HMM. You can extend this definition to word HMM.

For the sign. If you think of a discrete probability distribution, then this acoustic score thingy should always be negative. (Because log of a decimal number is negative.) In the case of a Gaussian distribution though, when the standard deviation is small, it is possible that the value is larger than 1. (Also see this link). So those are the time you will see a positive value.

One thing you might feel disharmonious is the magnitude of the likelihood you see. Bear in mind, Sphinx2 or Sphinx3 are using a very small logbase. We are also talking about a multi-dimensional Gaussian distribution. It makes numerical values become bigger.


Also see:
My answer on the Sphinx Forum

Thursday, January 24, 2013

Commentary on SphinxTrain1.07's bw (Part II : next_state_utt.c's First Half)

I will go on with the analysis of bw.  In my last post, we understand the high-level structure of the bw program.  So we now turns to the details of how the state lattice was built.  Or how next_state_utt.c works.

As always, here is the standard disclaimer: this is another long and super technical post I only expect a small group of programmer with exoteric interest of Baum-Welch algorithm can read.   It's not for faint of heart but if you understand the whole thing, you would have some basic but genuine understanding of Baum-Welch algorithm in real life.

The name of the module, next_state_utt.c, is quite unassuming but it is an important key of understanding the Baum-Welch algorithm.  The way how the state lattice is structured affects how parameter estimation works.   The same statement says for not only Baum-Welch estimation but also other estimation algorithm in speech.

But what so difficult about the coding the lattice?  Here are two points I think it is worthwhile to point out:

  1. I guess an average programmer can probably work out a correct concatenation of all phone HMMs if all phones are context-independent in 3 days to 1 week.  But in many advanced systems, most of them are using context-dependent phones.  So you go to make sure at the right time, the right triphone state was used.
  2. In Sphinx, it got a bit more complicated because there is a distinction between positions of triphones.  This is quite specific to Sphinx and you won't find it in systems such as HTK.  So it further complicates coding.  You will see in the following discussion, Sphinx has back off cases of position-dependent triphone estimation from time-to-time.   In my view, it might not be too different from the position-independent triphone system.  (It's certainly fun to read. :) ) 
My goal here is to analyze the next_state_utt.c, how it works and to state some part one can improve.

High-Level Layout of next_utt_state.c:

 -> mk_wordlist (mk_wordlist.c)  
 -> mk_phone_list (mk_phonelist.c)  
 -> cvt2triphone (cvt2triphone.c)  
 -> state_seq_make (state_seq_make.c)  

Nothing fancy here. We first make a wordlist (mk_wordlist) , then make a phone list (mk_phone_list), then convert the phone list to triphone (cvt2triphones), then create the state sequence (state_seq_make).

Now before we go on, just at this level, you may already discover one issue of Sphinx's bw.  It is using a list to represent phone models.   So, let's assume if you want to model a certain word with multiple pronunciations, you probably can't do it without changing the code.

Another important thing to note: just like many non-WFST systems, it is not that easy to make a simple phoneme system with Sphinx.  (HTK is an exception but you can always turn on a flag to expand context. Just look up the manual.)  Say if you want to express your phoneme system to be one phoneme word, then you would want your dictionary look like:


But then, if a word is a phone, should you actually want to build a network of cross-word triphones?  You probably want to if you want to shoot for performance - all of the most accurate phoneme-based system has some sort of context-dependency there.  (The Brno's recognizer probably has some, but I don't really grok why it is so good.)

But if you want to do your own interesting experiments, this fixed behavior may not suit your appetite.   Maybe you just want to use a context-independent phone system for some toy experiments.  But then, you are probably always building a triphone system.  So, it might or might not be what you like.

So if you really want to trigger the CI-model behavior, what can you do?  Take a look of my next post, in cvt2triphone.c, if the model definition file only specify CI states, then no triphone conversion will occur.   In a way, that is to say the system assume if you just train the CI model, you will get the CI model but there is no explicit way to turn it off.


mk_wordlist is rather trivial:

 char **mk_wordlist(char *str,  
             uint32 *n_word)  
   uint32 n_w;  
   uint32 i;  
   char **wl;  
   n_w = n_words(str);  
   wl = ckd_calloc(n_w, sizeof(char *));  
   wl[0] = strtok(str, " \t");  
   for (i = 1; i < n_w; i++) {  
      wl[i] = strtok(NULL, " \t");  
   assert(strtok(NULL, " \t") == NULL);  
   *n_word = n_w;  
   return wl;  

With one line of transcripts, mk_wordlist transform it to an array of C-string.  Memory of the string are allocated.


mk_phone_list is still trivial but there is a bit more detail

1:  acmod_id_t *mk_phone_list(char **btw_mark,  
2:                  uint32 *n_phone,  
3:                  char **word,  
4:                  uint32 n_word,  
5:                  lexicon_t *lex)  
6:  {  
7:    uint32 n_p;  
8:    lex_entry_t *e;  
9:    char *btw;  
10:    unsigned int i, j, k;  
11:    acmod_id_t *p;  
12:    /*  
13:     * Determine the # of phones in the sequence.  
14:     */  
15:    for (i = 0, n_p = 0; i < n_word; i++) {  
16:       e = lexicon_lookup(lex, word[i]);  
17:       if (e == NULL) {  
18:         E_WARN("Unable to lookup word '%s' in the lexicon\n", word[i]);  
19:         return NULL;  
20:       }  
21:       n_p += e->phone_cnt;  
22:    }  
23:    /*  
24:     * Allocate the phone sequence  
25:     */  
26:    p = ckd_calloc(n_p, sizeof(acmod_id_t));  
27:    /*  
28:     * Allocate the between word markers  
29:     */  
30:    btw = ckd_calloc(n_p, sizeof(char));  
31:    for (i = 0, k = 0; i < n_word; i++) {     /* for each word */  
32:       e = lexicon_lookup(lex, word[i]);  
33:       if (e->phone_cnt == 0) /* Ignore words with no pronunciation */  
34:        continue;  
35:       for (j = 0; j < e->phone_cnt-1; j++, k++) {     /* for all but the last phone in the word */  
36:         p[k] = e->ci_acmod_id[j];  
37:       }  
38:       p[k] = e->ci_acmod_id[j];     /* move over the last phone */  
39:       btw[k] = TRUE;               /* mark word boundary following  
40:                             kth phone */  
41:       ++k;  
42:    }  
43:    *btw_mark = btw;  
44:    *n_phone = n_p;  
45:    assert(k == n_p);  
46:    return p;  
47:  }  

In line 15-22:, we first look up the pronunciations of the words. (Remember, right now we can only look up one.) It then allocate the an array of phones with ID (in the type of acmod_id_t).

Now here is special part of the code, other the array of phones, it also allocate an array call "between word markers".  So what's the mechanism?  Let me give an example.

Suppose you have a transcript with word sequence "MY NAME IS CLINTON"

       mk_word_list would create

     word[0] -> MY  
     word[1] -> NAME  
     word[2] -> IS  
     word[3] -> CLINTON  

       mk_print_list (with my best guess of pronunciations) would create
     ph[0] -> M      btw[0] -> 0
     ph[1] -> AY      btw[1] -> 1
     ph[2] -> N      btw[2] -> 0
     ph[3] -> EY      btw[3] -> 0
     ph[4] -> M      btw[4] -> 1
     ph[5] -> IY      btw[5] -> 0
     ph[6] -> S      btw[6] -> 1
     ph[7] -> K      btw[7] -> 0
     ph[8] -> L      btw[8] -> 0
     ph[9] -> IY      btw[9] -> 0
     ph[10] -> N      btw[10] -> 0
     ph[11] -> T      btw[11] -> 0
     ph[12] -> AX     btw[12] -> 0
     pH[13] -> N      btw[13] -> 1

So essentially it would indicate there is a word end at a certain phone.

I believe such as representation are for convenience purpose: it facilitate determination of whether a word is at the beginning, the middle or the end.

An alternative here is to do an optional silence.  This, according to HTK handbook, usually reduce the WER slightly.  It seems to be reasonable to figure out where the location of a phone is using silences as a marker.

A digression: acmod_id_t

1:  typedef unsigned char ci_acmod_id_t;  
2:  #define NO_CI_ACMOD     (0xff)  
3:  #define MAX_CI_ACMOD     (0xfe)  
4:  typedef uint32 acmod_id_t;  
5:  #define NO_ACMOD     (0xffffffff)     /* The ID representing no acoustic  
6:                            * model. */  
7:  #define MAX_ACMOD     (0xfffffffe)     /* The max ID possible */  

Now, we used acmod_id_t in mk_phone_list, but what is it really?  So let's take a detour of acmod_id_ds.t ("ds" stands for data structure.)

acmod_id_t is essentially a uint32, which is just a the size of unsigned integer or 2^32 -1.  Why -1? Notice that MAX_CI_ACMOD was defined as 0xfe?

The more interesting part here: we saw ci_acmod_id_t is only a character type.  This is obviously another problem here, in some languages, one may be interested to express it with more than 255 phones. (Why 255?)

We'll meet acmod_set a little bit more.  But let us move on first - sometimes code tracing will be more motivated when you see the code before the data structure.   Many suggest otherwise: Indeed, once you know the data, code will make more sense.   But in practice, you will most likely read the code first and needs to connect things together.   Thus IMO: both approach has their merit in code tracing.

So far ..... and next post

To avoid clutter a single post, I will stop and put the rest of next_utt_states.c (cvt2phones and state_seq_make) on another post.   But I want to summarize several things I have observed so far:

  1. Current bw doesn't create a word network so it has issues to handle multiple pronunciations. 
  2. Current bw automatically expand triphone contexts. There is no explicit way to turn it off.
  3. bw is not doing optional silence in the network. 
  4. bw doesn't work for more than 255 CI phones. 
Btw, SphinxTrain1.08 has several changed which replace mk_word_list with data structure from sphinxbase.  Those are encouraging changes.  If I have time, I will cover them. 


Tuesday, January 22, 2013

Learning vs Writing

I haven't done any serious writings for a week.  Mostly post interesting readings just to keep up the momentum.   Work is busy so I slowed down.  Another concern is what to write.   Some of the topics I have been writing such as Sphinx4 and SphinxTrain take a little bit of research to get them right.

So far I think I am on the right track.  There are not many bloggers on  speech recognition.  (Nick is an exception.)   To really increase awareness of how ASR is done in practice, blogging is a good way to go.

I also describe myself as "recovering" because there are couple of years I hadn't seriously thought about open source Sphinx.  In fact though I was working on speech related stuffs, I didn't spend too much time on mainstream ASR neither because my topic is too esoteric.

Not to say, there are many new technologies emerged in the last few years.   The major one I would say is the use of neural network in speech recognition.  It probably won't replace HMM soon but it is a mainstay for many sites already.   WFST, with more tutorial type of literature available, has become more and more popular.    In programming, Python now is a mainstay plus job-proof type of language.  The many useful toolkit such as scipy, nltk by themselves deserves book-length treatment.  Java starts to be like C++, a kind of necessary evil you need to learn.  C++ has a new standard.   Ruby is huge in the Web world and by itself is fun to learn.

All of these new technologies took me back to a kind of learning mode.   So some of my articles become longer and in more detail.   For now, they probably cater to only a small group of people in the world.   But it's okay, when you blog, you just want to build landmarks on the blogosphere.   Let people come to search for them and get benefit from it.   That's my philosophy of going on with this site.


Monday, January 21, 2013

How Norvig refutes Chomsky

In a well-written and elaborated post, Peter Norvig reputed Noam Chomsky's arguments against statistical models. (Links: I, II)

"Then, to get language from this abstract, eternal, mathematical realm into the heads of people, he [Chomsky] must fabricate a mystical facility that is exactly tuned to the eternal realm. This may be very interesting from a mathematical point of view, but it misses the point about what language is, and how it works."

On speech recognition, I cannot help but think Norvig is right.  There are many times introduction of linguistic knowledge proven to be not as helpful as we hope.   If you are interested in the topic, you should also read Jelinek's piece on Some of my Best Friends are Linguists.


Thursday, January 17, 2013

Aaron Swartz......

Have been busy lately so I haven't post much.

The only story I think is worthwhile lately is the death of Aaron Swartz.  I am late on the whole event but the Wikipedia page is an amazing source of what's really going on.

You should also watch Lessig's comments at here.

Btw, last time I checked, "Maximum Likelihood from Incomplete Data via the EM algorithm" costs $29. I am glad many wrote great tutorials on EM but not so happy when I couldn't read the original paper.


Tuesday, January 15, 2013

Readings from Jan 15, 2013

The language of languages by Matt Might: this clears me up on several syntactical aspect of the few types of grammars. 

KLEE : Hmm.  This is very interesting.  Given a function, this tool can automatically generate all the test cases with proper coverage.  I think this could improve the current testsuit of C-based Sphinx.


John Regehr on Hiding Bugs

Hiding Bugs from Branch Coverage : Great article on how one can hide bugs by function.  All code in C.  It also mentioned usage of advanced tools such as KLEE and FAMA-C.


Monday, January 14, 2013

Readings for Jan 14, 2013

Approximation relating lg, ln, and log10 by John Cook
Digits in Power of 2 by John Cook
Rise and Fall of Third Normal Form by John Cook
The Crown Game Affair : Fascinating account on how cheating in chess was detected and caught.
Introduction to Conditional Random Field : Eerily similar to HMM, that's perhaps why there was many "cute" ideas published on it in the past.
Stuff Harvard People Like by Ed Checn: Entertaining post.  I do know some people from MIT and the study fits to the stereotype I recognize.   For Harvard, not as much,  some are really mellow fellows.   Also, not every one in a school is savvy in computers.   So the samples Ed Chen collected may or may not be representative. (To his credit, he named all his assumptions in his post.)


Thursday, January 10, 2013

Sphinx 4 from a C background : Material for Learning Sphinx 4

I have been quite focused on SphinxTrain lately.   So I haven't touched Sphinx 4 for a while.   As I have one afternoon which I can use with leisure (not really), so I decide to take a look of some basic material again.

Sphinx-4, as a recognizer, is interesting piece software to me, a recovering recognizer programmer.  It seems remote but oddly familiar.   It is sort of a dream-land for experimenting different decoding strategies.   During Sphinx 3.5 to 3.7, I tried to make Sphinx 3.X to be more generalized in terms of search.  Those effort was tough mainly because the programs were in C.  As you might guess, those modification requires much reinvention of a lot of good software engineering mechanisms (such as class).

Sphinx-4 is now widely studied.  There are many projects using Sphinx-4 and its architecture is analyzed in many sites.   That's why I have abundant amount of material to learn the recognizer.  (Yay! :) )

Here are the top 5 pages in my radar now and I am going to study them in detail:

  1. Introduction :  What Sphinx-4 is? And how to use it. 
  2. Sphinx 4 Application Programmer Guide : What excites me is model switching capability.  I also love the way the current recognizer can be linked to multiple languages. 
  3. Configuration Manager :  That's an interesting part as well.   That is a recognizer which is configurable for every components.   Is it a good thing?  There are pros and cons about a hierarchical configuration system.  But for most of the time, I think that's a better way than flat command-line structure. 
  4. Instrumentation : How to test the decoder with examples on TIDIGITS and many more database. 
  5. FAQ: Here is a list of questions which make me curious. 
  6. The White Paper : Extremely illuminating,  I also appreciate the scholarship when they compare different versions of Sphinxes. 
  7. The 2003 paper: I haven't gone through this one yet but it's certainly something I want to check out. 

Previous related articles:
Sphinx4 from a C background : first few steps
Sphinx4 from a C background : Installation of Eclipse
Sphinx4 from a C Background : Setting up Eclipse 

Stylometric Analysis

Just read a story about how usage of machine learning can identify anonymous hackers or crackers.  I am actually not too surprised by that capability.   According to the article, currently the  accuracy is around 66% to 80%, which I take it as detection rate.   There is a 5000 words minimum limit there, which attempt to make the word distribution estimation be robust.   (Backoff strategy comes to mind immediately......)

The final limitation about this method is that the text has to be in English.   That ..... I don't think it's such a big deal.  No one can communicate effectively if they mix up multiple languages in text.  If they do, probably it means the mixture of the languages is a kind of language itself.

Thinking deeper, this will probably prompt intelligent hackers to speak less in public forums.   They will either use secure channel to establish a connection for obtaining information.  

Will hacker be deterred by this new method?  I guess no,  many in the hacker community are well-aware that latest machine learning method can detect their existence.  For example, usage of IRC is already one particular signature that one can detect in the network.

In any case, the topic of how NLP can be applied to infosec always fascinates me, hope I can work on it someday.


Wednesday, January 09, 2013

No readings for today ...... but.....

It's a digression but check this out:  This is a heart-breaking story of comedian Anthony Griffin.


Commentary on SphinxTrain1.07's bw (Part I)

I was once asked by a fellow who didn't work in ASR on how the estimation algorithms in speech recognition work.   That's a tough question to answer.  From the high level, you can explain how properties of Q function would allow an increase of likelihood after each re-estimation.  You can also explain how the Baum-Welch algorithm is derived from the Q-function and how the estimation algorithm can eventually expressed by greeks, and naturally link it to the alpha and bet pass.   Finally, you can also just write down the reestimation formulae and let people perplex about it.

All are options, but this is not what I wanted nor the fellow wanted.   We hoped that somehow there is one single of entry in understanding the Baum-Welch algorithm.  Once we get there, we will grok.   Unfortunately, that's impossible for Baum-Welch.  It is really a rather deep algorithm, which takes several type of understanding.

In this post, I narrow down the discussion to just Baum-Welch in SphinxTrain1.07.  I will focus on the coding aspect of the program.   Two stresses here:

  1. How Baum-Welch of speech recognition in practice is different from the theory?
  2. How different parts of the theory is mapped to the actual code. 

In fact, in Part I, I will just describe the high level organization of the Baum-Welch algorithm in bw.   I assumed the readers know what the Baum-Welch algorithm is.   In Part II, I will focus on the low level functions such as next_utt_state, foward, backward_update, accum_global .

(At a certain point, I might write another post just to describe Baum-Welch, This will help my Math as well......)

Unlike the post of setting up Sphinx4.   This is not a post for faint of heart.  So skip the post if you feel dizzy.

Some Fun Reading Before You Move On

Before you move on, here are three references which I found highly useful to understand Baum-Welch in speech recognition. They are

  1. L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Chapter 6. "Theory and Implementation of Hidden Markov Model." p.343 and p.369.    Comments: In general, the whole Chapter 6 is essential to understand HMM-based speech recognition.  There are also a full derivation of the re-estimation formulae.  Unfortunately, it only gives the formula without proof for the most important case, in which observation probability was expressed as Gaussian Mixture Model (GMM).
  2. X. D. Huang, A. Acero and H. W. Hon, Spoken Language Processing.  Chapter 8. "Hidden Markov Models" Comments: written by one of the authors of Sphinx 2, Xuedong Huang, the book is a very good review of spoken language system.  Chapter 8 in particular has detailed proof of all reestimation algorithms.  If you want to choose one book to buy in speech recognition.  This is the one.  The only thing I would say it's the typeface of greeks are kind of ugly. 
  3. X. D. Huang, Y. Ariki, M. A. Jack, Hidden Markov Models for Speech Recognition. Chapter 5, 6, 7. Comments: again by Xuedong Huang, I think this is the most detail derivations I ever seen on continuous HMM in books.  (There might be good papers I don't know of).  Related to Sphinx, it has a chapter of semi-continuous HMM (SCHMM) as well. 

bw also features rather nice code commentaries. My understanding is that it is mostly written by Eric Thayer, who put great effort to pull multiple fragmented codebase together and form the embryo of today's SphinxTrain.

Baum-Welch algorithm in Theory

Now you read the references, in a very high-level what does a program of Baum-Welch estimation does? To summarize, we can think of it this way

* For each training utterance
  1. Build an HMM-network to represent it. 
  2. Run Forward Algorithm
  3. Run Backward Algorithm
  4. From the Forward/Backward, calculate the statistics (or counts or posterior scores depends on how you call it.)
* After we run through all utterances, estimate the parameters (means, variances, transition probability etc....) from the statistics.

Sounds simple?  I actually skipped a lot of details here but this is the big picture.

Baum-Welch algorithm in Practice

There are several practical concerns on doing Baum-Welch in practice.  These are particularly important when it is implemented for speech recognition. 
  1. Scaling of alpha/beta scores : this is explained in detail in Rabiner's book (p.365-p.368).  The gist is that when you calculate the alpha or beta scores.  They can easily exceed the range of precision of any machines.  It turns out there is a beautiful way to avoid this problem. 
  2. Multiple observation sequences:  or stream. this is a little bit archaic, but there are still some researches work on having multiple streams of features for speech recognition (e.g. combining the lip signal and speech signal). 
  3. Speed: most implementation you see are not based on a full run of forward or backward algorithm.  To improve speed, most implementations use a beam to constrained the search.
  4. Different types of states:  you can have HMM states which are emitting or non-emitting.  How you handle it complicates the implementation. 
You will see bw has taken care of a lot of these practical issues.   In my opinion, that is the reason why the whole program is a little bit bloated (5000 lines total).  

Tracing of bw: High Level

Now we get into the code level.  I will follow the version of bw from SphinxTrain1.07.  I don't see there are much changes in 1.08 yet.  So this tracing is very likely to be applicable for a while.

I will organize the tracing in this way.   First I will go through the high-level flow of the high-level.  Then I will describe some interesting places in the code by line numbers.

main() - src/programs/bw/main.c

This is the high level of main.c (Line 1903 to 1914)

 main ->   
    if it is not mmie training  

We will first go forward with main_initialize()

 -> initialize the model inventory, essentially means 4 things, means (mean) variances (var), transition matrices (tmat), mixture weights (mixw).  
 -> a lexicon (or .... a dictionary)  
 -> model definition  
 -> feature vector type  
 -> lda (lda matrix)  
 -> cmn and agc  
 -> svspec  
 -> codebook definition (ts2cb)  
 -> mllr for SAT type of training.   

Interesting codes:
  • Line 359: extract diagonal matrix if we specified a full one. 
  • Line 380: precompute Gaussian distribution.  That's usually mean the constant and almost always most the code faster. 
  • Line 390: specify what type of reestimation. 
  • Line 481: check point.  I never use this one but it seems like something that allow the training to restart if network fails. 
  • Line 546 to 577: do MLLR transformation for models: for SAT type of training. 
(Note to myself: got to understand why svspec was included in the code.)


Now let's go to main_reestimate.  In a nutshell, this is where the looping occurred.

      -> for every utterane.   
        -> corpus_get_generic_featurevec (get feature vector (mfc))  
        -> feat_s2mfc2feat_live (get the feature vector)  
        -> corpus_get_sent (get the transcription)  
        -> corpus_get_phseg (get the phoneme segmentation.)  
        -> pdumpfn (open a dump file, this is more related Dave's constrained Baum-Welch research)  
        -> next_utt_states() /*create the state sequence network. One key function in bw. I will trace it more in detail.  */ 
        -> if it is not in Viterbi mode.  
         -> baum_welch_update()  /*i.e. Baum-Welch update */
         -> viterbi()  /*i.e. Viterbi update)

Interesting code:

  • Line 702:  several parameter for the algorithm was initialized including abeam, bbeam, spthres, maxuttlen.
    • abeam and bbeam are essentially the beam sizes which control forward and backward algorithm. 
    • maxuttlen: this controls how large an utterance will be read in.  In these days, I seldom see this parameter set to something other than 0. (i.e. no limit).
    • spthres: "State posterior probability floor for reestimation.  States below this are not counted".  Another parameter I seldom use......


 -> for each utterance
       forward() (forward.c) (<This is where the forward algorithm is -Very complicated. 700 lines)  
       if -outphsegdir is specified , dump a phoneme segmentation.  
       backward_update() (backward.c Do backward algorithm and also update the accumulator)  
           (<- This is even more complicated 1400 lines)  
 -> accum_global() (Global accumulation.)   
         (<- Sort of long, but it's more trivial than forward and backwrd.)  

Now this is the last function for today.  If you look back to the section of "Baum-Welch in theory".  you will notice how the procedure are mapped onto Sphinx. Several thoughts:

  1. One thing to notice is that forward, backward_update and accum_global need to work together.   But you got to realize all of these are long complicated functions.   So like next_utt_state, I will separate the discussion on another post.
  2. Another comment here: backward_update not only carry out the backward pass.  It also do an update of the statistics.

Conclusion of this post

In this post, I went through the high-level description of Baum-Welch algorithm as well as how the theory is mapped onto the C codebase.  My next post (will there be one?), I will focus on the low level functions such as next_utt_state, forward, backward_update and accum_global.

Feel free to comment. 


Tuesday, January 08, 2013

Two Views of Time-Signal : Global vs Local

As I have been working on Sphinx at work and start to chat with Nicholay more, one thing I realize is that several frequently used components of Sphinx need to rethink.  Here is one example  related to my work recently.

Speech signal or ...... in general time signal can be processed in two ways: you either process as a whole, or you process in blocks.  The former, you can call it a global view, the latter, you can call it a local view.  Of course, there are many other names: block/utterance, block/whole but essentially the terminology means the same thing.

For most of the time, global and local processing are the same.   So you can simply say: the two types of the processing are equivalent.

Of course, not when you start to an operation which use information available.   For a very simple example, look at cepstral mean normalization (CMN).  Implementing CMN in block mode is certainly an interesting problem.  For example, how do you estimate the mean if you have a running window?   When you think about it a little bit, you will realize it is not a trivial problem. That's probably why there are still papers on cepstral mean normalization.

Translate to sphinx, if you look at sphinxbase's sphinx_fe, you will realize that the implementation is based on the local mode, i.e. every once in a while, samples are consumed, processed and write onto the disc.    There is no easy way to implement CMN on sphinx_fe because it is assumed that the consumer (such as decode, bw) will do these stuffs their own.

It's all good though there are interesting consequence: what the SF's guys said about "feature" is really all the processing that can be done in the local sense.   Rather than the "feature" you see in either the decoders or bw.

This special point of view is ingrained within sphinxbase/sphinxX/sphinxtrain (Sphinx4? not sure yet.) .  This is quite different from what you will find in HTK which see feature vector as the vector used in Viterbi decoding.

That bring me to another point.  If you look deeper, HTK such as HVite/HCopy are highly abstract. So each tool was designed to take care of its own problem well. HCopy really means to provide just the feature, whereas HVite is just doing Viterbi algorithm on a bunch of features.   It's nothing complicated.  On the other hand, Sphinx are more speech-oriented.  In that world, life is more intertwined.   That's perhaps why you seldom hear people use Sphinx to do research other than speech recognition.  You can, on the other hand, do other machine learning tasks in HTK.

Which view is better?  If you ask me, I hope that both HTK and Sphinx are released in Berkeley license.  Tons of real-life work can be saved because each cover some useful functionalities.

Given that only one of them are released in a liberal license (Sphinx),  then may be what we need is to absorb some design paradigm from HTK.  For example, HTK has a sense of organizing data as pipes.   That something SphinxTrain can use.   This will enhance work of Unix users, who are usually contribute the most in the community.

I also hope that eventually there are good clones of HTK tools but made available in Berkeley/GNU license.  Not that I don't like the status quo: I am happy to read the code of HTK (unlike the time before 2.2......).   But as you work in the industry for a while, many are actually using both Sphinx and HTK to solve their speech research-related problems.   Of course, many of these guys  (, if they are honest,) need to come up with extra development time to port some HTK functions into their own production systems.  Not tough, but you will wonder whether time can be better spent ......


Readings at Jan 8, 2012

Testing Redux  by Vivek Halder

Comment: I second Vivek,  in a large scale project, having no testing is a project-killing act.  One factor to consider: in real-life, the mythical 100X productive programmer are rarely seen.   Even then, these programmers can make a mistake or two.   Therefore, not having any automatic testing  for a group is a very bad thing. 

On the other, should an individual programmer always follow automatic testing?   Yes and no.  Yes in the sense you should always write a test for your program.  No in the sense that you shouldn't believe testing will make your program automatically correct.   

Comment: very well said.  I hope this more hours = more work done crap can end soon. 

Todon't by Jeff Atwood

Comment: I like todo list alot but sometimes it takes me a while to organize them.  They also grow very quickly.  A good suggestions here is that not only you should add to your list, you should always change priorities.  One of the purposes of todo is to record your life but it has nothing to do with how you move forward with your life. 


Saturday, January 05, 2013

Readings at Jan 5, 2012

Understanding your own code by Eli Bendersky

I will add, if you need to maintain a certain codebase, go ahead to read all of them.


Friday, January 04, 2013

OpenEars and RapidEars

There are couple of messages on open source speech recognition on iOS.  That certainly caught my eyes.  For now, I believe one of most well-known names is OpenEars.  It is built up on PocketSphinx and it provides speech synthesis functionalities as well.

It looks pretty promising.  So I will probably start to play with it as well.

I just added a link of OpenEars here in the Grand Janitor's Blog.   Btw, I should think of better syndicate different ASR-related site.


Readings at Jan 4, 2012

Forth language from James Hague (Link)


Tuesday, January 01, 2013

After the 50th Post, some vision on the Grand Janitor's Blog

I tried to gather some topic ideas on this blog.   From LinkedIn, I hear quite a bit of feedbacks on discussion about decoding algorithms.    I have thought about this a bit.   Here are some sample topics which I will discuss,

  • How to setup Recognizer X on a Platform Y? 
  • What are the interesting properties of Program A in speech recognition package B? 
  • What are the best practice when you try to use a certain model training package? 
  • Tuning results in a domain. 
  • News: certain package has a new version coming out! What are something special? 
  • Thoughts in speech recognition problem.
  • Thoughts in programming problem.
  • Algorithms/Data structures for decoding/parameter estimation : like you suggested in this post. 
  • What is an interesting development direction for a speech recognition package?
  • Others: softwares/algorithm in some related fields such as NLP, MT. 
Of course, I will keep on chatting about some on-going components of Sphinx which I feel interested,

  1. Sphinx 4 maintenance and refactoring
  2. PocketSphinx's maintenance
  3. An HTKbook-like documentation : i.e. Hieroglyphs. 
  4. Regression tests on all tools in SphinxTrain.
  5. In general, modernization of Sphinx software, such as using WFST-based approach.
Unlike academic journals, I hope this is more an informal hub of ASR information.   I don't know if I can keep on. but at least that's what I am going to try. 

In any case, this blog will occupy much of my spare time next year.   Feel free to comments and give me suggestions.  And of course ....... Happy New Year!