The Grand Janitor's Blog: 2007

Wednesday, November 07, 2007

Statistically Insignificant Me

Slightly related my last post. It relates to an interesting issue of whether we should share the bookshelf in the first place.

Why is it an issue? Well, privacy. Suppose someone is malicious and try to figure you out. The best way is to try to gather all information about you and work against you.

Another concern of mine is rather interesting and absolutely speculative, what if information I read will affect my thought and what if people could reconstruct it just from the information I read? That will open up a lot of interesting application. e.g. We might be able to predict what a person will do better.

Just like in other time series problem such as speech recognition and quantitative analysis. Human life could simply be defined by a series of time events. Some (forget the quote) believes that one human life could be stored in hard-disk and some starts to collect human life and see whether it could be model.

Information of what you read could tell a lot of who you are. Do you read Arthur C. Clarke? Do you read Jane Austen? Do you read Stephen King? Do you read Lora Roberts? From that information, one could build a machine learner to reverse map to who you are and how you make decision. We might just call this a kind of personality modeling.

It seems to me these are entirely possible from the standpoint of what we know. Yet, I still decide to share my bookshelf? Why?

Well, this was crystal-clear moment for me (and perhaps for you as well) which helps me to make a decision: Very simple, *I* am statistically in-significant.

If you happen to come to this web page, the only reason you come is because you are connected to me. How likely will that happened?

I know about 150 persons in my life. The world has about 6 billion. So that simply means the chance of me being discovered is around 1.5 x 10^-8. It is already pretty low.

Now, when other people know me and recommend me to someone else. Then this probability will be boosted up because 1) my PageRank will increase, 2) people follow my link deep enough will eventually discovered my bookshelves.

Yet, if I try to stay low-profile, (say not try to do SEO, not recommend any friends to go to my page) then it is reasonable to expect the factor mentioned is smaller than 1.

Further, 1.5 x 10^-8 is an upper bound as an estimate because
1, Not all my friends are interested in me (discounting factor : 0.6, a conservative one, the actual number is probably higher but I just don't want to face it. ;) )
2, My friends who are interested in me might not follow my links (discounting factor: 0.01)

So we are talking about an event with probability as low as 10^-9 or 10^-10 here. That seems to me close to cheap cryptographic algorithm.

But notice here, my security is not come from hiding or cryptography. My security merely come from my statistical insignificance. In English, I am very open but no one cares. And I am still a happy treebear. ;)

That's why you see my bookshelf. Long story for a simple decision. If you happen to read this, I hope you enjoy it.

-a

Visual Bookshelves

I love to read and like to write reviews for every books I read. None of them will change the world but it still loves to do it. That's why by definition - I'm a bookworm. Not even feel shy about it. ;)

I go quite far: try to record every books I read on a blog and start to put them in a blog called "ContentGeek". Luckily, I haven't gone very far. Because once I discovered Visual Bookshelves, there is no need for me to do it all.

Visual Bookshelves allow users to look up a book from Amazon, add comments and stored it in a database. It also shows the book cover of the books. What else could I want more?

So anyway, this is the link of my visual bookshelves:

http://www.cs.cmu.edu/~archan/personal/bookshelf.html

Enjoy.

-a

Monday, October 01, 2007

Prof. Randy Pausch's Last Lecture

http://video.google.com/videoplay?docid=362421849901825950&hl=en

If you work or study in Computer Science or Electrical Engineering, I would highly recommend this video to you. It makes me remember why I decide to work in this industry in the first place.

-a

Wednesday, August 01, 2007

David's plan on Sphinx 3.7

http://lima.lti.cs.cmu.edu/mediawiki/index.php/Sphinx3

A great read, it touches the heart of implementation issues of all sphinxen. And its criticism on my implementation a right straight to the point.

I felt very relieved when the current maintainer attack what I did in the past. (Some features I did were rather stupid.) This shows that Sphinx is still alive and will still be alive.

-a

Friday, July 20, 2007

Life in Scanscout

Hi Guys,
Scanscout (www.scanscout.com) is a rather interesting company. . If you look at this blog, you probably know that I have been there for a while.

My direct supervisor doesn't like to give away too much. I think he has a point (as he is a *v* smart guy"). This contradicts to my philosophy of information sharing. So alright, as a compromise, here are couple of things I could share. (Of course, my estimate of the probably of anyone looking at this blog is about 1/10^9, so I guess it doesn't matter that much......)

1, We have a massage chair and it is awesome.
2, We have a foozball table and have a tournament every Friday. Beware, there are several good players. (I always get the lowest score.)
3, It is on the fore-front of video advertising. I am glad that I've joined. :-)

Arthur Chan

Sunday, April 15, 2007

mosedecoder

Mosedecoder: http://www.statmt.org/moses/

Ah. This is not exactly news. It has been around since 2006 John Hopkins workshop.

mosedecoder is probably the first open source statistical machine translation implementation in the world. For quite a while, only the IBM models training portion of the code could be found in GIZA++. So for people who is interested in SMT, they will probably turn to Pharaoh, a close source implementation available in the web.

I could have some fun. ;-)

-a

Monday, March 05, 2007

Third Draft of Hieroglyphs

Hi all,

It has been a while I worked on the Hieroglyphs (the fancy name I made for sphinx documentation). This is perhaps the only things I haven't wrapped up in CMU. Therefore I decided to release a draft. You can find it

at

http://www-2.cs.cmu.edu/~archan/documentation/sphinxDocDraft3.pdf

It still looks pretty messy but it starts to look like a book now.

Several chapters and sections were trimmed in this draft. You will still see a lot of ?. Those are signals of not enough proof-reading. Forgive me, when I have more time, I will try to fix some of them in near future.

Grand Janitor