Cool post, Andy. NLTK is a lot of fun, but it's not necessarily a production-rea...

Cool post, Andy. NLTK is a lot of fun, but it's not necessarily a production-ready solution -- for instance, scaling it out to other languages may pose some problems with respect to utf-8. NLTK's real purpose is more for pedagogy, and your blog post is a nice addition to teaching people Python and computational linguistics at the same time.

You might be interested in checking out pattern ( http://www.clips.ua.ac.be/pages/pattern ). It has a heuristic approach to sentiment analysis built right in that might be worth comparing your features against.

Finally, as far as classification goes, Python is pretty all right, but can be a tad slow working through large amounts of data. I've found that text classification at scale is best left to an external library, with Python doing feature extraction and managing the data pipeline. In the past I've built out feature sets with Python and then passed them to TADM ( http://tadm.sourceforge.net/ ). The advantage of TADM is that, being written in C++, it's meticulously optimized. Of course, you have fewer modeling options available to you. That's just one example; there are plenty of these kinds of services written in Java, too, for instance.

Thanks for a good read!