That sounds possible, but gathering all the data for topic-specific training mig...

stavros · on June 14, 2011

That's interesting, do you have any code or examples?

My thought about topic detection is to have it learn which words go together, and then augment the Markov chain model by some method that would weigh the Markov chain probability with the topic probability to select the next word, so it would at least generally stick to topic-relevant words. Perhaps you could even select one topic (a sample sentence, really) in the beginning and have it generate sentences based on that for the entire document.

mostlycarbon · on June 14, 2011

I wish I could publish it, but my company isn't very much into open source. It's a standard context free grammar framework modified to generate output in a stochastic manner. So it's basically a stochastic context free grammar (SCFG). I can go more into depth in private if you like.

The phrase for finding word pairs in text corpora is "cohort analysis". I was a on research team that did studies of that; mostly finding them, not generating anything with them.

It's an interesting subject area.

stavros · on June 14, 2011

That gives me a good idea for further research, thank you.