Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That sounds possible, but gathering all the data for topic-specific training might be somewhat maddening. The problem you get with larger groupings of words is lack of cohesion. If you trained it at a sentence level, you might be able to produce coherent sentences. But as you generated more sentences to produce a paragraph, it would likely meander.

I created a kind of Mad Lib generator using CFGs. A paragraph consisted of: [Intro] [Supporting sentence 1] [Supporting sentence 2] [Supporting sentence 3] ... [Conclusion]. All the sentences had placeholders for various nouns and adjectives that could later be filled in programmatically, and I extended the grammar spec to both permute sets of sentences and generate null productions with certain probabilities.

The base sentences were created by humans, about 15 per grammar rule. A single person could create a topic-based paragraph/grammar in less than two work days. The chances of it creating the same template twice was about one in a billion. Of course, the probability varied depending on how many seed sentences were present.

If the person writing the seed sentences is literate and passed the 6th grade, then everything the program generates is indistinguishable from human text.

It works marvelously.



That's interesting, do you have any code or examples?

My thought about topic detection is to have it learn which words go together, and then augment the Markov chain model by some method that would weigh the Markov chain probability with the topic probability to select the next word, so it would at least generally stick to topic-relevant words. Perhaps you could even select one topic (a sample sentence, really) in the beginning and have it generate sentences based on that for the entire document.


I wish I could publish it, but my company isn't very much into open source. It's a standard context free grammar framework modified to generate output in a stochastic manner. So it's basically a stochastic context free grammar (SCFG). I can go more into depth in private if you like.

The phrase for finding word pairs in text corpora is "cohort analysis". I was a on research team that did studies of that; mostly finding them, not generating anything with them.

It's an interesting subject area.


That gives me a good idea for further research, thank you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: