1) What's the 'top words' which appears when you search for a site? I just get a bunch of profanities (for basically any show, even those which are PG-13). Is this meant to be the top words found in the show's subtitles (it's not) or the most searched for words (in which case why am I'm being shown that). Further searches seem to show some work (e.g. Homeland, or Game of thrones)
2) Expanding the 'top words' gives (apparently) a top 100, except many words appear more than once - in my 'top words' for 'The Simpsons', 'MOM' appears 7 times.
3) What are the 'Top topics'? Again, examining The Simpsons, the top topics are, 'Case/investigation', 'noisey', and 'spooky'.
4) Browser 'back' doesn't work from top topics or top words
Edit: Having read the 'about' I'm feeling far less critical, given this is part of a Big Data course project. Initially, I wondered if the prevalence of profanities in speech (generally) are causing a weird biasing effect (i.e. a single word being said repeatedly) but given there shouldn't be any 'fuck's in The Simpsons/Modern Family/Friends my guess is something may be off on the back-end?
First of all this is all still in very early beta work ! I'm not sure how it got to HN but here it is so all your feedback is great. I'll try to answer some of your questions the best I can:
1) The top words are those that characterize the show the best. This is not perfect science, and is an output of the LDA algorithm, but it gives already a good indication. Some words indeed shouldn't be there. Some possible explanation: subtitle mistake or a bug...
2) The words that appear more then once are again a glitch, and should be fixed. Again, work in progress...
3) The top topics are found using a topic modelling algorithm. It splits a corpus of documents into a number of topics, and every documents contains a certain proportion of each topic (20% Police, 80% Terrorism for example). The topics are bag-of-words, and so we manually give them names to what we think fits best.
4) Again beta...
I hope the 'about' is clear enough, if you have any questions feel free to ask !
3.) I think the 'Top Topics' are categorizations based upon the words found in the subtitles. Each word in the English language is mapped to a category, and based upon the word content of the show, it is assigned a category. While interesting in theory, it definitely misses the mark on certain shows. I assume The Simpsons is due to their 27 Treehouse of Horror events while the rest of the show does not necessarily have a central focus.
1) What's the 'top words' which appears when you search for a site? I just get a bunch of profanities (for basically any show, even those which are PG-13). Is this meant to be the top words found in the show's subtitles (it's not) or the most searched for words (in which case why am I'm being shown that). Further searches seem to show some work (e.g. Homeland, or Game of thrones)
2) Expanding the 'top words' gives (apparently) a top 100, except many words appear more than once - in my 'top words' for 'The Simpsons', 'MOM' appears 7 times.
3) What are the 'Top topics'? Again, examining The Simpsons, the top topics are, 'Case/investigation', 'noisey', and 'spooky'.
4) Browser 'back' doesn't work from top topics or top words
Edit: Having read the 'about' I'm feeling far less critical, given this is part of a Big Data course project. Initially, I wondered if the prevalence of profanities in speech (generally) are causing a weird biasing effect (i.e. a single word being said repeatedly) but given there shouldn't be any 'fuck's in The Simpsons/Modern Family/Friends my guess is something may be off on the back-end?