Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
HNSummaries.com - algorithmically summarized HN articles to your inbox (hnsummaries.com)
82 points by dy on April 12, 2012 | hide | past | favorite | 45 comments


Would love to get people's feedback - built this for myself over the weekend as a way to accelerate and limit my reading of HN (and being inspired by the NLP course from Stanford).

The NLP is pretty basic and takes a ratio of the original article, so you do get some longer listings.

Big thanks to Wayne Larsen of hckrnews.com for providing me with some insight on tracking top stories and letting me use his ranking data. Also, I recommend http://www.hackernewsletter.com/ for a human-curated version.


Is the code online? If not, any chance you would consider it? I'm into NLP (I wrote http://bookshrink.com) and would love to see how you did this!


Hi peter_l_downs - the summarizer is based on Open Text Summarizer (http://libots.sourceforge.net/) which works very similarly to your page at bookshrink (TF-IDF sentence scoring). I made some minor edits that accommodate article structure.

Bookshrink has some pretty amusing summaries... it reminded me of this meme a while back where people would paste books into Microsoft Word and AutoSummarize it down to 6 words :)


I like your workflow in the department of signups (no password, no 'account', opt-out in every email). However, I entered my email yesterday, got screen saying 'email is being delivered as we speak' and did not receive the email. Side effect - I can not reregister, reset password or what not ('Email address has already been taken').

Effectively, unsuccessful sign-up locks user out. You could provide some options (resend? remove?) where 'Email address has already been taken' is now.

Edit: 'unsuccessful sign-up' - seems like success is fractional here.


It would be cool if I could see a sample before signing up.


Just subscribed - some very nice summaries in there. Do you have any links you could share related to how you worked out what's relevant? I've an extended project which I want to do on basic NLP and this seems very relevant. Thanks!


Here is a simple recipe to do similar that works decently well as a start off point:

--------------------------------------------------------

Count how many times each word appears in the document into a dictionary or map structure

Also make sure you track the total words.

document |> splitBySpace |> if dictionary has word then +1 else 1; totalwords++

Then split the document into sentences.

Okay, now for each sentence

==========================================

score = 0.

split sentence by space and

for each word score+= -(dictionary[word]/sum) * log(dictionary[word]/sum)

dictionaryScore.Add(sentence, score)

==========================================

So now each sentence has a score. You can sort by best and lose order. Or if you want to limit (0 - 1) based on score:

findbestScore and filter each sentence by if limit < docscore / bestscore.

As I said this is only a start off point and is susceptible to list of random words (guess why) there are many ways to make it better. Here is a portion of code I dug up from a while ago:

  let inline sumMap m = m |> Map.fold (curryfst (+)) 0. 

  let inline internal countsAndSum n doc =
    let counts = splitstr [|" "|] doc |> filterStop n |> Array.fold mapAdd Map.empty
    counts, sumMap counts

  let ent m sum k = 
    let p = (mapGet m k 0.)/sum
    if p = 0. then 0. else -p * log2 (p)

  let eScore doc =      
    let counts , sum = countsAndSum 0 doc    
    splitSentenceRegEx doc |> Array.map (fun str -> str, splitstr [|" "|] str |> Array.fold (flip ((+) << (ent counts sum))) 0.)


This is the approach used in the Python NLTK. That algorithm was adapted from a Java library called Classifier4J that I wrote in the early 2000's[1].

I'd never seen that technique before, but (like a lot of algorithms) it is quite obvious once you've seen it.

Edit: actually, it's slightly different to my technique because I just used liner scoring (ie, counting popular words). I'm not sure which technique would work best.

[1] https://groups.google.com/d/topic/nltk-dev/qV9e5TsCBHg/discu...


Hey I hadn't seen the technique above either. But I've certainly heard of your work. Unfortunately I am not able to share in the bounty that is NLTK.

Anyways, it is really hard to judge these things (statistical recommenders) since the metric is inherently subjective and there really is no wrong or right answer. But the way I like to defend it is: if you are going to just skim you should at least use a statistical based approach. Better than just jumping about randomly.

These days I'm more interested in abstract summarization without cheating (no templates).


Unfortunately I am not able to share in the bounty that is NLTK

Why is that?


I am much stronger in F# than Python and have already invested time in building a decent codebase in it. Also I personally think better in statically typed functional languages.

I did not write the SO post which also is based on just word frequencies. I have found at least that in terms of picking the most relevant words with respect to the topic, the method I wrote which was inspired by ideas of entropy gives what I deem to be better results. Its robust against stop words and commonly repeated words that are not part of the topic. The summaries though, I cannot say are better or worse.


Any chance you would consider explaining it not in (haskell?)


Sure:

My algorithm is something I made up, and from memory it works like this:

1) Remove HTML, stem, remove stopwords etc

2) Sort unique words by popularity in the text

3) Split the original text on sentence boundaries.

4) Include each sentence that first mentions the next most popular word, until the summary is the maximum length requested.

http://news.ycombinator.com/item?id=1803020

Googling turns up http://sujitpal.blogspot.com.au/2009/02/summarization-with-l... which compares a few approaches.

Edit: Also http://stackoverflow.com/questions/2829303/given-a-document-..., which I think is by Dn_Ab who wrote the OP.


Does this approach have a name? It looks familiar, but I can't place it.


Reminds me of term frequency-inverse document frequency (http://en.wikipedia.org/wiki/Tf*idf) which can be used with cosine similarity to compute something that approximates the "semantic" relatedness of documents. Example and more algo details here: http://www.exsupera.com/sandbox/DCM/html/document.py?id=3


Although it sometimes gives similar results, its not tf-idf. It is simpler so faster for this task. If it has a name I don't know it, its just an idea I had: Sentences with the most uncertainty associated with them with respect to the document should be the most interesting/relevant.


I personally don't want algorithmically summarized content, I want manually summarized content by knowledgeable HN users. It's half the reason I click into the comments 99% of the time before clicking into the linked article. I want interesting insight along with a good summary of what the main points were being communicated. There's just no way automatically generated summaries can compete with that.


I'd love to see the best of both worlds. I too love it when someone takes the time to summarize an article -- great community service. I'd love to establish a convention for doing so (I vote for prepending "Summary:" -- I find "tl;dr" irksome).

Then HNsummaries.com could fetch those when available instead of or in addition to the auto-summaries.


Funny, I created exactly that few years ago, got some traction and a story on readwriteweb.com, and then let the app stagnate and die. Maybe I should pursue it again :)


How about the top comment for each article included in the email?


we are developing a service for that, stay tuned


Just got my first newsletter. Looking good for an initial release. Some feedback:

Would love to get an index of headlines on top of the email with anchors to actual stories below.

Would love to see shorter summaries and maybe some of the top comments for each story (summarized, if possible).


Thanks for the feedback. I'll add the list of headlines and I'm thinking about the summaries of the comments... it gets harder to understand them because they can be very context specific.

I'll take a crack and maybe add it as an option.


Bear in mind that comments here are self selecting for people who like HN's comments section ;-) But I know plenty of people and speak to people on Twitter who deliberately avoid these comments pages due to a perception (fair or not) of "drama" and what not. For those folks, an email like this could be just the ticket. For me though, I'm staying here ;-)


Thanks for sharing this, I'm curious to see how well it works out over time. It'd be nice to be able to choose the compression level.

Quality feels at least as good as an open source summarizer I played around with a while back; good work!


Thanks Mark for the comments and feedback! I appreciate it.


One thing I commend you for is to ask when I want to receive the email. It's surprising that barely any mailing-list or newsletter lets you pick that…


This.

I spend most of my time in Asia-Pacific timezones, so most of my automated emails arrive at awkward times. I'm glad that this one won't be staring at me from my inbox first thing in the morning -- helping me to produce first, consume second.


Great execution, but I'm uncertain of the idea. My personal perspective: I read wikipedia for information -- I read HN for critical insight. Not always present, but a higher signal/noise ratio than other websites. I don't want a summary of information - I want critical thought.


Why only 20 stories? I usually scan the first three pages once a day, a snapshot of the top 90 articles. Only about 10% are relevant so I'd rather have more summaries to sift thru to find the ~10 relevant articles for the day.


I actually thought algorithmic summaries would be worse than useless but they seem surprisingly good. Here's the one from Caine's Arcade:

"9 year old Caine sets up an arcade in his father’s used car parts store in East L.A., using only cardboard boxes his dad had lying around and a ton of ingenuity. Watch his dreams come true when this filmmaker sets up a flash mob to come and play. Just watching this may make you a better person. $82,000 has already been raised for Caine’s scholarship fund! little behind on the bandwagon, but...film just had me in tears."


This is reductive summarization so it's selecting key sentences and phrases from the text (rather than generating any new phrasing) so occasionally it will seem brilliant, and other times...


In particular, this approach works best with journalism-style writing. Journalists typically write in a style with fairly short sentences that stand alone, and paragraphs of only 1-3 sentences. They even pay deliberate attention to quotability, for either pull quotes or chance of being quoted elsewhere, so everything is well suited to pulling sentences out. Tends not to work as well when applied to other styles of writing.

For more general text, the first problem that comes up is that out of context sentences with pronouns that point nowhere end up being unintelligible. The second sentence above only worked because the "he" was completely unambiguous in this summary.


I've done some work in this area (specifically in developer related news) and you're right. The tricky ones are where you end up with links to GitHub repos or project pages that assume visitors know what they're looking for. Automated summaries then become less than useless :-( My dream solution? Developers learn to write nice summaries on their pages ;-)


might be worthwhile to automatically fetch and parse text from some well-known urls (github, for example) to grab content from there to use as an adjunct.


I plan on adding this to my http://newspaper23.com site. It's just way on the back burner.

Ideally I think you would do it client-side, so readers could adjust the shrinkage to the amount of time they have to peruse. I was also thinking about a scenario where you could browse at say 100-words and then dive-deep if you found anything that interests you. A more interactive approach. You might want to consider this.

But I really like the idea. Would love to hear how the project goes!


I got my first email, here's some feedback.

You should make sure that the summaries don't scale linearly with the size of the content--just because an article is 10x as long, doesn't mean I want a summary to be 10x longer. Maybe scale logarithmically?

I didn't find any of the summaries to be high quality or any better than I could get from briefly skimming HN myself.

I've unsubscribed.


I am taking an alternative approach to make sense of HN stories for Chinese readers. As a regular HN reader, I manually summarize the topic of top stories and translate them into Chinese. The motivation is to lower the startup/tech news sharing barriers. Link - http://geektell.com/


Really like it!

One small suggestion...could you make the "76 comments" under the title clickable through to the HN comments?

One other option (maybe a user preference), include some noteworthy excerpts from the HN comments in the email as well?


Feature Request:

It would be great to get a weekly or monthly summary.

Nice work.


why email? I'd like to see the summaries in a web page too.


How about giving writers the respect they deserve and not algorithmically rewriting their work? Has our attention span really gotten so short that we cannot read articles of substance any longer?


Well, it's like having an abstract of a paper. Which is a good point -- ideally the authors themselves would provide the summary. Still, you certainly need summaries!

I'd say the only time summaries could be a bad thing is for fiction, where you don't want to give things away.

For non-fiction giving things away is whole point. :)


It's not algorithmic but at Slashdot, the summaries are one of its biggest wins.


It's less about attention span and more about signal/noise ratio.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: