HNSummaries.com - algorithmically summarized HN articles to your inbox

dy · on April 12, 2012

Would love to get people's feedback - built this for myself over the weekend as a way to accelerate and limit my reading of HN (and being inspired by the NLP course from Stanford).

The NLP is pretty basic and takes a ratio of the original article, so you do get some longer listings.

Big thanks to Wayne Larsen of hckrnews.com for providing me with some insight on tracking top stories and letting me use his ranking data. Also, I recommend http://www.hackernewsletter.com/ for a human-curated version.

peterldowns · on April 12, 2012

Is the code online? If not, any chance you would consider it? I'm into NLP (I wrote http://bookshrink.com) and would love to see how you did this!

dy · on April 12, 2012

Hi peter_l_downs - the summarizer is based on Open Text Summarizer (http://libots.sourceforge.net/) which works very similarly to your page at bookshrink (TF-IDF sentence scoring). I made some minor edits that accommodate article structure.

Bookshrink has some pretty amusing summaries... it reminded me of this meme a while back where people would paste books into Microsoft Word and AutoSummarize it down to 6 words :)

gala8y · on April 13, 2012

I like your workflow in the department of signups (no password, no 'account', opt-out in every email). However, I entered my email yesterday, got screen saying 'email is being delivered as we speak' and did not receive the email. Side effect - I can not reregister, reset password or what not ('Email address has already been taken').

Effectively, unsuccessful sign-up locks user out. You could provide some options (resend? remove?) where 'Email address has already been taken' is now.

Edit: 'unsuccessful sign-up' - seems like success is fractional here.

jedberg · on April 13, 2012

It would be cool if I could see a sample before signing up.

c16 · on April 12, 2012

Just subscribed - some very nice summaries in there. Do you have any links you could share related to how you worked out what's relevant? I've an extended project which I want to do on basic NLP and this seems very relevant. Thanks!

Dn_Ab · on April 12, 2012

Here is a simple recipe to do similar that works decently well as a start off point:

--------------------------------------------------------

Count how many times each word appears in the document into a dictionary or map structure

Also make sure you track the total words.

document |> splitBySpace |> if dictionary has word then +1 else 1; totalwords++

Then split the document into sentences.

Okay, now for each sentence

==========================================

score = 0.

split sentence by space and

for each word score+= -(dictionary[word]/sum) * log(dictionary[word]/sum)

dictionaryScore.Add(sentence, score)

==========================================

So now each sentence has a score. You can sort by best and lose order. Or if you want to limit (0 - 1) based on score:

findbestScore and filter each sentence by if limit < docscore / bestscore.

As I said this is only a start off point and is susceptible to list of random words (guess why) there are many ways to make it better. Here is a portion of code I dug up from a while ago:

  let inline sumMap m = m |> Map.fold (curryfst (+)) 0. 

  let inline internal countsAndSum n doc =
    let counts = splitstr [|" "|] doc |> filterStop n |> Array.fold mapAdd Map.empty
    counts, sumMap counts

  let ent m sum k = 
    let p = (mapGet m k 0.)/sum
    if p = 0. then 0. else -p * log2 (p)

  let eScore doc =      
    let counts , sum = countsAndSum 0 doc    
    splitSentenceRegEx doc |> Array.map (fun str -> str, splitstr [|" "|] str |> Array.fold (flip ((+) << (ent counts sum))) 0.)

nl · on April 13, 2012

This is the approach used in the Python NLTK. That algorithm was adapted from a Java library called Classifier4J that I wrote in the early 2000's[1].

I'd never seen that technique before, but (like a lot of algorithms) it is quite obvious once you've seen it.

Edit: actually, it's slightly different to my technique because I just used liner scoring (ie, counting popular words). I'm not sure which technique would work best.

[1] https://groups.google.com/d/topic/nltk-dev/qV9e5TsCBHg/discu...

Dn_Ab · on April 13, 2012

Hey I hadn't seen the technique above either. But I've certainly heard of your work. Unfortunately I am not able to share in the bounty that is NLTK.

Anyways, it is really hard to judge these things (statistical recommenders) since the metric is inherently subjective and there really is no wrong or right answer. But the way I like to defend it is: if you are going to just skim you should at least use a statistical based approach. Better than just jumping about randomly.

These days I'm more interested in abstract summarization without cheating (no templates).

nl · on April 13, 2012

Unfortunately I am not able to share in the bounty that is NLTK

Why is that?

Dn_Ab · on April 13, 2012

I am much stronger in F# than Python and have already invested time in building a decent codebase in it. Also I personally think better in statically typed functional languages.

I did not write the SO post which also is based on just word frequencies. I have found at least that in terms of picking the most relevant words with respect to the topic, the method I wrote which was inspired by ideas of entropy gives what I deem to be better results. Its robust against stop words and commonly repeated words that are not part of the topic. The summaries though, I cannot say are better or worse.

joshu · on April 13, 2012

Any chance you would consider explaining it not in (haskell?)

nl · on April 13, 2012

Sure:

My algorithm is something I made up, and from memory it works like this:

1) Remove HTML, stem, remove stopwords etc

2) Sort unique words by popularity in the text

3) Split the original text on sentence boundaries.

4) Include each sentence that first mentions the next most popular word, until the summary is the maximum length requested.

http://news.ycombinator.com/item?id=1803020

Googling turns up http://sujitpal.blogspot.com.au/2009/02/summarization-with-l... which compares a few approaches.

Edit: Also http://stackoverflow.com/questions/2829303/given-a-document-..., which I think is by Dn_Ab who wrote the OP.

scoot · on April 12, 2012

Does this approach have a name? It looks familiar, but I can't place it.

pkghost · on April 12, 2012

Reminds me of term frequency-inverse document frequency (http://en.wikipedia.org/wiki/Tf*idf) which can be used with cosine similarity to compute something that approximates the "semantic" relatedness of documents. Example and more algo details here: http://www.exsupera.com/sandbox/DCM/html/document.py?id=3

Dn_Ab · on April 13, 2012

Although it sometimes gives similar results, its not tf-idf. It is simpler so faster for this task. If it has a name I don't know it, its just an idea I had: Sentences with the most uncertainty associated with them with respect to the document should be the most interesting/relevant.

marknutter · on April 12, 2012

I personally don't want algorithmically summarized content, I want manually summarized content by knowledgeable HN users. It's half the reason I click into the comments 99% of the time before clicking into the linked article. I want interesting insight along with a good summary of what the main points were being communicated. There's just no way automatically generated summaries can compete with that.

dreeves · on April 12, 2012

I'd love to see the best of both worlds. I too love it when someone takes the time to summarize an article -- great community service. I'd love to establish a convention for doing so (I vote for prepending "Summary:" -- I find "tl;dr" irksome).

Then HNsummaries.com could fetch those when available instead of or in addition to the auto-summaries.

marknutter · on April 13, 2012

Funny, I created exactly that few years ago, got some traction and a story on readwriteweb.com, and then let the app stagnate and die. Maybe I should pursue it again :)

scoot · on April 12, 2012

How about the top comment for each article included in the email?

czzarr · on April 13, 2012

we are developing a service for that, stay tuned

ankimal · on April 12, 2012

Just got my first newsletter. Looking good for an initial release. Some feedback:

Would love to get an index of headlines on top of the email with anchors to actual stories below.

Would love to see shorter summaries and maybe some of the top comments for each story (summarized, if possible).

dy · on April 12, 2012

Thanks for the feedback. I'll add the list of headlines and I'm thinking about the summaries of the comments... it gets harder to understand them because they can be very context specific.

I'll take a crack and maybe add it as an option.

petercooper · on April 12, 2012

Bear in mind that comments here are self selecting for people who like HN's comments section ;-) But I know plenty of people and speak to people on Twitter who deliberately avoid these comments pages due to a perception (fair or not) of "drama" and what not. For those folks, an email like this could be just the ticket. For me though, I'm staying here ;-)

moconnor · on April 12, 2012

Thanks for sharing this, I'm curious to see how well it works out over time. It'd be nice to be able to choose the compression level.

Quality feels at least as good as an open source summarizer I played around with a while back; good work!

dy · on April 12, 2012

Thanks Mark for the comments and feedback! I appreciate it.

Timothee · on April 13, 2012

One thing I commend you for is to ask when I want to receive the email. It's surprising that barely any mailing-list or newsletter lets you pick that…

kiwidrew · on April 13, 2012

This.

I spend most of my time in Asia-Pacific timezones, so most of my automated emails arrive at awkward times. I'm glad that this one won't be staring at me from my inbox first thing in the morning -- helping me to produce first, consume second.

jilebedev · on April 12, 2012

Great execution, but I'm uncertain of the idea. My personal perspective: I read wikipedia for information -- I read HN for critical insight. Not always present, but a higher signal/noise ratio than other websites. I don't want a summary of information - I want critical thought.

eaurouge · on April 12, 2012

Why only 20 stories? I usually scan the first three pages once a day, a snapshot of the top 90 articles. Only about 10% are relevant so I'd rather have more summaries to sift thru to find the ~10 relevant articles for the day.

dreeves · on April 12, 2012

I actually thought algorithmic summaries would be worse than useless but they seem surprisingly good. Here's the one from Caine's Arcade:

"9 year old Caine sets up an arcade in his father’s used car parts store in East L.A., using only cardboard boxes his dad had lying around and a ton of ingenuity. Watch his dreams come true when this filmmaker sets up a flash mob to come and play. Just watching this may make you a better person. $82,000 has already been raised for Caine’s scholarship fund! little behind on the bandwagon, but...film just had me in tears."

dy · on April 12, 2012

This is reductive summarization so it's selecting key sentences and phrases from the text (rather than generating any new phrasing) so occasionally it will seem brilliant, and other times...

mjn · on April 12, 2012

In particular, this approach works best with journalism-style writing. Journalists typically write in a style with fairly short sentences that stand alone, and paragraphs of only 1-3 sentences. They even pay deliberate attention to quotability, for either pull quotes or chance of being quoted elsewhere, so everything is well suited to pulling sentences out. Tends not to work as well when applied to other styles of writing.

For more general text, the first problem that comes up is that out of context sentences with pronouns that point nowhere end up being unintelligible. The second sentence above only worked because the "he" was completely unambiguous in this summary.

petercooper · on April 12, 2012

I've done some work in this area (specifically in developer related news) and you're right. The tricky ones are where you end up with links to GitHub repos or project pages that assume visitors know what they're looking for. Automated summaries then become less than useless :-( My dream solution? Developers learn to write nice summaries on their pages ;-)

mgkimsal · on April 13, 2012

might be worthwhile to automatically fetch and parse text from some well-known urls (github, for example) to grab content from there to use as an adjunct.

DanielBMarkham · on April 12, 2012

I plan on adding this to my http://newspaper23.com site. It's just way on the back burner.

Ideally I think you would do it client-side, so readers could adjust the shrinkage to the amount of time they have to peruse. I was also thinking about a scenario where you could browse at say 100-words and then dive-deep if you found anything that interests you. A more interactive approach. You might want to consider this.

But I really like the idea. Would love to hear how the project goes!

sabalaba · on April 13, 2012

I got my first email, here's some feedback.

You should make sure that the summaries don't scale linearly with the size of the content--just because an article is 10x as long, doesn't mean I want a summary to be 10x longer. Maybe scale logarithmically?

I didn't find any of the summaries to be high quality or any better than I could get from briefly skimming HN myself.

I've unsubscribed.

chrishan · on April 13, 2012

I am taking an alternative approach to make sense of HN stories for Chinese readers. As a regular HN reader, I manually summarize the topic of top stories and translate them into Chinese. The motivation is to lower the startup/tech news sharing barriers. Link - http://geektell.com/

mistermann · on April 13, 2012

Really like it!

One small suggestion...could you make the "76 comments" under the title clickable through to the HN comments?

One other option (maybe a user preference), include some noteworthy excerpts from the HN comments in the email as well?

sabalaba · on April 12, 2012

Feature Request:

It would be great to get a weekly or monthly summary.

Nice work.

gootik · on April 12, 2012

why email? I'd like to see the summaries in a web page too.

kenneth · on April 12, 2012

How about giving writers the respect they deserve and not algorithmically rewriting their work? Has our attention span really gotten so short that we cannot read articles of substance any longer?

dreeves · on April 12, 2012

Well, it's like having an abstract of a paper. Which is a good point -- ideally the authors themselves would provide the summary. Still, you certainly need summaries!

I'd say the only time summaries could be a bad thing is for fiction, where you don't want to give things away.

For non-fiction giving things away is whole point. :)

petercooper · on April 12, 2012

It's not algorithmic but at Slashdot, the summaries are one of its biggest wins.

sabalaba · on April 12, 2012

It's less about attention span and more about signal/noise ratio.