Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Bill Gates Flabbergasted By Gmail (fool.com)
118 points by jkuria on July 10, 2011 | hide | past | favorite | 57 comments


Bullshit. The conclusions the author draws, that Gates was "anchored in the old paradigm of storage being a commodity that must be conserved" sounds like typical hand-wavy details-don't-matter business-person think. Microsoft is notorious for generating tons of internal e-mail. People go away on holidays and come back to 10K unread emails. More likely Bill knows how much e-mail he receives and roughly how much it grows per week/month. This guy tells him that he's burned through 1GB of e-mail in few months and that just doesn't add up to Bill. Either this reporter receives an order-of-magnitude more e-mails than Bill does or people have 1MB pictures in their signature blocks or something. So Bill starts drilling down and asking questions. Remember Bill is a fairly 'technical' guy (http://www.joelonsoftware.com/items/2006/06/16.html)

'He began firing questions. "How many messages are there?" he demanded. "Seriously, I'm trying to understand whether it's the number of messages or the size of messages." '

I don't interpret this to be gates questioning the necessity for more than 1GB of e-mail, just trying to get to the bottom of how this guy managed to use that much in a few months.


One good reason that it tickled Gates' bullshit detector is that up until Outlook 2002, the Outlook .PST file format was limited to 2GB; Microsoft had a recovery tool (http://support.microsoft.com/kb/296088) to fix the problem. I knew someone who banged up against that limit as early as 2000, but only after saving every email they received (including many with huge attachments) for more than four years.


But that's one thing that was distinctive about GMail : it encourages you to keep (archive) everything - and searches it very quickly. Compare this to Outlook : Searching is painfully slow, and the instinctive way to use it is to delete emails rather than keep them.


Compare this to Outlook : Searching is painfully slow

Outlook 2003 maybe, but that was eight years ago. Microsoft bought lookout and Outlook indexed search is pretty much instant.


That's BS, I use outlook 2010 and IMO Searching is still painfully slow. I suspect I probably have less than 50MB of excluding attachments so it should be nearly instant even if it's a non indexed full text search.


There's definitely something wrong with your setup. Windows XP will require an extra download, for example.


I just started to use gmail's imap support to access my email accounts with thunderbird. It's quite impressive what the thunderbird team have done with it since I got disheartened with it a few years ago. One thing I was pleasantly surprised with was how good the search feature was.


I run up against this each year. Had an initial corruption and then started new PSTs each year and separated out some of my larger clients. It's quite heart-attack inducing to corrupt a massive PST when you base a lot of work and billing on the unread status of emails!


Agreed. I definitely have more than 1GB of email, but asking one would need/use that amount of storage is a perfectly legit question.


Because, according to the old paradigm, part of using email systems was the chore of deleting old messages. Much like it's perfectly natural to a lot of people to find a company's website, download a setup program and run it instead of running a package manager.


Maybe it's just me, but I still have to delete messages. Otherwise a bunch of crap I no longer care about gets returned in GMail's search, completely defeating the purpose of having a search-driven interface.


I find I'm unable to determine whether a message is crap I'll never care about again or not.

Often I can't think of a reason I'd possibly ever use a message again, and lo and behold there's some strange reason I want to refer back to it 2 years later (e.g., figuring out my start date for an old job on a new application ... weird little cases like that).


I guess just about everything Google will classify as "bulk" I'll never need again. The Groupon deal that expired yesterday isn't something I'm going to need to access 2 years from now. Maybe those monit alerts I will, but in all likelihood not. Things like that. But they usually end up having just enough terms to throw off GMail's already terrible search making it hard to find content I really care about.


> Otherwise a bunch of crap I no longer care about gets returned in GMail's search

It may be a language issue or a query-building issue, but I feel no difficulty searching through my gmail mailboxes. I often adopt an iterative approach where I build up my query by watching what gets returned.


I found using the is:read modifier in my searches is a great way of weeding out the crap, since I usually don't even bother to open that stuff. YMMV!


I bet Gates is also thinking about how much it was costing Google to store all that. Multiplying the numbers by the numbers of Hotmail users (2 GB * a few hundred million).

That was the clever thing about Gmail's original invite system. Google could offer an amount of storage that Microsoft could never match, and the invite system ensured they could control the initial volume of users.


There's a certain irony in this thread sharing the front page with “We ran out of disk space” (http://news.ycombinator.com/item?id=2747152)


> "anchored in the old paradigm of storage being a commodity that must be conserved"

The author seems to think that Google's innovation was to simply ignore the storage problem and then it went away.

What they did was work immensely hard at the storage problem, until their innovations in data compression and retrieval made it possible to store larger amounts of data. http://highscalability.com/google-architecture


"anchored in the old paradigm of storage being a commodity that must be conserved"

Yeah, such an old paradigm... Just try managing backups for an ever increasing mail store...

Yes, today storage is cheap (even middle-tier storage arrays where the cost per-GB is about 20x of consumer-level disks can be considered cheap), but keeping large volumes of data safe from disaster is not "cheap". Just ask Google how much time it took them to restore those Gmail mailboxes from tape after they got lost in production a few months back. And that's google we're talking about...


Bill Gates' question made some sense. If mails are 3k each, and you have 1 GB of space, how many mails do you need to receive per day to get out of space in 6 months?

http://www.wolframalpha.com/input/?i=%281+gigabyte+%2F+3kilo...

the answer is 1826 mails per day. I assume what happened here is that they didn't have good stats of the mail sizes in the conversation and Bill tried to estimate something like this and as a result, he thought it was ridiculous.


Maybe they email a lot with the kind of people who embed their fancy company logo as a 50K jpg into their mails, then you're down to 100 mails per day.


3k is a bit small...

consider, those are newspaper people. they send HTML emails, PDFs, scans, photos, all that stuff. wouldn't surprise me if they collected a few megs a day. busy people.


Version control over e-mail would hit that 7MB/day limit easily, when dealing with image formats. Makes sense.


> 3k is a bit small...

And Gates explicitly asked how big each email was, on average.


Just some figures that might interest.

Current storage offered in a gmail account: 7,602MB

Current maximum size of an email for gmail: 35MB

So you could technically fill a gmail account with 217 emails.

However, looking at my mail backups, I've received approximately 17,000 emails so far this year. The average size of an email was 6.6KB. Altogether, they're using up just over 100MB of disk space.

My live mail has only 15 of those 17,000 emails remaining in it, using up 300KB of disk.


It seems like maximum size for an email is 25 MB.

"With Gmail, you can send and receive messages up to 25 megabytes (MB) in size."

http://mail.google.com/support/bin/answer.py?answer=8770


I got the 35MB by connecting to their MX and then looking at the value returned by the SIZE ESMTP extension:

  mike@alfa:~$ telnet gmail-smtp-in.l.google.com 25
  Trying 209.85.143.27...
  Connected to gmail-smtp-in.l.google.com.
  Escape character is '^]'.
  220 mx.google.com ESMTP z4si16632304weq.140
  EHLO mail.cardwellit.com
  250-mx.google.com at your service, [178.79.145.246]
  250-SIZE 35882577
  250-8BITMIME
  250-STARTTLS
  250 ENHANCEDSTATUSCODES
Their MX states that it accepts messages up to 35882577 bytes in size, which is just under 35MB.


That's a great bit of technical expectation management.

"We'll accept up to 25MB!" And then when you send something 26 or 27MB because you don't notice that it's so close to the limit, the system forgives you.


More likely it is an accommodation of base64[1] (MIME) encoding of binary files, which results in a 30% expansion (3 bytes get encoded to 4): 25MB * (4/3) = 33MB. People who read that GMail is limited to 25MB will expect their 25MB photo to be accepted. This requires the absolute size limit to be set to 33MB (probably bumped up to 35MB to accommodate the HTML body and other spurious stuff like the 25 off-topic replies).

[1] http://en.wikipedia.org/wiki/Base64


Gates was right in "What've you got in there? Movies? PowerPoint presentations?" A variant of the 80/20 rule probably applies to email: 80% of the size of your inbox is caused by the 20% largest emails.


True, although it would be hard not to have such a rule:

where something is shared among a sufficiently large set of participants, there must be a number k between 50 and 100 such that "k% is taken by (100 − k)% of the participants

http://en.wikipedia.org/wiki/Pareto_principle


Yeah, that is pretty obvious. No one would bother saying that 55% of the space is taken by 45% of the email though. It is mentioned because of high difference between the two values.


I think that would be worth pointing out, because it's such a huge deviation from the quasi-Zipfian behavior you'd expect.


There's no reason those numbers need to sum to 100. E.g. 90% of X could be contolled by 30% of the people using X.


If it's a continuous function, they need to sum at some point.

A lot of real world things are close enough to continuous for all practical purposes, though I'll concede that we aren't actually sending and receiving email infinitesimals.


My point is that you have two distinct variables or percentages. It's unfortunate that it gets called the "80/20 Rule" because people see that and think, Oh, the two numbers have to add to 100.

It's not a rule. It's the idea that things are not always as balanced/equal/proportioned as one might hope or expect.

For example, there's the claim that, in the USA, 25 percent of households own 87 percent of all U.S. wealth. It's an example of the Pareto Principle in the form of 87/25. 100 only plays a role insofar as nether value can exceed it (since the numbers refer to percentages).


There are an infinite number of such pairs of numbers. 87, 25 may be one of them; perhaps 83, 22 is another, and 88, 28 is another, all describing the same distribution. But of all of those pairs of numbers, there is exactly one pair that sums to 100. If you want to compare the inequality of two distributions of wealth (or whatever), that pair of numbers is a reasonable candidate metric: an 80, 20 distribution is more equal than a 90, 10 distribution, and more unequal than a 70, 30 distribution.

The Gini coefficient is another scalar that can be used to rank distributions by inequality.


There are an infinite number of such pairs of numbers. 87, 25 may be one of them; perhaps 83, 22 is another, and 88, 28 is another, all describing the same distribution.

But they don't, and that's a key point. The numbers are, in fact (in this example) 25% and 87%. Different numbers would describe a different situation. They are not percentages of the same thing; they are percentages of two different things.


You are mistaken in your assertion that those sets of numbers necessarily describe different distributions or situations.

I will explain this more carefully so that you can understand what the key points actually are. Please take the time to read and understand the explanation below.

You are correct that they are percentages of two different things.

However, if 87% of the wealth, whatever that is, belongs to the richest 25% of the population, then it's entirely possible for 88% of the wealth to simultaneously belong to the richest 28% of the population. That would just mean that 1% of the wealth belongs to the 3% of the population between the 72nd and 75th percentile, which is an entirely plausible state of affairs.

Consider, for any number X from 0 to 100, you can find a number Y that makes the statement "The richest X% of the population owns Y% of the wealth" true, without changing the distribution. Y is continuous and increases monotonically with X; and when X=0, Y=0; and when X=100, Y=100. Under those conditions, there is guaranteed to be exactly one point in [0, 100] where X = 100-Y.

If you want to compare two different distributions, it's helpful to oversimplify them to scalars, since otherwise you have vectors in an infinite-dimensional Hilbert space, which are tricky to compare. If you know that in the US, the richest 25% of the population controls 87% of the wealth, while in Argentina, the richest 10% of the population controls 70% of the wealth (it doesn't), you don't know which country is more unequal. It could be that the richest 10% of the population in the US controls 87% of the wealth, or 34.8% of the wealth, or anything in between. Furthermore, it could simultaneously be the case that, in the US, the richest 10% of the population controls 80% of the wealth (making the US seem more unequal), while in Argentina, the richest 25% of the population controls 90% of the wealth (making Argentina seem more unequal).

There are lots of possible choices of scalar. The smallest percentage of the population that controls 50% of the wealth is one reasonable candidate. The percentage of the wealth controlled by the richest 50% of the population is another. The Gini coefficient is a third. And that unique point of intersection where the richest X% of the population controls (100-X)% of the wealth is a fourth.

Does that clarify matters?


Thanks very much for taking the time and trouble to write this. Hopefully I'll learn something, and perhaps also make clear my own points.

However, if 87% of the wealth, whatever that is, belongs to the richest 25% of the population, then it's entirely possible for 88% of the wealth to simultaneously belong to the richest 28% of the population.

Sure. And you can look through the data to find assorted pairs like that, if you have all the data. If you don't then you're guessing.

Consider, for any number X from 0 to 100, you can find a number Y that makes the statement "The richest X% of the population owns Y% of the wealth" true, without changing the distribution. Y is continuous and increases monotonically with X; and when X=0, Y=0; and when X=100, Y=100. Under those conditions, there is guaranteed to be exactly one point in [0, 100] where X = 100-Y.

Sure, and I understand the value of normalizing data in order compare like things. What I do not see is that all expressions of the Pareto Principle must be given in some normalized form that asumes complete knowldge of the distribution. That knowledge may not be available. That doesn't mean one cannot observe and convey an instance of the Pareto Principle.

Basically, my points are that a) examples of the Pareto Principle do not have sum to 100. For example, if I have a team of five hackers, and one writes 90% of the code, then 20% of my team does 90% of the work. It's a 90/20 thing, and that's a valid example of the Pareto Principle as given.

Could this be adjusted to some X/Y such that X+Y == 100? Here's where I may be missing your point; If all know is that one person, hacker #1, is writing 90% of the code, how can I know the actual distributions from 20% to 40% to 60%, etc.? Suppose hacker #2 is writing the other 10% of the code; then I have a 100/40 situation. If hackers 2 and 3 are each writing 5% of the code then that's 95/40, 100/60 (and 100/80 as well).

This is why I say shifting 90/20 to some X+Y==100 formulation is expressing something different (albeit possibly a true one). So, point b) is that some sets of numbers are both more accurate and more germane to making a particular point. E.g. 90/20 is more striking, and accurate, than perhaps 74/26, which may reflect a truth about the distribution but fails to convey anything salient (though it may be handy for comparison with some other data).

In other words, there's an important difference between making an arbitrarily true statement about a distribution, and pointing out a uniquely interesting aspect of that distribution.

My apologies if I'm still being dense, or missing your point entirely, and I do appreciate your explanation. I get the sense we've been talking past each other.


Agreed.


You could probably project that to 95/5 in the case of emails, where the expected case is miniscule and the worst case is huge. If you manage to go through a gig of inbox space in a couple months, there have to be some massive files there accounting for the bulk of your storage.


I'm inclined to agree with this fraction. I just used POP3 to LIST the first 360 messages in my mailbox and the graph shows an distinct upshot around 95%.

Around 95% of my (360) emails consume 6.5% of the space (360 messages take).


For what computers cost in 1980 compared to their utility, I'm not sure why an average household would get one.


An astute observation.

I believe it took spreadsheets for 'personal' computers to really start taking off in the smaller business world; it wasn't until - IIRC - Windows 95 that computers took off in households. I remember that you were something special prior to '95 or so if you owned a computer.


I think the number would seem like a plus to most even though they will never get anywhere near needing it. Just knowing that your never going to hit a space barrier while archiving all your emails is a big thing.


Yeah well what I'm interested in is: what are we believing in right now that will sound just as ridiculous five years from now on?


If this was 2004, how did the author have 2GB of storage on Gmail and use "more than half", when it launched with 1 GB and was only upgraded to 2GB in 2005? ( http://en.wikipedia.org/wiki/History_of_Gmail )


1GB over a few months? In text? That's ~500000 pages of text. I doubt that the reporters actually received that much email. There's probably something else draining space that they didn't realize (hence Gate's surprise).


It's a shame that we give attention whoring journalists like this guy any attention at all. More sensational bs. If I wanted sensationalist, I'd be on reddit's /r/politics.


I'm on a mobile phone right now, and I have to say, that was the most poorly laid out and convolouted site I've seen on a phone......


Only a few months after starting this, both of us had consumed more than half of Gmail's 2-gigabyte free storage space.

Okay, I'm extremely suspicious of anyone who claims to process one gigabyte of email per month. Barring large attachments, that is a tremendous amount of sheer data. If you're getting that much email, you probably need culling tools more than you need space.

Regarding the two gigabyte limit being shocking, it was at the time. The only reason Google could pull it off is because 1) very, very few users would get anywhere near that and 2) they were already scaling data at a ridiculous rate. A traditionally desktop-oriented software company like Microsoft would not be able to offer anything similar without tremendous investment of hardware and development.


The saddest thing about this, for me, is that I work for a company where everyone (including myself) gets over their mailbox limit in a couple days...7 years.../cry


Adding "2MB email ought to be enough for anybody" to the list of things Bill Gates might have said.


But somewhere in the design of Outlook was an implicit 2Gb limit : The max size of a PST file. Maybe it was an OS limit, but still the decision to store all email in a single binary blob wasn't sufficiently forward-thinking.


I suspect that means a .PST is mmap()'d (or Windows equivalent), 32-bit Windows gives each process a 2G address space by default.


If that were the problem, the limit would be substantially less than 2G because of the address space needed for the DLLs, the stack, other allocated memory, and so on; and, of course, the space consumed by address space fragmentation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: