NoSQL is somewhere on that initial "peak of inflated expectations". (I don't think it's hit the top yet, but it sure is soaring.) Rails passed the peak many months ago and is somewhere to the right of the Trough of Disillusionment. [1] (I don't think it's plateaued yet: Rails still isn't quite done being invented.)
Once the hot air leaks out in a year or two, NoSQL databases will still exist, and will in fact be better understood and better built than ever, but they will no longer be a trending buzzword. That's the day that the author is devoutly wishing for.
(I, personally, find the hype cycle to be kind of fun to watch, and educational too, so I'm not as bothered by it as he is.)
---
[1] Though your mileage may vary. The world isn't perfectly connected, so there isn't just one hype cycle. It's fun watching (e.g.) Facebook sweep through the world of my parents. They get really excited by "new" technologies about two to four years after the folks on HN have moved on from them.
The Rails message is "you don't need a DBA, you don't need to know SQL, just have your developers do the default install of MySQL and we'll do the rest". That is also the NoSQL message.
Speaking as a member of the rails core team, if you heard someone say that then he made it up.
Rails until 3.0 was heavily invested into SQL simply because it's what everyone used. However, it was swimming against the stream in that rails declared DBs to be something which is incrementally developed by the software through migrations instead of setup by DBAs per change tickets. This has been enormously successfully, there is probably not a single web framework left that pretends we still live in a DBA dominated world. Part of the shrapnel of this decision is that rails did away with triggers, db constrains and stored procedures but this is simply because most high volume sites don't use these things anyways because they get very hard to scale later on.
Rails past 3.0 will work natively with any data store that you can imagine. It ships something called ActiveModel which is a tiny interface that you can implement on top of Mongo, Cassanda, Redis. ActiveRecord is just the SQL incarnation of this interface.
Very high quality libraries based on ActiveModel already exist. Have a look at Cassandra Object as an example.
this is simply because most high volume sites don't use these things anyways because they get very hard to scale later on.
If you start with the assumption that "you don't need a DBA" then that's probably true.
Just to give you an idea of my background, I work on a system using a commercial RDBMS that "scales" to thousands of commits/sec and tens of terabytes of data. We expect to take it to tens of thousands of commits (we already do that many reads!) and hundreds of teras with no major structural changes. One thing you learn in this game is that database agnosticism is a wild goose chase. To really scale, you need to intelligently choose a technology and use its features to the fullest and just accept that you will be "locked in". We couldn't port to another RDBMS if we tried because certain things, like our chosen database's locking strategy for example, are baked in to the way we do things. It's not a matter of SQL syntax. We'd be starting again from scratch, we'd need new algorithms. But we can do things, we take things for granted, that most of the Internet peanut gallery takes to be impossible, because they start from assumption that abstracting the database actually helps anything.
That's what I got from it too. The framework being this flexible should help a DBA fully use the database to his advantage. The project I work on has had to do some non standard things with Rails and it's flexibility has made it much easier. In our view rails and it's default are a starting point providing the basic infrastructure to get the project going, not an end all be all solution. I see the same thing in NoSQL solutions, they try to give the developers and DBA as much flexibility as possible. They don't really seem to offer any default. RDBMS is basically the sensible defaults without the flexibility. Maybe the in between can be found.
There are a rash of these articles popping up. While there are usually valid points in urging people to avoid the hype of the "next big thing", a lot of these guys seem to be bitching that they might have to learn a new skill set.
I have been in this business long enough to remember Clipper and FoxPro devs bitching about SQL when it was on the rise. This sounds about the same to me.
These articles are just link-bait with no content. The 'NoSQL' buzzword is as distasteful to me as, say, 'cloud', but that doesn't mean that articles discussing them don't have value.
The flurry of 'NoSQL' articles often cover the different approaches to data stores, their implementation, their interfaces, their management, performance, scalability, etc. That interests me, but doesn't mean I'm going to go to work on Monday and kill all the non-NoSQL dbs we have running.
Ted's point may be valid for BigTable-like databases. (I'm not saying it is, but I don't know enough about those to say so.) Those are designed for scalability and if you don't need the scalability you probably should use a RDBMS instead.
But there are other advantages of SQL-less databases that don't deal with scale. I deployed my first MongoDB app a couple weeks ago. Even though it was a small (~1 developer month) project, and neither myself nor the other developer had used MongoDB before, I still think we finished faster than if we'd used MySQL. Just like Cassandra is a premature optimization if you just need an RDBMS, an RDBMS is a premature optimization if you only need an object store.
I dunno... I find the combination of the Django ORM and South (in particular the --auto flag for auto-creating migrations) is incredibly productive. With the ORM, I can conjure up a query that answers pretty much any question I might have of my data. I've experimented with MongoDB (and a bunch with Redis) and I find I'm much more likely to end up with a query that I can't resolve without having to do a bunch of extra work.
Most of this is probably having expertise with the tools, but I find that for rapid prototyping the ability to run relational queries is really important, especially since query performance during the prototyping phase isn't really an issue.
I've found that relational ORMs force me to write convoluted and weird code for all but the simplest of joins. Maybe django's ORM is better; I've mostly used SQLAlchemy and a few others.
Switching back and forth between Mongo and SQL is not without technical cost. The syntax and semantics are quite different and translating complex queries to map/reduce is also a problem.
I agree that developing with MongoDB is really really nice. We'll see how it holds up under more load, but for prototyping data models I have not used a better stack than mongodb + mongomapper
It doesn't force you to fit non-tabular (eg. hierarchical) data into a table structure.
Also, no schema means the data structure is more malleable. With the right ORM, this fits in well nicely with polymorphism: I can store objects with some common features in the same collection, but when I retrieve them from the database I get different types of objects which inherit the same base object. mongoengine is one ORM that does this.
The database is still aware of the fields, so MongoDB can build indices on certain fields if you wish. Admittedly I haven't deployed Mongo in an environment that really tested its performance, but we've been serving about 20k pageviews per day with no issues. Granted, this was a fairly basic application.
As for changing requirements, mongo handled those well too.
It's certainly not a silver bullet, but when I just need a basic object store the query performance trade-off is worth it.
"Did you know that Cassandra requires a restart when you change the column family definition? Yeah, the MySQL developers actually had to think out how ALTER TABLE works, but according to Cassandra, that's a hard problem that has very little business value. Right."
Really did the MySQL people think about it? because it takes ages to do an ALTER. Even when you are doing something like dropping an index it can lock up for hours, where no one can do any inserts. In contrast restarting a service is no big deal.
> In contrast restarting a service is no big deal.
Yikes, restarting a service that's so essential to everything else in web infrastructure is certainly a big deal. Where I work (large 30+ million users/month site), we have batches that do all sorts of processing and DB crashes (basically equivalent to a restart) can be a major pain because it can be difficult to figure out exactly what failed and when and how to best recover. You might say, "oh just re-schedule all the batches to allow downtime", but once you have 30 developers with a hundred or so batches, that can be damn near impossible to orchestrate.
A simple stateless web-app can probably tolerate a DB restart, but Cassandra was built to scale - not to host a to-do app.
ALTER takes ages to do because of the ACID constraints of MySQL. If you want, you can sacrifice the ACID constraints by just cloning the table with the proper modifications and then dropping the table, but I'm venturing into DBA-land for which I am in no way qualified to profess knowledge.
> ALTER takes ages to do because of the ACID constraints of MySQL.
No, I don't think it has much to with ACID. Rather, they made a simple single implementation of ALTER TABLE that copies the whole table out on any change whatsoever. Add a column? Recreate the table. Add a table comment? Recreate the table. Drop an index? Recreate the table.
They could have identified cases where in principle the table & metadata could be modified in-place, but that would be a lot harder than a simple copy. It would probably necessitate changes to the legacy architecture, which in turn would require a host of other changes.
No kidding, came here to say the same thing. ALTER TABLE operations in MySQL essentially copy the entire table, even if you're changing something trivial like a table comment. It's a huge pain in the ass. Restarting the service is a freakin' cakewalk in comparison.
It occurs to me that MySQL started off as a thin SQL wrapper on a NoSQL database: here, have a SELECT and WHERE, but you'd best not JOIN, and forget about transactions or referential integrity.
Then, over time, they tacked on a few more relational features, but they had yet to solve the hard problems of relational databases.
Meanwhile, the people who were originally drawn to MySQL as a dumb-and-quick datastore got frustrated with this line of development and christened the NoSQL movement. It's not so much a departure from relational databases (they were never really there), but a return to MySQL basics, w/out the SQL.
I can't believe some of what is said by people on both sides of the NoSQL arguments. Discounting use of RDF data stores, almost all of my recent work involves PostgreSQL and MongoDB. I think that it is blatantly obvious which to use in specific circumstances. I have not had to do this yet, but using Datamapper.setup, you can integrate the use of both in the same application by storing some model data in a relational database and some in MongoDB, as it makes sense to do so.
As is to be expected from this author, this is definitely on the flame-bait side of things. I submit it because I believe there is an important point here: for the vast majority of startups, going with a relatively unproven "NoSQL" database is a premature optimization and an unneeded technical risk. I disagree with the author that these databases are a flash in the pan, but their over-application is.
I dunno, I see the opposite: RDBMS's are a premature optimization. In my experience, it's /much/ easier to hack together a quick webapp in MongoDB, because you don't have to worry about relations, migrating schema, etc. Sure, it might be slower than Postgres on a billion-row table, but wait until you have a million rows before you shackle yourself to the relational constraints.
He has it backwards. You use NoSQL to 'get shit done'. When you have a billion rows, then worry about schemas. By that point you will have a much better idea, a - what said schema should look like, b - what the architecture of the Postgres, or mysql, or Oracle should look like, and c - how much money you will have to solve the problem.
I once worked in a Notes shop. Notes has no schema for documents and nothing to enforce migrating data from older documents to the current format. After a few years of customers manipulating data with various versions of the code, they had documents in such bizzare combinations of states that it was no longer possible for anyone on our dev team to inspect them and say which behavior would be right for the workflow.
Schemaless data should only be a summary of data properly maintained elsewhere, which you can regenerate at need. If your authoritative data has no schema, it will decay to garbage.
That's because you waited years to address the problem. Not only that, you also rewrote the code, as I advise. But you did not take the opportunity to address the structural data issues you were having, contrary to my advice.
My strategy is to rewrite the code, if needed, but with an eye towards addressing structural data issues. After a few months use of a web app you have a good idea of any surprising usage patterns that may appear. Readjust at that point when you are 'talking with data'.
This advice is for small startups of the HN variety, where 'customers' are a lot more important than 'authoritative' data stores initially. NoSQL systems are useful tools for mitigating the danger of doing too much engineering upfront. Many tech entrepreneurs fall victim to doing too much upfront engineering in the hopes of their data store not 'decaying to garbage', only to find that no one wants to use their product. NoSQL makes it easy to go back and migrate off the data you want to store 'on the move'. When you have a better idea of how much of it there is, and how it is used.
If you are not doing data migration, after n revisions to the data management code, each record can be in any of 2^n states, depending on which code revisions did or did not modify it. How many revisions can you make before your code can no longer handle some of your older data? I'd say days' worth, not months, because you're trying to iterate a lot faster than we did. And the odds of a complete rewrite understanding all your old data are even worse.
If you are doing data migration, you necessarily have an old and new schema in your head. At that point you're just refusing to write it down and let the tools tell you whether the code agrees with you.
I don't really see the point of starting with NoSQL and then going for Oracle. NoSQL DBs are real tools you know, not toys you through after your app gets traction. It's actually the contrary that has happened so far.
Maybe I should have been clear. You only go to SQL RDBMS if you need it. Which, in the vast majority of cases you will not. Further, you only go to SQL where you need it.
For instance, right now someone is developing a street car racing game for Facebook. The XBox kind, not the FarmVille kind. At any rate, one of the features is obviously, playback. Now keeping all of those physics updates in an SQL is pointless. And figuring out a schema for that data would have only gotten in the way of them getting that out the door. Throw the physics messages in a queue and write them to Cassandra. If you have even 10000 MAUs, you will easily generate billions of rows. It's just not data that really needs to go into mysql.
I don't think Cassandra is a toy. I think you should 'get it out the door' with everything in Cassandra, and then slowly, move the business stuff off. User names, what cars they bought for instance. Stuff that is not read often. But at first, get it out the door. Don't stop to figure out a perfectly normalized schema with balanced indexes.
Isn't Cassandra slightly more complicated than MySQL? Sure adding another column is trivial, but the access patterns & indexing need to be determined first.
NoSQL might be hype. Let's get specific. Cassandra eliminates the SQL database single point of failure and hard to replace masters via a lose sync, "eventually consistent" protocol.
Is there some startup offering a web service that doesn't need that?
And have you ever tried to deploy an SQL database capable of thousands of miles apart syncing?
Eventually consistent is quite a different model than ACID. If you accept that, and accept that you can't rely on networks to always be up, you'll live comfortably and cost effectively.
No one has ever explained this to me: why are we partitioning this space? Why can't a single database management system:
* have individual tables, indeces and views that are either relational or document-oriented, or graph- or object-based while we're at it, on a case-by-case basis,
* manage them all in a single, well-known distributed pool,
* and present a unified API to access all of them (e.g. a Structured Query Language of some sort)
* that allows tables of disjoint types to be joined in queries, with appropriate warnings when it creates non-optimized query plans?
In other words, why can't I say that my reports table should use the "relation" backend, while my messages table should use the "document" backend, and be done with it?
It's as if, when you went to a car dealership, they asked you whether you wanted to see the "cars with cigarette lighters" or "cars with automatic windows" section. Why can't my car do both?
have individual tables, indeces and views that are either relational or document-oriented, or graph- or object-based while we're at it, on a case-by-case basis
This is already the case. Nowadays, almost all relational databases (except, of course, MySQL) support XML columns. PostgreSQL supports them rudimentary, and DB2 and MSSQL have even special storage strategies and index structures for XML, i.e. for generic tree structures, data-oriented as well as document-oriented ones.
Also, abstract data types ("encapsulation", the base of OO) are implemented in these databases, too (except, of course, in MySQL), as well as other OO features such as table inheritation and some kinds of polymorphism.
I'd love to see a comparison between using these XML engines for queries and NoSQL, then. I'm betting they'd be competitive at least to the point that, if you already had one of the supporting DBMSes set up, there would be little point in training your DBA on NoSQL as well.
I thought most of this already exists, just not on the free databases.
Having XML or JSON columns that can their interior fields indexed, would replicate what document databases do.
Also what having a master-master relational database with a bunch of materialized views replicate what people are using Cassandra for. Is it just because the free databases don't have materialized view support?
It appears as though Ted's complaint about needing to restart Cassandra to modify ColumnFamilies (tables) is nearly obselete. A patch for the last remaining subtask has been submitted.
I'm getting tired of both sides of this argument, I'll be happy when the whole back and forth dies :). Rarely do you see a balanced opinion. Sometimes it's people that are fanatical about the new-ish NoSQL idea. Other times, like this, it's someone so stuck in their ways they think that everything but what they like is a fad and nothing will ever change.
One of the key things I look for when I interview developers is that they can recognize the right tool for the job. Potentials that get married to a technology or language are shown the door pretty quickly.
Also, as others have pointed out, this particular article seems to not quite understand the decisions involved, to the point of getting some things backwards.
AdWords implemented on top of MySQL? Perhaps the CRM portion of AdWords (i.e., where the advertisers submit their ads and publishers view their balances) is -- it's fairly easy to partition by functionality and doesn't have extremely tight latency bounds. This isn't where real time auctions (what really distinguishes AdWords from what came before) happen.
You can be sure, however that the data used for real time ad auctions is extracted out of MySQL and into a highly customized data store (likely, a pure in memory one). It's all about using a right tool for the job. You can also be sure that you'll never see a paper on that data store, as that's their competitive edge. If you could duplicate it with off the shelf components (whether MySQL or Cassandra), Google would be toast.
Likewise, I am sure Amazon uses Oracle for their billing system and catalog submission interface, but they use specialized systems for search, shopping cart and recommendations.
For a business app that only needs to scale to the amount of paying customers (i.e., advertisers, account managers and customer support) and has no real time constraints -- but on the other hand involves complex and frequently changing business logic (e.g., where altering tables may be required) an RDBMS is the right tool for the job.
Where latency matters, data grows much faster than Moore's law (in relational to main memory size), Amdahl's law starts to matter in regards to computation (computation work load needs to be partitioned to take advantage of parallelism), and traditional caching strategies simply don't work, something else is. That situation is starting to become more and more common across web companies. You can also be sure that places like Wallmart and the like employ plenty of non-relational technologies (my personal bet would be is that they're likely using Coherence or Terracotta): usually, however, they're expensive and are built/configured by field-engineers to be custom tailored for their workloads. When you employ a world-class engineering team, "build" starts to make more sense than buy when you're solving a very specific and constrained problem (e.g., fault tolerant shopping cart system).
You don't need to be of Google's size to be at that stage. Talking about scalability and performance without taking the workloads into account (e.g., "Google Facebook or Amazon" as if e-commerce, search and social networking were compatible) is also an anti-pattern: I am sure engineers at Google would laugh when you compare Facebook's scale to theirs; likewise Facebook's engineers would laugh when you compare the real time aggregation that happens on their site to what happens at Amazon; Amazon's engineers would likely tell you holiday season pager duty horror stories that would scare Facebook or Google engineers.
> a highly customized data store (likely, a pure in memory one)
That's when we stop calling it a data store, and start calling it a data structure. Data stores are where data goes when it's not part of the working set. With that definition, it's perfectly sensible for AdWords to use MySQL as its data store.
(Edit: this is a longer reply than I intended, no longer really intended as a direct reply to the parent; this is more a reflection on systems architecture of data-intensive applications).
That's a good point, but a pure in memory data structure is:
a) Not persistent to disk at all. Judging from my own experience with similar low-latency systems used in ad serving (where we called these "data servers") and other similar systems, the data is likely to be persisted to local disk and the deltas replayed to it from a MySQL db to avoid long restart times.
b) Lives within the ad server process. This is likely not true, as the ad server process will need to compose a "working set" for particular ad auctions from multiple data sources (bid price for each ad, keywords, budget/delivery/campaign specifications for each ad, keyword relevance of ad/ad campaign). Each of these data sources is likely represented by a different data structure (red-black tree for one, hash table for another, trie for yet another, graphs, B-Trees, etc...), has very different characteristics in terms of cache-locality, rate of change, size, density and comes from multiple places (some from RDBMS, others from Map/Reduce)
(Interesting side note: earlier I also wanted to say that neither the data structures are usually not partitioned in how they're store, now is computation done on them partitioned. However, with the age of parallel computing this is simply not true: there are now parallel data structures and algorithms).
One compromise is perhaps we can call these systems "data servers" or "data structure servers" (afaik Redis does the latter). MySQL (or any other RDBMS) merely feeds these systems through some form of message oriented middleware. In this case RDBMS (and this is an over simplification which doesn't cover all the corner cases) is merely acting as tape: changes are played forward and not randomly accessed. RDBMS that is the source of truth for ad-serving is never queried real time and can easily be taken down for maintenance while ad serving continues. It doesn't even need to be highly available (if advertisers can't submit ads it would certainly be a huge and costly outage, but much less costly than if users see ads!).
Note, such a system is also necessarily eventually consistent (in the truest meaning of the word: customer receives an SLA which corresponds with a point where the serving component is consistent with the DBMS).
There still needs to be an efficient OLAP component to back the CRM/ERP functionality of this system, for which an RDBMS is still a good bet (combined with an off-line system e.g., Map/Reduce for more complex reporting and optimization). However, had an end-to-end ad-serving system been written from scratch now, would the RDBMS component serve as primary source of truth (rather than just as the backend for the publisher/advertiser/support UI component).
In addition, this ("write to RDBMS, serve from elsewhere") design is also very specific: writes to the "ad submission database" are rare and don't always require high availability. Consistency (between the RDBMS and serving component) can be much more eventual than would be in a Dynamo based system (where the weak "can't read my writes" eventual consistency is only a failure condition).
Now suppose you also want highly available, low-latency writes (even if not at the same frequency as reads) and you'd want to be able to read-your-writes in normal situations. This makes the "write to RDBMS, serve from something else" (effectively what popular memcache+MySQL deployments are) scenario more brittle. You now have much harder questions to answer (do I want a system that's always in a consistent state e.g., to avoid having to do quorum reads/writes? am I okay with eventual consistency as a failure scenario? etc...) but with many workloads this becomes a necessity.
Despite speaking at NoSQL events, I am not a big fan of the NoSQL name. Not only do these systems not intend to completely displace SQL based RDBMS systems (and as with ad server example can exist side-by-side with them), additionally these systems provide functionality that can't be provided by RDBMS systems (and not just due to scalability concerns).
I myself am not actually sold on the noSQL movement, at least not on the idea of ditching SQL entirely. It has its place, but may not be the best solution for every problem.
That said, on the authors complaint about having to restart cassandra when doing the equivalent of an alter table: lately every time we do an alter table in MySQL (which takes hours on large tables, during which time you can do nothing with them), when the alter finally finishes, MySQL mysteriously crashes. MySQL may have given more thought to the problem, but their solution obviously has problems too.
Macs are better than Pc, C# is better than Java, Unix is better than Windows, RDBMS are better than NoSQL databases...
Why can't people just use the best tool for the job and move on...
From a personal stand-point we've switched from MySQL to Google AppEngine (and BigTable). Although I find there are some major drawbacks (e.g. joining tables) not having to worry about database servers and scalability is a major advantage. That said, if MySQL becomes the best tool for a particular feature, then let it be...
The scalability aspect of "NoSQL" is interesting, but I think possibly the more interesting part is the wide diversity of data models (key value, schema-less tables, document databases, etc)
True, some of these models are more restrictive than traditional RDBMSs to provide scalability, but I think some of them will often be useful even if scalability isn't initially a concern.
In fact the term "NoSQL" itself is more relevant to the data models than the scalability.
Well, since you need a safe way, you'll need a locking mechanism, too. And since it needs to be language agnostic, you can't just dump the internal representation of your object to disk, so you need some sort of serialization.
At that point, it's probably easier to go with an already existent object or document store.
If you start using the filesystem as a datastore that requires concurrent access you open up a whole new can of worms. You need a locking mechanism - which you'll probably implement using (wrapped) native syscalls. Not only does that break cross-platform operation, you'll also have to work on and fix (but find first, of course) bugs in the locking implementation. As you spend more and more time on this and your app starts growing, you'll find yourself spending more and more time working with the limitations of the filesystem you're using (file size limits, directory size limits, access times for files in large directories). You can hack your way around all that but then you have to face other critical tasks. Say .. backup and restore procedures. Can you do partial backup/restore operations? No? Well, get ready to write code for that too. And you preferably want to be able to do those live. Remember those locking issues you solved when you started down this road to hell? Yeah, they're back with a vengeance now.
How about a full restore? Maybe you should have implemented a replay-able log system to get that full restore up to speed with the state of the db since the time of the last backup.
Or maybe this isn't exactly the right point at which to re-invent the wheel :)
XML is nice for this because it supports multiple schema versions, validation, and has support in just about every language. My chief complaint with json as a cross-language serialization/interchange format is lack of a great way to validate your format. Most of the json schema definitions i've seen require your schema to follow a certain convention, which seems backwards and wrong to me.
Filesystem is the original document store. Unfortunately, most filesystems really suffer when you put $LOTS of files in the same folder, so you end up implementing a nested folder structure and that complicates your code. Now, if this is more complex than running an entire "object" store depends on the application. Also, the filesystem's addressing may not be granular enough (and so waste a lot of space,) if you are going to store < $BLOCK_SIZE files.
You should check out vertexdb. It's designed to be used just like a filesystem, but it fixes the shortcomings of filesystems when they are used as dbs.
I found the comparison with 'Real Businesses' particularly funny, given Wal-mart have have 2.1 million employees worldwide and Twitter has 75 million users...their scaling requirements are different by a factor of 35.
I'm pretty sure that Walmart's databases track way more interesting things than tweets.
I'd assume that they know where every pallet of product is located anywhere in the world and what's in it, where every truck is and what's in it, the referrer for every click on their web site, details on every purchase at every store, location of every incoming product from every supplier.....
And then I'd assume that they take all that data and de-normalize it, move it to a whole different series of databases, poke it into star schemas and warehouse it for further analysis - 'cause y'know - red wieners might be more popular than blue wieners next xmas, and they'd better anticipate that by at least 6 months, 'cause the wiener factories have to re-tool.
Listen to rm-rf. Way back when (2003-2004), I was involved in Walmart's supply chain managament software. I recall a few million rows to be optimized every day (before noon, so the hundreds of trucks plying the road can move things around efficiently).
You're comparing two completely different businesses on disparate metrics.
How many 'users' does WalMart have worldwide? I'd say at least 500 million on whom they keep a purchase record. Then there's products, credit card numbers, suppliers, etc.
But in terms of their 'real business' it is not the number of customers they have but the number of they have but only their employees who will be using their systems. The purchase records, products and credit card numbers are closer to Tweets than users.
I don't think the number of people doing data entry or accessing a system matters nearly as much as how much data one has to be tracked. WalMart's data needs exceed that of Twitter, they exceed the needs of Facebook. I don't know if they meet or exceed Google, but I'd imagine it's up there.
Is it just me, or does this statement make absolutely no sense whatsoever?