Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Tarsnap.

Tarsnap lets you say: tarsnap -c -f backup01302009 mysql_dir/

And you can just adjust the date each day. It gives you the luxury of a full dump (anytime you want to restore, just reference backup01302009), but it only actually stores the deltas (making sure not to duplicate data that might be in backup01292009 or backup01282009 and so on). Tarsnap stores the data to S3 so that it's replicated in multiple data centers.

It costs a little more than S3 at 30 cents per GB, but it's metered out so that if you only use 1MB of storage, you'll only be charged 0.03 cents for that storage. You could try creating your own way of doing incrementals, but I doubt you'd get it as efficient as Colin (the math genius behind Tarsnap) and so I doubt you'd get it cheaper. Plus, this way you don't have to deal with it.

And remember, it's hard to fill up a database.* As the Django Book notes: "LJWorld.com's database - including over half a million newspaper articles dating back to 1989 - is under 2GB." So, if they were using Tarsnap, they might be storing 5 or 10GB tops at a whopping $1.50-$3 per month plus whatever the transfer of their deltas was for the month. Oh, and tarsnap compresses the data too. So, maybe they'd be paying $1 or something lower.

* Clearly, if you hit it big time, you might not want to continue paying for tarsnap. However, if you become the next big thing, you can hire someone to deal with it for you.



This doesn't work. You can't just copy the files and expect them to be in a sane or consistent state.

You either need to a) use InnoDB hotbackup or b) use a slave, stop the slave, run the backup, and restart the slave to catch up.

At delicious we used B, plus a hot spare master, plus many slaves.

Additionally, every time a user modified his account, it would go on the queue for individual backup; the account itself (and alone) would be snapped to a file (perl Storable, iirc.) Which only got generated when the account changed, so we weren't re-dumping users that were inactive. A little bit of history allowed us to respond to things like "oh my god all my bookmarks are gone" and various other issues (which were usually due to API-based idiocy of some sort or another.)


Using a slave isn't fool proof either. If someone were to run a malicious command, it gets replicated, and could get backed up before being caught.


I didn't say that. Read what I wrote.

You use the slave so you can shutdown the database and get a consistent file snapshot. Then you do offline backup.


Yeah, it's true. I was a little simplistic. I usually use A, but I'm not dealing with the amount of data that delicious is.


Whenever Tarsnap is mentioned, I have to mention Duplicity which does the same thing, but is Free Software.

I use this for my personal backups, as well as backups of our work svn (fsfs) and git repositories. I use it against S3, and have found it incredibly reliable.

As a bonus, it encrypts everything but still does incremental backups. It's a really nice piece of software, and you don't have to pay anyone to use it.


...Duplicity which does the same thing...

Duplicity is not the same thing as tarsnap. Duplicity uses a full plus incrementals model compared to tarsnap's snapshot model, so with duplicity you're either going to be stuck paying to store extra versions you don't want or be stuck paying for multiple full backups. Moreover, tarsnap is considerably more secure than duplicity.

Before I started working on tarsnap, I considered using duplicity; but it simply didn't measure up.


How is tarsnap considerably more secure?


Some problems with duplicity off the top of my head -- I'm sure there are others (there always are):

1. Duplicity uses GnuPG. GnuPG has a long history of security flaws, up to and including arbitrary code execution. Yes, these specific bugs have been fixed; but the poor history doesn't inspire much confidence.

2. Duplicity uses librsync, which follows rsync's lead by making rather dubious use of hashes. In his thesis, Tridge touts the fact that 'a failed transfer is equivalent to "cracking" MD4' as a reason to trust rsync; but now that we know how weak MD4 is, it's possible to create files which rsync -- and thus Duplicity -- will never manage to back up properly.

3. When you try to restore a backup, the storage system you're using can give you your most recent backup... or it can decide to give you any previous backup you stored. Duplicity won't notice.

4. If you try to use the --sign-key option without also using the --encrypt-key option, duplicity will silently ignore --sign-key, leaving your archives unsigned. Based on comments in the duplicity source code, this seems to be intentional... but this doesn't seem to be documented anywhere, and it seems to me that this is an incredibly dumb thing to do.


EBS does deltas too. Is anyone else using it? I like the ability to mount a volume or clone a volume almost instantly and mount it on another machine.


EBS does deltas, but there are a few caveats. The most important being that you need to be using EC2. For many, $72/mo plus bandwidth might be a bit much for what they're doing if it can work on a 512MB Xen instance for under $40 with a few hundred gigs of transfer included.

Beyond that, drive snapshots aren't the easiest things to do. I know that Right Scale tells their customers to freeze the drive so that no changes can occur until the backup is complete. With S3 performance around 20MBytes/sec, to backup 1GB would take around a minute. That's not bad and only doing deltas it's unlikely you're going to have a huge amount to backup at any given time, but it isn't exactly good either. With file-level backup, you can do a mysqldump and then just back up that file. Eh, maybe I'm just preferring the devil I know in this situation.

It's a little more complex to set up (doing file-level backups), but if you're going the volume route, you need to make sure you don't leave the drive in an inconsistent state.

All that said, EBS is awesome. If it fits what you're looking for, then go for it!


This is not totally accurate. EBS snapshots are basically instantaneous, its just the copy to S3 that takes time, but Amazon performs this in the background. We use XFS on our EBS volumes (running MySQL 5 innodb) and then have a little perl script (http://ec2-snapshot-xfs-mysql.notlong.com/) that does FLUSH TABLES WITH READ LOCK -> xfs_freeze -> snapshot -> xfs_thaw -> UNLOCK TABLES. The whole process takes a fraction of a second, and it also logs where in the binlog the snapshot was made (handy since we create new slaves based off snapshots and reduces how much data we shuttle around).

We snapshot a slave every 10 minutes and the master once a night (just in case something totally weird happens to the slave and the sync isn't right). This is a multi-gig DB and we've had no problems.

Here is a link to a full tutorial about running MySQL on EC2 with EBS: http://developer.amazonwebservices.com/connect/entry.jspa?ex...

I wanted to also point out that a live slave is NOT a backup scheme. If someone hacks your database and runs DROP ALL FROM PRODUCTION_DATABASE you've now got a perfect copy of nothing.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: