Amazon S3 file system with improved caching: the itch I scratched over Christmas

timmorgan · on Dec 29, 2009

This project has one of the best READMEs I've seen in recent days of looking at a lot of open source code. Good overview, well written.

stephen · on Dec 29, 2009

Very cool. This looks like a real version of the Ruby fusefs I wrote to grok all of the s3organizer vs. s3sync vs. whatever schemes for differentiating files vs. directories in S3:

http://github.com/stephenh/s3fsr

Given I used Ruby's fusefs, nothing is streamed, and its single threaded, limitations I assume this C++ implementation doesn't have to deal with.

ZitchDog · on Dec 29, 2009

Would it be possible to reuse an existing HTTP caching solution like squid or nginx for the caching since s3 exposes a REST api?

russross · on Dec 29, 2009

I'm not sure it would interact nicely with the request authentication system that S3 uses.

I think a generic cache layer would be a better solution. A bit of googling turns up fuse-cache, which sounds like about the right thing (although I haven't actually examined it in detail), or fs-cache, which also sounds like a discrete cache layer to be added to any file system. Basically, you mount it, and it passes requests through to any other mount (like s3fslite) while adding an on-disk cache layer.

I haven't tested any of these, but that seems like an approach worth pursuing.

- Russ

pan69 · on Dec 29, 2009

Very nice! I was rather disappointed with the FuseOverAmazon version I tried a couple of months ago. I will definitely give this a try. Thanks for the great documentation on how to use it as well.

Great job!

anotherjesse · on Dec 29, 2009

Neat! Rather than having to issue a find, perhaps a background task that primes the cache as soon as you mount?

russross · on Dec 29, 2009

I'm hesitant to automatically fire off that many requests, especially when they may not end up being necessary. If you are using the same machine and preserving the cache, it will already be primed each time you mount the bucket, except the first time (or any time you delete the cache database file).

Using find is just a trick I used whenever I'd corrupt the cache or change the DB schema while developing it, and then wanted to go in and test it again interactively.

I should probably mention that reducing the number of requests was one of my primary goals. The first time I played with s3fs (the one I forked), my bill for the month was roughly 10% storage and bandwidth, and 90% requests (or was it 20/80?).

Anyway, thanks for the feedback; I do appreciate it!

- Russ

chrischen · on Dec 29, 2009

Does this use http to upload everything to S3?

russross · on Dec 29, 2009

Yes, it does. Adding https as an option is something I'll probably look into.

edit: It uses libcurl for transfers, and libcurl supports https, so getting a secure connection is as simple as adding the option:

    url=https://s3.amazonaws.com

at mount time.

I've added that to the README file.