Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, February 6, 2012

If unused RAM is wasteful, what about unused disk space?

Linux has a few interesting behaviors might not be obvious to the casual observer. For instance, it likes to keep its memory relatively full. This is not a bad thing. After all, something you've had in memory once might be needed again down the road. If you don't have any other use for that space, you might as well let those pages hang out and hope they serve some purpose. It's easy enough to just toss them out if you need the space for other data at a later point.

In my days of working support, this would occasionally generate a support ticket. Someone would get on a new server, look at "free" or /proc/memstat and would wonder why "all of their RAM was in use". Of course, usually it wasn't, and they were just seeing aggressive caching in use. After all, unused memory is wasted memory.

So now I've been thinking about disk space. These things are getting ridiculously large. I picked up a brand new drive the other day and the smallest one worth my consideration in terms of price per GB had half a terabyte of storage space. It's almost silly just how much space there is now.

It got me thinking about the RAM situation. Linux keeps stuff in memory just in case you might need it later, because you probably have more physical RAM than you actually need at any given moment. Well, what about disk space? You probably have hundreds of gigs of free space just spinning around and around all day long.

At the same time, you probably have to wait for that honey badger video to stream in from YouTube every time you load it. How does this make sense? Where is the opportunistic caching of content onto our now-copious amounts of disk space? Sure, browsers have disk caches, but how much are they really willing to use?

Okay, sure, with a simplistic implementation, it would screw up the numbers you see in "df". Fine. But what if you had a temp filesystem which lived in the free space and had no guarantees about longevity? It would be like caching everything you fetch in the Mac trash can or the Windows recycle bin. It can go away at any time, but until then, it's still there for you.

Now I'll go another step beyond that. Imagine a whole office full of machines which are doing this. You actually have the potential for a best-effort storage cluster right there under your nose, just a few milliseconds away! You can put out a multicast request for something by ID, and if a host has the object you want, it lets you have a copy. Otherwise it says nothing. If you get no responses, you fetch it as usual.

Certain elements of this are not new, of course. The Squid web cache has had this ability for many years. It would poke nearby caches to see if they have a copy of an object before going out to the origin server. This would be a bit like that, and a bit like memcached. The difference is that it would use the host's disk space in such a way that it didn't actually affect the visible "free space" numbers, and it would be entirely best-effort.

You'd have to be careful to not affect performance of the host machines by running at a very low priority, but that isn't a tall order.

With the kinds of resources ordinary workstation type systems have these days, it seems wasteful to not share them somehow. This assumes an environment with a certain degree of trust, of course. But hey, if you're doing things over regular unencrypted http, you're already trusting the network. Why not benefit from it?