Writing

Atom feed icon Software, technology, sysadmin war stories, and more.

Saturday, May 31, 2025

rsync's defaults are not always enough

rsync is one of those tools which is rather useful. It saves you from spending the time and effort on copying data which you already have. It's the backbone of many a mirror site, and it also gets used for any number of backup solutions.

There's just one problem: in the name of efficiency, it can miss certain changes. rsync normally looks at the size and modification time of a candidate file, and if they are the same at both ends, that's the end of any consideration. It won't get any further attention and it moves on to something else.

"So what", you might think. "All files change at least their mtime when someone writes to them. That's the whole point of a mtime."

And yet... I'm writing this post, and here we are.

The keen-eyed observers out there are probably already thinking "ooh, bit rot" and other things where one of the files has actually become corrupted while "at rest" for whatever reason. Those observers are right! That's totally a problem that you have to worry about, especially if you're using SSDs to hold your bits and those SSDs aren't always being powered.

But no, this is something you have to worry about *beyond* that. This is about a "sneak path" that you probably didn't consider. I didn't.

Here, let's run a little experiment. If you have a x86_64 Debian box that's relatively current and you've been backing up the whole thing via rsync for a year or two, go do something for me.

Go run your favorite file-hasher tool on /usr/lib/x86_64-linux-gnu/libfribidi.so.0.4.0 for me. Give it a sha256sum or whatever, or even md5sum if you're feeling brash. Then note the modification time on the file.

Now mount one of your backups and do the same thing on the version of the file that's on the backup device. See anything ... odd? Unusual?

Identical mtimes, identical sizes... and different hashes, right? I spotted this on a bunch of my machines after going "hmmm..." about the whole SSD-data-loss thing.

Clearly, something unusual happened somewhere, and it's been escaping the notice of your rsync runs ever since. I haven't gone digging into the package history for this thing to find out just when and where it happened, and (more importantly) how. It's rather unusual.

If you're freaking out right now, there is some hope. rsync has both -I and -c which promise to not use the quick method and instead will run a checksum on the files. It's slower so you won't want to do this normally, but it's not a bad idea to add this to the mix of things that you do every so many rotations.

I should point out that the first time you do a forced-checksum run, --dry-run will let you see the changes before it blows anything away, so you can make the call as to which version is the right one! In theory, your *source* files can get corrupted, and if you just copy one of those across, you have now corrupted your backup.

Isn't entropy FUN?