Writing

Atom feed icon Software, technology, sysadmin war stories, and more.

Wednesday, December 11, 2024

Circular dependencies for socket activation and WireGuard

One of the more interesting things you can do with systemd is to use the "socket activation" feature: systemd itself opens a socket of some sort for listening, and then it hands it over to your program, inetd-style. And yes, I know by saying "inetd-style" that it's not even close to being a new thing. Obviously. This is about what else you can do with it.

Like in my previous tale about systemd stuff, you can add "deny" and "allow" rules which bring another dimension of filtering to whatever you're doing. That applies for the .socket files which are part of this socket activation thing. It can even forcibly bind it to a specific interface, i.e.:

[Socket]
ListenStream=443
IPAddressDeny=any
IPAddressAllow=192.0.2.0/24
BindToDevice=wg0

That gives you a socket which listens to TCP port 443 and which will do some bpf shenanigans to drop traffic unless the other end is in that specific /24. Then it also locks it down so it's not listening to the entire world, but instead is bound to this wg0 interface (which in this case means WireGuard).

This plus the usual ip[6]tables rules will keep things pretty narrowly defined, and that's just the way I like it.

I did this in a big way over the past year, and then never rebooted the box in question after installing such magic. Then earlier this week, I migrated that system's "personality" to new hardware and that meant boots and reboots here and there, and wasn't it weird how it was spending almost two minutes to reboot every time? What the hell, right?

Digging into the systemd journal turned up that some of the "wg" stuff wasn't coming up, and it sure looked like a dependency cycle. A depends on B, which depends on C, which depends on D, which depends on A again? If not for the thing eventually timing out, it wouldn't have EVER booted.

I'm thankful for that timeout, since the rest of the box came up and I was able to get into that little headless monster to work on the problem.

The problem is basically this: if you have a .socket rigged up in the systemd world, you by default pick up a couple of dependencies in terms of sequencing/ordering at boot time, and one of them is "sockets.target". Your foo.socket essentially has a "Before=sockets.target", which means that sockets.target won't succeed until you're up and running.

But, what if your foo.socket has a BindToDevice that's pointing at WireGuard? You now have a dependency on that wg0 thing coming up, and, well, at least on Debian, that gets interesting, because it ("wg-quick@wg0" or similar) wants basic.target to be done, and basic.target in turn wants sockets.target to happen first.

foo.socket waits on wg waits on basic waits on sockets waits on foo.socket. There's the cycle.

Getting out of this mess means breaking the cycle, and the way you do that is to remove the default dependencies from your .socket file, like this:

[Unit]
DefaultDependencies=no

After that, it's on you to set up the appropriate WantedBy, Wants, Before or After declarations on your .socket to make sure it's attached to the call graph somewhere.

I should mention that it took a LOT of rebooting, journal analysis, cursing, and generally complaining about things before I got to this point. If you're in a mess like this, "systemd-analyze dump <whatever>" is generally your friend, because it will point out the *implicit* dependencies which are just as important but which won't show up in your .socket or .service files. Then you get to sketch it out on paper, curse some more, and adjust things to not loop any more.

There doesn't seem to be a good way to spot this kind of problem before you step in it during a boot. It's certainly not the sort of thing which would stop you before you aimed a cannon directly at your foot. Apparently, "systemd-analyze verify <whatever>" will at least warn you that you have a cycle, but figuring out how you got there and what to do about it is entirely up to you. Also, if you don't remember to run that verify step, then obviously it's not going to help you. I only learned about it just now while writing up this post - far too late for the problem I was having.

I sure like the features, but the complexity can be a real challenge.