Geometric effects of certain system design choices
I have a little story about the compounding effects of bad system design decisions.
Let's say you've decided to write your services in a language which does not really do threads for whatever reason. Maybe it doesn't have threading. Maybe it has an interpreter which only ever does a single thing at a time, regardless of how many threads you have. Maybe you've created a situation where threads can't be used for some other reason even if the language supports them.
Given that you still want your server hosts to be able to take more than one request at a time, you'll probably end up doing something else by running a bunch of these server processes in parallel. This might happen by way of fork() where one of them starts up, opens sockets for listening to the network or whatever, and then kicks off a bunch of children to actually do the work.
This gives you a whole bunch of distinct processes that just happened to inherit a common file descriptor that's listening to the network. Let's further assume you've managed to work around the whole "thundering herd" problem where every single one of them wakes up for a single incoming connection, they all call accept4() or similar, but only one "wins", and the other ones do all of that thrashing for nothing.
You've gotten past that too. You now have many many processes sitting there trying to make the most of things. They probably need to talk to other things somewhere inside your production environment, and how are they going to do that?
Well, if you're the kind of place that made the above choices, it's not much of a stretch to guess that you're going to use HTTP as your transport internally. Your program probably uses libcurl or your language's local equivalent to make a bog-standard HTTP request within your "network" to another server process somewhere else.
I'm guessing a lot of people are nodding their heads here and are waiting for the other shoe to drop. They do all of these things, and they're working, so what's the post about?
The post is about when they stop working, and is a preview of what's ahead for people who continue down this road.
So, all of these things doing HTTP requests, right? Are they connecting to hostnames somehow? That is, are they trying to POST something to "http://other.service.internal:1234/api/v1/ListUsers"? I bet they are, and they're probably doing it on purpose, because someone somewhere is populating DNS with the IP addresses of the instances running that service. Maybe it's some "cloud DNS" thing they have, or maybe it's linked to whatever's running containers for them. They do that DNS request, get back a whole bunch of possible instances, pick one, and go to town.
Trouble is, those answers get big, and pretty soon they won't fit in a nice UDP query, and maybe EDNS0 isn't an option or isn't helping. The resolvers used by these HTTP clients start having to make TCP connections to figure out what's behind those hostnames. They don't cache anything, so they just go out and ask their configured resolver every time there's a request heading to somewhere else in the company.
It isn't much of a stretch to see that the first outage manifests as the DNS service completely melting down as every server instance taken together amounts to a giant unintentional botnet. DNS goes down, and now nobody can resolve internal instances, and so the whole company goes down. Ho hum, no more cat pictures for a while.
After this, maybe the "DNS infra" gets scaled up a bit, but eventually it becomes obvious that going to off-host resolvers is not quite the answer, either. A project is started to install a caching resolver on every system so that the http clients can now just bounce their requests off localhost port 53 instead of internal.botnet.victim port 53, and it won't all pile onto the upstream ones so much.
This probably works for a bit, but if traffic keeps growing, the next outage is not far behind. This time, the processes find themselves unable to connect to the local DNS server for some reason! They start saying something like "address already in use" when trying to connect *outward*, and that makes no sense at first. After all, doesn't that mean "I'm trying to listen on port and someone else is there already"?
Well, sure, and you kind of are doing that here. Think about it: the system needs to make sure a given (srcip,srcport,dstip,dstport) quad is unique so it can tell connections apart. If you show up and say "give me a socket I can connect ANYWHERE", it's going to need to make sure you can't turn around and create a problem. That "0.0.0.0:31843" socket you have for your outgoing connection came at a cost: nobody else can use it, either.
While this outage is going on, someone might eventually strace something and notice that the EADDRINUSE is related to the process trying to bind its SOCK_STREAM file descriptor to an address. Maybe they then look in 'netstat' or 'ss' and see the disaster: every single ephemeral port is gone. Why? Well, consider what's going on. Every process is connecting from 127.0.0.1 port (something) to 127.0.0.1 port 53. Given that three of the parameters are fixed, it doesn't take long to exhaust all of those possibilities, especially given the usual TIME_WAIT and lingering stuff that sockets do. It's only a 16-bit space to begin with, and your system probably doesn't use all of it - it might only use the upper ~half, for instance: roughly 32K to 61K.
Somewhere around here, someone discovers a sysctl knob that amounts to "ignore TIME_WAIT if it's a loopback connection because it's probably safe and we won't step on a real connection since it's from us to us". Then they set that up, and now they're running again, but they feel dirty... oh so dirty.
Obviously, this too will eventually fall apart. Even if you can recycle former connections as fast as you can, it seems inevitable that you will end up with all possible connections actually in use ("ESTABLISHED") simultaneously, and then what do you do?
What a world, what a world. I'm melting, melting.
Consider what had to happen to get to this point.
Someone decided to build a system that deliberately forked and thus didn't (and couldn't) share common things like a stub resolver or a RPC subsystem. This means every instance has to stand up all of this stuff on its own, and there's really no coordination going on.
Yeah, coordination. Guess what happens when a replica drops out? That one forked child on that one host finds out and maybe marks it down, but the other ones all have to do the same thing. Because there's no common RPC mechanism in use, there's no way to share this knowledge across the workers. They aren't threads, remember, they're full-on forked children.
Incidentally, you know that one request each worker makes to a downed instance out there which is how it finds out that instance is down? Yeah, I bet that request is d-e-a-d DEAD, and can't be retried. You found out that you have a dead replica out there, but you did it at the cost of "fataling" a request and hoping someone upstream from you can retry it. Even if you can technically retry, you might not have the time budget to do that whole hop-skip-jump resolve+connect thing again. Fun times!
By the way, anyone suggesting you rig something up with SysV shared memory or mmap or something else like that in order to share state across these forks is hereby banished to a realm where you can eat all the paste you want all day every day.
Doing a naive http client approach isn't helping matters. Not keeping track of who's out there and having to do a DNS request for every single request going to another service is definitely not making anyone's life easier. Having this be multiplied by all of the forks of all of the workers on this disaster of a deployment is just the icing on the cake.
Any one of these sketchy design decisions by itself might have worked out fine in isolation, but when they all interact with each other, you can end up with a category five technical debt storm.
I hope you brought your waders.
Just look at all of the possibilities that have been left on the table.
A single process could have had something that kept a pool of connections available for any client to use. It wouldn't have to keep re-resolving them for every single request. It could even do keepalives to figure out if the far end is actually alive before committing a request to the depths, never to be seen again. When it does learn the hard way that an endpoint is down and just swallowed a request, all of the workers then benefit since they're all going through this shared subsystem. They don't have to repeat it needlessly.
Forked workers and running your internal transport as HTTP requests might seem simple and easy, but there are big nasty monsters waiting for you down the road. Now that you know this, please try to swerve when you see them coming at you.