We have to talk about this Python, Gunicorn, Gevent thing
Even if you're in a terrible situation, you should probably try to learn from it. You never know if your purpose in life is to actually serve as a warning to others as that "Demotivational" poster puts it.
There are a lot of things I will not do myself. I, for example, will not write a service in Python. I will not use web requests when the situation calls for RPCs. I will not use "green" (userspace) "threads" when there are actual OS-level threads and parallelization is necessary.
I've had these beliefs for a while, but they weren't rooted in experience with those specific technologies. I had seen them from afar, and decided they were no good, and besides, I had other things I was using that worked better and made me happier.
However, I actually have that experience now, and I can serve as that warning. Here's what happens when you build around Python, Gunicorn, Gevent, and web requests instead of something more sensible.
So there's this "service", right? And it's sitting there on this box. The box has 4 CPUs so you're running 5 instances of it because someone decided the ideal situation was (num_cpus + 1). Exactly how does it get there?
Well, first up, something runs gunicorn, which does some housekeeping, sets up a listening socket and then calls fork(). Then, the children all import your "app" -- that is, your actual "service" code. Then they do some local housekeeping, twiddle epoll a little, and start waiting on that listening socket to go live. At this point, every one of your processes is sitting in the kernel, having called epoll_wait().
A connection arrives on the socket. Linux runs a pass down the list of listeners doing the epoll thing -- all of them! -- and tells every single one of them that something's waiting out there. They each wake up, one after another, a few nanoseconds apart.
All five processes/ call accept4(). There's only one connection waiting, so only one of them succeeds. Obviously, the first one who called it won. The first one to call it was probably the first one which was awakened by the kernel. (The kernel seems to be doing some kind of LIFO thing, so that one process keeps winning, but I digress.)
The winner proceeds to recvfrom() on that socket to find out what they want. It's some kind of "GET /blah HTTP/1.1" thing. Oh, it's a web request.
Meanwhile, the other four processes have gotten -1 with errno set to EAGAIN, because - surprise! - nothing else was waiting on that socket. They sulk a little bit, kick at the pebbles by their feet, do a little housekeeping and jostle some data around, redefine their epoll sets, and call back into epoll_wait(), having burned a bit of CPU time for exactly no benefit.
This situation, incidentally, was dubbed the "thundering herd" back in the '90s when Linux people started freaking out about how Windows NT web servers were whipping Linux royally. People realized that maybe it wasn't a great idea to have every single Apache httpd blocking in accept() and then waking up at the same time. They came up with semaphore tricks and (much later) things like EPOLLONESHOT. It got better.
In this world, the '90s apparently never ended.
Let's carry on with this scenario. Process #1 is still running this request. It in turn has to fire something down the network to yet another "service". When it does, something in the "green thread" libraries notices what's going on and says "hey, you seem to be waiting on the network, so how about we go back for more work?". It pushes that original request aside and goes back to the epoll situation.
And, hey, what do you know, it actually gets another connection! It does recvfrom() on it, and sees what it wants, and now it starts fulfilling this request. It involves a bit of cranking on some data. Maybe it's manipulating strings. Maybe it's reading JSON off the filesystem, and then deserializing it into local objects. Maybe it's applying translations. (Maybe it's already done this a bunch of times, and is too stupid to keep it around from the last time.) Whatever. The point is, it's now very busy with this new request.
Meanwhile, that original request is getting old. The request it made has since received a response, but since there's not been an opportunity to flip back to it, the new request is still cooking. Eventually, that new request's computations are done, and it sends back a reply: 200 HTTP/1.1 OK, blah blah blah.
Then it does its weird userspace "thread" flip back to the original request's context and reads the response. It chews on this data, because, again, it's terrible JSON stuff and not anything reasonable like a strongly-typed, unambiguously, easily-deserialized message. The clock is ticking and ticking.
Finally, it sends back a reply to this original request. But, oh, what's this?
SIGPIPE.
Yep, the write() down the file descriptor for that original request returned -1 with errno=EPIPE, and it also caused a SIGPIPE signal to be fired at the process at the same time. What happened?
What happened is that the client got tired of waiting and bailed. It hung up the connection. With that connection dead on the network side of things, the write immediately failed.
The service shrugs, closes down the connection, does its usual post-request cleanup stuff, and goes back to waiting with everyone else.
Consider that the client has no idea the request actually completed. It's total byzantine agreement stuff here. The client's only option now is to try it again and hope it finds someone who doesn't run away in the middle of their "conversation" to take care of the latest request.
(I sure hope these requests are idempotent...)
What's going on here? Many things. So many things.
Within any given worker, only one thing is ever really going to execute at once. It's either handling one request or it's handing another. There's only one thread associated with this process. It's not like it can use multiple CPUs and go off and do work. These aren't OS threads, they're fake userspace threads. If one's running, the other one is waiting. If the waiting one has a tighter deadline, oops!
So why not have a bunch of workers? Just fork a crapload more than just num_cpus plus one? Well, that's another thing. All of those workers cost memory.
But wait, you say. fork() is copy on write. You shouldn't have all of those copies of the code. And, well, you're half right. fork() is in fact copy on write for those pages, but it's only applicable to the pages you've already loaded.
"Huh?" you probably think.
Go back and look. I said that it forks and then it imports your app. Your app (which is almost certainly the bulk of the code inside the Python interpreter) is not actually in memory when the fork happens. It gets loaded AFTER that point.
"Why in the hell would you fork then load, instead of load then fork?"
It's around this time that you discover that people have been doing naughty, nasty things, like causing work to occur at "import time". That is, they have code hanging out in their .py files which is not enclosed by a function or class or whatever, such that importing it runs it. That code has side-effects. Nobody ever told them to not do that. (Clearly though, nobody's ever told them not to do anything, based on the wobbly tower of bad architectural decisions I've been describing.)
That's why they fork-then-load. That's why it takes up so much memory, and that's why you can't just have a bunch of these stupid things hanging around, each handling one request at a time and not pulling a "SHINYTHING!" and ignoring one just because another came in. There's just not enough RAM on the machine to let you do this. So, num_cpus + 1 it is.
I'm not even going to get into the whole insanity of using web requests when you really should be using a real RPC mechanism, with, oh, you know, strongly defined data types and methods which use them, timeouts, load balancing, health checks, queue depth determination, sharding, pick-two, selective LIFO, and everything else that tends to show up given enough time in a battle-tested system. That's a rant for another post entirely.
So how do you keep this kind of monster running? First, you make sure you never allow it to use too much of the CPU, because empirically, it'll mean that you're getting distracted too much and are timing out some requests while chasing down others. You set your system to "elastically scale up" at some pitiful utilization level, like 25-30% of the entire machine.
Your cloud provider loves this. They tell you: sure, run more instances! Buy our stuff to help you manage the tens of thousands of identical underutilized virtual machines that this design gives you! It'll buy our founder a lovely ivory-handled back scratcher.
It means you have to come up with breathtaking things to deal with thousands of instances when the work could really be done with something far smaller, given a reasonable implementation. One of your friends points out the COST paper and you weep silently.
Have you ever learned something and realized, gee, I wish I didn't know any of that? Only someone in a terrible situation should ever experience that and have to understand that? Yeah. That's me.