Old box, dumb code, few thousand connections, no big deal
I think some of us who have been doing this a while have been doing a terrible job of showing what's possible to those folks who are somewhat newer. I get the impression that a lot of us (myself included) hit a certain point, get tired of this crap, and just want to be free of it. Sometimes I wonder what it would be like if I could just forget I knew anything about systems stuff and just disappear into the world to... I don't know, raise goldfish, or something.
But, as long as I'm here writing about this stuff, I might as well share some of what's been learned over the years. Not all of it is good, and some of it is probably just flat-out wrong or outdated, but there are other things people need to know.
First of all, it does not take "that much machine" to serve a fair number of clients. Ever since I wrote about the whole Python/Gunicorn/Gevent mess a couple of months back, people have been asking me "if not that, then what". It got me thinking about alternatives, and finally I just started writing code.
The result is something kind of unusual for me: a server without a real purpose just yet. It's more of a demonstrative thing at this point.
What I have is a server which starts up and kicks off a "listener" thread to do nothing but watch for incoming traffic. When it gets a new connection, it kicks off a "serviceworker" thread to handle it. The "listener" thread owns all of the file descriptors (listeners and clients both), and manages a single epoll set to watch over them.
When a client fd goes active, it pokes that serviceworker thread with a condvar, and it wakes up and does a read from the network. If there's a complete and usable message, it dispatches it to my wacky little RPC situation which is sitting behind it. Then it pushes the result out to the network again and goes back to waiting for another wakeup.
If you're keeping score, this means I spawn a whole OS-level thread every time I get a connection. I designed this thing around the notion of "what if we weren't afraid to create threads", and this is the result.
I wrote up a load testing tool, too. It will create any number of worker threads, each of which opens a TCP connection back to the server. Each one of those will fire a request down the pipe, wait for the response, sleep a configurable period, and then go again.
Let's say I stand up the server and a loadgen instance on the same machine. In this case it's my nine-year-old workstation box running Slackware64. I tell the load generator to hit the server (on localhost), run 2000 workers, and wait 200 milliseconds between queries.
Doing the math, every worker should run about 5 queries per second. It's not exactly 5 because the request itself takes some amount of time to complete. That means I should see about 10000 per second overall.
I start it up. What do you suppose happens to the server? It now has 2000 client connections and 10000 QPS to worry about. Does it eat the whole machine? No, far from it.
The server process has about 90 MB of RSS -- that is, physical memory. That's not bad for something with a hair over 2000 threads! Now, sure, the virtual size (VSZ) of the process is something like 20 GB, but who cares? That's all of the overhead of having that many threads around, but they aren't actually using it in physical terms, so it works.
What does the latency look like? Since this is all running over loopback, it looks stellar. Most of the time, all of the requests finish in under a millisecond. Every now and then, a handful slop over into "more than 1, less than 2" territory. That's pretty cool, right?
But, all right, that's not realistic. Queries don't come from localhost, they come from other machines on the same network. Fine. So what happens when I keep those 2000 clients running, then add 4000 more over the local gig Ethernet from my Mac laptop?
Well, first of all, the latency spreads out. Those requests going over loopback aren't all now happening in under a millisecond. Now some of them are actually making it up to 40 ms. Gasp!
The Mac, meanwhile, is seeing a nice spread of latencies, too. I have it doing some really terrible analysis to yield percentiles. If I haven't screwed this up too badly, the numbers look like this for one five-second period grabbed from the scrolling console:
p50: 24ms, p75: 31ms, p90: 42ms, p95: 49ms, p99: 68ms
If you're not familiar, this means that half (50%) of the requests finished in 24 milliseconds or less. 75% finished in 31 milliseconds or less, and so on.
This seems to be lagging my laptop a lot worse than the Linux box which is doing all of the serving. I'm ON the Linux box running X, in an X terminal, sshed into the Mac, typing this post... and it's now lagging. It's not the Linux box, either, since I can do more interactive things on it, and it's ho-hum, boring, same old same old.
The server, meanwhile, is up around 256 MB of physical memory, and is juggling just over 6000 threads. The load average on the whole machine (which is running the server, a load generator, and some other unrelated things) is about 4.4.
So what happens if I throw more load at it? I have one more vaguely-Unixy box here on my local network: a 2011 Mac Mini. How about another 4000 clients from that thing?
The server is now up to about 422 MB of physical memory and just over 10000 threads. The load average on the whole machine is about 7.8 now. The whole server is running about 32000 QPS now.
What did this do to the timing? Back on my laptop, things have slowed down a little bit. Here's another five-second snapshot:
p50: 25ms, p75: 38ms, p90: 57ms, p95: 75ms, p99: 127ms
To be quite honest, the numbers from the laptop are all over the place. It also has Firefox and at least one stupid chat client that's actually a giant piggy web browser running, plus whatever else Apple decides to kick off in the background. Still, the p99 stays under 250ms. In theory, I could run this a lot longer and get some error bars out of this whole business. I could also use sensible systems to do this testing which don't have tons of other crap running on them which add noise to my results. But hey, I'm using what I have on hand.
Now, to be clear, this all hinges on exactly what kind of work the server is doing. It's not doing TLS. It IS dealing with my own little goofy wire protocol, deserializing that into a protobuf, figuring out what it means, deserializing the inside of *that* thing into another (request-dependent) protobuf, and then dispatching it to the handler. The handler does a little work (more on that in a bit), and then the server has to serialize the response, jam it into the outer message, serialize that, wrap it up in my goofy wire protocol, and fire it down the socket.
What, then, is the workload? Well, for this test, I wrote something truly stupid called LavaLamp. It's an homage to the SGI Lavarand project which used cameras pointing at lava lamps to generate random numbers.
Mine does not use lava lamps. The name is meant to be ironic. It DOES generate pseudo-random numbers, though. The idea is to do just a little work on the CPU without being anything too interesting. It's deliberately not optimized. Every time it runs, it creates a random device, a random engine, and a uniform int distribution which is constrained between 'A' and 'Z'. Yep, that's right, this thing only spits out capital letters in ASCII.
It runs this 16 times and sends you the result. There's no caching, so every client gets their little blob fresh off the grill, so to speak.
It's kind of funny since it does actually open /dev/urandom every time through. In fact, if you don't give this thing a high enough file descriptor limit, it might just blow up when it gets that far. After all, being connected to 10000 clients each consumes a fd, the listening sockets eat a few, epoll has one, there's stdin/out/err, and now you want to open /dev/urandom every time you get a request? That's a lot of cruft!
Note: it doesn't crash when this happens. I catch the random device failure and just kill the request. Why bring down the whole server when one request failed? That's just asking for trouble.
I just wanted something that would present a little churn to the box instead of being something truly stupid like an "echo" endpoint. This does a teeny tiny amount of work and makes it that much more interesting.
So, my dumb little box from 2011 can handle 10000 persistent connections each firing a request down the pipe 5 times a second to do something truly stupid with random numbers. It does this with maybe half a gig of physical memory used, and about 75% of the CPU power available. I haven't done anything special in terms of balancing IRQs from the NIC, splitting up the accept/epoll/dispatch load, or pipelining requests. It's about as stupid as you can imagine.
The only thing that should be even mildly interesting is that I built it with a deliberate decision to not be afraid of threads. I figure, I'll stand 'em up, and let the kernel figure out when to schedule them. So far, it's been working pretty well.
This is just one example of what can be done. It's not magic. It's just engineering.