Silly benchmarks on completely untuned server code
A couple of days ago, I finally put my foot down on the topic of running Python, Gunicorn, and Gevent as the basis for your services. It generated both a lot of support and a fair bit of pushback. This tends to happen when you show up in a space where people are emotionally invested in a particular technology and say things about it.
There was one key element to some of the responses that made me wonder, though: okay, so if not that, then what? I actually didn't have a ready-made answer to this already on top. I normally don't have to solve for that when I end up somewhere.
Basically, when you work for company G or company F, you get a nice code base that's ready and waiting for you to jump in. It has a rich set of libraries, and among them were things you could use to get a basic HTTP status server going, or even speak whatever flavor RPC that company had invented. It was no big deal to write a bit of code that handled whatever logic you needed and push it out to production. It would run, and it would run well, unless you did some really terrible things in your implementation.
Now I'm out on my own again, and I got to thinking: what if I had to solve this problem for myself, now? I don't have the repos ("google3" and "fbcode") available to me any more. What if I wanted to run a simple little service and serve requests from it? How would it work, and how far would it scale?
I had done some dumb things involving wrapping $http_server_library back in the period from 2011 to 2013 when I was on my own. It let me do enough simple status pages and really crappy "RPC" (hardly) between my own programs to run things like the scanner site, or some of my more useless web-based inventions.
So now I was wondering: just how far would it go now? What if I took some completely untuned code from 2011 that wraps that library and answers with a 200 and a short string? Just how far can that go?
I started it up and launched 'ab' at it. It's a simple little tool that comes with Apache and can do HTTP benchmarking. I should point out that all of these numbers are going to be terrible, filled with noise, and subject to many many caveats. The point is to give very very rough ideas of what can be done in the space. That's it.
First up, I told it to run 10000 queries. It finished in just over a second. Right there, with no tuning at all, and no parallelization in the clients, it pulled off about 8400 RPS. Not too bad.
ab -n 10000 http://localhost:8080/ --> ~8400 RPS
A quick look at tcpdump shows that it's doing this with a new TCP connection every single time, so that's one extreme: 10K clients showing up one at a time asking for stuff. How about the other extreme, where 1 client shows up once and asks for 10K things in a row? That's just ab's -k switch for HTTP keepalive.
Keepalive mode is even faster. The whole round finishes in about 400 milliseconds. Obviously, not having to stand up new connections, even those over loopback, helps. There were probably also other gains in terms of not having to spawn and destroy threads on the server.
ab -k -n 10000 http://localhost:8080/ --> ~26000 RPS
Now, obviously both of these are ridiculous. 10000 requests are not going to walk in the door single-file like the first one, and they aren't going to all arrive over a single connection from a single client like the second one (we hope).
How about some concurrency? After all, in real life, clients arrive in parallel. What if we do those 100 at a time? What happens then? We're back to not using pipelining here.
ab -c 100 -n 10000 http://localhost:8080/ --> ~14500 RPS
Turns out dealing with all of that parallelism costs a little bit. There's also a bit more variance in the latency. The 50th percentile (hereafter pxx) is 7 milliseconds. The p99 is 10 milliseconds, and the absolute longest request is 15 milliseconds. Previously, as reported by the tool, all of them were sub-millisecond.
I should point out that I have done absolutely no tuning to this thing, or the machine it's running on. There's other stuff going on. It's serving my usual web site out to people who are clicking through from Hacker News or reddit or Blind or whatever.
What about the server side of this? Just how much oomph does this thing need out of the machine? Well, to find out, I started it in "time -v" as a really crappy way to get some measurements, then killed it after running the same test as the last one above (10K requests, 100 at-once).
It says it used 160 milliseconds of user time, 760 milliseconds of system time, and had 15% of the CPU. The maximum RSS (how much actual memory it consumed) was about 30 MB.
30 megabytes. Not gigabytes, megabytes.
For what it's worth, the box itself (snowgoose) is an 8-way Intel E-2174G which seems to be running most of its CPUs north of 4 GHz at the moment. It's using 15% of that.
Clearly, this thing could do a whole lot more on this box. I could take more requests, or I could do much more (read: ANY) work inside these requests. Or a little of both.
Huge caveats to add: I don't normally do benchmarking like this. I'm probably doing something wrong. "ab" is probably flawed in a dozen ways that I don't know about. The server is doing no work at all. There are no locks to contend with, or time spent on the CPU thinking about basically anything. It's not logging. It's not touching the disk. It's just dispatching a request to the single handler in the map, and that handler is setting "200", "text/html" and "Blah\n". Really.
Anything that did real work would undoubtedly be slower. It has to be.
But, would it be as slow as the things I've seen? No way. It's just not that expensive to spin up a thread and do some work.
Final note: I didn't mention what language or what library I used for this. To make my point, a friend did a similar test and got similar numbers with a different language/library. Then, just to make things even more ridiculous, he shoved it onto his Windows box and ran it from there. Even then, it did a stupid amount of traffic without breaking a sweat.
This kind of stuff REALLY shouldn't be that difficult. If you have the right tools and environment, it isn't. If you've never seen it work this way for yourself, you might think that the status quo is fine. That's okay. You just haven't seen the alternatives yet. That's why you have people like me around. I point it out to give you the option to try something else.
The choice is yours.