40 milliseconds of latency that just would not go away
Have you ever tried to optimize a system but found it just would not get any faster than some seemingly arbitrary point? Did it seem like the stuff somehow had an agreement to where it would never deliver results to in less than X milliseconds, even if it was unloaded and had a super-quick network link between the devices?
This happened to some friends of mine a couple of years ago. They had been running one version of some software for a long time, and it had been forked off from upstream. It apparently had picked up a bunch of local fixes for efficiency, correctness, and all of that good stuff. Still, it had managed to miss out on a bunch of goodness, and so the company eventually moved back to the open-source release.
Upon doing that, they noticed that no requests would complete in less than 40 milliseconds, even if they had been doing it previously in the same conditions on the older version of the code. This magic number kept showing up: 40 ms here, 40 ms there. No matter what they did, it would not go away.
I wish I had been there to find out whatever got them to turn the corner to the solution. Alas, that detail is missing. But, we do know what they discovered: the upstream (open source) release had forgotten to deal with the Nagle algorithm.
Yep. Have you ever looked at TCP code and noticed a couple of calls to setsockopt() and one of them is TCP_NODELAY? That's why. When that algorithm is enabled on Linux, TCP tries to collapse a bunch of tiny sends into fewer bigger ones to not blow a lot of network bandwidth with the overhead. Unfortunately, in order to actually gather this up, it involves a certain amount of delay and a timeout before flushing smaller quantities of data to the network.
In this case, that timeout was 40 ms and that was significantly higher than what they were used to seeing with their service. In the name of keeping things running as quickly as possible, they patched it, and things went back to their prior levels of performance.
There is an interesting artifact from this story: some of the people involved made T-shirts showing the latency graph from their service both before and after the fix.
Stuff like this just proves that a large part of this job is just remembering a bunch of weird data points and knowing when to match this story to that problem.
Incidentally, if this kind of thing matters to you, the man page you want on a Linux box is tcp(7). There are a *lot* of little knobs in there which might affect you depending on how you are using the network. Be careful though, and don't start tuning things just because they exist. Down that path also lies madness.