Writing

Feed Software, technology, sysadmin war stories, and more.

Thursday, January 5, 2023

S p a m m y s y s c a l l s in strace dumps

I was doing some light nerd reading at lunch the other day and ran across someone who had encountered trouble with a program that was using TCP_NODELAY when perhaps it shouldn't. TCP_NODELAY is one of those things which turns off Nagle's algorithm, which is usually used to batch up a bunch of small writes so you don't spam the network with tons of tiny packets. (If this sounds familiar to long-time readers, it's because it starred in a post that made the rounds in the fall of 2020.)

All of those packets have overhead. It's not quite the same problem that it was when we had 10 megabit shared-media networks with collisions out the wazoo, but it's still not great to just waste bandwidth and CPU time on things that aren't latency-sensitive.

The problem comes when you have a program that has a bunch of stuff to put on the wire, and yet it does it with individual calls to write(). Instead of pushing (say) ~2 KB at the network with a single call, it instead spins through the buffer, writing each one individually. Now you have 2000 packets flying around, all with their headers and everything else as overhead. Having the kernel batch this up is basically saving the world from broken code.

I saw this and it reminded me of a similar bit of damage in my own life. I have some projects where I am forced to wrap another program and listen to its stdout. It doesn't have a library form, so the only way to make use of it is to go through this whole rigamarole. I get to create a pipe, then fork and have the child connect stdout to that pipe and exec the program in question. The parent process then sits there listening to the pipe for updates.

I realized that my program (the reader) was waking up FAR too often. I should be getting updates every 30-45 seconds, but it would wake up a couple of thousand times in that interval. WTF? Well, it turns out that for whatever reason, it writes to stdout (more or less) a byte at a time.

Seriously. I had to see this for myself, and attached to it with strace. It pretty much looked like this:

708589 22:46:24.174856 write(1, "\"", 1) = 1 <0.000039>
708589 22:46:24.175018 write(1, "i", 1) = 1 <0.000041>
708589 22:46:24.175187 write(1, "d", 1) = 1 <0.000040>
708589 22:46:24.175339 write(1, "\"", 1) = 1 <0.000041>
708589 22:46:24.175506 write(1, " : ", 3) = 3 <0.000048>
708589 22:46:24.175666 write(1, "12345", 5) = 5 <0.000041>
708589 22:46:24.175814 write(1, ", ", 2) = 2 <0.000041>
708589 22:46:24.175981 write(1, "\"", 1) = 1 <0.000041>
708589 22:46:24.176138 write(1, "c", 1) = 1 <0.000039>
708589 22:46:24.176279 write(1, "h", 1) = 1 <0.000040>
708589 22:46:24.176443 write(1, "a", 1) = 1 <0.000041>
708589 22:46:24.176596 write(1, "n", 1) = 1 <0.000040>
708589 22:46:24.176732 write(1, "n", 1) = 1 <0.000040>
708589 22:46:24.176875 write(1, "e", 1) = 1 <0.000043>
708589 22:46:24.177045 write(1, "l", 1) = 1 <0.000070>
708589 22:46:24.177331 write(1, "\"", 1) = 1 <0.000030>
708589 22:46:24.177454 write(1, " : ", 3) = 3 <0.000029>

That's 17 lines from a much longer log. Those 17 lines alone were displayed in about three milliseconds, and had many many more above and below. Here I was, trying to see what kind of data it was sending to me, and it was spamming me with syscalls.

If you look closely, you can see it's not quite one byte per write() but it's pretty close. Numbers go all at once for whatever reason, and those " : " strings are another curiosity.

I pretty much forgot about this until the TCP_NODELAY post crossed my path, and then was reminded of this. Clearly, short writes are pretty common.

I wonder though, do people not strace programs any more? If this was my project and I was trying to figure something out, all of the vertical scrolling would drive me crazy. When it's spewing out more than my scrollback buffer will let me access, something is wrong! I'd go to lengths to try to batch it up a little.

The one part of this process I control is the reader, so that side has some sanity-enabling hacks added to it. I wait for up to a second between checks, and even then, I call select() with a 250 msec timeout. This gives the syscall-spamming writer program a chance to finish writing to the pipe before I go and read it. This raises the chances that I'll get the whole thing from a single read() call. Their program can spin making thousands of syscalls per event. Mine makes about four: futex, futex, select, read.

...

Full disclosure: I thought about writing a program to spit the body of this post out as a series of individual bytes handed to write(), and then the post itself would have just been the output from running it in strace. Everyone would have had to read it vertically, and I bet it would have been seriously annoying for basically anyone, except the rare people who have lived through it and would find it oddly hilarious.

But, I'm trying to make a point here, so I decided to make it accessible to people who tend to read things in terms of words and sentences. I left the crazy up in the first two words of the title instead.