Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, December 18, 2023

Smashing the stack for pain and misery

I need to remind people how easy it is to forget just one of the many gotchas of working on this ridiculous computer stuff. One missed nugget of data at a critical moment can leave you scratching your head and going "WTF" for longer than would otherwise seem reasonable.

Here's something that happened to me last week. I was working on a stupid little utility that runs on my machines and lets me keep tabs on what's going on with systemd. If it gets unhappy because any of the services have stopped running, then this thing will let me know about it. For the handful of systems I have to worry about, it gets the job done.

Now, since I'm in "holiday mode", I'm largely working on my laptop instead of sshing back to a Linux box somewhere else. This laptop is a Mac, so it's mostly compatible with what I'm doing. Obviously, it doesn't run systemd, but that wouldn't stop me from tidying up a tool in test mode. I was working on this thing, and noticed it started blowing up in strange places. Also, it was a really strange "bus error". To me, that says "binaries on NFS" or "unaligned access on some architectures". I'm not doing either sort of thing here.

gdb was not really an option at that moment for various annoying reasons so I resorted to "debug via printf" - putting little notes to say "I got here" and whatnot. They kept changing. I'd think I had it nailed down, and it would move!

Eventually, I got it down to something truly odd: it was blowing up in a worker thread, and it was the point where that thread started up and read in a config file from the disk. The line of code looked something like this, where I call into one of my own helper libraries:

auto raw = file::ReadFileToString(kDefaultConfigPath);

Okay, I said to myself, let's find out what's going on in that function and started sprinkling my "I got here" notes into there. One of those notes was at the very top of that function and just said "got into ReadFileToString". It never ran.

I removed the call to that function. It stopped crashing.

So, what's in that function that's so spooky? Well, it opens a file descriptor, does the usual sanity checks on it, and then creates a buffer that it'll pass to read()... and herein lies the problem:

  char buf[1048576];

Yep, just having that there was blowing the stack, and the bus error is how it manifested in that particular arrangement of function calls within the worker thread.

That's right, if you're already pressed for stack space and then enter a function with something like that, you might just explode. Here's a contrived example with an even bigger buffer to demonstrate it with just a single innocent-seeming function call:

mac$ cat bs.cc 
#include <stdio.h>

#include <memory>
#include <thread>

static void do_thing() {
  char buf[1048576 * 8];
  buf[0] = '\0';
}

int main(int argc, char** argv) {
  if (argc != 1) {
    printf("running in worker thread\n");

    auto worker = std::make_unique<std::thread>(&do_thing);
    worker->join();
    return 0;
  }

  printf("running in main\n");
  do_thing();
  return 0;
}

The fun part is that on a Mac, the flavor of error changes between "bus error" and "segmentation fault" just by shoveling it into a thread.

mac$ ./bs 
running in main
zsh: segmentation fault  ./bs
mac$ ./bs foo
running in worker thread
zsh: bus error  ./bs foo
mac$ 

Nice, right? Further complicating matters is that on a boring old x86_64 Linux box, it gets reported as a segmentation fault both ways.

linux$ ./bs
running in main
Segmentation fault
linux$ ./bs foo
running in worker thread
Segmentation fault

A simple twiddling of the ulimits will change the behavior ever so slightly:

linux$ ulimit -s unlimited
linux$ ./bs
running in main
linux$ ./bs foo
running in worker thread
Segmentation fault
linux$ 

Fun fun fun. Obviously, I need to rethink the way I manage my buffers.