Software, technology, sysadmin war stories, and more. Feed
Friday, April 13, 2018

Don't blame std::string for your vector accesses

In early 2017, I wrote a summary of the sorts of stuff I had been up to. One of them was a brief description of some badness people would encounter while using std::vector in C++. Given how brief that mention was last year, it seems only appropriate to go back and add some more meat to this story.

It starts like this: someone's service goes down, and it takes down a chunk of something much bigger which depends on it. You start looking to find out what's going on, and find out that all of the instances of their service have been "crashlooping" - starting up, running for a short time, and then dying again.

The container/job management stuff keeps respawning the tasks, but they crash so quickly that hardly any progress can be made. Obviously, someone has to figure out why it's crashing and what can be done about it, so the actual debugging starts.

Eventually, someone notices that it's actually a segmentation fault (SIGSEGV), and with the help of some logging magic that captures the stack trace, they find that ... it's crashing ... in std::string? "Wow! I managed to find a bug a standard C++ library!", they think. Yeah, well, not so fast. There's probably something else going on.

Sure, gdb does say something like this...

Program received signal SIGSEGV, Segmentation fault. 0x00007ffff7b27cab in std::basic_string<char, std::char_traits<char>,
std::allocator<char> >::basic_string(std::string const&) ()

... but that's just where the car hit the wall at the end of the wreck. The metaphorical banana in the road or cell phone stuck under the brake pedal happened a bit earlier. std::string is left holding the bag and looks guilty, but it's only doing what you asked.

One stack frame up from the crash is "operator+". Okay, so we know we're doing a string and a + somehow. One more stack frame up from that, we get the call site: line 13 of crashy.cpp. What's on that line? This.

std::string x = blah[16] + "foo";

Hey, what's blah, anyway? Go a few more lines up.

std::vector<std::string> blah;

Okay, it's a vector, so that [] is a glorified array access, just like the days of using plain old C. How big is this thing, anyway? Is 16 a valid index?

This takes a lot of work, since you have to find out how that vector gets populated, and how big it was at the time of the crash. After a lot of snooping and poking at things, you determine that no, 16 is not a valid index into this thing. What does that mean, then?

Ultimately, it means you're handing a bad pointer into the "+" call, and when it tries to chase it down, things blow up. It's pretty straightforward now.

I saw this far too many times over the years with the [] operator. Sometimes, it's because the index is just a number received from the network and it's used without any sorts of bounds-checking. Ask yourself, do you have code like this?

vector<string> err_string = { "foo", "bar", ... ... ... };
string error_meaning(int errnum) {
  return err_string[errnum];

I've seen this happen when clients get back an error code from the server and decide to "helpfully" map it back to a string by using it as an index into a local vector of strings. Guess what happens the first time the server returns a new error number? All you need to do is have the server evolve slightly faster than the client and this could happen.

Another variant of this situation is to use .front() or .back() to (attempt to) access the vector, even though the vector is empty. It blows up just the same.

Still another flavor of badness is where someone has two different vectors loaded, and assumes they will always be the same size. The code looks something like this:

for (i = 0; i < x.size(); ++i) {

It's like... hold on there! i is constrained to the bounds of x, but it has no connection at all to the bounds of y! It's a really bad idea to go and use it to reach into that second vector. You don't know what kind of monsters are lurking in there. What happens the first time that x is bigger than y? That's right, the y access goes into the weeds. Boom.

Somewhere in the past few paragraphs, at least one person has started drafting a comment to HN about how you could use .at() and then it would do bounds checking and would also throw an exception for an invalid access. This is true. You could in fact use .at() and get an exception thrown when you go off the rails. Some people do exactly that.

Unfortunately, I've also seen that same code neglect to handle the thrown exception, at which point it bubbles all the way up and slays the program anyway by aborting it. So, yeah, you can do that if you're willing to pay for it, but you have to actually make sure you deal with the consequences!

SIGSEGV or SIGABRT... either way, you're toast.