Writing

Software, technology, sysadmin war stories, and more. Feed
Wednesday, March 24, 2021

Compile times, and why "the obvious" might not be so

One of the double-edged swords about having done some of this work for a while is that I have a bunch of random things that I take for granted. These are mostly just bits of data and other knowledge that I picked up a long time ago and don't even talk about any more. This is actually a bad thing in certain situations, since everyone comes from a slightly different place in terms of what they know and do and their specialties.

To give an example, last month I wrote about some complaints I had about programs encountered in the real world. One of them was about garbage collection and how I didn't like it. Some people wrote in seemingly unaware that their language of choice had it. This blew my mind since when I think of that language, "it uses GC" is right up there next to "it runs on a virtual machine". It practically *defines* the thing for me, given that I've had to deal with it making my life interesting at times.

Given this, it made me wonder how many things I've woven into my head at some deep level that other people might benefit from seeing written down. Granted, this means that anyone else who also knows about the topic at hand might groan and go "well duh, everyone knows that", but that's the whole point. Not everyone does. Have some conversations with real people and you will realize the wide spread that exists.

One thing which got me thinking the other day was seeing a post where someone was complaining about build times. Apparently they can be really long and annoying. I got the impression they were starting from scratch every time, and wondered why anyone would do that. Then I started thinking about this double-edged sword of things taken for granted and figured "because nobody bothered to tell them about it".

So, in that vein, I am going to describe something that happens all the time when I work on stuff, and it seems completely ordinary and boring to me, but might well seem like magic to someone who hasn't seen it yet. It's not magic, though. It's just another way of doing stuff.

Let's say I'm working on something in C++. I have my code split up into a bunch of different files and directories. Usually a project will get its own directory, and then there are others which hold common code which is useful in multiple projects - logging, database connections, that sort of thing.

Take the feedback handling thing on this site for example. It's in a directory called feedback, and the entire thing lives in a file called savefeedback.cc. I've written a little about this before. This one file has a main() and it uses a whole bunch of other gunk to get its job done.

It uses my logging library, a dumb little "Result" wrapper I wrote (a story for another time), my CGI parsing stuff to handle the incoming POST data from the comment form, something to generate 4xx or 5xx errors if something goes wrong, and my stupid little MySQL client library wrapper thing to let it write to the database.

All of those other libraries have their own .cc and .h files, and they get compiled into object files - .o. base/logging.h and base/logging.cc turn into base/logging.o (in the output path, not in the source tree). The same goes for all of the other bits: cgi, mysqlclient, etc.

Once one of those object files has been generated, it doesn't get generated again until something it depends on changes. Let's say base/logging.o. If logging.cc or logging.h are updated, then yeah, it'll get recompiled, and a fresh .o file will drop out. Otherwise, that same object file will sit there for days, weeks, or even months, since there is no reason to rebuild it.

Meanwhile, the thing I'm actually working on - feedback/savefeedback.cc - gets recompiled and turns into feedback/savefeedback.o constantly. Then, because it's also the topmost module with a main() in it, the linker is run to take that .o file and the others (logging, cgi, mysqlclient, etc.) into the final binary output file.

This is usually a pretty fast process, and it's partially thanks to the fact that I don't have to start from scratch every time. I'm not even doing parallel builds here. My dumb little build tool still doesn't support that since I honestly haven't needed it in the nearly 10 years it's been around. Even though the machine itself is about as old (built in the summer of 2011), it's fast enough for my purposes.

Am I building something huge? No, not really. Am I doing ridiculous metaprogramming tricks? No. Do I have a bunch of people making changes all over the place? No. Will this be sufficient for everyone? Of course not.

I just wanted to state something that's obvious to me that maybe not everyone has heard of yet: you don't have to start from zero every time if you trust your build system.

I know this is obvious if you've used Make and really gotten in there and done it up properly so it knows what needs to go again and what can be left alone. Remember, I'm talking about people who may have never touched that thing and have no idea what I'm talking about.

If this is or was obvious to you, then this post wasn't for you. You probably walked some of the same roads as me and encountered it, too. It doesn't make us special.

If you can't trust your build system, maybe that's worth investigating. If your build process involves creating a whole environment from scratch every time (and then tearing it down again) and that's costly and/or slow, maybe that's not the way you should be doing things.

...

I'll try another pass at this, in case the above didn't land.

Let's say you have a bunch of files which get compiled into some intermediate form, and then get linked together into some final form. You write a small shell script, batch file, punched card, or whatever to do the equivalent of this:

compile a.source into a.object
compile b.source into b.object
...
link a.object, b.object, ... into binary

You then run those steps every time. This means you re-generate all of the .object files every time through, even if their inputs didn't change. This probably is not what you want. To avoid this, you need a build system which is a little more aware of what's going on and is not just a list of commands to run every time.

You might have a problem if every time you run your build command, it takes the same amount of time, and it's non-trivial, even if you just up-arrow, hit enter, and run it again. If two back-to-back runs with no changes in between take a very long time, worry.

You might have a good system if you can do that up-arrow+enter thing and it comes right back and does nothing since nothing changed. If you can keep doing it and have it just shrug and do nothing every time, that seems decent to me.

If you then change one file and it runs a tiny bit of work and comes back with the new result, that's great!

...

If any of this was new, consider poking at your build system to find out what's actually going on. Try instrumenting the steps to find out where you're spending time waiting on something to run. See if it scales according to the amount of changes applied to the inputs (source code).

Make sure something actually is slow before spending effort on trying to fix it. Measure twice, cut once.