Troubleshooting another spot of downtime
Yep, the site was down this morning. Method of detection? Certainly not any automated monitoring. Nope, in this case, I got mail from someone who was kind enough to notice and who guessed at a contact e-mail address (which worked -- thanks Lin!), and sure enough, the site was toast. Time for some troubleshooting.
Obviously, Apache was down, but why was it down? It was down because it wouldn't restart. (Duh.) But, why would it not restart? Time to try it by hand.
# service httpd start
Starting httpd: [Thu Dec 31 09:24:10 2015] [warn] module ssl_module is already loaded, skipping
(98)Address already in use: make_sock: could not bind to address [::]:443
It then followed with an "[OK]", but things were certainly not OK. Apache was not running. A quick 'ss' and a double-check with netstat confirmed: no, nothing is bound to the port, as you would expect with the web server down. So, that can only mean one thing: the web server is stupid enough to try to bind to the port twice and is conflicting with itself.
Into the conf.d directory I went, and it was obvious what I had to do.
# grep -r 443 .
./some_domain.conf: Listen 443
./ssl.conf: Listen 443
some_domain.conf is one of the files I've written myself. That's not its real name, but that's not important to the story. The point is, it has its own config for all things SSL/TLS on the site: all of the yak-shaving stuff I did earlier this year to make it all look nice on those "grade your SSL" pages.
But, what's this? ssl.conf also existed, and it was sporting an identical Listen. What's ssl.conf? That's easy: it comes with an install of the httpd rpm... or an update. Yep, an update, like the one I had just run on purpose.
So this brings us to a good couple of hours ago: I decided it was time to upgrade things on the box, and just let it run to completion by itself. The first machine used to do RHEL upgrades by itself, so what could go wrong? Then, I walked away without checking things, because it was late and what could go wrong, and all of those sentiments.
From personal experience (which we'll get to later), it's clear what happened: despite my home-brewed config management stuff which sets up all of the files in and under /etc/httpd, nothing stops something from adding a new, unmanaged file to the mix. In this case, it was a file which I probably deleted ages ago while migrating to this machine, and then RPM "helpfully" put back during the update.
This re-introduced the extra Listen directive, and the rest is history.
I now have a placeholder file with that same name in there just to keep it from "helpfully" reappearing on a subsequent update.
This should sound very familiar to long-time readers... because it's the same thing which happened three -- almost four -- years ago: another RPM update resurrected a 'dead' config file and hilarity ensued.
There, it was mod_php. Here, it was SSL configs. Same effect: a dead site.
One other part of this story may be making you go "wait a minute..." here, and it should. Remember the whole bit about configuration management systems only managing the files they know about, and not noticing new ones?
"Any file in a magic directory becomes part of the active configuration."
So, there's my story: nothing new under the sun, and the same failure modes as before. The only thing that's any different is how much faster I fixed things this time because they all seemed too familiar.
There's a lot of stupidity in this story. How much can you find? I'm sure someone will be along shortly to tell me how these things should work, from monitoring, to root-causing, to patching stuff in Apache, to sending things upstream.
To those people: you aren't wrong... but you don't know the whole story, either.