One screenshot explains everything in 15 seconds
A couple of weeks after leaving my job working support tickets and phone calls, I was still solving things for people. One afternoon, I was chatting with a friend back at that company and he mentioned having to go on the KVM. This basically implied that some customer's machine was so badly broken that the datacenter techs had to pull it out of the rack and hook it to a system which would let support access its console. That's never a good sign.
I naturally asked what was happening. He said some customer had a box which would not permit logins at the console. When I asked for more, he just said he thought "the firewall was flapping", and others thought "it was the server". That firewall thing made no sense to me. Why would that stop logins at the console? He said it just takes the password and sits there forever. I said that short of some really horrible LDAP type rigging (which would be extremely uncommon for a single-whitebox hosting customer), local logins should Just Work, network or no.
He was able to boot into a rescue environment and mount the filesystem to look around, and logs seemed to indicate the box would come up normally from its own hard drive/filesystem. It just would not allow logins at the console. I figured, okay, let's find some way to be booted from that filesystem and also be in as root without going through login. Let's put in our own little crazy backdoor. I had him add this to /etc/inittab:
kb::kbrequest:/sbin/agetty -n -l /bin/bash tty12 115200
I had picked this up by snooping around on my old ServerBeach machine. They added it to their kickstarts back in those days, and it let them get a root shell at the console without needing any passwords. The trick is that by default, Alt+Up turns into a keyboard signal event to init, and it'll run whatever you put there. Then you just Alt+F12 over to the newly-created virtual console, and there's your shell.
My friend did this, then rebooted off the disk and gave it the magic poke to get the shell running, and dropped into it. I asked him to strace login, but he said "no trace output". This was driving me nuts, since I couldn't see the machine, couldn't touch the machine, and only knew what he was telling me. I needed to know more!
Finally I just asked if he could grab a screen shot of his KVM session. I needed to see what he was seeing, and not just what he was choosing to tell me. The screenshot shown here is what he sent me, minus root's shadow password entry, which I have obliterated on purpose. Believe it or not, just looking at this was enough to troubleshoot the problem when coupled with what I already knew.
As soon as I saw this, I said: the disk is full. Stop auditd. He said "but the disk doesn't show full". I said well, then auditd is broken. Stop it and try again. He did that, and all of a sudden, logins started working as they should. Naturally, this was a WTF moment for him.
Here's what happened: auditd as shipped on certain RHEL installs of the day had a hard requirement for at least some % of free space on its filesystem. Once the system dropped below that, it would just refuse to do anything. I suspect this was some brain damage down in one of the PAM libraries which would just do blocking I/O forever instead of giving up.
Oct 4 04:02:04 web audbin: threshold 20.00 exceeded for filesystem /var/log/audit.d/. - free blocks down to 18.33%
It was made worse by the fact that auditd was usually responsible for the disk filling up! Those systems would make tons of audit files and yet would have no scheme for rotating them or otherwise keeping them from filling things up. That means given enough time or enough activity, it would create its own problem.
Learn enough about how systems behave and you too can be a strace ninja.