Why I usually run 'w' first when troubleshooting unknown machines
What's the first command you run upon jumping on a wayward Linux box to try to troubleshoot something? For me, it's almost always "w". Unless I have data pointing me in some other direction before landing on the system, I like to see that as a sort of "first snapshot" of what the box is up to before I go off and possibly do other things.
Why bother? Well, over the years, I've discovered a number of bizarre things just from the odd little bits of data which will be returned in that command's output. Here's a mock output:
17:02:13 up 23 days, 1:08, 2 users, load average: 0.05, 4.13, 2.11 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT alice pts/0 abc.example.com 12:14 10:37 0.78s 0.00s /bin/sh rachel pts/1 xyz.example.com 17:02 0.00s 0.00s 0.00s w
Right there, I can see that okay, the box hasn't been rebooted recently, it was busy a few minutes ago but isn't too busy right now, and oh hey, some other person is on the box poking at things, and has been most of the afternoon. It looks like they last did something interactive about 10 minutes ago, and isn't that about the time that the badness started?
The next thing I'd do is to go get in touch with that person. It would be foolish to continue when the answer might be a few short chat messages away.
Sometimes, I get on the box, and it's just me. That is, there's just one user on board, that user is me, and I'm running my "w". Nothing else is there. Many times, I've gone off and looked at the part of syslog which captures login information. This might be /var/log/secure depending on how the system is setup. I've found that grepping for "accepted cert" is a great way to look for prior ssh connections (possibly for interactive logins) while discarding a bunch of other stuff that's relatively uninteresting.
Obviously, I could also use 'last' to see who's been on the box recently, but this isn't the whole story. It's totally possible to "ssh root@box /path/to/command" and never start a login shell, which then leaves no trace in the lastlog, but then goes on to break something on the box. The syslog is how you'd find this.
I should mention that if it's clear that someone has been on a system doing interactive stuff, then I go look at the shell history. Ideally, they are timestamped or have some other way to associate them with a specific login session, but even just having the raw command history back to the beginning of time (on the box) is better than nothing.
I've been able to track down some well-meaning but ultimately flawed attempts at fixing things that then blew up and became something much bigger. The folks who I pinged about it were amazed that I somehow had managed to "guess" that a specific member of their team had been poking at a specific box, but there's really no magic or guessing. The ssh logs show who it was (from the cert identity and/or source address), and the bash logs show what they ran. From there, the rest is academic.
It would be great if all of this was already being aggregated and analyzed and you'd get some kind of warning that "hey, so and so was on this box 15 minutes ago and ran some kind of command as root", but it's been my experience that many places don't have that kind of automation and really don't care enough to ever bother. This means that I have to fall back to old-school first principles to debug things, since there's no clearer signal in those cases.
To be clear, I wish I didn't have to ssh in and start poking at boxes directly, but until we as an industry move on from running things that way, then I will sometimes have to resort to tracking down changes that way.
If you want to impress me, set up a system at your company that will reimage a box within 48 hours of someone logging in as root and/or doing something privileged with sudo (or its local equivalent). If you can do that and make it stick, it will keep randos from leaving experiments on boxes which persist for months (or years...) and make things unnecessarily interesting for others down the road.
Or, hey, somehow automate things to where nobody ever needs to ssh in. If you can swing that, then you will be way ahead of some outfits that should know better but can't manage to make it happen due to the sheer inertia. If you can make it happen while your organization is still relatively small, so much the better.
May your outages be few, and your logs filled with helpful data.