Software, technology, sysadmin war stories, and more. Feed
Monday, September 14, 2020

Thunderbolt 3 dock plus 2020 MacBook equals crash

If you have a 2020 MacBook Pro or Air and are trying to use a Thunderbolt 3 dock, and your machine keeps dying, it's NOT you. It's not your machine. It's ALL OF THEM. Go look at reddit if you don't believe me.

Let me back up here a little. Warning: intense navel-gazing ahead.

For the past few years, I've been doing my thing on my daily driver "early 2015" MBP which was working fine. It was the last model before the touchbar models came out, and I purchased it explicitly (in 2017) for that, figuring it would bridge me over the bad times.

Earlier this year, right around the time the world was locking down, it started going crazy. The video would flash seemingly randomly, exposing some windows in a nonsensical fashion. The graphics would hang for seconds at a time, or it would just freeze completely. The logs were filled with cruft about "GPU hangs", and clearly it had gotten itself into a situation where it could not recover from it.

While this was going on, I realized Apple had just come out with post-touchbar models that actually had a real keyboard again, and ordered a 2020 MBA, figuring it would take over, and the old machine would just be a loss. I thought the old machine had some deep hardware issue, and since it had just exited the 3 year AppleCare timeframe (bought in 2017, remember), that was that.

That brings us up to around the last time I wrote about this. While waiting the ~month for the new machine to come in, I shifted everything onto my old Mini and tried to hang in there as best I could. Things were slow and weird but I managed to keep writing and generally working on stuff for clients.

The first week of June, I found this thing that said that Apple's "2020-002 security update" patch had completely hosed GPU on Broadwell machines, and guess what? That is exactly what I had. These brave souls had determined that if you reinstalled the OS over the Internet, it would get a base install just prior to that update. Then, you could block installation of that update somehow, and it would stay out. Your machine would remain stable.

Yes, really.

This got me looking again, and by this point, 2020-003 had come out, and it claimed to have a fix. What luck! I installed it, put in a fresh battery, and started using it like normal. That, too, was fine. The long nightmare of having it barf every time I looked at it funny was over. My old solid machine was back.

And, you know what? My brand new (and not cheap) machine came in the mail direct from the factory the very next day.

I have the strangest luck with these things.

Well, given that I paid for it, I might as well try it, and it turned out to be a decent little machine. With the use of a Thunderbolt 3 to Thunderbolt 2 adapter, I was even able to still use my 2014 vintage TB monitor on this new machine. Things seemed to be going well.

Then, I took it too far. I flew too close to the sun. I thought it was my time to finally enjoy the "single cable solution" USB-C/Thunderbolt 3 thing that people have been talking about for years. I never got it to work on my company machine at that pink mustache factory, but surely this time would be different. I ordered a dock that would do everything all in one place: Ethernet, a SD card reader, charging, you name it. A single cable would connect the laptop and so it would be nice and quick to go between my desk and my couch.

Ever since then, it's been no end of stupidity every time I returned the machine to the dock. It seems like if it's away from the dock for long enough, perhaps at the couch overnight, then when I bring it back, it can't sync up again. If I then try to say "oh, well, time for a 'Microsoft fix' as we used to call it in the '90s" and reboot the box, it actually hangs during the shutdown and falls over dead.

That's right, the machine fails to shut down, and instead crashes. When it comes back up, it says "your machine was shut down due to a problem" which effectively means "I paniced last time I was running" and gives me the pop-up to type in some comments and fire them off at Apple.

I started out nice, but I have been adding more and more venom and snark to these reports as time has gone by.

A couple of days ago, I decided to take matters into my own hands and see if it's somehow related to the specific Belkin dock I had purchased. Yes, even though this is clearly the computer's fault, I wanted to see if it was being triggered by that one dock or something else. You see, no matter what craziness might be present out there, it is NOT OKAY for the machine to freak out and lock up. That means they aren't being sufficiently defensive in their coding and Jon Postel is spinning around in a grave somewhere.

So, I ordered another dock. This one is from OWC, a long-time hotbed of Apple fan-people and whatnot. I figured if anything's going to be well-supported, this would be the one. It landed a couple of days later, and I swapped it into place.

Nothing changed. The thing is just as stupid as it ever was.

This has ratcheted up my level of bile and venom to "oh, I'm writing about this now", and so, here we are.

Clearly, creating both the hardware AND the software is not enough for them to get their shit together. They have managed to ship a TB3 implementation that can be lead completely astray when certain hotplug events happen, possibly with certain other power management modes happening in between.

Also, when it does go stupid, all kinds of interesting things happen. The "Thunderbolt" screen on the System Information thing (the icon with calipers over the chip) won't even load sometimes when this is going on. It's like the TB subsystem is so screwed up that it can't even answer a query for a dumb little userspace tool.

Of course, since the hood is welded shut on these things, there is exactly nothing I can do about it. I trust them to get this stuff right, because my hands, feet, arms, legs, and everything else are tied when running in their ecosystem.

On a Linux box, I would have probably bisected it to a bad kernel release and then gone in deep with the printks to look to see exactly where things went bad. I could get it down to a minimal repro case. Then I could find the seven missing characters and fix it. (Seriously, most "bus access" problem stuff tends to be the most stupid code bugs you can imagine.)

But no. All I can do here is embarrass them until they do something about it.

I mean, it worked before.

Let's see if it'll work again. This has gone far enough.

October 3, 2020: This post has an update.