Software, technology, sysadmin war stories, and more. Feed
Saturday, August 28, 2021

Big company tale: six months for a list and a button

One of the projects I did at a big company once upon a time was evolving organically, and we eventually realized we needed a "dashboard" of sorts. That is, we wanted an internally-hosted page that would let anyone load it up and see what was going on with the service. It was intended to be simple: ping our server, ask for the status, and then render the response into a simple list. Later on, we knew we were going to want to add the ability to send a "panic" signal to our server from this page, so that anyone in the company could hit the "big red button" if things went haywire.

This, then, is a story of trying to make that status page happen at a big company. I'm going to use dates here to give some idea of how long a single task can drag on. We'll start from January 1st to make it easy: dates are correct relative to each other here.

January 1: we put up a terrible hack: a shell script that runs a Python script that talks to the service to get the status and then dumps out raw HTML to a file in someone's public_html path.

January 29, early: there's this team that nominally owns dashboards, and they got wind of us wanting a dashboard. They want to be the ones to do it, so we meet with them to convey the request. We make a mockup of the list and the eventual big red button to give them some idea of what it should look like.

January 29, late: asked "dashboard team" manager if they had been able to get the network stuff talking to our server yet via chat. No reply.

February 13: random encounter with that manager. Asked about it. Our dashboard "is on the roadmap now".

February 14: an added detail: "no telling when, though"

February 20: still waiting around for something to happen.

March 20: I mention to the rest of the group how I'm losing faith in that dashboard team. "Pretty sure we're going to need to hack up a terrible page to get them moving on this".

March 24: talking to the group again: "we still don't have a status page", and "how long does it take to make a single page?"...

March 25, early morning: I hit the wall and start doing it myself. It means writing code in a language I never use, with frameworks I have never seen before, dealing with data structures that are completely foreign. It's slow going. I have to ask questions that probably seem stupid to most people because this frontend stuff is SO not my domain. My first iteration doesn't even talk to the server: it just has a bunch of data hard-coded and is all about writing the renderer to spit out a list.

March 25, mid-morning: someone notices that one of the necessary steps to talk to the server from the frontend side of the world was never done, because we're having to do it right now to get my terrible new code to work. That is, you have to basically copy across some RPC definition type stuff to talk cross-systems, and they would have needed that as soon as they started messing around in the problem space.

The fact it's never been copied across means that nobody ever even started looking at things. It's one of those things that takes like five minutes and lets you continue with the rest of the project. This spawns the notion of a "canary macguffin" - a critical early step in the project that never happened, so you can tell that nobody ever got that far.

March 25, early afternoon: having synced the RPC stuff, the network I/O now works, and the terrible code written in this crazy moon-man language and frameworks is now talking to production and getting Real Data.

March 25, mid afternoon: and now it's a page that other people can load up from my testing server instead of being a lump of stuff on disk that only I can run.

March 26, morning: all of the "finishing touches" that need to exist on an internal page are added: security context stuff, permission domain stuff, that sort of thing. The code is split into functions so it won't be a giant stream-of-consciousness top-to-bottom blob of garbage. Various people take pity on me and help me understand how to make it sort server-side so it doesn't load up, then freeze the browser while some JS code sorts it on the client. They also help me understand a bunch of data structure/framework stuff that is completely foreign to me.

March 26, mid-afternoon: code ships and is online for anyone in the company to see. It's just a status page (no big red button), but this means we can now kill the terrible shell+python thing that's been running every two minutes in a screen session all this time.

March 26, late afternoon: I told the dashboard team that we went and did it ourselves. I am advised that the person nominally assigned the task "hasn't even started designing it yet".

March 30: dashboard team manager randomly drops by my desk and is suddenly *very* interested in the terrible page we wrote, and asks what else we need. I advise that we need the big red button that lobs a "panic" RPC at our server. Manager advises they will "bring the details to (the assignee)".

April 8: it seems there's now a mock-up of sorts from the team. Inside the group, we start talking about that situation where if you mail the $open_source_project mailing list asking for help with a legitimate problem, nothing happens, but if you make up a shitty version of something and fire it off, then suddenly 50 million people show up and go OI! DO IT THIS WAY! But, three months earlier when you politely asked for help, zip, nothing, nada, zilch.

April 14: someone points out they've Done Something to the page, and oh no, what have they done? The existing page now has this godawful rendering of a very large piece of equipment. Put it this way, if the project's codename was "bulldozer", there was now a little graphical bulldozer up at the top of the screen, complete with all of the other crap that you'd expect to see around a bulldozer.

Also, this isn't just a PNG or something. It's not some stock artwork, and it's not something someone drew. Oh no. This thing is a whole pile of CSS crap that manages to spit out a *dynamic rendering* of the damn thing.

So nothing happens for months, then two weeks after we ship something terrible to show how it's done, they now have time to go and screw around with this ridiculous (and ugly) thing? What?

The group's chatter continues. "This is what they spent time on?", and "We don't have a big red button, but we sure as hell have a CSS-ified bulldozer", and "how long do you think that took", and finally "if it was longer than 30 minutes, we got ripped off".

April 27: the all-CSS-bulldozer-thing disappears from the top for those of us in the group, because they do something to exclude us from seeing the new rendering, so at least we don't have to look at the damn thing and be reminded of how badly this is going.

May 14: still nothing useful to report. The page is now in tatters: what was a single file is now split across multiple things of different types: frontend framework A, frontend scripting language B, and so on. We ponder just reverting the whole mess to get it back to a simple single file that we actually understood and could work on.

June 3: meeting with the dashboard team manager in which we ask why they put a CSS bulldozer on top of the page and still haven't given us the big red button, which we actually need to keep production safe. The response: the person assigned to the project went off to spend a month doing something else on some other team.

I may have said something like "it's like you asked me to clean up the parking lot and I just decided to paint my nails first".

We pointed out that this was not a request for priority. We are just going to end up doing it ourselves. We just can't understand why all of this stuff was done and just dumped there. It was one simple file... and now it's *five*. The official answer is "this is the only way we can maintain it".

Finally we pointed out that this is customer feedback: i.e., you've already lost my business, please don't rush to save it now. You should take this feedback and recalibrate so as not to leave future people in the lurch like what happened here.

Someone noted that it would probably have been okay if the person went "here's your button and by the way we added pretty things". Instead, it turned into "look how I amused myself, made you wait for months, and did nothing to improve your experience, but I had fun and that's the important part".

June 4: manager goes and writes the panic button thing.

June 13: turns out, no, wait, the button is there, but it doesn't ask for confirmation (as we had asked), and... it didn't send the RPC to our server, so it didn't actually DO anything. It was just an image or something!

June 17: someone on the team reports that the button now works.

So, whenever you wonder what it's like at a big company... sometimes, it's like this! And, hey, sometimes it's even worse!