Writing

Feed Software, technology, sysadmin war stories, and more.

Sunday, September 20, 2020

How a hypothetical learning project might evolve

Do you ever wonder how some projects go from nothing to something, and guess at the twists and turns that must have happened during development? I came up with a hypothetical situation here that's inspired by a bunch of things that have happened in the past, but this doesn't describe any one product... knowingly, at least.

If you're the sort of person who reads descriptions of problems and tries to design your own solution in your head as you go, this might be a fun one to try. See how long you can stick with your original approach before you have to change gears to accommodate the changing requirements. Don't cheat by skipping to the end. Assume this whole thing evolved organically over the course of several months, and there was very little, if any, "look ahead" on the part of the customers/users.

The "prompt" for this idea was basically this: someone comes up to me and says they're more interested in working on "backend" stuff. Let's say we're already past the point where I ask them "what have you tried so far", and instead I'm going to give them an assignment. They ask for it to be realistic, so they might be prepared for "the real world" some day. And so, this is what followed.

...

Okay, how about you write something that'll take a host name and a port and do a (TCP) connection to it to see if it's up? Just say if it's up or down. We'll be using it for some goofy host monitoring stuff, and this'll get you started.

Oh, and don't get stuck. Give up after 30 seconds or so. The host might be down, and you'll sit there for a long time otherwise.

Make sure you handle both v4 and v6.

Now, do you see how you have to do a DNS query to turn a host name into an IP address? How long does that take? It might be quick but it's never zero. There's always SOME delay. Can you instrument things so we can say how long you spent on the DNS part? Say something like "DNS delay 15 msec" before you say up or down.

You got that? Nice. How about the connection itself. That's not going to finish immediately, either. If it succeeds, can you say how long it took to happen? You can say that instead of up. If it's down, still just say down.

Next up, how about multiple ports? Can I give you multiple ports for this host, and then you'll check all of them? You can do them serially. For now.

That works? Now how about doing them in parallel? That way we'll get the results sooner. Don't worry about limits on concurrency for now. They're our machines so it's okay if we effectively portscan them.

Good, good, now, well, I'd like to check multiple hosts. You can do the individual hosts serially, then do the ports for that host in parallel, same as you're doing now. So check all of host A's ports, and then host B's ports, and so on. Oh, and they might not be the same ports for each host.

After that, yep, you guessed it, can you make it do all hosts, all ports, in parallel? That'll really get us cooking.

Ready for the next step? Can I get you to be persistent and run as a service, so you can have a cache? Like, can I ask you to check a bunch of stuff, and if you already have those answers, you don't do fresh connection attempts? You can set the cache interval from a single number of seconds that we'll specify for all items. (I won't try to make you track requests for different cache item lifetimes. Yet. Heh heh.)

Remember that the wall clock can go backwards, so use a monotonic clock for aging things out.

Right, now, sometimes we really do have to punch all the way through to the host and not use the cache. Can we set a field in a request that says "do this right now on demand" that'll ignore the cache? Make sure it updates the cache afterward, since, hey, we paid for it, and we might as well keep the data around, right?

Great! When we do an on-demand request, can we log something if what we had cached for that target turned out to not match what the on-demand check found? Say it was polled earlier and was cached as down, but the on-demand request found it up. That's notable, right?

Brilliant. So... data. People love data. Can we keep track of how many on-demand requests we've served for a given host/port? It can be just in-memory for now, so if your service restarts, the numbers will go away.

This next one, yeah, you knew this was coming. Can we make the statistics persist across restarts of your service so when you roll out a new version it doesn't all go back to zero?

How about a top N on-demand target leaderboard, or hall of shame, or something? We want to see who's using it the most, for all time.

You know what? Some of those in the "all time" list are old news. Let's make a way to only give the top N in the past X days. Maybe we want the top 10 requested targets for the past 30 days. That type of thing.

...

Still with me? If you were designing it in your head, how did you do? Did you keep the metrics in your process the whole time? Did you ship them off to somewhere else? If so, where did it flip from one to the other? Was it when it went from ephemeral to stateful?

How about the resolver? Did you start with gethostbyname or did you stick with the C library's getaddrinfo()? Or, did you punt and go with something else? Did you notice that getaddrinfo blocks? How are you going to handle parallelizing that and not getting stuck behind it when the resolver is down or slow?

Do you want to attempt happy eyeballs support or just take multiple DNS RRs for a given target one at a time?

See all of those crazy things to worry about? I think I'd actually like to do something like that with some willing students. I'd send them down the road to something interesting (like monitoring port "aliveness"), and throw enough curve balls at them that would force them to encounter and solve for those kinds of problems.

Thoughts? Send me feedback if so.