Software, technology, sysadmin war stories, and more. Feed
Friday, December 11, 2020

Feedback: errors, filesystems, magic numbers, and more

It's reader feedback time!


An anonymous reader says:

When I was a student, error checking was not optional. If your project failed anyhow, you would get an automatic 0.

For example, write an FTP server in C. What if the examinator overloads LD_PRELOAD to return NULL every x malloc? If your program doesn't handle the error correctly, you get a 0, no matter how well your FTP server works under "normal" circumstances.

Back then, it was a nice way to learn how to work and write robust programs.

I'm down with that, but I wonder how many programming class teachers are operating at a high enough level to be able to go that far. I say this because I had a terrible experience when taking a C++ class in college back when dinosaurs walked the earth.

For context, it was using Borland C++ on Windows. We had some book I can't remember that came with a floppy (see, I told you it was a long time ago), and we were supposed to Do Stuff with the code. The only thing is... it reliably crashed the program when run. I'm talking about an old-school Windows 3.1 general protection fault popup.

The teacher's response was to "delete the destructor".

Keep in mind, this was the *introductory* course, and we obviously weren't doing anything nutty like attempting threads or really anything else interesting. I wish I had kept those lab files so I could go back and show everyone what the problem was. Unfortunately, that's all gone.

The one thing which stuck with me was the teacher just punting on the thing, and not even trying to understand it, using it as a learning experience for all of us, and/or supplying a better version that didn't break.

Instead, it was just one more log on the bonfire of "it does that".


Another person asks:

What filesystems do you prefer on Linux? Do you regularly use RAID? Do you use some commercial NAS solution for your collection of cat pictures or do you have something like FreeNAS and ZFS running on commodity hardware?

I prefer ext4fs. That's been my progression: ext2, then ext3, now ext4. I don't think I ran the original extfs or minixfs very long back in the old old days of playing with Slackware installs.

One thing I will not run is btrfs, aka ENOSPCfs. I saw it fail again, and again, and again, and again, AND AGAIN, and again, and again. Then I saw it fail some more. The promised features never materialized, and the drama and surprising behaviors never ceased. All costs, no benefits? I'll leave that on the shelf, a few spots down from reiserfs (which was plenty sketchy BEFORE you know who did you know what).

I don't use RAID. I mean, I used to. I ran AMI MegaRAID-derived stuff on a box about 20 years ago for a RAID-5. It worked, I guess. It wasn't easy to maintain, given their proclivity for sending you into the BIOS, or sending you into a terrible ncurses-ish tool to do things to the array.

A few years after that, I had a 3ware RAID setup because it seemed nice to be able to do RAID and not have to pay the SCSI tax for a home machine. I even went as far as buying their 4-disk cage so I could hotswap (!) things if needed. I had more disks fail out of that than anything else before or since.

Not long after that, I got to see how things like GFS work, and realized that I'm a lot happier with non-POSIX solutions for storing lots of data that's usually at rest, or is just constantly receiving append traffic. It doesn't solve all problems, but it's way better than using RAID and/or NFS.

So yeah, if the single disk in snowgoose gets cooked, y'all are going to see some downtime until I deal with it. Considering the value of what's on that box, it's not worth the drama of running an array and paying for the extra hardware every month.

It'd be a different story if I had actual money flowing through here, but I don't, so it just doesn't need all of those nines.

Most of my disks wind up outliving the machines they're mounted in. They get migrated from one case to the next, and keep running until I need to upgrade them for size or performance reasons.

My "cat pictures" and anything else of consequence gets preserved by existing on a bunch of machines at once, most of which have external backup devices. There are varying degrees of storage, like "online" (plugged in, turned on, on the network, always awake), "standby" (plugged in, turned on, usually asleep), "offline" (at hand but not plugged in normally), and "offsite" (somewhere far away). I rotate the media when possible. This year has made it tough but I find a way to cope.

I built some stuff to make this easier. It's not the sort of thing that would really work for anyone else, but it does the job for me.


Victor asks:

After having worked with a bunch of popular languages, and getting pushback from many colleagues and managers when trying to raise the bar for what we allow in the code base, I sympathise. It would be interesting to know what you think a minimum bar of entry should be for language features to avoid such footguns, and whether a conglomeration of features from different languages would be sufficient for safe, modern systems development. For example, Rust-style exhaustive `match` statements seem to be excellent moves in the right direction.

My ponderings over the past couple of days since writing that post have centered around "maybe more coding is not the answer". It seems like we might start with a problem statement that leads us to create a program. Then maybe we figure out the logic and all of the inputs, outputs, states, what triggers the transitions, and what happens when moving between them. Only then do we turn around and try to write something that actually honors that design.

There are so many problems with this approach. Even if you nail the logic/design part, it's easy to screw up the process of writing code to cover it. You might forget a part. You might write it backwards. You could put in something that's not on the original plan.

... and that's just the errors you can get during the *initial* build. This says nothing about the inevitable changes that will follow, and how well they can be added to the overall system without breaking the world.

This then got me thinking more about declarative things, such that most of the work is not a just a tangled series of steps I'm laying out to the computer. The logic itself is not written as steps, so why should that part of the actual program wind up that way?

I might write a whole post about this at some point. The topic is still simmering in my head.


Another reader writes in:

In regards to fixing "fork() can fail", I wonder if adopting the "fail fast" mantra would be a way to address it. That is crash the entire program if fork failed. The hard part then becomes providing some other mechanism for cases when you can properly handle the failure without that also being abused. Makes me think of all of the empty catch blocks in Java code.

In the original situation from 2014, nuking the program outright would have been interesting. It would have prevented the program in question (a "stats" collector running as root everywhere) from accomplishing its job. This shifts the problem from "murdering everything else on the low-mem boxes" to "the stats collector keeps being killed".

You'd hope that the trip through the "you screwed up and now I'm gonna kill you" handler would leave some kind of telemetry behind, and that would eventually bubble up to the team which owned that service. Bigger companies where I've worked have ways to aggregate error logs and other things like this from their various bits of infra. Teams which are high-functioning tend to stay on top of them and keep it clean.

However, the mere existence of a list of things which is broken is never going to assure things being cleaned up. I've seen just as many well-meaning "defect tracker" type things sitting there ignored by people who don't care and don't have a reason to. There's nothing driving them to notice and deal with it, and so they don't. The incentives are wrongly oriented, and that's one way it manifests.

As someone who's actually built a thing or two which looked for problems before they could bite a customer and seen them be completely unappreciated and ignored, it would not surprise me if the "crash dashboard" type thing also just sat there doing nothing.

And yeah, I can also see people learning the absolute minimum way to squelch the alerts by "handling" the error code and just swallowing it. Python, Java, C++, PHP, you name it: people will find a way to just do an "except: pass" or whatever the local idiom is.

If there's nothing and nobody noticing this and trying to stop it and/or revert the existing ones, then it's kind of inevitable that the badness will sneak in EVERYWHERE.

This is why I think we need a system which basically hates you until you treat it right. It has to be smart enough to know that you're trying to fake it out, and it'll say something like "okay, but I'm telling!" and then add it to a list of dumb things done in the code that got compiled into the binary.

Imagine if you could *ask* some program, app, or web site which corners were cut when they built it. Then you could decide whether you still wanted to proceed or not. That'd almost be interesting, until you discovered just how low the bar for quality really is, and what most people will take as acceptable.

After all, the fact that things are crap is not new. People know this stuff is crap. They still buy it, perhaps because there are no non-crap alternatives.

There's some game theory lesson to be learned here, no doubt. Who wants to show up with an actually good alternative and collect an entire market?

Quality! It still means something.


In response to yesterday's CentOS 8 post, one reader says:

I think Centos 8 does offer kickstarts?, we do PXE kickstart installs all the time.

Sure they do. The actual OS obviously can do a kickstart. But I was looking for an option where the dedicated server vendor (IBM Cloud, aka Softlayer) would do it for me. This is usually the right way to do things when you're in the dedicated world, since then you get added to their RHN entitlement or whatever it's called now. (It's been almost 15 years since I had to worry about such things in my line of work, thankfully.)

Trying to do your own install and then get it inserted into their tracking is basically a one-way trip to dramaville. I *like* never using the support mechanism with this place. I am my own support tech.


On the topic of magic numbers, one reader shares one of theirs:

Loved the magic number article! Here's one I've come across - 3.579 MHz. This is the color subcarrier for NTSC, and since every TV set needs this frequency reference, these crystals are very easy to come by. 3.579 happens to be in the 80M ham band, so a lot of homebrew QRP rigs use this frequency.

That's useful. You find an application that forces the price down on something, and then capitalize on it with a totally different application. This is kind of what happened with the RTLSDR sticks: someone made a nice tuner and sampling chip for the purposes of DVB-T TV reception, and someone else found out you could interrogate it and get raw samples across a wide swath.

Now you have $20 SDRs instead of shelling out $1000 for a USRP. I mean, a USRP is still an amazing thing, and there are still reasons to get them, but if all you want to do is listen to the cops, the dogcatcher, or track some planes, you no longer have to drop major coin on it.


Also on the topic of magic numbers, Patrick writes:

Reading your magic list of magic numbers triggered some memories, thanks for that. But is it really 3G which will go offline starting 2022, or rather 2G?

It's 3G. 2G's sunset happened at the beginning of 2017 for AT&T's network. Digging around turns up some web pages of various degrees of sketchiness, but it looks like they'll drop 3G in February of 2022.

Verizon, meanwhile, dropped 2G at the end of 2019 and is torching 3G in a couple of weeks, at the end of 2020... again, according to at least one web page which is *not* Verizon itself. Take with a grain of salt.

I'm talking about North America here, but I imagine other places have their own timelines for similar things.

It's kind of interesting when you consider this is happening even though the telcos have recently come into possession of massive amounts of UHF space. All of the TV stations are getting "repacked" so they will all fit below channel 37. Channel 37 itself is left unused as a "guard band" for radio astronomy, and everything past that is now the domain of some wireless carrier. Exactly who gets what bands in what places is a function of their wheeling and dealing with the FCC over the past few years.

If you're old enough, you probably remember that the UHF band in the US for TV used to go up to channel 83 until 1983, coincidentally enough. Then it was pulled back to a top end of 69, and channels 70-83 were handed over to the cellular providers. This explains why you could pick up AMPS phone calls on those upper channels with a little fooling around with your fine tuning knob - they were just analog FM transmissions, totally compatible with what your TV wanted to see as part of a NTSC transmission.

People freaked out about scanners listening to their phone calls when any kid with a suitably-old television could totally listen in (and probably did). Sheesh.

Anyway, at some point, they pulled it back even more, and now TV runs from 2-13 in the VHF band and 14-36 in the UHF band. Everything else has been reallocated.

Some TV stations have shut down their transmitters entirely and are now riding as a "subchannel", multiplexed into the stream of some other station. Some of them are in truly miserable positions now, and can't be received in areas which formerly were able to get them just fine. (KRON, I'm looking at you.) I'm guessing they care more about their cable and/or satellite coverage than their supposed broadcast area.

There's a reason they keep telling you to "rescan" if you have an antenna: this stuff is NOT holding still.


Finally, regarding the signal bug from last month, it seems that a fix was committed. Thanks to the anonymous contributor for the tip.

Next Signal funtimes: try sending a message that's begins with a bunch of spaces and then contains content, like this:

                    ^ this thing right here

That'll go out without any spaces before the carat. Try it!