Writing

Feed Software, technology, sysadmin war stories, and more.

Monday, February 28, 2022

When Command-V (or CTRL-V) is your "copilot"

A reader wrote in asking what I thought about the whole GitHub "copilot" thing. If you haven't heard of it yet, it's apparently this thing that was trained on a whole mess of repositories and their code, and you can "prompt" it and it'll spit out code that may or may not do what you want. It doesn't seem to know about licensing (BSD, GPL, MIT, that sort of thing), or plagiarism, or any of that other good stuff.

When this hit the news last summer, I had some thoughts about it but held back at the time. Now, maybe it's just the caffeine talking, but I feel like going for it. So, here we go.

This is actually a story about somewhere I used to work, and someone who was on one of my teams back in the day. One day, I heard that one of the people in a far-away office had created some kind of tool that would let you look at the status of different machines - that is, Linux boxes. I don't remember the finer points of it or why this person even bothered, but they had this thing and wanted to get the code reviewed in order to check it in.

It was written in a language which was not common for that company, particularly back then. Most backend stuff was written in C++, with a significant amount in Java. A lot of people did dumb glue things in Python (and paid the price long-term, but I digress). We even had people hacking on site integrity stuff in Haskell and doing linters in OCaml. Frontend stuff was largely derived from PHP. This project was in Go, and so was quite the unusual thing to see in this environment.

Maybe that's why I got to looking at it more closely. It just felt "off". Something about the code didn't feel right beyond the fact that it was in a new-to-me language, in a new-to-the-company language, and for a brand new project. Maybe it didn't really match up with the problem it was supposed to solve at the company.

Finally, I just grabbed some bits of code that looked fairly specific and hit up the web to see if they came from anywhere. I figured it might be Stack Overflow or something. It wasn't.

This stuff was cribbed straight from some GitHub project. It had no attribution, and it had no license. This was a problem in so many ways. This company was rather particular about how and when it would add external code to the tree. They had deep pockets and were worried about license violations, lawsuits, and (worse still) patent violations and the associated lawsuits from *those*.

There was this whole thing you'd do in order to bring in an external library. First, it had to be suitably licensed. You'd run it by the local "open source czar" who would approve it. Then you'd add it to this "third-party" part of the tree with all kinds of info about where it came from and how it was licensed. It would then be built that way, and you could pull it in from there.

None of this happened here. This code was ripped off wholesale and was just jammed straight into the project.

Now do you see why I tell this story in relation to a question about Copilot? It's like, yes, there's now a coder tool which does this. But... there have been "coder tool" *people* who have been ripping off code and jamming it deep into their company's code bases for a very long time. This might make it easier to do and even harder to prove when it does, but it is by no means a new situation.