Finding confusing code in the linux kernel: An interview w/ Prof. Justin Cappos Artwork

Dr. Mirman's Accelerometer

Welcome to "The Accelerometer," a cutting-edge podcast at the intersection of technology and artificial intelligence, hosted by Dr. Matthew Mirman. Armed with a Ph.D. in AI safety from ETH, Dr. Mirman embarked on a unique journey, to join the accelerationist dark side and found his YC funded company, Anarchy. In each episode, Dr. Mirman engages with brilliant minds and pioneers in the YC, ETH, and AI spheres, exploring thought-provoking discussions that transcend the boundaries of traditional AI safety discourse. "The Accelerometer" offers listeners a front-row seat to the evolving landscape of technology, where Dr. Mirman's insights and connections with intriguing individuals promise to unravel the complexities of our rapidly advancing digital age. Join us as we navigate the future of AI with The Accelerometer.

All Episodes

Dr. Mirman's Accelerometer

Finding confusing code in the linux kernel: An interview w/ Prof. Justin Cappos

October 17, 2024 • Matthew Mirman

The linux kernel is 28 million lines of code.

These are 28 million opportunities for attack.

In possibly one of the most widely depended on projects in the world.

Humans alone can't possibly defend this. Humans need to find every vulnerability. An attacker only needs to find one.

The future must automate defending it.

A few months ago I spoke with Prof. Justin Cappos at NYU Tandon School of Engineering about automating code quality.

His answers surprised me.

I've personally always hated linters. I'm very scientific. Very demure. Very mindful. Very logical.

Linters seemed arbitrary. Rules made up by pedants.

Justin's lab had an idea: Maybe we could measure the importance of these rules scientifically. We can actually figure out if a line of code is statistically confusing.

Why is confusing code a bigger problem than not enough whitespace?

Simple: It's harder to debug what you can't understand.

Using these techniques they found 3.6 million confusing lines of code in open repositories.

Hopefully these techniques lead to a safer kernel, and safer code for you!

Watch on Youtube

Accelerometer Podcast
Accelerometer Youtube

Anarchy
Anarchy Discord
Anarchy LLM-VM
Anarchy Twitter
Anarchy LinkedIn
Matthew Mirman LinkedIn

Speaker 1: 0:00

If you design a really good defensive system, you just don't know that you prevented an attack. One of the key things that you really should do is you should design resilience into your system. We found 3.6 million confusing code patterns across 14 large, widely used% of the lines of code in the projects contain these patterns and they're patterns that are shown to be confusing. What is a confusing pattern? Little tiny bit of C code that, if you write it and you show it to a programmer, a statistically significant number of programmers think that this does something different than what the compiler actually has it do on your system, that's every line of code that I have ever written.

Speaker 1: 1:20

Well, you'd be a good case study for us to draw these from and actually one of the ways that we found these in the first place. You know it might surprise you to know this, but the way that people used to try to figure out what was confusing or what was not, by and large, was they would go and ask one of these like gray beard programmers who had written impressive code. They'd say what do you think is confusing? And they would think for a while and they would write down a list, and this was just sort of how it was done. And there were a few things that were tested here and there, but by and large they asked people and people said, oh, go-tos are bad and this is bad and don't do this and do that. But there wasn't any real rigorous foundation for a lot of these assumptions. And so what we did is we took a code from the International Obfuscated C-Code Contest, which you may not have heard of, but it is programs where people compete to write the most confusing possible C-code you can imagine, and the examples are amazing. Right, there's a little in ASCII art. There's a little picture of a train and you run it and it says choo choo, choo, choo choo when you run it, yeah, it's all like fantastically hard to figure out. And you look at it and you have no idea what a lot of these would do. That's the point is to be confusing.

Speaker 1: 2:47

And so from these we had this hypothesis that a lot of this confusion wasn't at the program scale. It wasn't the whole program that made it confusing. It was that there's these little indivisible bits of confusion sprinkled throughout this code and we call them atoms, because the atom you don't usually split the atom right. And so what we did is we developed a series of 20 of these patterns. We conducted a big study where we had a whole bunch of programmers go and test these patterns and what we found is, to varying degrees of significance, 15 of these patterns had different effect sizes on programmer confusion. And then, following onto that, we actually went and found 3.6 million of those patterns in popular code like Linux, kernel, git, apache, svn, you know, different database programs, all kinds of things like this, and we found bugs. We found like serious, you know, bugs that were caused by the fact that the code was more confusing than it should be, just by these very simple, teeny, tiny patterns.

Speaker 2: 4:05

Can you give us an example of one of those patterns?

Speaker 1: 4:08

Yeah. So a really simple example of this is let's say you have a program that says b is equal to 2. Int b equals 2. You also have an int a equals 2. You also have an int a and then you say a equals b++. Print f and you have your parentheses and the percent d and everything like that comma a. So you're just printing out a After b started as 2, and you say a equals b++. So what should the value of A be.

Speaker 2: 4:43

So these are like the corner cases that they're teaching you in compilers classes, where you literally are testing your compiler for these. So why not just like take those like survey, like schools of computer science, and just say, can you give us like your fuzzing tests that all the students handed in for the compilers exams?

Speaker 1: 5:02

Well, this pattern is really really common in code. It's extremely common. This pattern comes up hundreds of thousands of times in very popular C code over and over again. So this isn't even like. There are patterns that, believe me, are like compiler fuzz cases, like, for instance did you know that in C you're used to writing like if you have an array, array like A, and then the array bracket and then two and then the close bracket. But you can write two and then the open bracket in A and then put the close bracket. And because of the way that C resolves array indices, it's just as valid to have the subscript on the outside as it is the array, but they're just added together. You can just switch them effectively. Yes, so because of that, there's things like that that are definitely they're not going to come up in normal code that normal people say they come up in normal code.

Speaker 2: 6:03

Okay, okay, that specific one. I was working on JavaScript core for a while. Yeah, that comes up. How do you find these patterns?

Speaker 1: 6:14

Well, the way we found them originally was we went through and looked at the International Obfuscated C code contest and derived them from them, and then we did the rigorous testing to make sure that they were statistically confusing. And I don't usually find.

Speaker 2: 6:28

How do you like, given an existing pattern, how do you find that in some arbitrary code?

Speaker 1: 6:33

Yep. So what we do is we basically write code to go and scan the code base. So this you know, there's languages like CodeQL, there's things like Clojure, there's great tools like CoxenL, which Linux kernel uses, and they all have abilities to go and look through code and try to find things, and so we've written classifiers that go and try to find these atom patterns using these tools.

Speaker 2: 7:01

How much left is there to generalize for these?

Speaker 1: 7:04

There's absolutely a ton. One of the really interesting things that happened out of this work is a lot of times when you do academic work, you write a paper and that's it. And here I really wanted to take this to industry but I felt it was always too early. But the academic community loved it. Like we got best paper awards for our first two papers on this and then after that my student graduated, went away and I thought, oh, you know, this is a good idea. It's a shame that we didn't carry it further. But there are literally dozens and dozens of papers that have cited it.

Speaker 1: 7:43

Recently People have done their PhD thesis confirming parts of this. We even had a study where someone hooked like EEG sensors up to programmers' heads and sat and scanned their brainwaves as they were shown different patterns of these and they found different brain activity for some of these patterns, which is fascinating to me. Brain activity for some of these patterns, which is fascinating to me. That there's like something biologically fundamental to the way we perceive different C code patterns and stuff like that is just like amazing to me. And we've also seen amazing people from around the world go and take this and take this and look for new atoms, look for atoms in different languages and extend things out. So I really think that we're in the very, very early days of what I think will be a very promising area of work that can have such amazing impact in making languages more understandable, finding bugs in existing code, just making it so much easier for people to program and understand programs.

Speaker 2: 8:44

Now, are using these confusing patterns necessarily a bad thing? Yes, so.

Speaker 1: 8:52

I won't say I should clarify that. I won't say that every time you have a pattern it is always bad. Sure, I do not mean that and I do not mean that every population of programmers will perceive every pattern the same. If you're conditioned to see a pattern many times, I believe without evidence, without having done this test, but I believe that you get used to the example I gave A equals B plus, plus, like. After seeing that hundreds and hundreds of times, over and over and over again, it gets more internalized into you.

Speaker 1: 9:24

But on the other hand, I think that the pattern A equals B, b++, which is the non-confusing way to write that exact same pattern, is much less confusing and much easier and has a much quicker uptake and even requires less cognitive load, even from an experienced person, to do. But there's always this thing where you know people love the list comprehension in Python. They love their reg exes, right, they love that code. That's like write once, read never, you know, because it makes them feel clever and cool is part of it. But if you want code that's readable, then yes, you really should factor atoms out.

Speaker 2: 10:11

Do you think it's possible to design a language which has no confusing patterns?

Speaker 1: 10:17

Yes, it certainly is.

Speaker 2: 10:20

It feels a lot like Lajban, the natural language that was designed to be unambiguous.

Speaker 1: 10:27

Oh, you can actually do this in a much easier way. I would argue that you know, for anyone who took a complexity theory class and learned the language of a Turing machine, that that very simple language of what you do with the tape head move left, move right, halt, whatever is not confusing. When you see it, you know what to do Now. On the other hand, that doesn't let you actually comprehend what the program is doing, because it's at such a low level, right? You're not able to even deal with integers. You're having to deal with this weird abstractions at a much lower level, and so, while I think a language can avoid containing atoms, I think that it's not surprising that many higher-level languages do, because they provide abstractions that are easier for the programmer to understand.

Speaker 2: 11:23

If that's really the case, though, that isn't the definition of the atom, not inclusive enough.

Speaker 1: 11:31

Well, if an atom is supposed to contain all confusion, then you're correct, but we know there are patterns and things that you can write that we do not include currently in our definitions of atoms, for a large reason. Like this the nice thing about looking at atoms and focusing on them is they're largely unstudied by the broader community and they're largely misunderstood, so they're things that kind of go under the radar, whereas, for instance, people have an idea that if you replace all your variable names with just a letter of the alphabet, then that is going to be confusing, and they sort of know things like if you do variable shadowing and stuff like that, or shadow built-ins or other stuff, that's going to be confusing. They understand those types of things. So I don't feel like we have to grow our appreciation of atoms to include that. So I don't feel like we have to grow our appreciation of atoms to include that.

Speaker 2: 12:32

I think it's a very nice self-contained class of the broader class of things that are confusing. So, in response to the you know multi-million confusions that you found, how did you deal with it? That's like more than you can go and tell every individual person, please fix this.

Speaker 1: 12:47

Yes, so I tell you what we did not do, which is we did not go and fire off pull requests to change all of this which we could have automatically generated, and any researcher who's thinking about doing something like that should not.

Speaker 1: 13:01

If you want to work with a community, especially a big open source community, you have to work with them, gain their trust, show you're going to be around a while, do good things slowly and do good things that have high impact and low kind of false positive slash annoyance rates.

Speaker 1: 13:18

And so more recently we started to work with some of the Linux kernel folks. There's an absolutely wonderful tool out of INRIA by Julia Law called CoxenL that we've really started to explore and we're looking forward to digging more into the Linux kernel developers use, going very surgically after very confusing, very rare atoms in the code that are very likely to have bugs associated with them and fixing them through manual. You know, slow walk doing this through the Linux kernel and then over time we can start to build up trust and show the community we're not just doing, we're not just here to give them a pull request that changes every 23rd line of the kernel or whatever it is. We're actually here to try to slowly make things better and understand their needs, and we'll slowly try to progress to a good final point, which will probably not be the removal of all atoms, but will be removal of the worst offenders in the worst cases.

Speaker 2: 14:28

So I wasn't even aware that, like I mean before, I wasn't even aware of this concept of a confusing atom, but I wasn't aware that you could have one confusing atom be more confusing than another confusing atom. How do you quantify that?

Speaker 1: 14:42

Well, quantifying it is actually what you do is you just run a user study? We took, I think, 71 programmers and we showed them different examples that had an atom and had equivalent code without the atom. And if you have a pattern where, let's say, 50 of the 71 got it wrong, if another pattern where 40 of the 71 got it wrong, then when you do a statistical analysis they will have different effect sizes. There's a statistical thing called the effect size and the effect size tells you how strong the effect is of the change, and so some atoms have a very, very strong effect size.

Speaker 2: 15:23

What's the largest effect atom you've seen.

Speaker 1: 15:28

The largest effect atom, ironically is one that some programmers don't fall for.

Speaker 1: 15:34

I think the Linux kernel has this everywhere and it's probably not a common source of confusion, but it deals with. If you do things like put leading zero or other digits in front of numbers, like if you say printf and then %d and you have comma 013, many programmers believe that prints 13, when actually the number that that will print out is 15, because it converts it into octal, and so it's very confusing. And, by the way, this isn't a C only thing contains only octal numbers and starts with a zero, it's octal, but if it contains non-octal numbers, like contains a nine, then it is actually treated as though it's decimal, and so JavaScript does really weird things in this regard. But this is a common source of confusion for programmers because a lot of them don't have this memorized, whereas I think a lot of the Linux kernel developers who are used to dealing with having to do a lot of bit masking and other things, odd things with numbers, are probably just used to aha, I know, this is octal, and so on.

Speaker 2: 16:55

So, on a related note, what's the craziest bug that you've seen as a result of one of these confusing patterns?

Speaker 1: 17:03

I don't want to answer that. Because I don't want to answer that, because I don't want to pick on any project.

Speaker 2: 17:07

Okay, sure. How do you suggest people go and build projects that aren't confusing at all? I?

Speaker 1: 17:14

mean not confusing at all, is a very high bar. But if you want to try to minimize it, then there's many different sources of confusion and there's things that people are sort of generally aware of. Of course you want to do things like run your linter. Your linter will catch a lot of confusing variable names that are too short or things like that. It'll catch issues with shadowing built-in shadowing other things, shadowing built-in shadowing other things. It'll catch a lot of other potential type mismatch issues in certain cases and other stuff like this.

Speaker 1: 17:51

But the two parts of confusion I think people aren't sufficiently thinking about are the very micro levels, like we talk about in the atoms, where they're just kind of programming patterns that you maybe need to work out of your repertoire.

Speaker 1: 18:06

So use fewer regular expressions, do not use nested list comprehensions for sure, and probably use list comprehensions a little more sparingly when I'm talking about higher level languages and from C we have a whole host of little things the atoms that I've been discussing but at the broader level, the other part of confusion that I think is a little bit underrated is API level confusion the way APIs are designed, the way they're defined, what they do with state, how they handle corner cases and things like this is extremely confusing for most programmers and you know some of this is a feature of the language. Some of this is just a feature of the way the API itself was designed. Like when's the last time you checked the return value of a print statement? Right? And you know people often don't expect things like that. They don't expect their APIs to fail or behave in weird ways, and we've done whole other like series of work looking at all sorts of failures and problems that occur in situations when APIs system calls other things like this don't behave as programmers expect.

Speaker 2: 19:25

So we have, like now, some million number of confusions found, and you mentioned that you found a lot of bugs as a result of these confusions. I assume that basically every project on the internet, every project that's open source, has got to be affected by these confusions, by the bugs as a result of them. You work in the software pipeline, right? How can we trust any piece of software?

Speaker 1: 19:54

I mean you really can't, to be honest, you shouldn't and so one of the key things that you really should do is you should design resilience into your system, you should design verification in, and that's why a lot of the work that we've done has really focused on not just assuming that you have a correct software pipeline, not just assuming that everything is going to go well. But you have systems like Intodo that provide attestations for different actions as you make your software and do things with it. And you have systems like Tuck, where it assumes, frankly, that someone is going to break into your repository and it tries to minimize the potential damage they could cause if they compromise a developer account, if they gain root access on your repo, if they get into your HSM and get a signing key you had in there, and it gives you a secure path to recover from it. So what we really need is we really need more systems that provide strong verification against adversaries, including internal adversaries throughout the pipeline, and also ones that fail gracefully as they're compromised.

Speaker 2: 21:05

Can you elaborate?

Speaker 1: 21:06

a little about what attestations are. So an attestation is in some ways like a signature only. An attestation also indicates the things that were used as part of the operation, the things that came out and more about what happened. So, to use an example, let's say that you download a piece of software from Microsoft. It will be signed by Microsoft with some sort of code signing key that tells you Microsoft says this is okay, but it really tells you almost nothing else.

Speaker 1: 21:39

If you have attestations, your attestations you'll generate throughout usually your software supply chain. So, for instance, when you go to sign a Git tag to say I want to build this and push this release out, potentially then you will create a sign attestation that says these are the source files that I'm taking in, I'm making this tag over it and this is what's being produced, and then your build system will say these are the files I took in from that tag. This is the operation I'm doing. Here are the binaries I produced. And then your packaging system will take those binaries and package them up and say this is the thing I produce, this is the actual package.

Speaker 1: 22:23

And also things will happen like tests will be run and the fact that the test was run will be attested to, and then all of that information can be checked to make sure that your software went through all of the correct steps and every process done by all the right people, that nothing was changed in between, nothing was substituted, and so for a situation like the SolarWinds attack, where someone was able to get into the compiler and substitute different source file in as part of the process, if the in-toto attestation were generated by the compiler itself as part of the process, that would have actually detected that the substitution was made and the attack would not have been effective in that specific case. So it has a tremendous potential to provide a much higher degree of security, especially as you start using more hardened compilers and hardening other parts of the infrastructure.

Speaker 2: 23:20

What happens if there is an exploit in the attestation software Sure?

Speaker 1: 23:27

It's the same as what happens if there's an exploit in the attestation software is basically the same as what happens if there's an exploit in your signature software, which is all bets are basically off.

Speaker 1: 23:36

And so what happens if there's an exploit in your signature software, which is all, bets are basically off. And so what happens as part of that is that code gets extremely heavily reviewed and attacks do come up at times. You see flaws occasionally in signing, algorithm, implementations and things, but they're relatively rare. I think that the kind of 1980s crypto nerd imagination of how attacks would be done was it was going to be mathematical attacks on RSA and everything else, and it turned out for a long time instead to be effectively implementation attacks on crypto along with attacks on source code, and the attacks on source code became more and more, and now people are very aware of even side channel prevention and other things like that that are much more subtle, like power analysis attacks and other stuff like that. So now when we have trusted enclaves built by Intel in a lot of processors or ARM trust zone or things like that, it's you know the level of trust you should have in your signing and attestation generation is pretty high.

Speaker 2: 24:43

How do you see AI changing your field?

Speaker 1: 24:47

So AI is a really messy thing. I think anybody who makes a prediction about AI is bound to be wrong. So let me go ahead and make my wrong prediction. While I've made it clear I have very little confidence in this, I think there's some things that are super obvious, which is that when it comes to deep fakes and convincing people of political messages and doing other stuff, from a psychological standpoint it's an absolute nightmare. I think that AI is causing so many problems related to cybercrime and tampering and stuff like this that it's just going to be bad. There are people who think that AI is going to be able to design better algorithms or do other things like this. I think that that is a little more far-fetched. At least where we are today, I don't believe AI is going to come up with a better like they have that like one operation better factorization algorithm, right, mm-hmm, but that's, I think, a very different sort of thing.

Speaker 1: 26:03

Primality test, yeah, I think AI is pretty good at. If you give it a very narrow, scoped problem, then you have very specialized AI that can do things like, for instance, play chess better than humans, right, but that is scoped to a very specific problem. When it comes to telling it to do something like design an encryption algorithm or design a system that has a security property or things like that, I think AI is going to be quite a ways off for quite a long time. And if it does get to the point where it can do that more effectively than humans, I don't know what jobs are left. To be honest, I think we're all going to be working Amazon warehouses picking things out of bins, because robotic claws haven't quite gotten there yet, but maybe the AI of bins, because robotic claws haven't quite gotten there yet, but maybe the AI will design better robotic claws.

Speaker 2: 27:00

So who knows? But you haven't seen it applied to say like finding confusing patterns in more general cases.

Speaker 1: 27:08

yet we had someone who was interested in applying those techniques and we were very interested to see it. So I don't know if or when that work is going to happen, but I'm very excited to see it. So I don't know if or when that work is going to happen, but I'm very excited to see it. And by no means am I trying to say I don't think people should try to use AI for these types of things. I'm just saying that I think the idea of how we designed and formulated our study, how we knew to find a way to get ground truth, come up with things, do the work and everything, I don't think AI is going to be at that level.

Speaker 1: 27:43

I've tried it. I've said you know, hey, write a grant proposal, write a pitch for a paper, right, and it puts words in the right places. And it puts words in the right places and you know it would probably fool a non, you know computer scientist or you know, make a panel waste their time reviewing something they shouldn't review. But it's not coherent, it doesn't have a good, it doesn't have an idea in it. That you know. It's, I think, quite easy to distinguish, at least now.

Speaker 2: 28:24

Yeah.

Speaker 1: 28:26

What issues do you see in the AI software pipeline? Ai software has a lot of issues and a lot of these are very widely known. There's a lot of issues with like poisoning of AI pipelines and you know and going through and inserting malicious data. Or what do you do if you have an image model that you've trained on images and all of a sudden you find out that one of those images was of child pornography or something else like that, and that is in a model that you've spent tons and tons of computational hours doing, or even what if it wasn't. But it starts to generate that kind of content and you're not sure. So there are a lot of problems with those types of aspects in there. There's also problems of knowing how you got the stuff into your AI model and knowing a model was correctly trained.

Speaker 1: 29:26

Given the fact that AI models are almost always trained using GPUs and GPUs have lots of data races in them just by the way that they work, by their parallelism, it's natural that when you train the same model and the same data, you get different things out at different times.

Speaker 1: 29:34

You train the same model and the same data, you get different things out right at different times, and so this sort of non-reproducibility of AI models is actually, I think, something that is something of a looming disaster, at least from the standpoint of science. Because if I write a paper and say I have this great AI model that does X right, and then someone else tries to replicate it and they make an AI model using the same techniques and maybe even the same data source they did, and their AI model sucks, they can't say with any certainty that what I did is actually, you know, flawed. They can just say they didn't get the same result, and you know it's. After you try it a few times. It's somewhat less plausible, but it's still possible. So I'd like to see more ability to reproduce, which might require changes in the way that AI models themselves are trained.

Speaker 2: 30:29

Do you think that reproducibility is an important enough problem to design chips around, like for the neuromorphic accelerator startups to be working on?

Speaker 1: 30:40

I think there will be a market for reproducibility. I don't claim to know what that is.

Speaker 2: 30:47

I guess if you're training, like, are you willing to spend like $200,000 on a chip specifically?

Speaker 1: 30:52

for the reproducibility no, you're definitely not willing to spend $200,000 more money. The reproducibility You're definitely not willing to spend $200,000 more money. But if you know it can get built into every chip or half of the chips, and half of them cost, you know, $10 more. I bet the cloud providers are going to buy a lot of those because I think there will be some demand.

Speaker 2: 31:13

And is reproducibility something that really exists in traditional software? I mean, there's certainly in like small traditional software. I can imagine reproducibility is very attainable. But for something like the Linux kernel I mean, maybe the Linux kernel is also a bad example, since it's probably the single most studied piece of software in the world.

Speaker 1: 31:32

But Okay, so there's a reproducible builds project, um, that's mostly done by folks in the debian community, um, that we've participated in as well. That actually is built um. A substantial percentage of the debian packages including, I believe, the linux kernel, in a reproducible way, um, and I would want to look all that up, but I think it's there close like 70-something, 80-something percent of Debian packages are built reproducibly. By the way, did you know that if you just take Hello World in Java and you build it twice, you will get different binaries out?

Speaker 2: 32:16

I did not know that about java, but it doesn't surprise you?

Speaker 1: 32:20

yeah, okay, it surprises a lot of people.

Speaker 2: 32:23

The same is true most programming like well, java's got like the full jvm, like I I know, like for c, like the difference in binaries depending on the compiler is usually pretty small and probably depends on, like, the register allocation, because I don't know some register allocation.

Speaker 1: 32:39

That's weird but it can be that um other. There's a whole bunch of factors that can go into it. Um there's lots of things that cause reproducibility problems in in code.

Speaker 2: 32:50

Um I mean a lot of uh compilers these days have machine learning based optimizers.

Speaker 1: 32:56

Yeah, and so when you have things like that, then that optimizer itself when it runs can be reproducible Right and it could be the case, well, unless it depends on a giant neural network that's using your GPU.

Speaker 1: 33:10

But yeah, so it's usually when the model is built that it becomes that, I believe. If you're running the model itself on the GPU as you categorize things, then you will have that problem. If you run the model itself in a single threaded way on your processor, then that's less of a concern. It's just like the model itself is probably the most.

Speaker 2: 33:34

Most people are not using GPU optimized compilers, probably not, yeah, no.

Speaker 1: 33:42

So, but it's a problem and there's some debate about how serious the problem is. I think it's clear that in a world where you can't do reproducible builds, it's very hard to know there isn't a backdoor in software Because, frankly, the delta between checking to see if the thing is bit for bit identical and doing anything else is massive. Even if there's one tiny field you also need to check, even if, oh, it might just be the order of these things in the zip file that got put in a different order. Any of that it's just not something that almost anyone will do. Almost zero people on earth will check any of that. So bit for bit identical is really, I think, where industry and others need to go, where industry and others need to go.

Speaker 2: 34:36

So in order to get that reproducibility, all of the dependencies need to be reproducible too presumably.

Speaker 1: 34:44

As long as you pull the dependencies from the same source, then you're fine. If you're rebuilding the dependencies yourself, then they do have to be reproducible, you're correct. So it sort of depends on the ecosystem. In an ecosystem where you pull all source, you'll have that problem. In an ecosystem where you pull binaries and then build, then you're probably okay, so that 30% of packages that aren't reproducible? Those could be popular dependencies, popular packages.

Speaker 1: 35:15

Yes, some of them very much are popular dependencies, popular packages yes, some of them very much are. When we've worked with the Debian community and others it's not just Debian, but when we worked with the reproducible builds community to try to address these issues, we wanted to target the most popular ones, but often they've been tried multiple times. That was, I think, one of the first things that that community did was go after a lot of the more popular ones and then, as is unsurprising, um, the popular ones tend to be the most complicated and have the most, um, you know, um, the most uh, custom, uh like build environments and most you know, custom setups. They just tend to be very, you know like. What's the word I'm looking for? I know the word Complex. No, no, no, it's like bespoke, bespoke. Yes, they tend to have very bespoke environments, have very bespoke environments, and so the fact that you know that it's not surprising then that they have a lot of the reproducibility problems and it's hard to get them to change.

Speaker 2: 36:24

What do you suggest people do to get their projects to be more reproducible? Because this is like a very abstract concept to me, like I imagine, like just checking whether it's reproducible is just building it numerous times and literally checking, but how do you go about designing it to be reproducible?

Speaker 1: 36:40

Uh, there are a lot of things that you uh, you basically have to do to do that Um, and most of it is actually in your tool chain, not so much in your code. There are places in your code you need to perhaps make a change or things you need to do, but by and large it's your tool chain. You need to make sure that the way you read in files from a directory, for instance, is not in the order that the file system gives you the list of directory entries, but is in some other canonicalized order, like the alphabetical order, because different file systems will store the same set of files and directory entries in different orders and you'll run into all sorts of other interesting problems. In fact, we've done a whole massive bit of work looking at interesting operating-specific differences for software. Did you know that about 60% of bugs that are found in a software project's lifecycle for the average released software come after they release the software?

Speaker 2: 37:52

Are there any major effects that have been attributed to people not downloading signatures?

Speaker 1: 37:59

I mean, the fact is, nobody checks them. So you know, certainly if you checked signatures and someone went and substituted a malicious package as has happened many times with PyPI then you would have had it. But the problem is that nobody's signing their packages and nobody's checking signatures.

Speaker 2: 38:20

So I guess, on the flip side, like, can we attribute like it's hard to attribute anybody not getting hacked? I guess Are there cases where people have been getting hacked so regularly and they started adopting signatures and stopped?

Speaker 1: 38:34

Oh yeah, well, I mean, a good example of that is what happened with let's Encrypt, how we went from everybody using HTTP, or at least HTTPS, only being like 20% of websites to now it's like over 80 or 90%. So it's just the overwhelming default is that now you, when you go to a website, you're you know you're doing TLS, so that's great.

Speaker 2: 39:00

Yeah, and I guess that has like prevented a huge number of attacks?

Speaker 1: 39:05

Yeah, and it's hard to know, because one of the things about if you design a really good defensive system, then a lot of times you just don't know that you prevented an attack. And it's one of the really frustrating things about like Tuffin and Toto in particular is that both are quite widely used across lots of different domains and what we do is, when we try to argue that they're effective, we go and look at lots of past attacks, dozens of past attacks, and we find that in many cases none of the past attacks would have been able to happen had people been using Tuffin and Toto. And then you know that doesn't mean they're perfect, it just means that you know they would have prevented the sort of things that occurred. So I don't know how many attacks they have prevented.

Speaker 2: 39:58

What project are you most proud of?

Speaker 1: 40:00

Wow, that's asking like, which child do I like the most? Wow, that's asking like which child do I like the most? I'm really proud of everything I've done, even the things that failed. The thing I'm most proud of overall, though, is that I've never really taken on a project just to write a paper or just to try to get grant money or just to do whatever. I've always really focused on how do I make a positive difference in the world, and I've always believed very firmly in everything that I worked on. At times, I found that we were wrong later, either the market moved somewhere else or we misunderstood how pressing the concern is, or something else came along, but I'm very proud that I always came into it with the intention of actually helping people.

Speaker 2: 40:51

How did you get started in this line of work?

Speaker 1: 40:55

Well, to be honest, my mom tells stories about things I did when I was little that kind of clued her that I had a security mindset and was thinking in an adversarial way about things. So I don't know. I've actually never taken a security class, because they didn't offer one when I was a graduate student at the University of Arizona, but I've always been fascinated by it and always loved to build things that have interesting properties, and many of those have happened to have security properties. So I've just sort of been pulling on threads repeatedly, never giving up, and this is where it took me.

Speaker 2: 41:34

I guess your parents must have been something related to engineering.

Speaker 1: 41:41

Well, my dad effectively did statistics and things related to yield for different factories and manufacturing, and my mom was an elementary school teacher. So not so much a little bit, but not too much either. My mom did always love math and taught, like you know, fourth, fifth grade math and things like that.

Speaker 2: 42:08

But yeah, I say that because I don't think my parents or grandparents would know that I got a degree related to security research, even. But yeah, so what drives you now?

Speaker 1: 42:26

Same as always. I see so many problems. I see them as so much, uh, so intensely important in the world today. Um, I think in certain cases, like some of the work we do with automotive and internet of things devices that literally people will die if we did not do a good job to solve problems in an effective way. I think that, even though the line is a little more indirect to look at things like how is it impacted if Git security is poor or how is it impacted if there's other software supply chain vulnerabilities, I think the line from that to it actually impacting people's lives is not very far. So I think we should absolutely be leaving the world a better place, and that's what I aim to do.

Speaker 2: 43:18

Have you seen vulnerabilities that have actually affected people's lives?

Speaker 1: 43:23

Yeah, there are a lot of vulnerabilities and lots of attacks that have done this. There's numerous incidents of ransomware hitting hospitals. There's incidents of people going after different infrastructure. We had all of the attacks that happened with allegedly North Korean hackers going after infrastructure. I mean, they've gone after infrastructure around the world, but particularly in South Korea and things like this. But there are absolutely people that have died due to cyber attacks and it's not that any one country is responsible for this Heck.

Speaker 1: 44:03

Some of the biggest cyber attacks have been carried out by the United States, like the Stuxnet and flame malware attacks in Iran and other things like that. But it's a very scary world and my hope is that people get a handle on things, especially around infrastructure. That is more internet of things like vehicles, automobiles we look at airplanes, we look at military vehicles, you look at medical devices. I think there are ways in which things can be improved and I'm very pleased to see that a lot of the industry is starting to move in that direction with increasing urgency over time released some of our first work on this, on making neural networks robust to adversarial examples.

Speaker 2: 45:11

Uh, the first people to use this were not, uh, people trying to make neural networks safer. It was people trying to make uh capture detecting systems uh you know more robust. Yeah, do you ever worry that the work that you're doing trying to make software safer is going to be used to actually attack it more robustly?

Speaker 1: 45:55

In some ways, us presenting a design that has good security properties is a roadmap for attackers to look at why we did our design and figure out why other designs are not as secure. But that doesn't, I think, isn't an excuse for us not to provide something good and then work our damnedest to try to get people to use it, and fortunately, in a lot of these cases we've been very effective. But yes, I mean, at some level you can only help people so much and try to guide them so much, and at some level some other interest is going to take over, some corporate interest or some business interest or something or some other reason why a company may want to go a different direction. And then, when they get hacked, you know someone's going to reach out to me and I'm going to say, well, I don't know. They have known about this problem for five years. It's all on the record. So you know I feel bad about it, but I don't the.

Speaker 2: 46:49

You know it's not, definitely not better to just have everyone have crappy security, because people will still figure it out the 90s and just expecting that your computer would get hacked like that, expecting that you would download a virus, and right now, like it's weird, you you expect to not have to worry about in.

Speaker 1: 47:14

In general you don't, and the reason why certain things have gotten much better, um, one of them is is that you actually get operating system updates and software updates very frequently to your software.

Speaker 1: 47:27

You probably use Chrome or Brave or some other software like that and you know that'll give you an update every few days to patch the latest vulnerabilities and most attacks. And most, you know most of these issues are just they're not zero days. You know, most of these issues are just they're not zero days. They're usually things that are known vulnerabilities that if you just had bothered to update would be fixed. And the other thing is is that organizations have done a pretty good job of knowing how to write more secure code, like the testing, and things have gotten a lot better and this is due to a lot of effort. Gotten a lot better and this is due to a lot of effort, you know, both from academia but really industry. Microsoft especially has done an enormous amount of work on testing and making things available to the community and it's really been very helpful in building more secure software. And you know other organizations and companies came along and also did tremendous things in that regard.

Speaker 2: 48:27

What would you suggest for people who want to make their systems more secure right now? How would you suggest they do that?

Speaker 1: 48:35

So if you're a developer, the OpenSSF has standards and guidelines that you should follow and if you're developing a software project that's open source, you can also go through several different processes. There's a badging process and other things you can use to go and check your software and find out, you know, what deficiencies it says it has. You can apply for different levels of badging that you can go and put, as you know, on your GitHub page for your project so that people can know what level of standards you've met. If you're a home user, just a normal person for your own hygiene, do your software updates that's absolutely number one. Use a password manager that's absolutely number one. Use a password manager that's number two. And use multi-factor authentication, not through SMS, if you can avoid it. If you do those three things in general, you'll be in a pretty good place.

Speaker 2: 49:37

Well, thank you very much for coming on. Thank you.

People on this episode

Dr. Matthew Mirman

Host

Justin Cappos

Guest