Inside the 29.5 Million DARPA AI Cyber Challenge: How Autonomous Agents Find & Patch Vulns

Nov 6, 2025

View Show Notes and Transcript

What does it take to build a fully autonomous AI system that can find, verify, and patch vulnerabilities in open-source software? Michael Brown, Principal Security Engineer at Trail of Bits, joins us to go behind the scenes of the 3-year DARPA AI Cyber Challenge (AICC), where his team's agent, "Buttercup," won second place.Michael, a self-proclaimed "AI skeptic," shares his surprise at how capable LLMs were at generating high-quality patches . However, he also shared the most critical lesson from the competition: "AI was actually the commodity" The real differentiator wasn't the AI model itself, but the "best of both worlds" approach, robust engineering, intelligent scaffolding, and using "AI where it's useful and conventional stuff where it's useful" .This is a great listen for any engineering or security team building AI solutions. We cover the multi-agent architecture of Buttercup, the real-world costs and the open-source future of this technology .

Questions asked:
‍00:00 Introduction: The DARPA AI Hacking Challenge
‍03:00 Who is Michael Brown? (Trail of Bits AI/ML Research)
‍04:00 What is the DARPA AI Cyber Challenge (AICC)?
‍04:45 Why did the AICC take 3 years to run?
‍07:00 The AICC Finals: Trail of Bits takes 2nd place
‍07:45 The AICC Goal: Autonomously find AND patch open source
‍10:45 Competition Rules: No "virtual patching"
‍11:40 AICC Scoring: Finding vs. Patching
‍14:00 The competition was fully autonomous
‍14:40 The 3-month sprint to build Buttercup v1
‍15:45 The origin of the name "Buttercup" (The Princess Bride)
‍17:40 The original (and scrapped) concept for Buttercup
‍20:15 The critical difference: Finding vs. Verifying a vulnerability
‍26:30 LLMs were allowed, but were they the key?
‍28:10 Choosing LLMs: Using OpenAI for patching, Anthropic for fuzzing
‍30:30 What was the biggest surprise? (An AI skeptic is blown away)
‍32:45 Why the latest models weren't always better
‍35:30 The #1 lesson: The importance of high-quality engineering
‍39:10 Scaffolding vs. AI: What really won the competition?
‍40:30 Key Insight: AI was the commodity, engineering was the differentiator
‍41:40 The "Best of Both Worlds" approach (AI + conventional tools)
‍43:20 Pro Tip: Don't ask AI to "boil the ocean"
‍45:00 Buttercup's multi-agent architecture (Engineer, Security, QA)
‍47:30 Can you use Buttercup for your enterprise? (The $100k+ cost)
‍48:50 Buttercup is open source and runs on a laptop
‍51:30 The future of Buttercup: Connecting to OSS-Fuzz
‍52:45 How Buttercup compares to commercial tools (RunSybil, XBOW)
‍53:50 How the 1st place team (Team Atlanta) won
‍56:20 Where to find Michael Brown & Buttercup

Ashish Rajan: [00:00:00] What is AICC and why does it take three years to run this thing?

Michael Brown: The goal of the competition is for competitors to build fully autonomous AI driven systems that can find and patch vulnerabilities in open source software.

Caleb Sima: It was fully autonomous, right? You guys cannot debug it or fix it or do anything, right?

Michael Brown: While the actual competition's running. You're just. Drinking coffee. I have to be honest, like I've been a bit of an AI ML skeptic for a long time. I've been really blown away at how good they've actually become, but ultimately, AI was actually the commodity. The answer is, one, you have to put a lot of engineering work and two, use AI where it's useful and use conventional stuff where it's useful.

Ashish Rajan: One of the biggest AI security challenges just finished at BlackHat or Hacker Summer Camp a couple months ago. And yes, in this particular challenge, you had to have autonomous AI find vulnerabilities, verify the vulnerability, and patch them. And yes, all three done autonomously by an AI agent. All DARPA, the institute behind running this would do is send API also trigger the requests, and then from that point onwards, after the races, it just, [00:01:00] lets take care of the entire thing.

I was fortunate enough to speak to Michael Brown, who is a principal security engineer and the head of AI/ML security research at Trail of Bits that won the second prize with their project. Buttercup. Yes. We talk about why the project was called Buttercup as well, and how much did they end up spending on LLM models.

What was the secrets behind the scene that they had to do? Obviously, all of the projects were made open source, so all of you can access it, and I'll leave a link in the show notes as well for you to check out the winner as well as the third place who came, walked into semifinals. By the way, the whole thing took three years.

I asked the question like, why is it taking three years and so much has happened in LLM. In between that. So we went into the weeds of this as well as to how the LLM worked. What was the scaffolding like? What would they do differently, what they've learned a lot more. So Caleb and I had a chance to distill a lot of that information from Michael in this episode.

All that information would be helpful for you if you are trying to use cybersecurity for your own functions within enterprise. You can understand the scaffolding, how does the engineering work [00:02:00] behind these projects and use that internally. If you know someone else who's trying to do similar things with their cybersecurity function in their organization, definitely share this with them.

We also talk about the reality of this as well. So they definitely might wanna know how real it is for someone to make this for themselves and how much investment is required to build a autonomous AI security, multi AI agent thing in their organization. So. Definitely share this episode with them. And as always, if you have been listening or watching an episode of AI Security Podcast for this is maybe your second or third time or fourth time, really appreciate your support.

Thank you so much for supporting us. And if you can take one moment to just make sure you're subscribed to the YouTube and LinkedIn, if that's for you, watch these episodes or just follow and subscribe on iTunes or Spotify. That's where you're listening to this. It only takes a few seconds, and thank you so much for supporting us in the work we do.

I hope you enjoy this episode and I'll talk to you soon. Peace. Hello. Welcome to another episode of AI Security podcast. I've got Caleb, obviously, and Michael Brown. Michael, welcome to the show, man. Yeah, thanks for having me. It's great to be [00:03:00] here. Maybe to kick it off I'm I'm super excited to hear about everything that has been leading up to this moment for you and Trail of Bits as well.

So if you can start off with a bit of introduction about your professional background. What are you up to these days? How'd you get there?

Michael Brown: Sure. Uh, yeah, so my name's Michael Brown. I'm a principal security engineer at, uh, Trail of Bits. I lead up our company's A IML Security research group. We focus really on two kinds of security.

Um, we focus on using A IML techniques, including large language models and also more kind of good old fashioned AI techniques. We're trying to solve security problems that have traditionally been really hairy challenging for conventional methods to solve. And then we also do some work that's, uh, at the intersection of trying to secure, uh, AI ml based systems.

So building security tools for them like AI, bill and materials, tools.

Ashish Rajan: Frontier models you mean or actual AI systems were used by companies as well?

Michael Brown: Uh, no is systems that integrate AI into them. So like larger systems that use AI as one or more, uh, sub component.

Ashish Rajan: Awesome.

And I [00:04:00] think to set the context, because how we ended up meeting was the AICC, what is AICC and why does it take three years to run this thing?

Michael Brown: Yeah. So the AI cyber challenge is a, uh, large it's a large open competition that's been run by DARPA for the last two years. Uh, they announced it, um, around this time, two summers ago.

And it's been a two round competition sponsored by DARPA and ARPA h challenging different teams to create fully autonomous. AI driven systems that are capable of finding and patching vulnerabilities within, uh, open source software. Could you,

Caleb Sima: I also want to rewind a little bit. Wait, it takes three years.

How does this work? So what's the, you know, what are the rules, what's the background and how do you progress? It's like, is everyone

Ashish Rajan: progressing together?

Caleb Sima: Do you enter only one? So you have to enter three years ago and then you can no longer enter? Or like, what's the, sounds like a talent show. Sorry.

Yeah, there's a lot of questions around

Ashish Rajan: this. There's a lot of questions here. Sorry.

Michael Brown: Yeah, sure. So the way, um, [00:05:00] the way DARPA created and the way they, they kind of structured the challenge shortly after they announced it, they announced that the competition was open and that competitors were free to register.

Uh, while the registration process was ongoing, there actually was a kind of, um, round zero, so to speak, of the competition where small businesses were encouraged to submit white papers or concepts for the type of system that they wanted to build under this competition. Um, and DARPA selected the top seven concepts that its on and gave them a $1 million prize to essentially act as seed funding.

Um, this is kind of important because, um, small businesses, they, they have a lot of talent like Trail of Bits, um, but they don't always necessarily have the financial flexibility. You know, we're running a business, we don't, we can't just, you know, take a bunch of people off, make them, non-profitable, so to speak, or, or, you know, put them on things that aren't generating income or revenue for the company to just go off and compete.

Some cases, you know, we could maybe find one or two people. We could, we could do that with, but not a team large enough to compete in something at this scale. Um, so, uh, so DARPA had this competition [00:06:00] Trail of Bits, uh, um, we submitted a concept paper. I was the lead author, the lead lead designer for this system.

I, I built this system, or at least designed it with my co-author, Ian Smith. Um, so we, we submitted this at the very beginning of the competition once, so we actually had seed money, um, to help us build the first version of Buttercup. Which competed in the semifinals. The semifinals were held in, um, 2024 at Defcon, and this is where all of the open competition teams, so both the seven business or the seven small business teams that have been given seed money and also, um, 32 other independent, uh, at risk, uh, teams also competed, uh, in the semifinals.

Um, so from these 32 or from these 39 total teams, the top seven, uh, scoring teams advanced to the finals, which were held this summer at Defcon. So it was kind of a two year, um, you kind of think of it as, uh, as spanning three, uh, hacker [00:07:00] summer camps, right? Announced in the, the 2023 edition of, of, uh, hacker summer camp.

The first semi-final round occurred in 2024, and the finals occurred in 2025. And then, you know, at the very end of this DARPA crowned three winners Trail of Bits was one of those, uh, winning teams. We came in second place, uh, along with, uh, our fellow competitors at, uh, team Atlanta, which is a group from, uh, Georgia Tech, GTRI ki, um, and Samsung Research, uh, as well as the third place Finisher, um, which is a company named Theory, uh, which is predominantly based in, uh, South Korea, but also as a presence in Austin, Texas.

Caleb Sima: And what was the competition? What's the goal?

Michael Brown: Yeah, so the goal of the competition is for competitors to build fully autonomous AI driven systems that can find and patch vulnerabilities in open source software. So the idea here is that the government and, and pretty much everybody has a really strong vested interest in making sure that open source software is secure.

Otherwise we get stuff like Log4J where we all, you know, [00:08:00] start using these really popular and really easy to use and, and frankly thankless maintain software packages. Then something goes haywire. Uh, and then it turns out everyone, you know, spends an entire month scrambling to figure out, okay, do we actually use this thing?

Where do we use it? Are we vulnerable? And then we all incur these huge costs to remediate these vulnerabilities. Um, largely because these software packages, uh, you know, once again, they're thankless, maintained by one or two primary contributors, and they don't have the time. They have, they have day jobs, they have, they have other things that are, that they're doing.

You know, they have families, they have, you know, the myriad other things that like distract us all day to day. Sure. We don't have time to spend all day trying to be proactive about hunting bugs in these libraries. Um, so the way trail, or the way DARPA structured the competition is they would give us challenge problems in the form of an open source software repository.

And it had a reliable build script and it had a reliable fuzzing harness or dynamic analysis harness. So the idea was that the cyber reasoning system, or this fully autonomous [00:09:00] system can't have any human interaction whatsoever other than us turning it on and placing it into a ready state, then their competition.

API sends these challenge problems. They are basically a Git repo with, um, some amount of Git history with the ability to build and rebuild the software with different types of, um, sanitizers, uh, built in. And then also the ability to run inputs against these, against these challenge problems. And the idea being that if you find a vulnerability, you can then prove that it exists because you, you have some crashing test case or some input you can feed into the program that demonstrates the issue.

Um, and then, uh, you're tasked with after that creating a patch that re remediates this vulnerability. Also, you know, it's held to a higher standard. It has to not only fix the vulnerability, but also can't break any existing functionality of the program. Um, and to a certain degree that this wasn't actually involved in the scoring, it was talked about a lot.

There was a heavy emphasis and desire for these patches to be reasonably close to something [00:10:00] that you could submit as a pull request to an open source repository. Um, so the idea is, the idea was we didn't want these things that were like kind of obviously AI generated or kind of sloppy type submissions.

We wanted them to be something that would actually reduce the burden on open source software maintainers, giving them something that they could realistically take as like a 95% solution for a pr to a problem that was demonstrated to actually exist and that could be reproduced. So the idea is to have a really, really incredibly low false positive or low end precision tolerance for automated vulnerability discovery.

Caleb Sima: I does that also include, this also means things that are not, quote unquote code related or config related would also sort of be out of bounds, right? Like if you create some other process or system to virtual patch, something that is eliminated as part of the rules.

Michael Brown: Yeah. Yeah. Those kinds of things weren't allowed.

So you actually had to do focused and targeted vulnerability identification and patching. Uh, so you weren't [00:11:00] allowed to just kinda like scan the entire code base, right? And just add a little extra security code bits here and there, and they get credit for fixing the bugs. Um, because like we all want in software development, we want small prs that, you know, handle very specific issues that are well documented and there's traceability through that.

So, yeah, you have to give credit to DARPA and ARPA H for the way they structured the competition. They've really made it so that the tools that we built and our fellow competitors built. They're actually pretty close to being usable. They aren't kind of these, you know, blue sky, if only the rest of the real world look like this, then this solution would fix everything.

It was very much rooted in, okay. What's the reality of the situation affecting the security of the open source software community?

Caleb Sima: So did, how did they score? Was it like, what was the metrics you guys had to achieve?

Michael Brown: Yeah, yeah. So, so in the semifinals it was pretty simple. If you found a vulnerability, you got, uh, two points, and then if you could fix the vulnerability and fixing the vulnerability required, you [00:12:00] know, fixing the original vulnerability and not breaking anything else along the way, um, you would get six points.

So, you know, patching was, was heavily weighted versus vulnerability discovery, and that was pretty simple scoring. Um, the finals introduced some other scoring kind of wrinkles. And the semi-finals, you, you had to find a vulnerability before you were allowed to patch. They relaxed this a little bit in the finals where you could try a patch, a vulnerability even if you couldn't prove that it existed.

But in order to give you, um, an extra bonus for for, uh, being able to prove a vulnerability existed, they kept the same scoring where you'd still get two points for a vulnerability, six points for a patch. But if you could bundle or you could associate a proof of vulnerability, um, this crashing test case along with the patch that fixes it, you got bonus points.

So, um, essentially, um, if you were able to do all the steps in the chain, uh, you were able to get additional points. They also added a couple of other wrinkles where, um, the longer your cyber reasoning took to find the vulnerability and patch it, uh, you would [00:13:00] kind of bleed away a few fractions of a point over time.

So basically, um, if you were to find something in the first half of the processing window for a challenge problem, so, you know, let's say you have like 12 hours to process a challenge, anything you find within the first six hours is worth the full amount of points. But if you find something at, you know, 11th hour and 59th minute.

Caleb Sima: Yeah.

Michael Brown: Um, that vulnerability was only worth a point where that patch is, it's like

Caleb Sima: a, it's like a budget that it kind of gets drained. It's like a compute budget almost, but, um, point budget. Yeah. That's interesting. That's an interesting way of doing that

Michael Brown: they just put an emphasis on trying to get things done, uh, more quickly. Um, and I think that's because they wanted to be able to show that the technology that we were building, they were going to be perform better than just what exists today. Which is probably the most successful tool for improving open source software.

Security is, is OSS fz to, to a certain degree, the, the cyber reasoning systems we built are built on top of OSS fz. So they wanted to be able to show that like, not only could we do more with [00:14:00] AI if we added into the mix, but also that we'd be able to do things faster. And it

Caleb Sima: was fully autonomous.

Right. So this was a. Call the API this thing, just cranks it and submits it. Uh, you guys cannot debug it or fix it or do anything, right?

Michael Brown: Yeah, it's super weird. It's like the exact opposite of a, like a CTF style competition. You don't really know what's gonna be there. So like you do a little bit of preparation work, but it all matters in that, 12 hours that you have to do the, the competition.

This is the exact opposite. Like all the preparation was paid for upfront. And while the actual competition's running, you're doing

Caleb Sima: Yes. As the way that we all imagine AI should be though. Yeah. So like, yeah, it allows you to, I mean, you,

Ashish Rajan: you finish your project last hacker some camp. 'cause you just had to, 'cause I mean, were you allowed to make changes between the semifinals and the finals?

Michael Brown: Yeah, so the development cycles were, were, were actually pretty short. So, um, for the semifinals like I said, they were running this white paper concept competition. It took them a little while to get out. It also took them a little while to get the [00:15:00] infrastructure for the competition up and running. So we actually weren't able to really do work on the first version of our cyber reasoning system that we call Buttercup until like April.

And the due date was, uh, mid-July. So we functionally had three and a half months to build the first version of Buttercup and have it compete. Um, things were a little bit more forgiving the second time around, um, because they'd, you know, DARPA already had some of their infrastructure built up.

Um, and we'd also, we'd done this for a round, so things were just a little bit easier. But even then, we weren't able to start development of the second version of Buttercup until January of this year. So we actually only spent, um, six months. The, the due date this time around was at the end of June. So we also, we built the second version of Buttercup that took second place in only six months.

Of calendar time. So it's very com very compact.

Caleb Sima: I gotta ask Buttercup. Mm-hmm. How, how did it get named that Man, I wish I had a great story for this, but I just don't,

Michael Brown: I mean, sometimes it's just the way it worked. So I tell, here's the story. When, when, um, when I originally [00:16:00] wrote the concept paper, I was calling it patchy.

Uh, 'cause my kids were obsessed with the movie Wally at the time. Yes. And it just seemed natural. But as time went on we started thinking about like, okay, what are we actually gonna name this thing? And we just worked, we were trying to come up with a name that was like, memorable, that was like easy to pronounce and that would like, frankly just like be easy to market with.

And, um, somebody was a fan of Princess Bride on the team.

Caleb Sima: I knew it. I knew Princess Bride. I knew that was my prediction is like, it had to do some of the princess

Ashish Rajan: Wait, wait context for people without kids. What? What's Princess Bride?

Michael Brown: The Princess Bride is a movie that came out like in the eighties and it's like a cult classic.

Like most cult classics that did poorly in theaters but is like beloved by generations after. Um, it's, it's the story. Um, it's actually a story within a story. It's like a grandfather telling his, uh, grandson a story for bed. And it's about you know, a fantasy princess and her one through love and his, and his journey to find her and rescue her from the powers

Caleb Sima: that people and I also would like to very much note Ashish that princess Bride is not just for [00:17:00] people with kids.

In fact, the first time I saw Princess Bride, I definitely did not have kids. Oh. Um, and you, if you, you clearly haven't seen it, so you have to go watch that today. I will. I'll if tonight, tonight you need to go watch. It's classic. In fact, during the pandemic, have you seen this, Michael? They got a whole bunch of celebrities when everyone was in lockdown and they recreated Princess Bride with all of these shorts, with all these like 50, 80 different celebrities recreating the entire movie word for word.

Oh, wow. From home. Did you see this, Michael? Did you see the I did. I did.

Michael Brown: We were definitely desperate for content during those dark days. It was

Caleb Sima: awesome. It was,

Ashish Rajan: it was phenomenal though. Yeah. So, I mean, so Butter Cup, which is obviously now open source. What was your original concept outta curiosity?

Michael Brown: Yeah, the original concept was probably a little more ambitious.

We had probably a lot more moving parts, but we had to write the concept before DARPA really defined what the competition was going require and what the rules were gonna be. So [00:18:00] they kind of just said, we want to autonomously find patch vulnerabilities in open source software. Give us your wild ass ideas and if we vibe with one of 'em, we'll give you a million bucks.

It's just like, I dunno, I kind of overstating it or saying a little more simply, but that's kind of what it was. So it's funny, whenever I give talks on this, on this topic, like I show the original concept and it's got blocks here, you know, lines drawn between them. It's this beautiful flow diagram and then I show a version of what we actually built and like two thirds of it is like got a big X through it and got thrown out.

Largely because, so for example, one of the design decisions that DARPA made was they required us to prove that a vulnerability exists. So you can't just like run a static analyzer, have it say, yeah, there's bugs on these 100 lines of code. Um, and then just like accept a false positive rate of 80%.

Like we've already got that. That's not really pushing the boundaries. It doesn't help us people ignore those anyway, because alert fatigue is a thing. So they really wanted to make sure that when these cyber reasoning systems found something that it actually [00:19:00] existed, so you had to prove it. So that meant a lot of the components that I had that I originally put into design to do like static analysis.

They just, they just went out the window because they weren't going to be doing what the system needed to do. We still use them more in a support role for, um, our fuzzing efforts. Um, but they became like a secondary or supporting character versus something that was like a first class, uh, workflow through the system.

There's also some stuff that we wanted to do that we thought was clever, but um, DARPA deed was too clever and made against the rules. Um, this was, um, like making, you know, like specialized databases for context for LLMs to be able to use. Oh. So they, they kind of deemed that to be a little bit too much of, um, like studying just for the test.

And to their credit, they did this because they wanted to make sure that the cyber reasoning systems weren't overfit to just the problems that they were gonna show us. They wanted to make sure this had a reasonable chance of working in the real world when the whole competition was done. So I, I'm actually kind of glad they did it, but we were, we were fully prepared to.

We were fully prepared to exploit a, a, [00:20:00] a weakness or softness in the rules if it was allowed to persist. So some of that stuff came out. But, uh, but yeah, ultimately, like, um, the way Buttercup works now is it's a pipeline system. We broke the problem down into like four sub problems, which is finding a vulnerability, generating context about the vulnerability so that you can do patching for the vulnerability, which is the third thing we have to do.

And then finally, there's an orchestration component, which is effectively and efficiently allocating your resources to these three stages of the problem. The problem next itself down because you have to find a vulnerability first, or at least in the semi-finals you did. We basically kept that constraint on ourselves for the finals because we knew it would help us score more points.

Um, you know, we have a lot of resources upfront for vulnerability discovery. Can I,

Caleb Sima: could I pause just real quick, just a, I get, I'm a little bit confused. Is there, maybe there's a difference between sort of finding a vulnerability versus confirming the vulnerability, like you'd say. What would you, how would you issue a patch without finding the vulnerability, I guess is the thing I'm a little bit [00:21:00] confused on?

Michael Brown: Yeah. So, um, like I said, in the semi-finals you couldn't do that. But in the finals you could, and in fact, the third place, team theory, um, they, they actually did this to a certain degree. It's actually part of the reason why we were able to edge them out for second place. If you look at some of the numbers that DARPA's released, which we don't have all of them, so they're not a hundred percent accurate, but the accuracy, which was also a factor in the scoring, you lost points for being less accurate, for submitting faulty patches or for submitting, um, faulty vulnerabilities.

Um, your score would go down if you submitted too many of them. Our accuracy was about 90% and theirs was about 45%. That was because they were making some patches against static analysis alerts that they couldn't verify were true positives or false positives.

Caleb Sima: Oh. So it's the, it's the verification part That is what you could skip per se.

You could just say, oh, spits out a bunch of junk, I'll just auto throw a bunch of patches on it. And that is sort of what causes that. Okay.

Michael Brown: Yeah. So, yeah, I, I don't wanna mischaracterize their system. They certainly didn't do that. But basically they [00:22:00] had a system that would look at, um, different static analysis tools and they would try to identify vulnerabilities without actually running them using a fuzzer.

Caleb Sima: Right.

Michael Brown: Right. And then they would go through this series of, of validation to try and, uh, come up with what they consider to be high probability of being a true positive. But because there's no human involved, you, you can never be, you can never be completely sure because typically, you know, in, in, in, in the process today, if you have a, if you have a really quality alert on something, you'll have a human verify it.

And then it's relatively, you know, quick for them to show that. Okay, yeah, that's actually is a real issue. When you have an AI system that has to be fully autonomous, you don't have that, that, that luxury. So there was always a risk. So you know, that's a risk that they took and ultimately cost them on their accuracy.

And, um, and honestly if you look at it, we only beat them by I think by like 10 points. So the accuracy adjustment for being only 45% accurate versus 90%, um, if all of the things were the same, if their, if [00:23:00] their process did not help them find any additional vulnerabilities or generat any additional patches, um, then we would've beat them on accuracy alone.

But I think that, you know, that the design decisions they made helped them find a few more things. So it's not as cut and dry as just what we were accurate and they weren't. So that's why they went to third place. It's possible that they maybe beat out the fourth place team because they found some things, because they were willing to take some risks versus others.

We don't really know the full story yet.

Caleb Sima: And the fact that what you're saying is the, the way at which they took the risks in finding things, they would able to find more versus you and because they were like, we'll take the hit on the accuracy score, uh, in order to have the more coverage capability.

Is that sort of what you're saying?

Michael Brown: Yeah. It's entirely possible that so once again, I don't have their data. I haven't looked through it. Um, DARPA hasn't released competitor. The all DARPA's given us is our data. Yep. We know how well we did and we can look at that, but they haven't given us everyone's, um, stuff yet.

But, um, I think it's entirely possible and entirely likely that. There was a, a [00:24:00] vulnerability that they found that was really hard to prove, really hard to find an input that would actually make it all the way through the program and trigger the vulnerability. But they

Caleb Sima: patched it. Yeah. But they patched it and they scored six

Michael Brown: points for it nonetheless, even though they couldn't find a way to like actually trigger it.

Yeah. So yeah, it's entirely possible and likely I think they're really smart guys. They did a great job.

Caleb Sima: So, and if you think about it in that scenario, actually, if there are no, there's no humans, there is no quote unquote, uh, I guess fatigue in looking at multiple patches and just patching it as long as is easy still works.

Yeah. You, you might as well just patch it. Uh, qualifications apply. Yeah.

Michael Brown: But well, you know, it gets back to like what's a realistic outcome though? So I can tell you right now as an open source maintainer, if I had this, you know, big library that everybody uses, and I'm the only person who does anything with, if you send me a pull request that says, Hey, there's a vulnerability here, and I patched it and you don't have any way to like, prove to me that it's actually a vulnerability that it actually exists.

That goes to the bottom of my priority pile. So that's part of the reason why in our system we [00:25:00] prioritize always finding a proof of vulnerability. One, because the score encouraged it, and two, it's also, it's frankly, it's just a more realistic would I be able to push like useful in the real

Caleb Sima: world?

The way I sort of think about it is you are right in the sense of where you guys created something that I think tomorrow could make a huge impact in the everyday existing open source repos.

That approach alternatively their approach. Alternatively, you know, if you think about okay, you know, a six months approach from now or a year approach from now where a repo could ostensibly be completely automated in that sense and managed in that sense.

It's an interesting view from both ends, right?

Michael Brown: Yeah. And I think going forward in the future, you kind of want both. Right now we have this huge backlog of code that's, you know, that's probably got all kinds of vulnerabilities in it that no one has time to look at. Attackers or defenders. Yeah, attackers have more time.

So eventually they find them first. [00:26:00] Or, you know, defenders rely on like automated tools, but they have even less time. So I think early on, you know, there's going to be, as we start applying these systems, systems like ours that prioritize like proof of vulnerability through dynamic analysis, they're gonna find a lot of bugs and help them fix a lot of bugs.

But eventually you're going to get to the point where the only bugs that remain after these systems are used for a long enough period of time and they catch up on this backlog are gonna be the bugs that are hard to trigger. And in which case you do want a system like theories to come in and help you find those bugs that are really difficult to find a crashing input for, and really difficult to prove because they still exist.

It's just now we need something that doesn't give us an 80% false positive rate. We need something that gives us more like. A 55% false positive rate, which you know, is, is I guess a kind of around the around the vicinity of where their system is. So for them to get the false positive rate on, on speculative patching down significantly from 80% is, is, is still a pretty astounding achievement and definitely something they should be proud of.

Ashish Rajan: [00:27:00] I was gonna say, 'cause you guys were not allowed to have a knowledge source for shoving in all the known vulnerabilities. How are you guys validating one accuracy of, we did have,

Michael Brown: so we actually did have a knowledge source for all of the known vulnerabilities, and it's the large language model.

So this kind of information is is freely available on the internet. Every large language model that's being offered, that was available to us in the competition, which wa or, um, you know, commercially available models by anthropic open ai, Google. Um, a couple of other providers were allowed in there, and, and for the finals you were allowed to use local or custom AI models.

We, we did not. And that's largely because the information or the, the targets that we were looking at, open source software, they're very well represented inside the models that commercial that are commercially available. And with our, our team that was trying to build this cyber reasoning system, we weren't gonna go build a better large language model than all of open AI or all of Anthropic.

They have a whole, you know, billion dollar company, going [00:28:00] after doing these things. So we did not need to replace that. So, you know, if we had different targets, stuff that was more niche or stuff that wasn't well represented in a large language model, we'd have to change it out a little bit.

But, but ultimately all the knowledge that we needed to know about these individual programs is because they were ingested and, and vulnerability information about them were ingested during the training for, you know, GPT for. Claudes sonnet 3.8 or, or whatever, you know, whatever model, they, it's all in there.

Any commercial avail model is, is especially because code is such like a high value proposition for generative ai. It's all in there. It's all been, it's all, so

Ashish Rajan: you could identify, verify using the existing models, but was it doing like a, hey, for this particular task I'm just gonna use Claude? Or like, or what is it to your point, quote unquote autonomous in sense how, in how it was picking the LLM model it needs?

Michael Brown: Yeah, so we actually did not let our system pick whatever model it was going to use. Um, we, we, we were given roughly or we, we chose to, to request an equal [00:29:00] number of resources from open AI and Anthropic for our system. And the way it worked was we had two, two major areas where we used large language models.

One was obviously generating patches. This was where we used open AI's models. And then we also had another portion of the system that, um, was built into our vulnerability discovery engine, where we used a large language model to help us generate inputs. To put into the fuzz to help it be more efficient and help it, um, saturate the fuzzing harnesses faster, and also help us generate inputs that were closer to vulnerability triggers than the fuzzing engine would do on its own.

You know, fuzzing engines are there to find bugs and vulnerabilities. We cared just about vulnerabilities, so we were trying to kind of weight the coin flip that comes into like natural mu or with randomized mutational, fuzzing more towards vulnerability. So we used a large language model to do that.

We used Anthropic for that purpose. Um, and we, we roughly used each of them about 50 50. Um, so that was kind of hard coded into our system. Uh, if one of those providers was down upstream, we would fail over to the other [00:30:00] one. So in our testing for like failover, um, you know, the OpenAI was slightly better at patching.

Andro was slightly better at seed generation, but they both had pretty solid performance on the tasks, on the other tasks. So if, um, you know, during the competition, this actually happened during one of them at. Where one of the major providers was down for a period of time and we had to rely on using one of the other providers.

But this happened during an exhibition round, not during the actual competition.

Ashish Rajan: Oh, right, right. But still,

Michael Brown: I saw your expression. I knew exactly what the next one was like. Oh my God. Yeah. I'm like, I was just like, oh, wait.

Ashish Rajan: Which, I mean, can we, I don't wanna talk about which one, but Sorry. Caleb, you go first, man.

Caleb Sima: Yeah, I was gonna ask, so, you know, as you, as you kind of went through these this learning experience from your prototype and pitch to actual execution, I'd love to know sort of maybe couple top things that really surprised you that you did not expect in this thing. And maybe you know, maybe one of the things that, uh.

[00:31:00] Uh, you learned that maybe the number one thing you really learned through this process?

Michael Brown: Yeah, so I, I have to be honest, like I've been a bit of a, of an AI ml um, kind of skeptic for a long time, but I've been working in it for a decade and I feel like I've been a skeptic the entire time. But I kind of continuously find ways to kind of refine my skepticism.

So, you know, my, my background, I you know, I've, I've, I started off doing mostly conventional software analysis, conventional cybersecurity problems. Um, and then, around the time I was doing grad school, a lot of the work that was coming out then was very much, Hey, let's use AI to do this. And it was either like data scientists who had no idea anything about cybersecurity, were trying to apply modeling techniques to problems they didn't understand, or it was people who were like software security experts trying to apply modeling techniques that they didn't understand to a problem that they did.

Mm-hmm. And a lot of the results were kind of, were really. Bad, rough had obvious like flaws and like I as a person who had like, kind of studied both during, you know, the, the portion of the PhD program that I finished at Georgia Tech, like, I kind of [00:32:00] understood both and started working in the space and um, you know, I really focused on trying to understand the problem and understand the modeling technique that were a good fit and put them together.

Um, so I was actually pretty skeptical that the large language model, certainly back in 2023 when this was first announced, we're gonna be all that successful at patching. But I've been really kind of blown away at how good they've actually become. And honestly two years is like six lifetimes for the development cycles right now for a large language model performance.

So I was, um, I was really pleasantly surprised with how capable the models were. At creating high quality patches and at improving fuzzer efficiency along the way. I didn't think we, I didn't think we'd get the performance that we did, but we did.

Caleb Sima: Do you see that even continuously approved sort of month after month or quarter after quarter as you've been running where a year ago patching or f fuzzing was not a good, not as good today?

Michael Brown: So I, I feel like my, my surprise with this was kind of more immediate during the [00:33:00] semifinals, and I don't think it's really gotten a ton better, um, with the model since then. So we were using GPT-3 five and GP PT four during the semifinals. The reasoning models like oh one and the thinking models from, from, uh, anthropic have come out since, uh, but we actually ended up not using any of them in the finals at all.

The regular models, uh, were actually better because we had already built in a lot of the reasoning. Interesting. All reasoning or the, the logical analysis already kind of built into the cyber reasoning system. We didn't need the large language model to try and approximate it. So yeah, we actually like, so right now we are, we're not in a situation where like a new model comes out, we dump it in and then Buttercup does 10% better.

Um, but we did, you know, upgrade our models after we tested them to make sure that their performance was no worse than what we saw before. Um, so yeah, it's entirely possible that there's been some kind of like incremental changes, but um, honestly like the patching performance that I saw, that kind of blew me away.

We saw that in the semi-finals and it's kind of continued at the same level. Huh, [00:34:00] interesting.

Caleb Sima: It sort of gives this indication of, you know, everyone's talking about plateau and the models and whether, you know, practically speaking, that is true without the upgrade of model as well. So you didn't even use the latest models.

Michael Brown: Yeah. So another thing too to, to keep in mind is that, um. The, uh, the problem of dealing with open source software, this is data that's represented really well and at scale within the training data for these large language models. So if you had to pick a domain in which the large language model was gonna perform the best, it would be open source software.

Other domains, I expect that performance will probably drop off a lot. So there's a big difference between, patching uh, I don't know, like, um, like SQL light. Uh, when you have trained the model on all of the code and every vulnerability that's ever been in SQL light, it's a big difference.

If I wanna go, you know, now take the firmware off of my Linksys router and try and patch a vulnerability like at like a packet processing level, that software looks very different [00:35:00] than this kind of software does. So, you know, there is a grain of salt to be taken with this, that it's gonna take a lot of hard work to adapt this stuff to other domains.

But open source software is so huge and there's so much of it, and the security is so questionable. Like, it doesn't take anything away from this competition or the things that we've learned. It's more me just saying like, there's a warning for, you're not gonna be able to say, you're not gonna immediately go build a business around Buttercup for X.

Oh, damnit half the people

Ashish Rajan: listening to this, we're gonna do that. Yeah.

Michael Brown: Yeah. This is the warning, like, I don't know, free advice. Like, we made it open source. If you wanna go do Buttercup for X, we're cheering you on. But I'm just saying like, it's gonna be harder than you think unless X bears a strong similarity to, uh, open source software that's freely publicly available and available for the commercial AI

Caleb Sima: companies

Michael Brown: be able to train on

Caleb Sima: your software.

So the, the biggest surprise then for you was, hey, you're a little bit more of an AI optimist for LLMs walking outta this.

Michael Brown: Yeah. What,

Caleb Sima: yeah. What about some of the things you learned when you think about the journey? [00:36:00]

Michael Brown: Yeah. Um, let's see. Um, you know, we learned the importance of high quality engineering.

Uh, this was always something that we knew was gonna be an advantage for us because we're we're more professional engineers than than like grad students. Several of the teams that participated in this, particularly like the open teams, you know, they were university teams. So grad students were going after PhDs or master's degrees.

Very smart, very great. But they, they aren't doing like software engineering, like as a discipline in their day jobs. Like, we like to think of Trail of Bits. We hold ourselves to pretty high standard for the open source tools we create. Like we want them to be maintainable. Whereas like a lot of research tools that come out of academia, they kind of bit rot after about six months.

Ours bit rot after about a year and six months. So we're like, we're not, we're not perfect, but we like to think that we hold off a little bit longer or hold up a little bit longer. But but yeah, so I don't know. We, um, we, we started talking to some of the competitors at Defcon and they would tell us about how, um, during the exhibition rounds for the finals.

Uh, which I didn't really get to this, but the, but the, the detail here is that [00:37:00] there were three practice rounds basically before the final round where the scores were actually tabulated. Um, and this was so that you had the chance to make sure your system wouldn't explode. We had one of those rounds where our system exploded.

We were, we were generating file names that were too long for a particular component. When it was trying to save things to disc, it would just nuke that component. And it was early enough in the pipeline that it just meant nothing worked. So, for a while there we were about to, like, for a while there, we were about to like sharpen up our resumes and like try to go get jobs at McDonald's slipping burgers.

We, we built this complicated system and then puke after dealing with three challenges. But fortunately, you know, when we got a chance to dig into it, we saw that it was, it was pretty simple problem to fix. Um, so this happened to us once out of three exhibition rounds and one, one final round. And when we talked to our competitors, this was happening like every single round for them and it actually happened to a couple of the competitors in the final competition.

Caleb Sima: Oh, wow. Um, just bugs in their workflow basically.

Michael Brown: Yeah. So one of the teams that, [00:38:00] um, I think came in second to last, or third to last, they did better than we did at finding the vulnerabilities. But they had a couple of bugs in their patching system and they patched almost none of them.

So their score was, was much lower than a lot of other teams. So that was something that would've potentially put them in competition for second or third place, maybe even first place. But, um, without the engineering there, without like the diligence you know, you build a buggy system and they eventually come back to roost.

They came back to roost on us. So like, we are, we, I wanna be clear that I'm not trying to throw stones at anybody. We had our own problems. But and then, then, then also like the seventh uh, place team. Uh, team lacrosse. Uh, they had some pretty strong performance in the, in the semifinals. Certainly nothing to bulk at, but for some reason, I'm not actually sure.

They haven't really, like, the only reason I know about the other team is because they published something about it on Oh, right, the blog. But for the, for team lacrosse, like they something clearly went wrong. They only scored like 10 points in the entire thing. They only found like one vulnerability and patched it.

And so something [00:39:00] clearly went wrong. Those guys are smart guys. We've worked with them before and we know that their system should have performed better, but something went wrong or I just don't know what it is yet. So those guys are way too smart for that to have been the outcome. Just naturally.

So, uh, so yeah, I mean, like, it, it literally all the way through, the thing that I learned is, is that our focus on engineering paid off well because we, we thought we were being sloppy. And it turns out that we actually had it together more than most people, which is to say we made slightly fewer mistakes than other folks did.

Ashish Rajan: I think engineering is an interesting one 'cause uh, we, so Caleb and I interviewed a couple of people, Jason s and Daniel Misler. We were talking about this competition and one of the things we spoke about was we were talking about, we call it the state of security red team version or whatever. And the idea was we were trying to figure out is it the scaffolding or is it the ai.

That would win in this race that we are all in in a way. And I guess engineering is an interesting angle to the same question where you, you obviously looked at the other teams as well with [00:40:00] whatever information they had shared publicly and otherwise. And the question we had was like, what wins in the long run?

Is it more the fact that me as a company should create scaffolding and the engineering principles that we have developed for all these years for how a system should run and scale? Or is it to what Kayla was saying earlier in the future, AI would just go, thank you Michael. Let me just take it over from here now.

Michael Brown: Yeah. So actually the answer is more than that. So the answer is, is, one, it's scaffolding, but two, it's also not forgetting that conventional approaches to problem solving exists. So there's a couple things by what I mean by that. Um, so I'll, I'll, I'll address the scaffolding part first 'cause it's probably easier and more clear.

Everyone had access to the same budget for large language models and everyone had access to the same large language models that were available. And I think only one team even tried to do a custom model. And from what I understand, it was not a big differentiator for them in terms of score. So ultimately AI was actually the commodity.

It was the thing that everybody used, everybody had the same access to.

Ashish Rajan: Yeah.

Michael Brown: Um, the, the budgets [00:41:00] were set by DARPA. So this wasn't a thing where it was like, you know, poor little Trail of Bits only has $10,000 of our, of our money that we're willing to spend on this competition for LLM credits. And then somebody else can just swoop in and spend, oh no, we're gonna spend three, $3 million,

Ashish Rajan: right.

Michael Brown: And make sure that we win the $4 million, first prize or whatever. Um, so everybody had the same when it came to that the scaffolding was super important. I kind of just talked about how important the engineering was. Yeah. But if you look at the strategies the different teams used, the fir the top two teams, team Atlanta and ourselves, we had what we called like a best of both worlds approach, which is, you know, both of our groups, um, have been doing research for a long period of time and vulnerability discovery and vulnerability remediation.

But we did, we've been doing it since before AI became incredibly popular and before the large language model became the predominant form of technology. So we had previously spent lots of time on building better fuzz, building better static analyzers all this kind of work. So if you look at our strategies, we both very much combined [00:42:00] AI with the existing tools and picked for each sub problem in this long chain of things that have to go right to find a patch of vulnerability.

We picked the strongest performing tool at each stage and used that. Um, so our, our competitors, uh, team theory came in third. They, they, they used a little bit more ai. They were a little bit more, they, they say they, they were a little more AI forward, so I'm just gonna use their words to describe, but they focused, they deemphasized the traditional stuff and focused a little bit more on trying to let the AI solve the problem.

And that ended up being something that would only got them, uh, third place. So I think it was a pretty strong indication that. You know, when it comes to building good systems that use ai, the answer is one, you have to put a lot of engineering work in to make sure that they're stable and robust, otherwise, the whole house of cards comes, spun tumbling down.

And two, use AI where it's useful and use conventional stuff where it's useful. Um, a good example of this is, you know, if I'm trying to figure out, if I were work at Amazon and I'm trying to figure out what's the best route for my driver to [00:43:00] take to deliver this particular set of packages, I can ask ai, but it's not really well suited for that problem.

It might, might hallucinate an address or it might forget about a package or something. Whereas like this is pretty clearly a traveling salesman problem and we have good algorithms for solving the traveling salesman problem. So this is not something you probably want to go apply AI for. You probably just use the existing solution that's probably like gonna cost you less in terms of resources and time, uh, to go use.

So to the, I think the A ICC was a really good lesson and. Using AI where it makes sense and not using AI just like everywhere. For the sake of using it,

Caleb Sima: I'd love to get you maybe a little bit into the weeds. So, you know, some of the listeners are, what are some of the tips and things that you have learned that you maybe conveyed to others around, you know, the way at which you've managed to best work with ai?

Sort of the tips and tricks, you know, get, get a little bit on the weeds for people.

Michael Brown: Yeah. So, um, the most [00:44:00] important thing I can say about using AI successfully is try not to get it to boil the ocean. If you ask it to do too much, you're going to end up with something that that has to satisfy a lot of stages and do everything right and AI and, and, and machine learning more broadly.

They don't do this well. Um. When you use AI or when you use machine learning to solve a problem, there is going to be some chance that we'll get things wrong. This is inherent to how AI ML modeling works. And so when you ask, you know, a large language model, particularly a reasoning model, Hey, here's my, here's my big, big problem, go solve it.

Even if it does a good job of breaking down everything into 15 steps, the model still has to get all 15 steps, right? And the earlier it makes a mistake, that mistake compounds over time. So every place that we've used large language models effectively in doing really challenging cybersecurity work, it's when we've identified a good problem for ai, we did a lot of work contextualizing it, making sure it had the right information to be successful, and then adding [00:45:00] post step validation to make sure that whatever comes out of that, uh, stage of the pipeline if we have a way to validate it, we do so that we keep those errors from compounding along the way.

So, yeah, if I, if I were to say, you know, getting into the weeds build a system like Buttercup, build a system that uses AI in small areas. Really tightly define what the problem for it is to solve. Um, and a good example of this is our patching. The, um, the patching system that we use in Buttercup is, is actually a multi-agent system.

It's one that we built before they became pretty commonplace or before. Um, there was a lot of infrastructure support for building systems like these. But a good example of this is you can, you can ask a large language model, Hey, here's a vulnerability I discovered. Here's some information about the code.

Go write me a patch. But it turns out you get better results when you break it down and you have an agent that is a, is told that it's a software engineer and you say, Hey, here's a bug report from a user. Go fix this bug report. And then you have a separate persona, another LLM, that thinks it's a security [00:46:00] engineer that says, Hey, check this.

Does this actually fix this security problem? Then you have a separate component, which is a quality assurance engineer. Or, or, or a QA persona that it's only their job to say, Hey, make sure this doesn't break anything else. And here's the tools you can use. Yeah. So when you only ask the LLM to perform a small, like atomic function, the chances of you being able to one, be successful at that function and, um, having done the problem breakdown well enough that you can validate effectively what that LLM produces as an output.

They go through the, the, the, the chances of success become way, way higher. It's not the answer everybody wants to hear. Everybody wants to hear that AI is gonna solve all the problems, and we're all gonna, you know, go put on those nice flowing Greek robes and eat grates and let our robot, you know, let our, let our robot armies, you know, till the fields and, you know, go build all of the accounting systems that need to be built.

But the reality is, like, AI is super, super useful, but it's probably gonna require a lot more work than most of us anticipate.

Caleb Sima: Well, you, you still can have that as long as [00:47:00] every persona or agent is focused in their area of expertise per your sort of comment, right? Like you can still spawn a hundred QA engineers which is way more than you would be able to hire normally.

Ashish Rajan: So that should you probably split the 10 between one business unit instead of the hundred between the entire company, I guess. Yeah, that's right.

Caleb Sima: But I guess maybe all you really need though is, is AI managers, which still don't quite exist yet.

Ashish Rajan: I mean, you know how you mentioned Michael, that this, obviously the Patcher sounds like almost like a.

The thought that I came with when you mentioned the patcher was, 'cause you said bar up is great for open source projects, probably not for enterprise applications, which are probably a lot more complex and has a lot more let's just say bandages that have been put over the years that no one touches and no one wants to touch.

'cause I saw there was obviously C and Java or some of the languages that were looked at. I'm curious if someone listening or watching to this wanted to kind of upgrade Buttercup for their internal enterprise version. Mm-hmm. 'cause uh, there's obviously this question [00:48:00] of this almost proves that, hey, PE people don't need to buy a AI tool.

They can build their own if they had the right investment from the company. Right. Whatever the company's objective may be, that if it's to solve that problem that, hey, we obviously spend a lot of money on these vendors. I wanna be able to do it myself. These guys, there's like three teams doing it. They had a million dollars to spend on this and whatever amount for LM models.

For people who are trying to kind of con transform this into an enterprise version that they made on internally. 'cause a lot of internal banks have pen testing teams, vulnerable team management teams. It almost like is a combination, at least in my mind, vulnerable team management. People are instantly going, can I just use this?

So what's the leap between like what you guys created between and, and to what they would have to create?

Michael Brown: Yeah. Like I said, it depends largely on, on what kind of software does your organization produce. If you're building firmware for obscure, um, industrial [00:49:00] control system applications, then Buttercup is probably gonna struggle quite a bit.

Um, and you're gonna have to do a lot more work to make Buttercup work internally. But if your company makes software that is written in languages that are among the most commonly used within the ecosystems and you have code that follows pretty consistent, uh, pretty consistent like design patterns that are, you know, represented well within the open source software ecosystem.

And or it relies on a lot of open source software, then the, the barrier is much less. So, you know, to a certain degree, like you need to get your organization to buy in to resources being spent, because if you're gonna do this at scale, it costs, it costs a lot of money. Uh, one of the things that we did after the competition was over, but before we went to Defcon, was we spent a month making a version of Buttercup that you can run on your laptop.

And that's one of the versions we made Open source. And it's the version we maintain. So by as part of the terms and conditions for participating in A ICC, we had to make the semi-final and the final version of Buttercup available. We made those available. We don't maintain 'em. [00:50:00] Uh, we made the version that runs on your laptop available so that anyone can use it.

And we're hoping that this is the version that people eventually take inside their organizations and use. But if you wanna scale that up to your enterprise software, you need to deploy that like on, on cloud infrastructure and you need to give it lots of, uh, resources, both compute and a healthy LLM budget for it to go produce results.

So, um, none of this stuff came for free as, as an important thing. Each team was given something close to like a hundred thousand dollars. It's actually a little bit more than that between LLM credits and compute credits on Microsoft Azure, which was just a cloud platform that everything ran on in the competition.

So these vulnerabilities that we discovered, you know, like the, the best performing team discovered something like 40 plus vulnerabilities, and I think patched like 30 of them. Uh, but they spent like $110,000 or something close to that. So, um, the cost was cheap. Certainly over a six day period it was, it was, it was less than one employee makes in a year to find these vulnerabilities and patch them.

But it also wasn't something that a startup that has [00:51:00] like zero money can just go do and, and run forever. And it's also

Ashish Rajan: the long tail effort as well. 'cause you guys have been working on it for some time. It wasn't that. You that you heard the podcast and tomorrow you have Buttercup in your enterprise.

Michael Brown: Yeah, yeah, exactly. Yeah. And there's certain types of software, certain types of systems that you wanna target that, that Buttercup might not work on right away. So like, if you don't have the source code, like you're dealing with binary code Buttercup writes patches and source, and it relies on being able to analyze the source code.

So if you're dealing with like, kinda like reverse engineering or, um, other types of applications, Buttercup is, is you're gonna have to basically kind of retarget it for a different level of program representation.

Caleb Sima: I was gonna ask, so you know, we're, we're running on time, but I wanted to, maybe I ask two, you know, first, where is the future of Buttercup?

Do you foresee this getting to this stage where, open source or just not even open source, maybe closed source, that AI just continuously assesses and patches issues for engineers?

Michael Brown: Yeah. So I mean, one of the things that's kind of [00:52:00] interesting about Buttercup is we, we built heavily on OSS Fuzz, which is already doing a great job of fuzzing the open source software ecosystem and reporting bugs.

The problem is, is bug reports, they don't necessarily get, like, actually fixed. So I think the future of Buttercup is taking the good work that's being done with like, these efforts to secure these huge like software ecosystems and taking it one step further beyond saying, Hey, there's a problem here to, Hey, there's a problem here.

Here's the fix. And also it costs me $150 to do the whole thing. That's the, that's the end future that I, that I wanna see Buttercup get to. In the meantime, you know, our short term plans for the future. We're maintaining Buttercup, we're continuing to make it available. We're happy to hear people try to use it and we're available to help.

If you're trying to make this work within your organization, you want help adapting it to work for you, please give me, please give the rest of the team here a Trail of Bits, a call. We'd love to talk to you more about that.

Caleb Sima: And one question, how do you think Buttercup compares to sort of a lot [00:53:00] of these now security products and vendors that are out there that are, you know, effectively saying the same thing?

We identify all these vulns and codes and auto submit prs, uh, to patch them.

Michael Brown: Yeah, I mean, we're really, really different than some of the other offerings out there. The ones that come to mind are like RunSybil and XBOW. They are a lot more LLM focused, but also their problem is a little bit less constrained than ours was.

And also they're kind of focusing in different spaces. I think both of those products are focusing kind of more on like web vuln type stuff. And we've been focusing more on like memory corruption and uh, and like top 25 CVEs and, and like Java programs. So they cover certainly different areas. Um. But I think one of the areas, what I'm, what I'm hoping Buttercup becomes is a template for how to use AI to like, solve problems that actually merges all of this research that has been ongoing for like 35 years to make conventional techniques for finding vulnerabilities and, and, uh, solving them.

We don't wanna just like, forget all that exists because we have AI now. There's certain problems that, uh, AI works [00:54:00] really well for and the conventional stuff works really well for. So I'm hoping that we'll see a shift.

Caleb Sima: And what about the person who won, uh, team Atlanta? How did they compare? What did they do different?

Michael Brown: Uh, they had a bigger team, honestly. At Trail of Bits we had about 12 different people over the course of those two years work on the work on the project. Everybody worked some degree of part-time. We had other projects that we were doing for other clients. Team Atlanta had a pretty similar approach to ours, if you look at their website, I think they like credit something like 42 people who worked on this. Oh, wow. Like I said, they had people across Georgia Tech KAIST, which is a research university in South Korea, Samsung research. So they had, like, they had big tech corporations in there.

They had, um, grad students who were kind of like notoriously underpaid and, uh, very hardworking working on this. So it's actually kind of funny if you look at the sizes of the teams, uh, theory was actually like a slightly smaller team than ours. Um, there, there's kind of an interesting like linear term for scaling in terms of the amount of human effort that went in.

Um, but yeah, if you take a look at, um, like how much budget was consumed by Team Atlanta and like the [00:55:00] complexity of their system, how many extra blocks that they have that do certain types of things and how well they're specialized. A large part of the reason why they did better was because they were able to do more inside these like three and a half or six month windows.

And uh, and once again, I think like the, the LLM stuff was kind of a commodity. And that a lot of the areas that they were able to add additional stuff that we just didn't have time to build in. There were a lot of conventional approaches. Um, so I did, I did, I did most of the PhD at Georgia Tech. And that's where a lot of the grad students came from.

I actually know Tetsu Kim, the, the leader of that team that won. I'm great to see. It's my alma mater that, that won too, by the way. So like, I was not sad at all. Like I was gonna get beat by one team. I wanted it to be them because now I can like subtly take some street cred from like, having gone to the same university.

But but yeah, it's just, uh, this like, I don't know, so like the small businesses, like we were eligible for this like seed money early on, which is a huge advantage early on. But later on, if you're a, if you're a university and you have access to lots and lots of people. And you also don't have to worry quite so much about like the [00:56:00] ROI, um, on, on this kind of thing, which, you know, academia has very different incentives than, than business And for good reason.

Then it kind of changes the dynamics of the competition. Like if we were to put 42 people on this, we would've bankrupted the company. Um, because we, we received like $3 million before the final, uh, 1 million for our concept white paper. And we got a $2 million prize for, uh, for winning in the semi-finals.

So like there, there's no way you're gonna hire 42 engineers that are able to go work elsewhere in the tech industry and stay under a $3 million budget. This is not happening. So,

Ashish Rajan: no. All good. Uh, I was gonna say, I mean on, unless Caleb has any more questions, uh, we are gonna wrap up there. But where can people find information about Trail of Bits and where can they connect, connect with you and learn more about how to you use Buttercup maybe for their version of, uh, an autonomous AI agent in their organization?

Michael Brown: Yeah, super easy. Go visit, uh, trailofbits.com. Uh, we have a long running blog series over the last couple of years that's kind of documented our journey in the A ICC. [00:57:00] So if you're just kind of getting used or just kind of like learning about this for the first time. We've kind of got like a running commentary that you can go take a look at.

Some of the, the posts on our blog will also link to, uh, Buttercup, but it's also easy to find. If you go to, it's on GitHub. Uh, if you go to the Trail of Bits organization and look up Buttercup, that's where you can find our, our cyber reasoning system. We've already had people from outside of Trail of Bits commit PRS that we've merged.

You know, mostly like small little bug fix ones, but we're hoping the community will adopt it and take off with it. So, we welcome your contributions, we welcome your interactions, and um, you know, like I said, if you work at an organization and you're looking at trying to deploy this capability internally you know, we've made the core stuff available and we're happy to help you adapt it for your internal use.

So, um, you can reach out to me or other folks you can find us on, find us on GitHub, you can find us on our website. We've got everything

Ashish Rajan: out there. Awesome. I will put those in the show notes as well. But thank you so much for your time. But yeah, that was what we had time for.

I look forward to maybe another conversation with Buttercup 2.0 comes out.

Michael Brown: Yeah, absolutely. Love to come back. Thanks so much for having me.

Ashish Rajan: Thank you. Thanks, Michael. Thank you for [00:58:00] watching or listening to that episode of AI Security Podcast. This was brought to you by Tech riot.io. If you want to hear or watch more episodes of AI security, check that out on ai security podcast.com.

And in case you're interested in learning more about cloud security, you should check out a sister podcast called Cloud Security Podcast, which is available on Cloud Security Podcast tv. Thank you for tuning in, and I'll see you in the next episode. Peace.

‍

No items found.

How to Build Your Own AI Chief of Staff with Claude Code

AI Security 2026 Predictions: The "Zombie Tool" Crisis & The Rise of AI Platforms

Why AI Agents Fail in Production: Governance, Trust & The "Undo" Button

AI Security 2025 Wrap: 9 Predictions Hit & The AI Bubble Burst of 2026

AI Paywall for Browsers & The End of the Open Web?

How to Build Your Own AI Chief of Staff with Claude Code

AI Security 2026 Predictions: The "Zombie Tool" Crisis & The Rise of AI Platforms

Why AI Agents Fail in Production: Governance, Trust & The "Undo" Button

AI Security 2025 Wrap: 9 Predictions Hit & The AI Bubble Burst of 2026

AI Paywall for Browsers & The End of the Open Web?

Build vs. Buy in AI Security: Why Internal Prototypes Fail & The Future of CodeMender

Anthropic's AI Threat Report: Real Attacks, Simulated Competence & The Future of Defense

How Microsoft Uses AI for Threat Intelligence & Malware Analysis

The Future of AI Security is Scaffolding, Agents & The Browser

A CISO's Blueprint for AI Security (From ML to GenAI)

Gen AI Threat Modeling vs. AI-Powered Defense: A Debate with Canva & Anthropic

Vibe Coding for CISOs: Managing Risk & Opportunity in AI Development

Vibe Coding, Slopsquatting, and the Future of AI in Software Development with Guy Podjarny

Is Your Browser the Biggest AI Security Risk?

AI in Cybersecurity: Phil Venables (Formerly Google Cloud CISO) on Agentic AI & CISO Strategy

AI Red Teaming & Securing Enterprise AI with Leonard Tang of Haize Labs

RSA Conference 2025 Recap: Agentic AI Hype, MCP Risks & Cybersecurity's Future

MCP vs A2A Explained: AI Agent Communication Protocols & Security Risks

How to Hack AI Applications: Real-World Bug Bounty Insights

The Future of Digital Identity: Fighting AI Deepfakes & Identity Fraud