Mutation testing in the age of LLMs

The last conference I attended was Wroclaw RB. The conference had workshops in the first part, and I was really excited about the mutation testing workshop.

Google, in one of their research papers, claimed that mutation testing might have prevented 7 out of 10 production accidents if this testing practice was widely used.

“Mutants are coupled with 70% of high-priority bugs, for which mutation testing would have reported a live, fault-coupled mutant on the bug-introducing change.”

If you are a small business owner like me, you will take that edge over competitors any time.

My experience with mutation testing in other languages has been rather unstable. It felt like a great idea, and you could surface some improvements, but it was not something that I could run reliably on CI or even locally. It was even harder to sell the idea of 100% mutant coverage to the team in the pre-LLM era. A lot of teams struggled with automated tests in general and were very far away from such a tough metric.

But the author of the mutant gem, Markus, has been using mutation testing in production for years. And there have been a lot of rumors about Ruby teams that practice mutation testing with his gem as well.

It felt like, in the age of LLMs, running full mutation coverage for projects should be easier. And Markus’s workshop proved it.

At first we tried mutation testing manually, and after that did the same, but with LLMs. The LLM was capable of doing all the reasoning to effectively kill mutants, with very little input from my side.

I came home energized, with a feeling that I had found a magic pill for my LLM-powered development. I relied on LLMs a lot, but never trusted their output. I meticulously went over each line they generated, and my main responsibility was to delete as much code as possible (and I removed a lot).

Markus showed us toy examples. But how would it work on real Ruby projects? I had 3 open source projects to test this on before making a “go or no-go” decision for commercial projects.

Mutating OSS

My side-project heavily depends on 3 open-source Ruby projects, so I picked them to experiment with the mutant gem.

webhukhs gem

Here is the PR: https://github.com/skatkov/webhukhs/pull/13

The webhukhs gem is a webhooks processing engine for Rails. This is a very stable gem that had already been tested on multiple projects.

I mostly expected to see increased test coverage. But I was surprised to find a lot of great simplifications that I had completely missed.

.is_a?() can return true for an instance of a class or any ancestor. instance_of?() will only return true for an exact class match. So in my case, instance_of? is more precise.
Type casting was unneeded in a lot of places (e.g. .to_s or .to_json).
.force_encoding(Encoding::BINARY) was replaced with the simpler .b. I didn’t like this particular simplification. I prefer the explicit version. But I guess it’s very easy to mess up attributes here? So I went along with it, but it would be awesome if such cases could be configured.
hash[:key] was replaced with hash.fetch(:key). This is what I do myself all the time, but I completely missed one case.
rescue didn’t require assignment to the e variable. Great one, fewer variable allocations.
Rails.error.report(e, handled: true, severity: :error) was simplified to Rails.error.report(e, severity: :error). handled: true is a default value, so there was no need to provide this attribute.

Even though we didn’t surface any bugs, it still felt like a good enough improvement! So I decided to proceed further and see if mutant would bring more benefit in other projects.

yard-markdown gem

Here is the PR: https://github.com/skatkov/yard-markdown/pull/33 (there are more PRs that followed after)

Contrary to the previous polished gem, this was hacked together to “just work”. A proof of concept, in a way. There was no design or architecture; most of the code was in a single file that came with a template and glue code that made it work together.

There could be so many data permutations that this plugin could break in various unexpected ways. And I already knew that there were at least two bugs in this gem, but even with decent tests it was scary to touch. Would mutant find them?

In my LLM prompt, I did say: “It would be great if the code had better organization and this gem became easier to maintain long term.” I was not expecting much, but tried to be upfront about the desired results.

And I lost track of all the small code simplifications that this brought. The LLM, with mutant assistance, stopped only twice to ask for my help. In both cases it was because it had found a logic bug. The LLM found all the bugs I knew about.

Not only that, the code was neatly organized. It was more maintainable. I was not afraid to touch it anymore. I threw in yard-lint to improve code comments! This was a natural “oh, shit, what just happened?” moment.

I could not believe my own eyes: mutant had surfaced bugs and the LLM had properly signaled that something funny was going on. This was a success! I couldn’t wait to onboard another similar project and couldn’t go to sleep without doing that.

rdoc-markdown gem

Here is a PR: https://github.com/skatkov/rdoc-markdown/pull/58 (this had multiple follow-up PRs)

While the code for this gem was rather simple, conversion to markdown is rather hacky: HTML is getting converted into markdown. This entire thing is riddled with bugs, but I currently see no other way to integrate with rdoc and achieve a more stable solution.

On some gems this plugin just threw errors…

Mutant caught around 8 bugs and I managed to resolve 6. There are no gems that just fail after 100% coverage anymore. Yet there are more bugs left… mutant made a lot of them visible (but not all, though).

Unfortunately, there is not much that I can do in the rdoc-markdown gem itself to resolve all issues. RDoc itself requires more work.

At this stage I was sold, and immediately wrote to Markus that I needed that license.

Lessons learned

I have on-boarded 6 Ruby projects into mutation testing at the time of writing. It roughly took me two days to onboard each project into mutation testing, but with bigger rails project it took my couple of weeks of gradual conversion.

Mutant is now a staple of my workflow when I deal with Ruby projects. It helps with LLM fatigue. I still look into code LLM’s produce, but more to get a feeling for what architecture we’re building here.

In the pre-LLM era there was an argument that the costs of maintaining 100% mutant test coverage were too high. But with LLMs, this argument is completely void. LLMs can make the right call about where to simplify and where to add more tests. In some hairy spots, the LLM gives up, and this is where I pay attention, as this is most likely a bug.
At first, the LLM powered by mutant went overboard with adding tests. A lot of new tests felt repetitive and excessive. I still have not found a perfect way to deal with it to my satisfaction, but there are a couple of ways to improve this:
- minitest/rails seemed too lax with assertions. The fact that nil could be typecasted into a false value allowed a lot of live mutations that needed killing with additional tests. As it turns out, minitest-strict solves exactly that problem and allowed me to eliminate a lot of new tests.
- I can ask the LLM to attempt to delete some repetitive tests, while keeping a close eye on coverage.
- Maybe in the future we can have mutation testing for non-production code?
I got a little anxious seeing so much existing production code changed by LLMs. They were removing guard clauses, some checks… “this surely will backfire somehow”, I thought to myself. But I haven’t seen any new bugs introduced after mutant-powered refactoring did its job. ZERO issues!
Emboldened by the success in Ruby, I tried mutation testing in other languages. And I understood that writing a generic mutation testing framework is not that hard, but making it usable is really hard. It’s also computationally heavy. Incremental mutation testing of only code that is being changed is a killer feature of the mutant gem. So that $30/month fee for the mutant gem? Worth it! I don’t curse at every random error and don’t have the urge to fix the mutation testing framework itself.
As a web developer, I have never really seen all my CPUs work to their full potential. The mutant gem changes that, to the point that the browser started fighting for resources. Parallelization of agents with mutant has to be done carefully; one has to be sure not to execute multiple mutants at the same time. Getting a beefy server and running mutant there seems like a good idea that I haven’t tried so far. Relying on GitHub Actions is not enough now either.
Sometimes I’m in prototyping mode, trying to figure out whether a library or approach is workable, and in that moment mutant slows everything down and derails the entire process. In my head, “mutant” is a final polish step, and to draw this distinction I’ve created a “prototype” mode in my opencode, with explicit instructions not to execute any mutation testing.
I’ve been following Markus, the mutant gem author, and his work for some time. He is one of the original developer thinkers, not someone who just blindly follows “best practices”. And he goes out of his way to provide support and share his experience. It was an enriching experience for me personally just to have a couple of random chats with him. I’m standing here and questioning a lot of my personal choices as a result (and hopefully that makes me a slightly better developer now).

As a conclusion, I wanted to share a single thought with the world. No matter what kind of language you’re dealing with, take a day off and try a mutation testing framework.

I have continued my experiments with advanced testing approaches: I’m experimenting with fuzz testing and property testing. Especially bombadil’s experimental support for TUIs makes me excited. But let’s leave this part for future articles.