AI code review tools: which one actually catches bugs?
We seeded 50 real bugs into PRs and ran them past CodeRabbit, Greptile, Korbit, and Qodo Merge. Catch rates inside.
Every AI code reviewer claims to catch bugs humans miss. We tested that claim by planting 50 known bugs across 30 PRs and seeing how many each tool flagged.
The bug mix
We used a balanced mix: 10 logic bugs, 10 security issues, 10 performance regressions, 10 race conditions, and 10 subtle correctness bugs (off-by-one, wrong operator, swapped arguments).
Each bug was real — taken from public CVEs, post-mortems, or open-source bug trackers — then transplanted into a synthetic PR with realistic surrounding changes.
Catch rates
- CodeRabbit: 38 / 50 (76%). Best overall, strong on security and logic.
- Greptile: 34 / 50 (68%). Best repo-context awareness; caught bugs the others missed because it understood neighboring files.
- Qodo Merge: 31 / 50 (62%). Best free option; strong on correctness, weaker on performance.
- Korbit: 28 / 50 (56%). Best feedback tone; lower catch rate but the cleanest comments to act on.
False positives matter too
A tool that flags everything looks great on catch rate and is useless in practice. We tracked false positives per PR and Greptile was the cleanest signal, followed by Korbit. CodeRabbit caught the most bugs but also generated the most noise.
What none of them caught
Race conditions were brutal. The best tool caught 4 out of 10. AI reviewers still don't reason well about concurrency, and you should not rely on them for it.
They also struggled with bugs that required understanding the product, not the code — e.g. a price calculation that was technically correct but violated a business rule documented only in a spec.
How to use them well
Treat AI code review as a second pass, not a replacement. Configure noise down, route the comments to a thread instead of inline if your team finds them distracting, and pay attention when two different tools flag the same line.