back to articles
AI
Jun 14, 202612 min read

How I Use Cross-Vendor AI Models to Review My Code When It Matters

A different vendor's model reviews what my primary AI built, but only on the changes where the stakes justify it. Here is the cross-vendor review system I run, and what it does not fix.

Meta LayerCode ReviewCross-Vendor

Anyone who builds with AI agents has experienced this small hesitation before. The agent finishes, says it's done, and sounds completely sure. And somewhere in the back of your head sits a question you can't quite answer: can I actually trust this?

This is not an unfounded worry. I have watched an agent write several files that all stated the same thing about an external tool, confidently and consistently. Every one of them was wrong in the same way. The files matched each other, so the work looked correct from the inside, but nobody had checked the claim against the outside world. Internal consistency is not the same as being right.

My answer to this scenario has two pieces that only work together. A different vendor's model reviews the work cold, sharing none of the context that produced it, under one rule that keeps the review honest. And I don't run that review on everything: a tiering system fires it automatically, only on the changes important enough to justify it. The selectivity is what keeps it sustainable, and what keeps it from becoming theater.

Tools used

AI tooling
  • Claude Code
  • Codex CLI
  • Antigravity CLI
Governing rules
  • Cross-vendor review rubric
  • Stake matrix
Harness
  • cross-vendor-review.sh dispatcher
  • Pre-commit gate

The model is the worst judge of its own work

When I started building more seriously with AI, I would just ask the same model to review what it had built. It almost never pushed back, even when I tried to reduce the sycophancy rate. It would constantly tell me the work looked good, sometimes it would point out a few small things, but that's it. For a while, I thought this was enough… until I realized that I got the same "everything looks good" whether the work was fine or not.

It makes sense when I look at how I got there. The model was reviewing its own code. It still had the whole conversation in context, all the assumptions it made on the first pass, the same idea of what I asked for. Basically, it had already convinced itself that this was right, so asking it to check again was mostly asking it to agree with itself. It tends to do that anyway, and simply sounds more sure the second time.

I tried using fresh sessions with the same model. It helped a little, but not that much. It was still the same model, with roughly the same blind spots the initial session had. Switching to a different model from the same vendor was a bit better, but not by much. Models from one vendor share a lot of the same training, so they tend to carry the same blind spots too. What truly moved things was handing the work to a different vendor altogether, like Codex (OpenAI) or Antigravity (Google) instead of Claude. Other vendors' models were trained differently and missed different things, which meant they caught different things. They never went through the discussions where I convinced myself into a decision, and they had nothing invested in the work being right, so they read the change cold and tell it as it is.

There was one thing I had to set up to keep this fair though. Every model in my system reads the same rules I wrote, so the reviewers judge the work against the proper standard I hold my product to, not their own preferences. The other side of that is a reviewer with no context doesn't know what I was going for, so it sometimes flags things that were on purpose. That is the problem the next part aims to solve.

The rule that keeps a cold review honest

Giving the work to a cold reviewer creates a new problem of its own. A model with no context and no history will flag a lot, and a lot of it is just preference. It might have named something slightly differently, split a function in a different way, adopted a different approach… If I just blindly accepted everything it said, I'd have traded a yes-man for a reviewer that buries a few issues that matter under many that don't. It would also lead to scope-creep, where the first model simply says yes to everything the second model brings back… Sometimes leading to a different outcome altogether.

So the whole review runs on a single rule. Every observation has to point to something concrete: a failing or missing test, a lint rule, a specific line in my own project rules, or a line in the brief that was sent with the work. If the reviewer can name the anchor, then the observation is considered and we deal with it. If it can't, it's tossed aside. The missing anchor is the signal to drop it.

The anchor-or-decline rule

Every observation the reviewer raises has to cite a concrete anchor: a failing or missing test, a lint rule, a named section of my project rules, or a line in the brief I sent over. If it can name one, I deal with it. If it can't, the observation is declined by default. The missing anchor is the decline signal.

What I like about this approach is that it puts the burden on the reviewer, not on me or the first model. It cannot just say it doesn't like something. It has to tell me which test, rule, or line the work actually breaks. That constraint is what turns the cold model's stack of opinions into a short list of things that are truly worth fixing.

There's one hole I had to patch though. Every anchor I listed points internally, at my own repo: my tests, my rules, my brief. A second model reading only the repo can confirm that everything is consistent with everything else and still completely miss something that was wrong about the outside world. That's the situation from the start of this article. Three files agreed on an install command, the repo was perfectly consistent with itself, but the package didn't exist.

So one part of the rubric, the checklist of anchors the reviewer measures the work against, points outward instead. Any claim about the world, a package name, a URL, a version, a command-line flag, doesn't get to be confirmed by the file that asserts it. It has to be checked by running something real against the world. For that previous situation, it took a single command: npm view on the name, which came back with a 404. The repo couldn't have caught that… Only a real-world test could.

Not every change earns a cross-vendor review

If I ran a full cross-vendor review on every change, the mechanism would collapse under its own weight. The review itself runs between the models, I am not taking an active part in this, I just get a summary at the end. But if every single commit were to produce a summary for me to sign off, it wouldn't be sustainable and I would most likely stop reading them and start simply accepting them. A gate that is run on everything eventually stops meaning anything. The selectivity isn't cutting corners, it's what keeps the review worth doing at all.

There's a more direct cost too. To keep more than one vendor on hand, I have to pay for a few plans at the same time. In my case, Claude Max, ChatGPT Pro and Google AI Pro. And every cross-vendor pass costs time on top of money. A second model has to read the whole change cold and walk the rubric before I get anything back, so a change the first model wrapped up in seconds can sit in review for minutes. On something important, those minutes are nothing. On a typo, they're ridiculous.

As such, every change gets sorted into one of four tiers, and the tier decides how much review it gets. Most of the time, I'm not even making that call by hand: once a part of the codebase is classified, changes to it route to the same tier automatically. When something genuinely new shows up, it goes through a short decision tree, top to bottom, and stops at the first yes.

Yes

No

Yes

No

Yes

No

Production-irreversible?
money, auth, data migration

Tier 1
Different vendor required
Hold for explicit ship it

Quality-load-bearing?
user-facing, shipped copy

Tier 2
Different vendor required
Hold for ship it

Routine code change?
private helper, dep bump

Tier 3
Same-vendor self-review
Auto-merge on green CI

Tier 4
Markdown, comments
Hooks only, auto-merge

How a change gets tiered. Walk top to bottom, stop at the first yes.

The criterion that decides the tier is its reversibility, how hard the change is to undo once it's live. The things I can't cleanly walk back, money, auth, a data migration, sit at the top and always go to a different vendor before they ship. The further down the tree, the less review the change needs, down to markdown and comments where the commit hooks are the only gate.

When I'm not sure where something lands, my personal rule is to round up. Calling something Tier 2 when it was really Tier 3 costs me one extra review pass. Calling it Tier 3 when it was really Tier 1 could lead to real, meaningful negative consequences. Erring up is cheap, erring down is the expensive mistake.

I think this is the part that matters the most for anyone leading products, not just building side projects. The instinct with AI output is to treat it all the same: review all of it because you don't trust it, or accept everything as is because it's fast. Tiering lets us do neither. We spend our scrutiny where the stakes are high and let the trivial stuff go through. That's a call about risk management, not a technical one.

How it actually runs

The review runs in two stages, and the order matters.

To start, the first model checks its own work, but not from the session that wrote the code. That would just result in running the review in the echo chamber mentioned at the start of this article. It re-opens the work in a fresh context with the rubric in front of it. This stage is allowed to run commands, so this is where the claims about the outside world actually get tested. If a line says a package installs a certain way, the first model runs the command and pastes the result back as the anchor. The package that didn't exist gets caught right here, the moment npm view comes back with a 404.

Then, the same work goes to a model from a different vendor, and this is the part I had to lock down. The cross-vendor reviewer runs read-only. It can read, grep, and search the repo, but it can't run a single command. That's on purpose. A reviewer that could run commands on every pass would pose a much bigger risk than I want sitting in the loop. So it doesn't run the external checks itself, it flags them. The first model runs those in the next round. Spotting and running are split along the line of which stage can run commands.

flagged claims to run

Stage 1: First model self-checks
same vendor, fresh context, can run commands
runs every external-claim command

Stage 2: Cross-vendor review
different vendor, read-only, can't run commands
flags external claims, cites anchors

Summary to me
anchored issues to fix +
claims to verify

The two-stage review. The first model can run the command checks; the cross-vendor reviewer is read-only and can only flag them.

And read-only isn't a ceremonial request, the dispatcher enforces it. Claude gets launched with its tools restricted to reading; Codex (or Antigravity) runs in a read-only sandbox.

How the reviewer is locked to read-only
# Claude, restricted to read-only tools
claude -p "$PROMPT" --allowedTools "Read,Grep,Glob"

# Codex, sandboxed read-only
codex exec --sandbox read-only "$PROMPT"
bash

But here's the thing I keep coming back to: a setting that says "read-only" is just a claim until you check it. One of these tools, Antigravity, used to have a read-only setting that didn't actually work. It looked safe, but the reviewer could still run anything. So before trusting it, the dispatcher checks the tool's version and refuses to use it if it's too old. Same lesson as the rest of this article: don't trust a claim, check it.

Where it doesn't help

I've already covered the money and the time cost, so here's the part that's easier to gloss over: the things this setup doesn't fix.

The biggest one is that the whole thing runs on convention, not enforcement. The tiering, the anchor rules, the cross-vendor pass, none of it forces itself to happen. It only works if it actually gets run. I learned that the hard way with a different protocol in the same system, one whose entire job is catching when two documents disagree with each other… A count was wrong, it sat there live and public, and the protocol built to catch exactly that kind of drift didn't catch it, because catching it depended on someone running the skill, and nobody did. The system lowers the odds of a mistake getting through, but it doesn't make it zero.

The cold reviewer has a built-in blind spot too, the same one from earlier. It doesn't know what I was trying to do, so it spends some of its attention flagging things I did on purpose, then I have to sort the real catches from the noise. The anchor rule makes it easier to manage, but it doesn't put it on autopilot.

There's also a whole category of problems it simply cannot touch. A model from a different vendor can tell me a function breaks a test, or a claim is wrong about the world. But it can't tell me whether the feature itself should exist, if the copy actually resonates with the person reading it, or if I'm solving the right problem in the first place. Those are judgment calls, and no amount of cross-vendor review can replace judgment. It checks the work against anchors. Whether it's the right work in the first place is still on us.

Closing thoughts

A few things stand out coming out of this.

The biggest one is that none of this is really about the models themselves. It's about supervision. The whole point is to stay in the loop when judgment really matters and stay out of it when it doesn't. A different vendor reading the work cold is what catches the blind spots, and the tiering system is what keeps the whole thing sustainable and worth running. Neither half works without the other.

It's also a different bet than the one I see a lot in the industry right now. Most of the pitch is about more autonomy, agents that run longer on their own. I find I trust the setup more when I aim for the opposite: less autonomy on the things that are really important, more on the things that are less important. That's a decision about risk management, not a limitation I'm working around.

The part I'd like to point to, though, is what this unlocks. When we lean on a single model, the complexity of what we can confidently build is capped by how far we're willing to trust that one model's word. The moment a different vendor is checking the work where it counts, that ceiling moves. We can take on bigger, higher-stakes projects with AI because we're no longer betting everything on one model being right. It doesn't replace judgment, it protects it, and that's what lets us point it at harder problems.

This is one piece of a larger system I've been building, the meta layer around how I work with AI. I'll write about the other pieces over time. You can follow the rest of the meta layer series as I publish it.

More articles

All articles

About the author

Maxim St-Hilaire is a Staff Product Manager at Udemy leading MarTech product strategy. Programmer-turned-PM. Bilingual, based in Toronto.

Read more about Maxim