# Claude Code Fusion
July 2, 2026 · Engineering, AI

Turning one Claude Code session into a tech lead, cheap Claude tiers into workers, and the Codex and Grok CLIs into peer engineers billed to their own subscriptions.
I pay for three coding subscriptions: Claude, Codex, and Grok. For months the Claude one did all the work while the other two sat idle, and the most expensive model on the Claude plan spent its quota running test suites and renaming variables. The wrong model was doing the wrong job on the wrong bill.

[claude-code-fusion](https://github.com/okisdev/claude-code-fusion) is what I built to fix that. It turns a Claude Code session into a tech lead that plans, delegates, judges, and synthesizes, but almost never implements. Cheaper Claude models do the mechanical work. The Codex and Grok CLIs, already installed and already paid for, act as peer engineers whose tokens bill to their own subscriptions. The whole thing ships as a Claude Code plugin marketplace.

This post is the story of building it, including the parts that got thrown away.

## What didn't survive first contact

The obvious architectures died in the first research pass.

Proxy routers (the claude-code-router family) were ruled out immediately. Since Anthropic tightened subscription OAuth enforcement in 2026, only the official, unproxied Claude Code binary can authenticate with subscription credentials. Route traffic through a proxy and you fall back to API key billing, and you lose the top model tier entirely. The two things I was optimizing for, subscription quota and the strongest available orchestrator, are exactly what a proxy destroys.

An MCP server for multi-model access died on the project's own core metric: keeping the orchestrator's context lean. An MCP server parks its schema permanently in context and returns full peer transcripts straight into the caller's window. Either the orchestrator swallows all of that, or you wrap every call in a subagent anyway, at which point the MCP server is just a subprocess with worse failure modes. The CLIs already speak JSON, resume sessions by id, and fan out fine.

So the design verdict was: build a plugin, but only half of the system should be a plugin. The Grok integration needed real code. The orchestration layer needed none; model-pinned subagents plus a routing policy in a rules file, all native Claude Code mechanisms.

## The shape

Four layers. An orchestrator that only coordinates, three Claude workers pinned by role, two peer engineers from other vendors, and a panel that convenes the peers when a wrong answer would be expensive.

The orchestrator runs on the strongest available model (the settings pin is `best[1m]`, which floats to whatever that currently means) and holds one hard rule: if a tool call is producing the artifact the user asked for, that call belongs to a worker. The orchestrator's own tool calls are coordination overhead only; a quick peek to phrase a better brief, a read-only check on a worker's claim, final review of a diff.

Under it sit three Claude workers, pinned by role rather than by model name: `deep-reasoner` (Opus at maximum effort, for architecture and root cause work), `fast-worker` (Sonnet, for well-specified mechanical execution), and `trivial-worker` (true Haiku, for single-file trivia). Beside them, two peer engineers: the Codex CLI through its official plugin, and the Grok CLI through the companion runtime this repo ships. Each peer has a lane: Grok is the fast lane (its default model is quick, so it takes volume: self-contained packages, drafts, large-context reads, research digests), Codex is the deep lane (gnarly implementations, stubborn debugging, adversarial review). And their value is not just extra capacity. They are not Claude, and a second opinion from a different model family fails differently.

The routing boundary took the longest to phrase correctly, and it ended up being the sentence I care most about in the whole policy: difficulty is not the delegation boundary; ambiguity is. Hard-but-mechanical work (migrations, test authoring, API removals) delegates freely. Interpreting vague intent, cross-cutting product decisions, any task whose brief would itself require judgment to write: those stay in the main loop until the ambiguity is resolved, and then the resolved version delegates.

## Wrapping a CLI you don't control

The Grok half is real code: a dependency-free Node companion runtime that wraps the Grok CLI's headless mode. It handles background jobs, session threading (resume by uuid; the named-session flag looks like it should work and doesn't), adversarial reviews with one corrective retry when the JSON comes back malformed, and best-of-n implementation tournaments. Everything the orchestrator needs comes back as machine-parseable footer lines: `state: done|error|cancelled`, and on failure a `failure: <kind>` line.

Those failure kinds drive a circuit breaker in the routing policy. The orchestrator parses the line instead of inferring the outcome from prose, and each kind has its own consequence:

Two details in the runtime turned out to be load-bearing. The foreground timeout is 570 seconds because the Bash tool that invokes the companion times out at 600, and the companion must always reap the process and write the job record before Claude Code kills the call. And the whole thing is tested against a fake: `tests/fake-grok`, a small executable that replays eleven canned behaviors (bad JSON then good JSON, hangs that trap SIGTERM, auth errors) so the test suite runs with no network and no real Grok anywhere near it.

## The permission surprise

The best bug was not in my code. During hardening I ran `grok inspect` and found Grok loading permission rules straight out of `~/.claude/settings.json`. My "read-only consult mode" was not read-only: inherited allows let a consult run write files and, worse, execute nested `grok` commands through Bash. A consultation could recursively launch itself, on quota.

The fix rests on one property I verified rather than assumed: deny beats allow. Every consult run now carries explicit denies (`Edit`, `Write`, `Bash(grok*)`, `Bash(claude*)`, `Bash(codex*)`, `Bash(node*)`), and every run except best-of-n passes `--no-subagents`, because Grok also auto-discovers agent definitions in `~/.claude/agents` and would happily spawn my own workers back at me.

## The panel

For decisions where a wrong answer is expensive, `/fusion:panel` composes one neutral brief and sends it to Codex and Grok in parallel, blind. Neither engine sees the other's answer, not even in follow-ups. The orchestrator then adjudicates with a structured judge pass before writing any prose: consensus, contradictions, partial coverage, unique insights, and blind spots, all attributed by engine.

Blindness is the whole point. When the panel reviewed a design question about its own sibling feature (should the stop-time review gate deduplicate reruns by diff hash), both engines independently found the same flaw: a naive diff hash misses content changes in untracked files, because the review payload lists those by path only. Two engines converging on one subtle flaw without seeing each other is a far stronger signal than either alone, and much stronger than one engine agreeing with itself twice.

## The system audited itself

Before the repo went public, the system got pointed at itself twice, and both passes earned their keep.

The first was a full-stack acceptance run, all four executors live at once: Haiku inventorying the command files, Sonnet running the test suite, Opus auditing the circuit breaker policy for logical holes, and Grok reviewing the timeout handling, followed by a real review and a real panel. That run caught a genuine fail-open. The optional stop gate (Grok reviews your diff when the session tries to stop, and can block it) parsed strictly: it expected `BLOCK:` on the first line of the verdict. Grok, being Grok, prefaced its verdict with a sentence of chatter. Real dangerous diff, correct verdict, silently allowed through. The parser now scans every line, there is a regression test, and a live replay confirmed the block fires.

The second was the pre-release audit proper: four tracks in parallel. A claims ledger checking every statement in the README against the code. A consistency pass. A live blind Codex review. A live blind Grok review. All adjudicated panel-style into one fix batch. And mid-audit, the Grok track itself failed with `failure: auth` (my default Grok model had switched to a variant that rejected a parameter). The circuit breaker, which another track was auditing at that exact moment, degraded and rerouted as designed. The mechanism validated itself while under review.

## The idle bench

Shipping turned out not to be the end of the story. A few sessions later I went back and read the transcripts of the system actually running (including the session that produced the first draft of this post), and found the failure mode I hadn't designed for: Claude was still doing almost all the execution. When real work came up, a Claude worker took it, every time. The peers only appeared where the design had explicitly framed them: audits, panels, second opinions. Two paid subscriptions, mostly idle.

The diagnosis was structural, not accidental. The rules positioned the peers as reviewers, so the orchestrator reached for them at review moments and nowhere else. Their agent descriptions introduced them as a "second implementation or diagnosis pass", so they lost every selection contest for first-pass execution. Delegating to a native Claude worker is one tool call; delegating to a peer means composing a self-contained brief and waiting minutes, and nothing in the policy pushed back against that friction. And web search was hardcoded off in every companion call, which quietly gutted Grok's value for research.

The fix was a reframe, not a knob. The policy's Role section became an Operating model, and its opening line is now my favorite sentence in the repo: run the session like the founder of a well staffed startup; you decide, employees execute, and the bench is deep and already paid for. Under it, the rules that actually change behavior: the orchestrator's tokens are the most expensive in the company. Bias to fan out; five or more concurrent delegations is a normal working state, and under-dispatching, not over-dispatching, is the failure mode to watch for. Keep every engine drawing. Trust, then verify. The peers got lanes instead of a review corner (the lanes in the diagram near the top of this post arrived only with this rewrite), plus a balance check: when several delegations have gone to Claude workers while the peers sit idle, the next eligible package goes to a peer.

Two more consequences fell out of the same complaint.

First: slash commands defeat the point. If I have to type `/fusion:panel` myself, the tech lead isn't making the decisions; I am. Digging into why some things self-trigger and others never do produced the most useful mechanical insight of the whole project: Claude Code's auto-delegation is driven by description competition, not by rules files. The Grok agent fired on its own because its description happened to be trigger-shaped ("proactively use when..."); the panel almost never fired because it existed only as a sentence in the rules. So every description got rewritten to compete for its moment, and the policy gained an auto-invocation section that maps plain language ("which approach", "be exhaustive", "audit everything") to the right machinery.

Second: `/fusion:ultra`. I wanted ultracode-style intensity, a whole fleet attacking one task, without spending Claude quota on the fleet. Before building it I had a subagent read the source of UltraCode-Shim, the project that inspired the ask, and the finding was a nice full-circle moment: it gets its power from a MITM proxy that impersonates the official client and borrows the ChatGPT OAuth token, the exact architecture family this project ruled out on day one. What got built instead works with the grain: a request to go deep decomposes into six to eight independent facets (up to about twelve), fans them all out in one message with Grok taking the bulk and Codex the hardest two or three, and comes back as one synthesized deliverable, contradictions surfaced rather than averaged away. A size gate refuses to convene a fleet for a one-file change.

## Named for the pattern, not the model

The project spent its first days as claude-router. The rename came from an uncomfortable question I asked myself mid-build: what happens when the current top model becomes unavailable, or is superseded? A system named for routing to one specific orchestrator has its identity coupled to that model's lifespan.

Around the same time I read about OpenRouter's Fusion feature and Cognition's Devin Fusion, and realized the pattern I had converged on (independent perspectives, blind, adjudicated by a judge) was the same one other teams had discovered independently. The pattern deserved the name; the model didn't. claude-router became claude-code-fusion, and the technical consequence followed directly: roles bind to alias tiers (`best`, `opus`, `sonnet`), not to model IDs, so a same-tier release slots into its role with zero configuration. The one full-ID pin left in the system (true Haiku on `trivial-worker`, because an environment remap hijacks the `haiku` alias) is exactly the kind of rot `/fusion:doctor` exists to flag.

## Enforced versus requested

The most honest thing in the README is a split I now want in every agent system's documentation: what is runtime-enforced versus what is prompt-requested. The sandbox flags, the deny lists, the timeout ladder, the failure classification: enforced, in code, covered by tests. The routing table, the panel blindness, the circuit breaker policy: requested, in prose, dependent on the orchestrator model actually following instructions. Both layers are real, but they fail differently, and pretending prose is enforcement is how agent systems end up trusted more than they deserve.

## Coda

This post went through the system too. Subagents dug the story out of the original session transcripts, mapped this site's authoring conventions, and built the interactive diagrams above from specs, while the orchestrator wrote the prose and judged the results.

The marketplace lives at [okisdev/claude-code-fusion](https://github.com/okisdev/claude-code-fusion):

```bash
claude plugin marketplace add okisdev/claude-code-fusion
claude plugin install grok@claude-code-fusion
claude plugin install fusion@claude-code-fusion
```

Then `/fusion:setup` once per machine, and `/grok:setup` if Grok is on your PATH.

If you already pay for more than one coding subscription, the expensive question is no longer which model is best. It is which model should be doing the job in front of you, and on whose bill.