Feature · Comparison Mode

One prompt. Every server. Every model.

Run the same task across any combination of MCP servers and language models. See outputs, tool-call traces, latency, and failures side by side. The eval workflow you currently fake with five terminal tabs.

Read the launch notes →

Side-by-side, every dimension

Three axes: server, model, prompt.

Pick any combination. Hold the prompt fixed; vary servers and models. Hold the model fixed; vary servers. Hold the server fixed; vary models. MCPOrbit runs every cell of the grid concurrently and shows you the outputs, tool-call traces, latency, error class and token cost in one view.

Run claude-opus-4.7 vs gpt-5 on the same MCP server
Run two builds of the same server through the same model
Mix and match — N×M without writing a runner script

Comparison · 1 prompt × 2 targetslive

linear

claude-opus-4.7287ms

> tools/call list_issues

{ "team": "growth", "status": "open" }

< response

{ "issues": [4 items], "cursor": "..." }

Matched expected output

postgres

gpt-5-mini412ms

> tools/call list_issues

{ "team": "growth", "status": "open" }

< response

{ "issues": [4 items], "cursor": "..." }

Output differs

avg latency349mstool calls6errors0 / 1token cost$0.012

Use cases

Three jobs comparison mode does well.

Vetting a new server before integration

Run your real prompts against it. See if the tool selection matches what the README claims.

Regression testing during development

Snapshot a working build. Re-run after a change. Diff the structured outputs and the tool-call traces.

Benchmarking model providers

Same MCP server, three model vendors. Decide based on cost, latency, and behaviour, not on vibes.

The JSON-RPC log

Every message between MCPOrbit and a server, inspectable.

The log panel shows the live JSON-RPC stream — request, response, timestamp, error class. Right-click any row to copy as curl. When something diverges, the log is the first place you look.

Full request/response with timestamps
Error class surfaced inline (InvalidParams, MethodNotFound, …)
Copy-as-curl on any row
Filter by method, status, or server

JSON-RPC log · linear/list_issues

→initialize{ clientInfo: 'mcporbit/0.5' }

←initialize[ok]{ serverInfo: 'linear/1.2', capabilities: {…} }

→tools/list{}

←tools/list[ok]{ tools: [12 entries] }

→tools/call{ name: 'list_issues', arguments: {…} }

←tools/call[err]InvalidParams: 'team' must be a UUID

→tools/call{ name: 'list_issues', arguments: {…} } // retry

←tools/call[ok]{ issues: [4], cursor: '…' } 287ms

Honest caveats

What comparison mode is not.

Not an automated eval framework.

MCPOrbit doesn't grade outputs against a rubric. It surfaces them so you can decide.

Not a load-test runner.

Concurrency is one prompt at a time across configurations. If you need 10k requests, you want a load tester.

Not a CI plugin (yet).

Drift testing has the snapshot/diff primitive, but headless CI integration is on the roadmap, not shipped.

Comparison Mode FAQ

Can I save a comparison configuration?

Yes. Each comparison is a saved configuration with a name, the participating server-model pairs, and the prompt. Reopen any time.

Are tool calls executed against real servers?

Yes. There's no simulation. If your MCP server has side effects, comparison mode will trigger them. Use a staging server when comparing destructive tools.

Do model API calls cost real money?

Yes. MCPOrbit uses your own API keys. Cost is on you. We surface token counts in the log panel so you can budget.

Can I export comparison results?

JSON export ships today. Markdown export is on the roadmap.

What's the practical limit?

No hard cap. Practical limit is whatever your model providers and your patience can handle. We've seen teams run 6×3 grids comfortably.