Feature · Comparison Mode
One prompt. Every server. Every model.
Run the same task across any combination of MCP servers and language models. See outputs, tool-call traces, latency, and failures side by side. The eval workflow you currently fake with five terminal tabs.
Side-by-side, every dimension
Three axes: server, model, prompt.
Pick any combination. Hold the prompt fixed; vary servers and models. Hold the model fixed; vary servers. Hold the server fixed; vary models. MCPOrbit runs every cell of the grid concurrently and shows you the outputs, tool-call traces, latency, error class and token cost in one view.
- Run claude-opus-4.7 vs gpt-5 on the same MCP server
- Run two builds of the same server through the same model
- Mix and match — N×M without writing a runner script
linear
claude-opus-4.7287ms> tools/call list_issues
{ "team": "growth", "status": "open" }
< response
{ "issues": [4 items], "cursor": "..." }
postgres
gpt-5-mini412ms> tools/call list_issues
{ "team": "growth", "status": "open" }
< response
{ "issues": [4 items], "cursor": "..." }
Use cases
Three jobs comparison mode does well.
Vetting a new server before integration
Run your real prompts against it. See if the tool selection matches what the README claims.
Regression testing during development
Snapshot a working build. Re-run after a change. Diff the structured outputs and the tool-call traces.
Benchmarking model providers
Same MCP server, three model vendors. Decide based on cost, latency, and behaviour, not on vibes.
The JSON-RPC log
Every message between MCPOrbit and a server, inspectable.
The log panel shows the live JSON-RPC stream — request, response, timestamp, error class. Right-click any row to copy as curl. When something diverges, the log is the first place you look.
- Full request/response with timestamps
- Error class surfaced inline (InvalidParams, MethodNotFound, …)
- Copy-as-curl on any row
- Filter by method, status, or server
JSON-RPC log · linear/list_issues
Honest caveats
What comparison mode is not.
Not an automated eval framework.
MCPOrbit doesn't grade outputs against a rubric. It surfaces them so you can decide.
Not a load-test runner.
Concurrency is one prompt at a time across configurations. If you need 10k requests, you want a load tester.
Not a CI plugin (yet).
Drift testing has the snapshot/diff primitive, but headless CI integration is on the roadmap, not shipped.
Comparison Mode FAQ
Can I save a comparison configuration?
Yes. Each comparison is a saved configuration with a name, the participating server-model pairs, and the prompt. Reopen any time.
Are tool calls executed against real servers?
Yes. There's no simulation. If your MCP server has side effects, comparison mode will trigger them. Use a staging server when comparing destructive tools.
Do model API calls cost real money?
Yes. MCPOrbit uses your own API keys. Cost is on you. We surface token counts in the log panel so you can budget.
Can I export comparison results?
JSON export ships today. Markdown export is on the roadmap.
What's the practical limit?
No hard cap. Practical limit is whatever your model providers and your patience can handle. We've seen teams run 6×3 grids comfortably.
