Model Comparison Leaderboard
Same tools. Same prompts. Same infrastructure. The only variable is the model.
MiroFlow provides a standardized evaluation environment where every model gets the same tools, the same prompt templates, and the same infrastructure. This makes cross-model comparison fair and reproducible.
Cross-Model Performance
All results below were produced using MiroFlow with identical configurations — only provider_class and model_name differ.
Coming Soon
Benchmark results will be updated after comprehensive testing with v1.7. Stay tuned.
Why These Comparisons Are Fair
MiroFlow controls every variable except the model itself:
| Variable | How It's Controlled |
|---|---|
| MCP Tools | All models use the same tool set (search, code sandbox, file reading, etc.) configured via identical YAML files |
| Prompt Templates | Same YAML + Jinja2 prompt templates across all models |
| Verifiers | Each benchmark uses the same automated verifier (exact match, LLM-judge, or custom) regardless of model |
| Multi-Run Aggregation | Results are averaged over multiple runs with statistical reporting (mean, std dev, min/max) |
| Infrastructure | Same MCP server configurations, same API retry/rollback logic, same IO processing pipeline |
The framework is the constant. The model is the variable.
Test Your Own Model
Add any OpenAI-compatible model to the leaderboard in three steps:
Step 1: Create an LLM Client (if needed)
For OpenAI-compatible APIs, use the built-in OpenAIClient:
For custom APIs, implement a new client with the @register decorator. See Add New Model.
Step 2: Copy a Benchmark Config and Change the LLM
# Copy any existing benchmark config, e.g.:
# config/benchmark_gaia-validation-165_mirothinker.yaml
# Change only these two lines:
main_agent:
llm:
provider_class: OpenAIClient # Your client
model_name: your-model-name # Your model
Step 3: Run the Benchmark
bash scripts/benchmark/mirothinker/gaia-validation-165_mirothinker_8runs.sh
# (or adapt the script for your config)
Results are automatically evaluated by the benchmark verifier and aggregated across runs.
Step 4 (Optional): Submit a PR
Add your config and results to the repository. We welcome community-contributed model evaluations.
MiroFlow vs Other Frameworks
Coming soon — framework comparison results will be added after v1.7 testing is complete.
Reproduce Any Result
Every result in the tables above can be reproduced from a config file. Follow the benchmark-specific guides:
- GAIA: Prerequisites · MiroThinker · Claude 3.7 · GPT-5 · Text-Only
- BrowseComp: English · Chinese
- HLE: Full · Text-Only
- Other: FutureX · xBench-DS · FinSearchComp · WebWalkerQA
Documentation Info
Last Updated: March 2026 · Doc Contributor: Team @ MiroMind AI