Model Comparison Leaderboard

Same tools. Same prompts. Same infrastructure. The only variable is the model.

MiroFlow provides a standardized evaluation environment where every model gets the same tools, the same prompt templates, and the same infrastructure. This makes cross-model comparison fair and reproducible.

Cross-Model Performance

All results below were produced using MiroFlow with identical configurations — only provider_class and model_name differ.

Coming Soon

Benchmark results will be updated after comprehensive testing with v1.7. Stay tuned.

Why These Comparisons Are Fair

MiroFlow controls every variable except the model itself:

Variable	How It's Controlled
MCP Tools	All models use the same tool set (search, code sandbox, file reading, etc.) configured via identical YAML files
Prompt Templates	Same YAML + Jinja2 prompt templates across all models
Verifiers	Each benchmark uses the same automated verifier (exact match, LLM-judge, or custom) regardless of model
Multi-Run Aggregation	Results are averaged over multiple runs with statistical reporting (mean, std dev, min/max)
Infrastructure	Same MCP server configurations, same API retry/rollback logic, same IO processing pipeline

The framework is the constant. The model is the variable.

Test Your Own Model

Add any OpenAI-compatible model to the leaderboard in three steps:

Step 1: Create an LLM Client (if needed)

For OpenAI-compatible APIs, use the built-in OpenAIClient:

llm:
  provider_class: OpenAIClient
  model_name: your-model-name

For custom APIs, implement a new client with the @register decorator. See Add New Model.

Step 2: Copy a Benchmark Config and Change the LLM

# Copy any existing benchmark config, e.g.:
# config/benchmark_gaia-validation-165_mirothinker.yaml

# Change only these two lines:
main_agent:
  llm:
    provider_class: OpenAIClient       # Your client
    model_name: your-model-name        # Your model

Step 3: Run the Benchmark

bash scripts/benchmark/mirothinker/gaia-validation-165_mirothinker_8runs.sh
# (or adapt the script for your config)

Results are automatically evaluated by the benchmark verifier and aggregated across runs.

Step 4 (Optional): Submit a PR

Add your config and results to the repository. We welcome community-contributed model evaluations.

MiroFlow vs Other Frameworks

Coming soon — framework comparison results will be added after v1.7 testing is complete.

Reproduce Any Result

Every result in the tables above can be reproduced from a config file. Follow the benchmark-specific guides:

GAIA: Prerequisites · MiroThinker · Claude 3.7 · GPT-5 · Text-Only
BrowseComp: English · Chinese
HLE: Full · Text-Only
Other: FutureX · xBench-DS · FinSearchComp · WebWalkerQA

Documentation Info

Last Updated: March 2026 · Doc Contributor: Team @ MiroMind AI