Evaluation Methodology
MiroFlow provides a standardized evaluation framework for fair, reproducible model comparison across 9+ benchmarks. For cross-model results, see the Model Comparison Leaderboard.
Featured Results: MiroThinker


Supported Benchmarks
| Benchmark | Category | Verifier Type | Metrics |
|---|---|---|---|
| GAIA (Validation + Test) | General Agent | Exact match + normalization | Pass@1 accuracy |
| HLE / HLE Text-Only | Language Understanding | LLM judge | Accuracy |
| BrowseComp (EN + ZH) | Web Search | Exact match | Accuracy |
| xBench-DeepSearch | Deep Search | Exact match | Accuracy |
| FutureX | Future Prediction | Custom verifier | Ranking |
| FinSearchComp | Finance | Custom verifier | Accuracy |
| WebWalkerQA | Web Navigation | Exact match | Accuracy |
| FRAMES-Test | Multi-hop QA | LLM judge | Accuracy |
| SimpleQA | Simple QA | Exact match | Accuracy |
Controlled Variables
Every benchmark evaluation in MiroFlow controls the following variables to ensure fair comparison:
| Variable | How It's Controlled |
|---|---|
| MCP Tools | Identical tool configurations across all models — same search, code sandbox, file reading, etc. |
| Prompt Templates | Same YAML + Jinja2 templates rendered with the same context variables |
| Verifiers | Each benchmark has a dedicated verifier implementation used for all models |
| IO Pipeline | Same input preprocessing (file content, hints, message formatting) and output extraction (summary, boxed answer) |
| Rollback Logic | Same error recovery parameters (max_consecutive_rollbacks, max_duplicate_rollbacks) |
Multi-Run Evaluation
Benchmark scripts support automated multi-run evaluation for statistical reliability:
- Parallel execution: Multiple evaluation runs execute concurrently
- Result aggregation: Scores are collected and aggregated automatically
- Statistical reporting: Mean, standard deviation, min/max across runs
Example benchmark script:
# Runs 8 evaluation passes on GAIA validation with MiroThinker
bash scripts/benchmark/mirothinker/gaia-validation-165_mirothinker_8runs.sh
Reproduce Results
Follow the benchmark-specific guides in the sidebar to reproduce each result. Each guide includes dataset preparation, configuration, and execution steps.
See the Model Comparison Leaderboard for cross-model results and framework comparison.