GAIA Validation
MiroFlow demonstrates state-of-the-art performance on the GAIA validation benchmark, showcasing exceptional capabilities in complex reasoning tasks that require multi-step problem solving, information synthesis, and tool usage.
More details: GAIA: a benchmark for General AI Assistants
About the GAIA Dataset
What is GAIA?
GAIA (General AI Assistant) is a comprehensive benchmark designed to evaluate AI agents' ability to perform complex reasoning tasks that require multiple skills including web browsing, file manipulation, data analysis, and multi-step problem solving.
Performance Comparison
State-of-the-Art Performance
MiroFlow achieves state-of-the-art (SOTA) performance among open-source agent frameworks on the GAIA validation set.
Key Performance Metrics
- Pass@3: 81.8%
- Majority Vote: 82.4%
- Pass@1 (best@3): 74.5%
- Pass@1 (avg@3): 72.2%
Reproducibility Guarantee
Unlike other frameworks with unclear evaluation methods, MiroFlow's results are fully reproducible. Note that Hugging Face access was disabled during inference to prevent direct answer retrieval.
Setup and Evaluation Guide
Complete Reproduction Instructions
This section provides comprehensive step-by-step instructions to reproduce our GAIA validation benchmark results. All results are fully reproducible using our open-source framework.
Step 1: Prepare the GAIA Validation Dataset
Choose one of the following methods to obtain the GAIA validation dataset:
Method 1: Direct Download (Recommended)
No Authentication Required
This method does not require HuggingFace tokens or access permissions.
cd data
wget https://huggingface.co/datasets/miromind-ai/MiroFlow-Benchmarks/resolve/main/gaia-val.zip
unzip gaia-val.zip
# Unzip passcode: pf4*
Method 2: Using the prepare-benchmark command
Prerequisites Required
This method requires HuggingFace dataset access and token configuration.
First, you need to request access and configure your environment:
- Request Dataset Access: Visit https://huggingface.co/datasets/gaia-benchmark/GAIA and request access
- Configure Environment:
Edit the
.env
file:
Getting Your Hugging Face Token
- Go to https://huggingface.co/settings/tokens
- Create a new token with at least "Read" permissions
- Add your token to the
.env
file
Then download the dataset:
Step 2: Configure API Keys
Required API Configuration
Set up the required API keys for model access and tool functionality. Update the .env
file to include the following keys:
# Search and web scraping capabilities
SERPER_API_KEY="your-serper-api-key"
JINA_API_KEY="your-jina-api-key"
# Code execution environment
E2B_API_KEY="your-e2b-api-key"
# Primary LLM provider (Claude-3.7-Sonnet via OpenRouter)
OPENROUTER_API_KEY="your-openrouter-api-key"
OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
# Vision understanding capabilities
ANTHROPIC_API_KEY="your-anthropic-api-key"
GEMINI_API_KEY="your-gemini-api-key"
# LLM judge, reasoning, and O3 hints
OPENAI_API_KEY="your-openai-api-key"
OPENAI_BASE_URL="https://api.openai.com/v1"
Why OpenRouter?
We use Claude-3.7-Sonnet through the OpenRouter backend as the primary LLM provider because OpenRouter offers better response rates and improved reliability compared to direct API access.
Step 3: Run the Evaluation
Execute the evaluation using the following command:
uv run main.py common-benchmark \
--config_file_name=agent_gaia-validation \
output_dir="logs/gaia-validation/$(date +"%Y%m%d_%H%M")"
Step 4: Monitor Progress and Resume
Progress Tracking
You can monitor the evaluation progress in real-time:
Replace $PATH_TO_LOG
with your actual output directory path.
Resume Capability
If the evaluation is interrupted, you can resume from where it left off by specifying the same output directory:
uv run main.py common-benchmark \
--config_file_name=agent_gaia-validation \
output_dir="logs/gaia-validation/20250922_1430"
Execution Traces
Complete Execution Traces
We have released our complete execution traces for the gaia-validation
dataset on Hugging Face. This comprehensive collection includes a full run of 165 tasks with an overall accuracy of 73.94%.
You can download them using the following command:
wget https://huggingface.co/datasets/miromind-ai/MiroFlow-Benchmarks/resolve/main/gaia_validation_miroflow_trace_public_20250825.zip
unzip gaia_validation_miroflow_trace_public_20250825.zip
# Unzip passcode: pf4*
Documentation Info
Last Updated: September 2025 ยท Doc Contributor: Team @ MiroMind AI