FinSearchComp
MiroFlow's evaluation on the FinSearchComp benchmark demonstrates capabilities in financial information search and analysis tasks, showcasing advanced reasoning abilities in complex financial research scenarios.
More details: FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
Dataset Overview
FinSearchComp Dataset
The FinSearchComp dataset consists of financial search and analysis tasks that require comprehensive research capabilities including:
- Financial data retrieval and analysis
- Market research and company analysis
- Investment decision support
- Financial news and report interpretation
- Time-sensitive financial information gathering
Key Dataset Characteristics
- Total Tasks: 635 (across T1, T2, T3 categories)
- Task Types: - T1: Time-Sensitive Data Fetching
- T2: Financial Analysis and Research
- T3: Complex Historical Investigation
 
- Answer Format: Detailed financial analysis and research reports
- Ground Truth: Available for T2 and T3 tasks, changes dynamically for T1 tasks
- Evaluation: Judge-based evaluation with correctness assessment
Quick Start Guide
Quick Start Instructions
This section provides step-by-step instructions to run the FinSearchComp benchmark and prepare submission results. Note: This is a quick start guide for running the benchmark, not for reproducing exact submitted results.
Step 1: Prepare the FinSearchComp Dataset
Dataset Setup
Use the integrated prepare-benchmark command to download and process the dataset:
This will create the standardized dataset at data/finsearchcomp/standardized_data.jsonl.
Step 2: Configure API Keys
API Key Configuration
Set up the required API keys for model access and tool functionality. Update the .env file to include the following keys:
# For searching and web scraping
SERPER_API_KEY="xxx"
JINA_API_KEY="xxx"
# For Linux sandbox (code execution environment)
E2B_API_KEY="xxx"
# We use Claude 3.7 Sonnet for financial analysis via OpenRouter
OPENROUTER_API_KEY="xxx"
OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
# Used for hint generation and final answer extraction
OPENAI_API_KEY="xxx"
OPENAI_BASE_URL="https://api.openai.com/v1"
# Used for Claude vision understanding
ANTHROPIC_API_KEY="xxx"
# Used for Gemini vision
GEMINI_API_KEY="xxx"
Step 3: Run the Evaluation
Evaluation Execution
Execute the following command to run evaluation on the FinSearchComp dataset:
uv run main.py common-benchmark --config_file_name=agent_finsearchcomp_claude37sonnet benchmark=finsearchcomp output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")"
Progress Monitoring and Resume
To check the progress while running:
If you need to resume an interrupted evaluation, specify the same output directory to continue from where you left off.
Step 4: Extract Results
Result Extraction
After evaluation completion, the results are automatically generated in the output directory:
- benchmark_results.jsonl: Detailed results for each task
- benchmark_results_pass_at_1_accuracy.txt: Summary accuracy statistics
- task_*_attempt_1.json: Individual task execution traces
Evaluation Notes
Task Type Considerations
The FinSearchComp dataset includes different task types with varying evaluation criteria:
- T1 Tasks: Time-Sensitive Data Fetching tasks are excluded from correctness evaluation due to outdated ground truth, but completion is still tracked
- T2 Tasks: Financial Analysis tasks are evaluated for correctness and quality
- T3 Tasks: Complex Historical Investigation tasks require comprehensive research and analysis
Output Analysis
The evaluation generates detailed execution traces showing:
- Research process for each financial task
- Information gathering from multiple sources
- Financial calculations and analysis
- Comprehensive reports with insights and recommendations
Directory Structure
After running evaluations, you'll find the following structure:
logs/finsearchcomp/agent_finsearchcomp_claude37sonnet_YYYYMMDD_HHMM/
โโโ benchmark_results.jsonl              # Task results summary
โโโ benchmark_results_pass_at_1_accuracy.txt  # Accuracy statistics
โโโ task_(T1)Time_Sensitive_Data_Fetching_*.json  # T1 task traces
โโโ task_(T2)Financial_Analysis_*.json   # T2 task traces
โโโ task_(T3)Complex_Historical_Investigation_*.json  # T3 task traces
โโโ output.log                           # Execution log
Task Categories Breakdown
The progress checker provides detailed statistics:
- Total Tasks: Complete count across all categories
- Completed Tasks: Successfully finished tasks
- Correct Tasks: Tasks with judge_result "CORRECT" (T2 and T3 only)
- Category Breakdown: Separate counts for T1, T2, and T3 tasks
- Accuracy Metrics: Pass@1 accuracy for evaluable tasks
Usage Examples
Single Run Evaluation
uv run main.py common-benchmark --config_file_name=agent_finsearchcomp_claude37sonnet benchmark=finsearchcomp output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")"
Limited Task Testing
uv run main.py common-benchmark --config_file_name=agent_finsearchcomp_claude37sonnet benchmark=finsearchcomp benchmark.execution.max_tasks=5 output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")"
Custom Agent Configuration
uv run main.py common-benchmark --config_file_name=agent_gaia-validation benchmark=finsearchcomp output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")"
Multiple Runs for Reliability
Documentation Info
Last Updated: September 2025 ยท Doc Contributor: Team @ MiroMind AI