Skip to content

FinSearchComp

MiroFlow's evaluation on the FinSearchComp benchmark demonstrates capabilities in financial information search and analysis tasks, showcasing advanced reasoning abilities in complex financial research scenarios.

More details: FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning


Dataset Overview

FinSearchComp Dataset

The FinSearchComp dataset consists of financial search and analysis tasks that require comprehensive research capabilities including:

  • Financial data retrieval and analysis
  • Market research and company analysis
  • Investment decision support
  • Financial news and report interpretation
  • Time-sensitive financial information gathering

Key Dataset Characteristics

  • Total Tasks: 635 (across T1, T2, T3 categories)
  • Task Types:
    • T1: Time-Sensitive Data Fetching
    • T2: Financial Analysis and Research
    • T3: Complex Historical Investigation
  • Answer Format: Detailed financial analysis and research reports
  • Ground Truth: Available for T2 and T3 tasks, changes dynamically for T1 tasks
  • Evaluation: Judge-based evaluation with correctness assessment

Quick Start Guide

Quick Start Instructions

This section provides step-by-step instructions to run the FinSearchComp benchmark and prepare submission results. Note: This is a quick start guide for running the benchmark, not for reproducing exact submitted results.

Step 1: Prepare the FinSearchComp Dataset

Dataset Setup

Use the integrated prepare-benchmark command to download and process the dataset:

Download FinSearchComp Dataset
uv run main.py prepare-benchmark get finsearchcomp

This will create the standardized dataset at data/finsearchcomp/standardized_data.jsonl.

Step 2: Configure API Keys

API Key Configuration

Set up the required API keys for model access and tool functionality. Update the .env file to include the following keys:

.env Configuration
# For searching and web scraping
SERPER_API_KEY="xxx"
JINA_API_KEY="xxx"

# For Linux sandbox (code execution environment)
E2B_API_KEY="xxx"

# We use Claude 3.7 Sonnet for financial analysis via OpenRouter
OPENROUTER_API_KEY="xxx"
OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"

# Used for hint generation and final answer extraction
OPENAI_API_KEY="xxx"
OPENAI_BASE_URL="https://api.openai.com/v1"

# Used for Claude vision understanding
ANTHROPIC_API_KEY="xxx"

# Used for Gemini vision
GEMINI_API_KEY="xxx"

Step 3: Run the Evaluation

Evaluation Execution

Execute the following command to run evaluation on the FinSearchComp dataset:

Run FinSearchComp Evaluation
uv run main.py common-benchmark --config_file_name=agent_finsearchcomp_claude37sonnet benchmark=finsearchcomp output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")"

Progress Monitoring and Resume

To check the progress while running:

Check Progress
uv run utils/progress_check/check_finsearchcomp_progress.py $PATH_TO_LOG

If you need to resume an interrupted evaluation, specify the same output directory to continue from where you left off.

Resume Evaluation, e.g.
uv run main.py common-benchmark --config_file_name=agent_finsearchcomp_claude37sonnet benchmark=finsearchcomp output_dir=${PATH_TO_LOG}

Step 4: Extract Results

Result Extraction

After evaluation completion, the results are automatically generated in the output directory:

  • benchmark_results.jsonl: Detailed results for each task
  • benchmark_results_pass_at_1_accuracy.txt: Summary accuracy statistics
  • task_*_attempt_1.json: Individual task execution traces

Evaluation Notes

Task Type Considerations

The FinSearchComp dataset includes different task types with varying evaluation criteria:

  • T1 Tasks: Time-Sensitive Data Fetching tasks are excluded from correctness evaluation due to outdated ground truth, but completion is still tracked
  • T2 Tasks: Financial Analysis tasks are evaluated for correctness and quality
  • T3 Tasks: Complex Historical Investigation tasks require comprehensive research and analysis

Output Analysis

The evaluation generates detailed execution traces showing:

  • Research process for each financial task
  • Information gathering from multiple sources
  • Financial calculations and analysis
  • Comprehensive reports with insights and recommendations

Directory Structure

After running evaluations, you'll find the following structure:

logs/finsearchcomp/agent_finsearchcomp_claude37sonnet_YYYYMMDD_HHMM/
โ”œโ”€โ”€ benchmark_results.jsonl              # Task results summary
โ”œโ”€โ”€ benchmark_results_pass_at_1_accuracy.txt  # Accuracy statistics
โ”œโ”€โ”€ task_(T1)Time_Sensitive_Data_Fetching_*.json  # T1 task traces
โ”œโ”€โ”€ task_(T2)Financial_Analysis_*.json   # T2 task traces
โ”œโ”€โ”€ task_(T3)Complex_Historical_Investigation_*.json  # T3 task traces
โ””โ”€โ”€ output.log                           # Execution log

Task Categories Breakdown

The progress checker provides detailed statistics:

  • Total Tasks: Complete count across all categories
  • Completed Tasks: Successfully finished tasks
  • Correct Tasks: Tasks with judge_result "CORRECT" (T2 and T3 only)
  • Category Breakdown: Separate counts for T1, T2, and T3 tasks
  • Accuracy Metrics: Pass@1 accuracy for evaluable tasks

Usage Examples

Single Run Evaluation

Basic Evaluation
uv run main.py common-benchmark --config_file_name=agent_finsearchcomp_claude37sonnet benchmark=finsearchcomp output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")"

Limited Task Testing

Test with Limited Tasks
uv run main.py common-benchmark --config_file_name=agent_finsearchcomp_claude37sonnet benchmark=finsearchcomp benchmark.execution.max_tasks=5 output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")"

Custom Agent Configuration

Different Agent Setup
uv run main.py common-benchmark --config_file_name=agent_gaia-validation benchmark=finsearchcomp output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")"

Multiple Runs for Reliability

Multiple Runs
NUM_RUNS=5 ./scripts/run_evaluate_multiple_runs_finsearchcomp.sh

Documentation Info

Last Updated: September 2025 ยท Doc Contributor: Team @ MiroMind AI