xbench-DeepSearch
The xbench benchmark is an evaluation framework designed to measure both the intelligence frontier and real-world utility of AI agents. It consists of complementary tracks that test core model capabilities like reasoning, tool use, memory, and workflows grounded in business and professional settings. Its DeepSearch sub-track measures agentsβ ability to conduct open-domain information retrieval, combining fact finding, comparison, and synthesis through multi-step search and tool use.
See more details at xbench official website and xbench-DeepSearch Eval Card.
Setup and Evaluation Guide
Step 1: Download the xbench-DeepSearch Dataset
Direct Download (Recommended)
Dataset Setup
Use the integrated prepare-benchmark command to download and process the dataset:
By default, this will create the standardized dataset at data/xbench-ds/standardized_data.jsonl.
Step 2: Configure API Keys
Required API Configuration
Set up the required API keys for model access and tool functionality. Update the .env file to include the following keys:
# Search and web scraping capabilities
SERPER_API_KEY="your-serper-api-key"
JINA_API_KEY="your-jina-api-key"
# Code execution environment
E2B_API_KEY="your-e2b-api-key"
# Primary LLM provider (Claude-3.7-Sonnet via OpenRouter)
OPENROUTER_API_KEY="your-openrouter-api-key"
OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
# Vision understanding capabilities
ANTHROPIC_API_KEY="your-anthropic-api-key"
GEMINI_API_KEY="your-gemini-api-key"
# LLM as judge, reasoning, and hint generation
OPENAI_API_KEY="your-openai-api-key"
OPENAI_BASE_URL="https://api.openai.com/v1"
Step 3: Run the Evaluation
uv run main.py common-benchmark \
  --config_file_name=agent_xbench-ds_claude37sonnet \
  output_dir="logs/xbench-ds/$(date +"%Y%m%d_%H%M")"
Step 4: Monitor Progress and Resume
Progress Tracking
You can monitor the evaluation progress in real-time:
Replace $PATH_TO_LOG with your actual output directory path.
Resume Capability
If the evaluation is interrupted, you can resume from where it left off by specifying the same output directory:
uv run main.py common-benchmark \
  --config_file_name=agent_xbench-ds_claude37sonnet \
  output_dir="logs/xbench-ds/20250922_1430"
Post-Processing for Enhanced Performance
Test-Time Scaling for Improved Reliability
Test-time scaling can significantly improve the reliability of model responses. Instead of simple majority voting, we employ a comprehensive parallel thinking approach that:
- Aggregates final summary steps from each agent run before outputting results
- Uses another agent (o3 by default) to make final decisions based on equivalence and source reliability criteria
- Provides more robust and accurate final answers
Execute the following command to run multiple xbench-DeepSearch evaluations and automatically enable parallel thinking for enhanced performance.
bash scripts/run_evaluate_mulitple_runs_xbench-ds.sh
Running Parallel Thinking Analysis alone
After completing evaluations (single or multiple runs), you can apply parallel thinking post-processing to aggregate and generate the final result.
uv run utils/util_llm_parallel_thinking.py \
  --benchmark xbench-ds \
  --results_dir "logs/xbench-ds/20250922_1430"
The program automatically reads results from each run in the specified directory and performs aggregated analysis. The final output files are generated in the results_dir:
- llm_parallel_thinking_Nruns.json- Detailed analysis results
- llm_parallel_thinking_accuracy_Nruns.txt- Final accuracy
Where N represents the total number of experimental runs (minimum of 1).
Documentation Info
Last Updated: September 2025 Β· Doc Contributor: Team @ MiroMind AI