Futurex-Online
MiroFlow's evaluation on the Futurex-Online benchmark demonstrates capabilities in future event prediction tasks.
Dataset Overview
Futurex-Online Dataset
The Futurex-Online dataset consists of 61 prediction tasks covering various future events including:
- Political events (referendums, elections)
- Sports outcomes (football matches)
- Legal proceedings
- Economic indicators
Key Dataset Characteristics
- Total Tasks: 61
- Task Type: Future event prediction
- Answer Format: Boxed answers (\boxed{Yes/No} or \boxed{A/B/C})
- Ground Truth: Not available (prediction tasks)
- Resolution Date: Around 2025-09-21 (GMT+8)
Quick Start Guide
Quick Start Instructions
This section provides step-by-step instructions to run the Futurex-Online benchmark and prepare submission results. Since this is a prediction dataset without ground truth, we focus on execution traces and response generation. Note: This is a quick start guide for running the benchmark, not for reproducing exact submitted results.
Step 1: Prepare the Futurex-Online Dataset
Dataset Setup
Use the integrated prepare-benchmark command to download and process the dataset:
This will create the standardized dataset at data/futurex/standardized_data.jsonl
.
Step 2: Configure API Keys
API Key Configuration
Set up the required API keys for model access and tool functionality. Update the .env
file to include the following keys:
# For searching and web scraping
SERPER_API_KEY="xxx"
JINA_API_KEY="xxx"
# For Linux sandbox (code execution environment)
E2B_API_KEY="xxx"
# We use Claude-3.7-Sonnet with OpenRouter backend to initialize the LLM
OPENROUTER_API_KEY="xxx"
OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
# Used for Claude vision understanding
ANTHROPIC_API_KEY="xxx"
# Used for Gemini vision
GEMINI_API_KEY="xxx"
# Use for llm judge, reasoning, o3 hints, etc.
OPENAI_API_KEY="xxx"
OPENAI_BASE_URL="https://api.openai.com/v1"
Step 3: Run the Evaluation
Evaluation Execution
Execute the following command to run evaluation on the Futurex-Online dataset. This uses the basic agent_quickstart_1
configuration for quick start purposes.
uv run main.py common-benchmark --config_file_name=agent_quickstart_1 benchmark=futurex output_dir="logs/futurex/$(date +"%Y%m%d_%H%M")"
Progress Monitoring and Resume
To check the progress while running:
If you need to resume an interrupted evaluation, specify the same output directory to continue from where you left off.
Step 4: Extract Results
Result Extraction
After evaluation completion, extract the results using the provided utility:
This will generate:
futurex_results.json
: Detailed results for each taskfuturex_summary.json
: Summary statisticsfuturex_predictions.csv
: Predictions in CSV format
Sample Task Examples
Political Prediction
Task: "Will the 2025 Guinea referendum pass? (resolved around 2025-09-21 (GMT+8))"
Expected Format: \boxed{Yes} or \boxed{No}
Sports Prediction
Task: "Brighton vs. Tottenham (resolved around 2025-09-21 (GMT+8))
A. Brighton win on 2025-09-20
B. Brighton vs. Tottenham end in a draw
C. Tottenham win on 2025-09-20"
Expected Format: \boxed{A}, \boxed{B}, or \boxed{C}
Multiple Runs and Voting
Improving Prediction Accuracy
For better prediction accuracy, you can run multiple evaluations and use voting mechanisms to aggregate results. This approach helps reduce randomness and improve the reliability of predictions. Note: This is a quick start approach; production submissions may use more sophisticated configurations.
Step 1: Run Multiple Evaluations
Use the multiple runs script to execute several independent evaluations:
This script will:
- Run 3 independent evaluations by default (configurable with
NUM_RUNS
) - Execute all tasks in parallel for efficiency
- Generate separate result files for each run in
run_1/
,run_2/
, etc. - Create a consolidated
futurex_submission.jsonl
file with voting results
Step 2: Customize Multiple Runs
You can customize the evaluation parameters:
# Run 5 evaluations with limited tasks for testing
NUM_RUNS=5 MAX_TASKS=10 ./scripts/run_evaluate_multiple_runs_futurex.sh
# Use different agent configuration
AGENT_SET=agent_gaia-validation ./scripts/run_evaluate_multiple_runs_futurex.sh
# Adjust concurrency for resource management
MAX_CONCURRENT=3 ./scripts/run_evaluate_multiple_runs_futurex.sh
Step 3: Voting and Aggregation
After multiple runs, the system automatically:
- Extracts predictions from all runs using
utils/extract_futurex_results.py
- Applies majority voting to aggregate predictions across runs
- Generates submission file in the format required by FutureX platform
- Provides voting statistics showing prediction distribution across runs
The voting process works as follows:
- Majority Vote: Most common prediction across all runs wins
- Tie-breaking: If tied, chooses the prediction that appeared earliest across all runs
- Vote Counts: Tracks how many runs predicted each option
- Confidence Indicators: High agreement indicates more reliable predictions
Step 4: Analyze Voting Results
Check the generated files for voting analysis:
# View submission file with voting results
cat logs/futurex/agent_quickstart_1_*/futurex_submission.jsonl
# Check individual run results
ls logs/futurex/agent_quickstart_1_*/run_*/
# Check progress and voting statistics
uv run python utils/progress_check/check_futurex_progress.py logs/futurex/agent_quickstart_1_*
Manual Voting Aggregation
You can also manually run the voting aggregation:
# Aggregate multiple runs with majority voting
uv run python utils/extract_futurex_results.py logs/futurex/agent_quickstart_1_* --aggregate
# Force single run mode (if needed)
uv run python utils/extract_futurex_results.py logs/futurex/agent_quickstart_1_*/run_1 --single
# Specify custom output file
uv run python utils/extract_futurex_results.py logs/futurex/agent_quickstart_1_* -o my_voted_predictions.jsonl
Voting Output Format
The voting aggregation generates a submission file with the following format:
{"id": "687104310a994c0060ef87a9", "prediction": "No", "vote_counts": {"No": 2}}
{"id": "68a9b46e961bd3003c8f006b", "prediction": "Yes", "vote_counts": {"Yes": 2}}
The output includes:
id
: Task identifierprediction
: Final voted prediction (without\boxed{}
wrapper)vote_counts
: Dictionary showing how many runs predicted each option
For example, "vote_counts": {"No": 2}
means 2 out of 2 runs predicted "No", indicating high confidence.
Evaluation Notes
No Ground Truth Available
Since Futurex-Online is a prediction dataset, there are no ground truth answers available for evaluation. The focus is on:
- Response generation quality
- Reasoning process documentation
- Prediction confidence and methodology
Output Analysis
The evaluation generates detailed execution traces showing:
- Research process for each prediction
- Information gathering from web sources
- Reasoning chains leading to predictions
- Final boxed answers in required format
Directory Structure
After running multiple evaluations, you'll find the following structure:
logs/futurex/agent_quickstart_1_YYYYMMDD_HHMM/
โโโ futurex_submission.jsonl # Final voted predictions
โโโ run_1/ # First run results
โ โโโ benchmark_results.jsonl # Individual task results
โ โโโ benchmark_results_pass_at_1_accuracy.txt
โ โโโ task_*_attempt_1.json # Detailed execution traces
โโโ run_2/ # Second run results
โ โโโ ... (same structure as run_1)
โโโ run_1_output.log # Run 1 execution log
โโโ run_2_output.log # Run 2 execution log
Documentation Info
Last Updated: September 2025 ยท Doc Contributor: Team @ MiroMind AI