Data Agent Benchmark

AI fails at enterprise data analysis. We're building a benchmark to help.

This project is supported and maintained by UCB EPIC Data Lab and PromptQL.

The Problem

Current AI tools consistently fail on production data analysis tasks:

Business analysts struggle with AI-generated SQL that misunderstands business logic
Data teams fight with agents that can't handle multi-database queries
Engineers build data agents that work on demos but fail on real datasets
Organizations waste resources on solutions that don't scale

Why We Created This Repository

We need your help to build a benchmark that reflects real enterprise data challenges, not toy problems. By submitting the problems you actually face and the techniques you've tried, you're helping create the first crowdsourced benchmark for AI data analysis. Everything will be open source so anyone can run evaluations and see what actually works.

What to contribute

You can submit two kinds of issues:

Problem — A real enterprise analysis task where AI failed
Technique — An approach you believe works better (semantic layers, RAG, agents, tool calling, etc.)

🔒 Privacy & scope
Do not share production data, credentials, or PII/PHI. We only need enough business context and shape for UCB to synthesize a representative case internally. See SANITIZATION.md.

What “enough detail” means (for UCB to reproduce)

When you submit a Problem, please give:

Business question — the exact ask (e.g., “Calculate churn from Salesforce + product usage for Q3 FY2024”)
Time window & calendars — e.g., “Q3 FY2024 (FY starts July), timezone UTC; late data up to T+3 days”
Data sources & backends — named systems/tables or objects (no data), scale (“~10M rows”), freshness
Entities & identifiers — account_id, user_id, email domain, etc., plus which system is source of truth
Join logic & rules — how tables/systems should be stitched; known exclusions (test users, cancelled orders)
Expected output shape — columns & types, grain, and an example row format (values optional)
Failure mode — what the AI did wrong (e.g., wrong fiscal calendar; joined on company name; ignored SCD)
Tools tried — which LLMs/agents/semantic layers, and any relevant settings

For Techniques, include: where it applies, input/output contract, requirements (e.g., semantic layer), and any observed results or known limits.

How to contribute

👉 Open an issue: Create an issue
- Choose Problem or Technique
- Fill the form (no uploads of proprietary data, please)
Browse categories we track in CATEGORIES.md
See detailed example submissions in EXAMPLES.md

What happens next (triage → recreate → evaluate)

We label and triage incoming issues:

type: problem / type: technique
category: … (from our taxonomy)
status: needs info → accepted → in synthesis → evaluated → published

Once we have essence‑level detail, the UCB team will:

Recreate the case internally (private benchmark repo)
Run techniques over the recreated case
Share results and lessons learned publicly

See docs/TRIAGE.md for label definitions and status meanings.

Example Challenge Categories

We've come up with some challenge areas to get the ball rolling, but this isn't exhaustive - we need your input:

Single-Source Structured Analytics - SQL generation, joins, aggregations
Cross-Source Federation - Combining heterogeneous databases
Unstructured Data Integration - Extracting structure from text/documents
Production Control Flow - Multi-step workflows, error handling
Business Term Disambiguation - Handling ambiguous definitions
Temporal & Currency Metrics - Time-series, FX conversion, SCD
Entity Resolution - De-duplication across systems
Time-Series & Cohorts - Funnels, retention, attribution
External API Integration - Rate limits, pagination, schema drift
Governance & Compliance - Row-level security, PII masking
Advanced Analytics - Statistical tests, graph algorithms
Implicit Relationship Discovery - Fuzzy matching, key derivation

See CATEGORIES.md for detailed descriptions and EXAMPLES.md for real failure cases.

Sample Dataset

Are you building a data analysis agent and curious about what kinds of queries and data you'll need to handle? One example dataset and set of queries can be found in src/query_yelp/ - it includes multi-source data, nested JSON, missing values, and entity resolution challenges that mirror real enterprise problems.

Current Status

We're actively collecting real problems from practitioners, testing initial techniques across the benchmark suite, and building an automated evaluation framework. Watch this repository for regular updates on technique performance and new insights.

Ready to contribute? Submit a Problem or Submit a Technique.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Agent Benchmark

The Problem

Why We Created This Repository

What to contribute

What “enough detail” means (for UCB to reproduce)

How to contribute

What happens next (triage → recreate → evaluate)

Example Challenge Categories

Sample Dataset

Current Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
docs		docs
src/query_yelp		src/query_yelp
techniques		techniques
CATEGORIES.md		CATEGORIES.md
EXAMPLES.md		EXAMPLES.md
README.md		README.md
SANITIZATION.md		SANITIZATION.md

ucbepic/data-agent-benchmark-study

Folders and files

Latest commit

History

Repository files navigation

Data Agent Benchmark

The Problem

Why We Created This Repository

What to contribute

What “enough detail” means (for UCB to reproduce)

How to contribute

What happens next (triage → recreate → evaluate)

Example Challenge Categories

Sample Dataset

Current Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages