DiscoveryBench OpenHands Integration #6

Ethan0456 · 2024-10-24T11:25:38Z

This PR integrates the DiscoveryBench Benchmark into OpenHands, enabling it to evaluate the agent's capability for multi-step, data-driven discovery tasks across domains such as sociology, biology, and engineering.
https://github.com/allenai/discoverybench/
https://x.com/mbodhisattwa/status/1811524569410531333

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below:

With this integration, users can benchmark the performance of OpenHands agents on real-world and synthetic discovery tasks, measuring their success in generating hypotheses, analyzing data, and reasoning through complex workflows.

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR integrates DiscoveryBench into OpenHands by incorporating a structured flow that allows the OpenHands' agent to interact with DiscoveryBench tasks.
Non-trivial design decisions:
- Cloning the DiscoveryBench repository: Instead of using huggingface, we clone the repo to ensure that we always have the latest version and updates from the upstream repository.
- process_instance function: This function encapsulates the logic to execute each instance, parse the agent's hypothesis, and evaluate it against the gold hypothesis.

How we structured everything in run_infer.py

run_infer.py is the entry point for running the evaluation. Here's how the process is structured:
- DiscoveryBench setup: First, the script clones the DiscoveryBench repository and loads its dataset into a pandas DataFrame for easy processing of the instances.
- Agent environment: For each task, a Docker container is spun up with all the necessary libraries, ensuring that each task runs in a clean environment.
- Agent inference: The OpenHands agent is invoked to process the task within this environment, producing a hypothesis.
- Result parsing: After receiving the agent’s hypothesis, we parse it and compare it against the “gold” hypothesis provided by DiscoveryBench.
- Logging and output: The result for each task is logged into the test_result dictionary, which is ultimately written to an output.jsonl file for analysis and review.

…ation

Ethan0456 and others added 21 commits October 10, 2024 15:32

init: add discoverybench files

961374a

init: add discoverybench evaluation bash script

534adad

refactor: move utils to eval_utils/

d28cef0

refactor: reduce redundancy in log extraction function

ef4796f

Merge branch 'All-Hands-AI:main' into discoverybench-openhands-integr…

cb5b369

…ation

fix: modify response parser

682f151

chore: remove useless modules

54949ea

fix: update discoverybench evaluation

903a00d

feat: initialize runtime with libraries

985eedf

init: add README

a97319b

docs: update README to add todo

ec2721b

Create README.md

2f3689c

docs: Update run_infer.py to add TODO for docstrings

622edf2

docs: add function doc strings

e62082a

docs: add one line eval utils descriptions in README

26a831f

docs: Update README.md for more clarity

cf1f3c1

docs: Update README.md for more clarity on DiscoveryBench process

a9673c5

docs: Update utils README.md

81c8271

docs: Update discoverybench README.md to eval context

edc134f

docs: Update formatting for discoverybench README.md

6337c52

docs: Update README.md for clarity

fee00c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DiscoveryBench OpenHands Integration #6

DiscoveryBench OpenHands Integration #6

Uh oh!

Ethan0456 commented Oct 24, 2024 •

edited by suranah

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DiscoveryBench OpenHands Integration #6

Are you sure you want to change the base?

DiscoveryBench OpenHands Integration #6

Uh oh!

Conversation

Ethan0456 commented Oct 24, 2024 • edited by suranah Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Give a summary of what the PR does, explaining any non-trivial design decisions

How we structured everything in run_infer.py

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ethan0456 commented Oct 24, 2024 •

edited by suranah

Loading