[Eval] DiscoveryBench OpenHands Integration #7

Ethan0456 · 2024-10-30T10:49:14Z

End-user friendly description of the problem this fixes or functionality that this introduces

This PR integrates the DiscoveryBench Benchmark into OpenHands, enabling it to evaluate the agent's capability for multi-step, data-driven discovery tasks across domains such as sociology, biology, and engineering.

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below:

With this integration, users can benchmark the performance of OpenHands agents on real-world and synthetic discovery tasks, measuring their success in generating hypotheses, analyzing data, and reasoning through complex workflows. Here are the results for the DiscoveryBench test split with gpt-4o and CoderActAgent:

Metric	Value
Average Recall Context	0.267
Average Mean Accuracy Score	0.112
Average Final Score	0.103

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR integrates DiscoveryBench into OpenHands by incorporating a structured flow that allows the OpenHands' agent to interact with DiscoveryBench tasks.
Non-trivial design decisions:
- Cloning the DiscoveryBench repository: Instead of using huggingface, we clone the repo to ensure that we always have the latest version and updates from the upstream repository.
- process_instance function: This function encapsulates the logic to execute each instance, parse the agent's hypothesis, and evaluate it against the gold hypothesis.

How we structured everything in run_infer.py

run_infer.py is the entry point for running the evaluation. Here's how the process is structured:
- DiscoveryBench setup: First, the script clones the DiscoveryBench repository and loads its dataset into a pandas DataFrame for easy processing of the instances.
- Agent environment: For each task, a Docker container is spun up with all the necessary libraries, ensuring that each task runs in a clean environment.
- Agent configuration: Disabled function calling while enabling Jupyter and browsing delegate configurations in CoderActAgent.
- Agent inference: The OpenHands agent is invoked to process the task within this environment, producing a hypothesis.
- Result parsing: After receiving the agent’s hypothesis, we parse it and compare it against the “gold” hypothesis provided by DiscoveryBench.
- Logging and output: The result for each task is logged into the test_result dictionary, which is ultimately written to an output.jsonl file for analysis and review.

Link of any specific issues this addresses

This PR addresses issue [Evaluation] Add DiscoveryBench Benchmark OpenHands/OpenHands#4465

Link of Older PR this addresses

This PR is a newer version of this PR [Evaluation] DiscoveryBench OpenHands Integration

Signed-off-by: Abhijeetsingh Meena <[email protected]>

…taset Signed-off-by: Abhijeetsingh Meena <[email protected]>

Signed-off-by: Abhijeetsingh Meena <[email protected]>

…s for linting compliance Signed-off-by: Abhijeetsingh Meena <[email protected]>

Signed-off-by: Abhijeetsingh Meena <[email protected]>

…pyter and browsing delegate config Signed-off-by: Abhijeetsingh Meena <[email protected]>

Signed-off-by: Abhijeetsingh Meena <[email protected]>

Co-authored-by: openhands <[email protected]>

…ance (OpenHands#4567) Co-authored-by: openhands <[email protected]>

Ethan0456 and others added 16 commits October 30, 2024 13:58

init(eval): add baseline DiscoveryBench infer script

7c9ccaa

Signed-off-by: Abhijeetsingh Meena <[email protected]>

feat(eval): implement create_dataset function to clone and prepare da…

da61d20

…taset Signed-off-by: Abhijeetsingh Meena <[email protected]>

feat(eval): implement process_instance function

cfe39b6

Signed-off-by: Abhijeetsingh Meena <[email protected]>

feat(eval): initialize docker runtime with necessary python libraries

33071aa

Signed-off-by: Abhijeetsingh Meena <[email protected]>

feat(eval): implement complete_runtime function

a585cee

Signed-off-by: Abhijeetsingh Meena <[email protected]>

feat(eval): add response parser for DiscoveryBench evaluation

98efba4

Signed-off-by: Abhijeetsingh Meena <[email protected]>

docs(eval): Add README for discoverybench

8140aae

docs(eval): Add README for DiscoveryBench eval utils

477eb84

refactor(eval): integrate DiscoveryBench evaluation and update script…

f1bf06c

…s for linting compliance Signed-off-by: Abhijeetsingh Meena <[email protected]>

feat(eval): add run_infer.sh to execute inference

23a8027

Signed-off-by: Abhijeetsingh Meena <[email protected]>

feat(eval): add AgentConfig to disable function calling and enable ju…

0374351

…pyter and browsing delegate config Signed-off-by: Abhijeetsingh Meena <[email protected]>

chore(eval): set execute permission for run_infer.sh

7d35c51

Signed-off-by: Abhijeetsingh Meena <[email protected]>

docs(eval): update README to comply with linting rules

7f3ab98

Increase share popup duration from 5s to 10s (OpenHands#4625)

75ee54b

Co-authored-by: openhands <[email protected]>

Load GitHub users list at startup for improved authentication perform…

e21abce

…ance (OpenHands#4567) Co-authored-by: openhands <[email protected]>

Merge branch 'main' into test/discoverybench-openhands-integration

adf0b87

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Eval] DiscoveryBench OpenHands Integration #7

[Eval] DiscoveryBench OpenHands Integration #7

Uh oh!

Ethan0456 commented Oct 30, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Eval] DiscoveryBench OpenHands Integration #7

Are you sure you want to change the base?

[Eval] DiscoveryBench OpenHands Integration #7

Uh oh!

Conversation

Ethan0456 commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

End-user friendly description of the problem this fixes or functionality that this introduces

Give a summary of what the PR does, explaining any non-trivial design decisions

How we structured everything in run_infer.py

Link of any specific issues this addresses

Link of Older PR this addresses

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Ethan0456 commented Oct 30, 2024 •

edited

Loading