Skip to content

Conversation

@Ethan0456
Copy link
Member

@Ethan0456 Ethan0456 commented Oct 30, 2024

End-user friendly description of the problem this fixes or functionality that this introduces

This PR integrates the DiscoveryBench Benchmark into OpenHands, enabling it to evaluate the agent's capability for multi-step, data-driven discovery tasks across domains such as sociology, biology, and engineering.

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below:

With this integration, users can benchmark the performance of OpenHands agents on real-world and synthetic discovery tasks, measuring their success in generating hypotheses, analyzing data, and reasoning through complex workflows. Here are the results for the DiscoveryBench test split with gpt-4o and CoderActAgent:

Metric Value
Average Recall Context 0.267
Average Mean Accuracy Score 0.112
Average Final Score 0.103

Give a summary of what the PR does, explaining any non-trivial design decisions

  • This PR integrates DiscoveryBench into OpenHands by incorporating a structured flow that allows the OpenHands' agent to interact with DiscoveryBench tasks.

  • Non-trivial design decisions:

    • Cloning the DiscoveryBench repository: Instead of using huggingface, we clone the repo to ensure that we always have the latest version and updates from the upstream repository.
    • process_instance function: This function encapsulates the logic to execute each instance, parse the agent's hypothesis, and evaluate it against the gold hypothesis.

How we structured everything in run_infer.py

  • run_infer.py is the entry point for running the evaluation. Here's how the process is structured:
    • DiscoveryBench setup: First, the script clones the DiscoveryBench repository and loads its dataset into a pandas DataFrame for easy processing of the instances.
    • Agent environment: For each task, a Docker container is spun up with all the necessary libraries, ensuring that each task runs in a clean environment.
    • Agent configuration: Disabled function calling while enabling Jupyter and browsing delegate configurations in CoderActAgent.
    • Agent inference: The OpenHands agent is invoked to process the task within this environment, producing a hypothesis.
    • Result parsing: After receiving the agent’s hypothesis, we parse it and compare it against the “gold” hypothesis provided by DiscoveryBench.
    • Logging and output: The result for each task is logged into the test_result dictionary, which is ultimately written to an output.jsonl file for analysis and review.

Link of any specific issues this addresses

Link of Older PR this addresses

Ethan0456 and others added 16 commits October 30, 2024 13:58
…s for linting compliance

Signed-off-by: Abhijeetsingh Meena <[email protected]>
…pyter and browsing delegate config

Signed-off-by: Abhijeetsingh Meena <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants