Skip to content

[ROADMAP] DiscoveryBench Integration #2

@Ethan0456

Description

@Ethan0456

🛰️ DiscoveryBench Integration

This issue tracks the integration of the DiscoveryBench benchmark into OpenHands. DiscoveryBench includes real-world and synthetic scientific discovery tasks that will help assess the agents' capabilities in multi-step, complex problem-solving. The benchmark aims to provide comprehensive insights into how well OpenHands agents handle data-driven scientific discovery workflows.

📋 Tasks

1. Clone and set up DiscoveryBench repository

  • Clone the DiscoveryBench Git repository and install dependencies.

2. Create dataset for evaluation

  • Create a custom function that create a dataset from the cloned repository.
  • Prepare the dataset for evaluation.

3. Generate evaluation metadata and process each instance

  • Create metadata using the make_metadata function, including dataset and task info.
  • Use the process_instance method to prepare evaluation queries for each dataset instance.

4. Set up runtime

  • Create the runtime environment for experimentation.
  • Initialize the runtime by copying the necessary data files into the container.
  • Start OpenHands with the instance query and the data inside the container

5. Run the evaluation workflow

  • Extract the results generated by the OpenHands agents.
  • Analyze the results, comparing generated hypotheses to gold-standard outputs.

6. Compile final results into test result dictionary

  • Save all metrics and results into the test_result dictionary for final analysis.

7. Log and save evaluation outputs

  • Ensure all outputs are logged and stored for reporting.

8. Validate the integration

  • Perform end-to-end validation of DiscoveryBench within OpenHands to ensure correct functionality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions