forked from OpenHands/OpenHands
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
🛰️ DiscoveryBench Integration
This issue tracks the integration of the DiscoveryBench benchmark into OpenHands. DiscoveryBench includes real-world and synthetic scientific discovery tasks that will help assess the agents' capabilities in multi-step, complex problem-solving. The benchmark aims to provide comprehensive insights into how well OpenHands agents handle data-driven scientific discovery workflows.
📋 Tasks
1. Clone and set up DiscoveryBench repository
- Clone the DiscoveryBench Git repository and install dependencies.
2. Create dataset for evaluation
- Create a custom function that create a dataset from the cloned repository.
- Prepare the dataset for evaluation.
3. Generate evaluation metadata and process each instance
- Create metadata using the
make_metadatafunction, including dataset and task info. - Use the
process_instancemethod to prepare evaluation queries for each dataset instance.
4. Set up runtime
- Create the runtime environment for experimentation.
- Initialize the runtime by copying the necessary data files into the container.
- Start OpenHands with the instance query and the data inside the container
5. Run the evaluation workflow
- Extract the results generated by the OpenHands agents.
- Analyze the results, comparing generated hypotheses to gold-standard outputs.
6. Compile final results into test result dictionary
- Save all metrics and results into the
test_resultdictionary for final analysis.
7. Log and save evaluation outputs
- Ensure all outputs are logged and stored for reporting.
8. Validate the integration
- Perform end-to-end validation of DiscoveryBench within OpenHands to ensure correct functionality.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request