-
Notifications
You must be signed in to change notification settings - Fork 108
Add MLflow GenAI Evaluation Dataset export/import support #219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
|
||
| assert result is not None | ||
| assert result[0] == imported_name | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the whole lazy nature of the records attribute on a dataset object, would it be worthwhile to validate that the data actually loaded (records were populated) in the new location just to be safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Added. Now it also explicitly checks for the records of the dataset object.
| comma-delimited list (e.g., 'dataset1,dataset2'), | ||
| or file path ending with '.txt' containing dataset | ||
| names (one per line). [required] | ||
| --experiment-ids TEXT Comma-separated list of experiment IDs to filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be p1 but i feel we should add support for the names as well. I have done the same for logged-models and traces. I will work on this in my next pr to add for these 3 objects.
And can we use --evaluation-datasets and --experiment-ids same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. Adding support for names in a follow-up PR would be helpful.
You mean like '--evaluation-datasets all --experiment-ids 1,2' ? -> Yes, that works.
It exports all datasets filtered by the specified experiments. If you specify specific dataset names instead of 'all', --experiment-ids is ignored. I'll update the docs to make this more clear.
|
|
||
| ##### Export evaluation datasets for specific experiments | ||
| ``` | ||
| export-evaluation-datasets \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like here, adding specific prompts from the specific experiments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
|
|
||
| #### Examples | ||
|
|
||
| ##### Import with original name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we provide an option to customer to provide destination experiment name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be an item for a follow-up PR. I will add this to the list of potential follow-up changes in the PR description.
In general, evaluation datasets can be associated with multiple experiments (many-to-many), so we'd need to provide also an option to specify multiple experiments.
Otherwise, users can still manually update associations after import the add_dataset_to_experiments() function.
|
|
||
| ##### Import with evaluation dataset deletion (if dataset exists, delete it first) | ||
| ``` | ||
| import-evaluation-datasets \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe it will import to the same experiments name of the src tracking server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import uses experiment IDs, not names. If the source dataset was linked to experiment 1, it'll link to experiment 1 in the destination, which of course could be the wrong experiment if the destination isn't empty. That's why import-all works better (imports experiments first so IDs match up).
I've updated the docs to make this more clear.
We can add automatic ID mapping to handle non-empty destinations in a follow-up PR.
| res_datasets = None | ||
| try: | ||
| _logger.info("Exporting evaluation datasets...") | ||
| res_datasets = export_evaluation_datasets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Evaluation datasets are at experiment level. Correct me if i am wrong. If that's the case i think we should have this logic at export_experiment.py similar to traces and logged models. this will have common logic to bulk all or single/bulk experiments operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they should be treated as tracking server level and independent of experiments (similarly to registered models), as they're reusable across experiments (stored in their own DB table with many-to-many associations), and you can create them with experiment_id=[] to have no experiment link at all. Traces and logged models must belong to exactly one experiment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct (they are a top-level domain entity and can be attached to multiple experiments).
|
|
||
| @click.command() | ||
| @opt_output_dir | ||
| @click.option("--evaluation-datasets", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if possible lets move additional info to the click_options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| type=str, | ||
| required=True | ||
| ) | ||
| @click.option("--experiment-ids", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets utilize from click_options. same for threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| # Import evaluation datasets if they exist (returns dict with status) | ||
| evaluation_datasets_res = None | ||
| evaluation_datasets_dir = os.path.join(input_dir, "evaluation_datasets") | ||
| if os.path.exists(evaluation_datasets_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as export_all.
|
|
||
| # Import datasets | ||
| if use_threads: | ||
| results = _import_datasets_threaded(dataset_dirs, delete_dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single function here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| def export_evaluation_dataset( | ||
| dataset_name=None, | ||
| dataset_id=None, | ||
| output_dir=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default shouldn't be None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed output_dir to be a required parameter. For dataset_name and dataset_id, I kept the None defaults because one of them is required. The function validates this and raises an error if neither is provided.
5c204b0 to
ac16f4f
Compare
ac16f4f to
b963923
Compare
BenWilson2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fantastic! Excellent work and a great attention to the nuances of the base implementations for eval datasets / records!
| res_datasets = None | ||
| try: | ||
| _logger.info("Exporting evaluation datasets...") | ||
| res_datasets = export_evaluation_datasets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct (they are a top-level domain entity and can be attached to multiple experiments).
| existing_dataset = None | ||
| try: | ||
| import mlflow.genai | ||
| datasets = list(mlflow.genai.search_datasets()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking nit: The paged list type is a subclass of list
Exporting and Importing GenAI Evaluation Datasets for Hosted/Open Source Tracking servers >=3.4.0 in single or bulk mode.
Evaluation datasets are tracking-server level objects (not experiment-scoped). They are exported when >=3.4.0 MLflow client is used. Evaluation datasets are exported along with experiments, runs or models in bulk mode using export-all. Evaluation datasets are imported in bulk mode using import-all.
Features
Exporting
export-evaluation-datasetsto export evaluation datasets for all or specified dataset namesexport-evaluation-datasetto export a single evaluation dataset based on name or IDexport-allwhen MLflow >=3.4.0Structure
Evaluation datasets are tracking-server level objects (like registered models and prompts), exported at root level:
Importing
import-evaluation-datasetsto import all evaluation datasets from a directoryimport-evaluation-datasetto import a single evaluation dataset from a directoryimport-all(bulk import)--delete-evaluation-datasetflag to replace existing datasetsExport/Import-All
Since evaluation datasets are tracking-server level objects (not experiment-scoped), they follow the same pattern as registered models and prompts:
export-allexports evaluation datasets at the tracking server levelimport-allimports evaluation datasets at the tracking server levelAdditional Changes
samples/oss_mlflow/bulk/evaluation_datasets/tests/open_source/test_evaluation_datasets.pyRequirements
Note: Evaluation dataset support requires:
The export/import will be skipped with a warning message if the MLflow version doesn't support evaluation datasets or if using FileStore.
Testing
Tested exporting evaluation datasets using self-hosted MLflow tracking server with PostgreSQL backend. Validated no breaking changes on <3.4.0 tracking servers by running existing tests. Version checks gracefully skip evaluation dataset operations when not supported.
Other
Enhancements for Prompt Feature
Exception Improvement (export_prompt.py)
mlflow.load_promptAPIImportErrorandAttributeErrorDuplicate Handling (import_prompt.py)
--delete-promptflag to replace existing promptsFollow-up Work
Add support for experiment names (not just IDs) in CLI parameters for evaluation datasets, logged models, and traces.
Translate evaluation dataset experiment associations during import (e.g., dataset with source exp [1,2,3] → dataset with destination exp [10,20,30]) to prevent incorrect experiment references