Skip to content

Conversation

@odorfer
Copy link

@odorfer odorfer commented Nov 18, 2025

Exporting and Importing GenAI Evaluation Datasets for Hosted/Open Source Tracking servers >=3.4.0 in single or bulk mode.

Evaluation datasets are tracking-server level objects (not experiment-scoped). They are exported when >=3.4.0 MLflow client is used. Evaluation datasets are exported along with experiments, runs or models in bulk mode using export-all. Evaluation datasets are imported in bulk mode using import-all.

Features

Exporting

  • Added cli option export-evaluation-datasets to export evaluation datasets for all or specified dataset names
  • Added cli option export-evaluation-dataset to export a single evaluation dataset based on name or ID
  • Evaluation datasets are automatically exported with export-all when MLflow >=3.4.0
  • Supports filtering by experiment IDs to export only datasets associated with specific experiments

Structure

Evaluation datasets are tracking-server level objects (like registered models and prompts), exported at root level:

output_dir/
└── evaluation_datasets/
    ├── evaluation_datasets_summary.json
    ├── dataset-name-1_abc123/
    │   └── evaluation_dataset.json
    └── dataset-name-2_def456/
        └── evaluation_dataset.json

Importing

  • Added cli option import-evaluation-datasets to import all evaluation datasets from a directory
  • Added cli option import-evaluation-dataset to import a single evaluation dataset from a directory
  • Evaluation datasets are automatically imported with import-all (bulk import)
  • Duplicate detection: skips import if dataset with same name already exists
  • Optional --delete-evaluation-dataset flag to replace existing datasets

Export/Import-All

Since evaluation datasets are tracking-server level objects (not experiment-scoped), they follow the same pattern as registered models and prompts:

  • export-all exports evaluation datasets at the tracking server level
  • import-all imports evaluation datasets at the tracking server level
  • This is different from experiment-scoped objects like traces and logged models

Additional Changes

  • Updated README, README_bulk, and README_single for evaluation dataset operations
  • Added sample evaluation dataset exports in samples/oss_mlflow/bulk/evaluation_datasets/
  • Added test in tests/open_source/test_evaluation_datasets.py

Requirements

Note: Evaluation dataset support requires:

  • MLflow 3.4.0 or higher
  • SQL-based tracking backend (SQLite, PostgreSQL, MySQL)
  • FileStore is not supported

The export/import will be skipped with a warning message if the MLflow version doesn't support evaluation datasets or if using FileStore.

Testing

Tested exporting evaluation datasets using self-hosted MLflow tracking server with PostgreSQL backend. Validated no breaking changes on <3.4.0 tracking servers by running existing tests. Version checks gracefully skip evaluation dataset operations when not supported.

Other

Enhancements for Prompt Feature

Exception Improvement (export_prompt.py)

  • Fixed overly broad exception handling that was causing fallback to deprecated mlflow.load_prompt API
  • Changed from catching all exceptions to only catching ImportError and AttributeError

Duplicate Handling (import_prompt.py)

  • Added graceful handling for duplicate prompts during import
  • Skips import with warning message if prompt already exists (preserves version numbers)
  • Provides clear guidance to use --delete-prompt flag to replace existing prompts

Follow-up Work

  • Add support for experiment names (not just IDs) in CLI parameters for evaluation datasets, logged models, and traces.

  • Translate evaluation dataset experiment associations during import (e.g., dataset with source exp [1,2,3] → dataset with destination exp [10,20,30]) to prevent incorrect experiment references


assert result is not None
assert result[0] == imported_name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the whole lazy nature of the records attribute on a dataset object, would it be worthwhile to validate that the data actually loaded (records were populated) in the new location just to be safe?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added. Now it also explicitly checks for the records of the dataset object.

comma-delimited list (e.g., 'dataset1,dataset2'),
or file path ending with '.txt' containing dataset
names (one per line). [required]
--experiment-ids TEXT Comma-separated list of experiment IDs to filter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be p1 but i feel we should add support for the names as well. I have done the same for logged-models and traces. I will work on this in my next pr to add for these 3 objects.

And can we use --evaluation-datasets and --experiment-ids same time?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. Adding support for names in a follow-up PR would be helpful.

You mean like '--evaluation-datasets all --experiment-ids 1,2' ? -> Yes, that works.
It exports all datasets filtered by the specified experiments. If you specify specific dataset names instead of 'all', --experiment-ids is ignored. I'll update the docs to make this more clear.


##### Export evaluation datasets for specific experiments
```
export-evaluation-datasets \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like here, adding specific prompts from the specific experiments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above


#### Examples

##### Import with original name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we provide an option to customer to provide destination experiment name?

Copy link
Author

@odorfer odorfer Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be an item for a follow-up PR. I will add this to the list of potential follow-up changes in the PR description.

In general, evaluation datasets can be associated with multiple experiments (many-to-many), so we'd need to provide also an option to specify multiple experiments.

Otherwise, users can still manually update associations after import the add_dataset_to_experiments() function.


##### Import with evaluation dataset deletion (if dataset exists, delete it first)
```
import-evaluation-datasets \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe it will import to the same experiments name of the src tracking server.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import uses experiment IDs, not names. If the source dataset was linked to experiment 1, it'll link to experiment 1 in the destination, which of course could be the wrong experiment if the destination isn't empty. That's why import-all works better (imports experiments first so IDs match up).
I've updated the docs to make this more clear.

We can add automatic ID mapping to handle non-empty destinations in a follow-up PR.

res_datasets = None
try:
_logger.info("Exporting evaluation datasets...")
res_datasets = export_evaluation_datasets(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Evaluation datasets are at experiment level. Correct me if i am wrong. If that's the case i think we should have this logic at export_experiment.py similar to traces and logged models. this will have common logic to bulk all or single/bulk experiments operations.

Copy link
Author

@odorfer odorfer Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they should be treated as tracking server level and independent of experiments (similarly to registered models), as they're reusable across experiments (stored in their own DB table with many-to-many associations), and you can create them with experiment_id=[] to have no experiment link at all. Traces and logged models must belong to exactly one experiment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct (they are a top-level domain entity and can be attached to multiple experiments).


@click.command()
@opt_output_dir
@click.option("--evaluation-datasets",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if possible lets move additional info to the click_options.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

type=str,
required=True
)
@click.option("--experiment-ids",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets utilize from click_options. same for threads.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# Import evaluation datasets if they exist (returns dict with status)
evaluation_datasets_res = None
evaluation_datasets_dir = os.path.join(input_dir, "evaluation_datasets")
if os.path.exists(evaluation_datasets_dir):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as export_all.


# Import datasets
if use_threads:
results = _import_datasets_threaded(dataset_dirs, delete_dataset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single function here as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def export_evaluation_dataset(
dataset_name=None,
dataset_id=None,
output_dir=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default shouldn't be None

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed output_dir to be a required parameter. For dataset_name and dataset_id, I kept the None defaults because one of them is required. The function validates this and raises an error if neither is provided.

@odorfer odorfer force-pushed the feature/evaluation-datasets branch 2 times, most recently from 5c204b0 to ac16f4f Compare December 1, 2025 16:42
@odorfer odorfer force-pushed the feature/evaluation-datasets branch from ac16f4f to b963923 Compare December 2, 2025 22:15
Copy link
Member

@BenWilson2 BenWilson2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fantastic! Excellent work and a great attention to the nuances of the base implementations for eval datasets / records!

res_datasets = None
try:
_logger.info("Exporting evaluation datasets...")
res_datasets = export_evaluation_datasets(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct (they are a top-level domain entity and can be attached to multiple experiments).

existing_dataset = None
try:
import mlflow.genai
datasets = list(mlflow.genai.search_datasets())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking nit: The paged list type is a subclass of list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants