Add MLflow GenAI Evaluation Dataset export/import support #219

odorfer · 2025-11-18T17:01:52Z

Exporting and Importing GenAI Evaluation Datasets for Hosted/Open Source Tracking servers >=3.4.0 in single or bulk mode.

Evaluation datasets are tracking-server level objects (not experiment-scoped). They are exported when >=3.4.0 MLflow client is used. Evaluation datasets are exported along with experiments, runs or models in bulk mode using export-all. Evaluation datasets are imported in bulk mode using import-all.

Features

Exporting

Added cli option export-evaluation-datasets to export evaluation datasets for all or specified dataset names
Added cli option export-evaluation-dataset to export a single evaluation dataset based on name or ID
Evaluation datasets are automatically exported with export-all when MLflow >=3.4.0
Supports filtering by experiment IDs to export only datasets associated with specific experiments

Structure

Evaluation datasets are tracking-server level objects (like registered models and prompts), exported at root level:

output_dir/
└── evaluation_datasets/
    ├── evaluation_datasets_summary.json
    ├── dataset-name-1_abc123/
    │   └── evaluation_dataset.json
    └── dataset-name-2_def456/
        └── evaluation_dataset.json

Importing

Added cli option import-evaluation-datasets to import all evaluation datasets from a directory
Added cli option import-evaluation-dataset to import a single evaluation dataset from a directory
Evaluation datasets are automatically imported with import-all (bulk import)
Duplicate detection: skips import if dataset with same name already exists
Optional --delete-evaluation-dataset flag to replace existing datasets

Export/Import-All

Since evaluation datasets are tracking-server level objects (not experiment-scoped), they follow the same pattern as registered models and prompts:

export-all exports evaluation datasets at the tracking server level
import-all imports evaluation datasets at the tracking server level
This is different from experiment-scoped objects like traces and logged models

Additional Changes

Updated README, README_bulk, and README_single for evaluation dataset operations
Added sample evaluation dataset exports in samples/oss_mlflow/bulk/evaluation_datasets/
Added test in tests/open_source/test_evaluation_datasets.py

Requirements

Note: Evaluation dataset support requires:

MLflow 3.4.0 or higher
SQL-based tracking backend (SQLite, PostgreSQL, MySQL)
FileStore is not supported

The export/import will be skipped with a warning message if the MLflow version doesn't support evaluation datasets or if using FileStore.

Testing

Tested exporting evaluation datasets using self-hosted MLflow tracking server with PostgreSQL backend. Validated no breaking changes on <3.4.0 tracking servers by running existing tests. Version checks gracefully skip evaluation dataset operations when not supported.

Other

Enhancements for Prompt Feature

Exception Improvement (export_prompt.py)

Fixed overly broad exception handling that was causing fallback to deprecated mlflow.load_prompt API
Changed from catching all exceptions to only catching ImportError and AttributeError

Duplicate Handling (import_prompt.py)

Added graceful handling for duplicate prompts during import
Skips import with warning message if prompt already exists (preserves version numbers)
Provides clear guidance to use --delete-prompt flag to replace existing prompts

Follow-up Work

Add support for experiment names (not just IDs) in CLI parameters for evaluation datasets, logged models, and traces.
Translate evaluation dataset experiment associations during import (e.g., dataset with source exp [1,2,3] → dataset with destination exp [10,20,30]) to prevent incorrect experiment references

BenWilson2 · 2025-11-20T17:50:54Z

tests/open_source/test_evaluation_datasets.py

+
+    assert result is not None
+    assert result[0] == imported_name
+


With the whole lazy nature of the records attribute on a dataset object, would it be worthwhile to validate that the data actually loaded (records were populated) in the new location just to be safe?

Good point. Added. Now it also explicitly checks for the records of the dataset object.

a-gajam · 2025-11-21T16:59:45Z

README_bulk.md

+                                comma-delimited list (e.g., 'dataset1,dataset2'), 
+                                or file path ending with '.txt' containing dataset 
+                                names (one per line).  [required]
+  --experiment-ids TEXT    Comma-separated list of experiment IDs to filter


this can be p1 but i feel we should add support for the names as well. I have done the same for logged-models and traces. I will work on this in my next pr to add for these 3 objects.

And can we use --evaluation-datasets and --experiment-ids same time?

Yes, I agree. Adding support for names in a follow-up PR would be helpful.

You mean like '--evaluation-datasets all --experiment-ids 1,2' ? -> Yes, that works.
It exports all datasets filtered by the specified experiments. If you specify specific dataset names instead of 'all', --experiment-ids is ignored. I'll update the docs to make this more clear.

a-gajam · 2025-11-21T17:01:40Z

README_bulk.md

+
+##### Export evaluation datasets for specific experiments
+```
+export-evaluation-datasets \


like here, adding specific prompts from the specific experiments.

a-gajam · 2025-11-21T17:04:19Z

README_single.md

+
+#### Examples
+
+##### Import with original name


should we provide an option to customer to provide destination experiment name?

Could be an item for a follow-up PR. I will add this to the list of potential follow-up changes in the PR description.

In general, evaluation datasets can be associated with multiple experiments (many-to-many), so we'd need to provide also an option to specify multiple experiments.

Otherwise, users can still manually update associations after import the add_dataset_to_experiments() function.

a-gajam · 2025-11-21T17:04:54Z

README_bulk.md

+
+##### Import with evaluation dataset deletion (if dataset exists, delete it first)
+```
+import-evaluation-datasets \


i believe it will import to the same experiments name of the src tracking server.

The import uses experiment IDs, not names. If the source dataset was linked to experiment 1, it'll link to experiment 1 in the destination, which of course could be the wrong experiment if the destination isn't empty. That's why import-all works better (imports experiments first so IDs match up).
I've updated the docs to make this more clear.

We can add automatic ID mapping to handle non-empty destinations in a follow-up PR.

a-gajam · 2025-11-21T17:09:32Z

mlflow_export_import/bulk/export_all.py

+    res_datasets = None
+    try:
+        _logger.info("Exporting evaluation datasets...")
+        res_datasets = export_evaluation_datasets(


I think Evaluation datasets are at experiment level. Correct me if i am wrong. If that's the case i think we should have this logic at export_experiment.py similar to traces and logged models. this will have common logic to bulk all or single/bulk experiments operations.

I think they should be treated as tracking server level and independent of experiments (similarly to registered models), as they're reusable across experiments (stored in their own DB table with many-to-many associations), and you can create them with experiment_id=[] to have no experiment link at all. Traces and logged models must belong to exactly one experiment.

This is correct (they are a top-level domain entity and can be attached to multiple experiments).

a-gajam · 2025-11-21T17:16:17Z

mlflow_export_import/bulk/export_evaluation_datasets.py

+
+@click.command()
+@opt_output_dir
+@click.option("--evaluation-datasets",


if possible lets move additional info to the click_options.

a-gajam · 2025-11-21T17:16:51Z

mlflow_export_import/bulk/export_evaluation_datasets.py

+    type=str,
+    required=True
+)
+@click.option("--experiment-ids",


lets utilize from click_options. same for threads.

a-gajam · 2025-11-21T20:35:54Z

mlflow_export_import/bulk/import_all.py

+    # Import evaluation datasets if they exist (returns dict with status)
+    evaluation_datasets_res = None
+    evaluation_datasets_dir = os.path.join(input_dir, "evaluation_datasets")
+    if os.path.exists(evaluation_datasets_dir):


same as export_all.

a-gajam · 2025-11-21T20:36:29Z

mlflow_export_import/bulk/import_evaluation_datasets.py

+
+        # Import datasets
+        if use_threads:
+            results = _import_datasets_threaded(dataset_dirs, delete_dataset)


single function here as well.

a-gajam · 2025-11-21T20:37:37Z

mlflow_export_import/evaluation_dataset/export_evaluation_dataset.py

+def export_evaluation_dataset(
+        dataset_name=None,
+        dataset_id=None,
+        output_dir=None,


default shouldn't be None

Changed output_dir to be a required parameter. For dataset_name and dataset_id, I kept the None defaults because one of them is required. The function validates this and raises an error if neither is provided.

BenWilson2

Looks fantastic! Excellent work and a great attention to the nuances of the base implementations for eval datasets / records!

BenWilson2 · 2025-12-05T17:46:37Z

mlflow_export_import/bulk/export_all.py

+    res_datasets = None
+    try:
+        _logger.info("Exporting evaluation datasets...")
+        res_datasets = export_evaluation_datasets(


This is correct (they are a top-level domain entity and can be attached to multiple experiments).

BenWilson2 · 2025-12-05T17:50:37Z

mlflow_export_import/evaluation_dataset/import_evaluation_dataset.py

+        existing_dataset = None
+        try:
+            import mlflow.genai
+            datasets = list(mlflow.genai.search_datasets())


non-blocking nit: The paged list type is a subclass of list

danabens requested review from BenWilson2, a-gajam and danabens November 19, 2025 21:32

BenWilson2 reviewed Nov 20, 2025

View reviewed changes

a-gajam reviewed Nov 21, 2025

View reviewed changes

odorfer force-pushed the feature/evaluation-datasets branch 2 times, most recently from 5c204b0 to ac16f4f Compare December 1, 2025 16:42

Add MLflow GenAI Evaluation Dataset export/import support

b963923

odorfer force-pushed the feature/evaluation-datasets branch from ac16f4f to b963923 Compare December 2, 2025 22:15

BenWilson2 approved these changes Dec 5, 2025

View reviewed changes

a-gajam approved these changes Dec 8, 2025

View reviewed changes

Add MLflow GenAI Evaluation Dataset export/import support #219

Are you sure you want to change the base?

Add MLflow GenAI Evaluation Dataset export/import support #219

Uh oh!

Conversation

odorfer commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features

Exporting

Structure

Importing

Export/Import-All

Additional Changes

Requirements

Testing

Other

Enhancements for Prompt Feature

Follow-up Work

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

odorfer Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

odorfer Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenWilson2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

odorfer commented Nov 18, 2025 •

edited

Loading

odorfer Nov 28, 2025 •

edited

Loading

odorfer Nov 28, 2025 •

edited

Loading