Skip to content

Drop the deprecated binary format. #11307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Jul 17, 2025
Merged

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Mar 4, 2025

Close #7547 .

  • Drop support for the deprecated binary format.
  • Add compatibility tests for categorical features.
  • Add compatibility tests for AFT survival training.
  • Use the same set of models for Python and R tests.

todos:

  • Test error handling.
  • Warning messages consistency.
  • New test models.
  • Update R tests for using the same set of models.

Models
xgboost_model_compatibility_tests-3.0.2.zip

Remove.

Basic model test.

Cleanup.

cli.

adaptive test.
@trivialfis
Copy link
Member Author

trivialfis commented Jul 11, 2025

@hcho3 Could you please help re-upload the test models to s3, with binary models removed? Also, the RMM build is failing ;-(

@hcho3
Copy link
Collaborator

hcho3 commented Jul 11, 2025

@trivialfis Should we also remove RDS model files from xgboost_r_model_compatibility_test.zip ? It's probably using the legacy binary format.

@trivialfis
Copy link
Member Author

Please do. @hcho3 .

@hcho3
Copy link
Collaborator

hcho3 commented Jul 11, 2025

Done

@trivialfis
Copy link
Member Author

Thank you!

@trivialfis
Copy link
Member Author

@hcho3 Could you please share the latest download link?

@trivialfis trivialfis requested a review from Copilot July 12, 2025 11:03
Copilot

This comment was marked as outdated.

@trivialfis
Copy link
Member Author

Let me generate some new models using 3.0 instead

@trivialfis
Copy link
Member Author

trivialfis commented Jul 13, 2025

Models generated by the new script using 3.0.2 and 2.1.4:
xgboost_model_compatibility_test.zip

@hcho3
Copy link
Collaborator

hcho3 commented Jul 13, 2025

Are we dropping compatibility for JSON models from XGBoost 1.x?

As for the download link, I uploaded the zip file to the same s3 bucket as before. Were you not be able to access it?

@trivialfis
Copy link
Member Author

trivialfis commented Jul 13, 2025

Let me work on tests for some older models.

@trivialfis
Copy link
Member Author

@hcho3
Copy link
Collaborator

hcho3 commented Jul 14, 2025

My bad, let me change the access setting.

@trivialfis trivialfis changed the title [WIP] Drop the deprecated binary format. Drop the deprecated binary format. Jul 15, 2025
@trivialfis
Copy link
Member Author

I uploaded new models. After this PR, Python and R will test with the same set of models. The old models are still in S3, just in case we need them. Also, we can't remove the R test models until the CRAN package is updated.

@trivialfis trivialfis requested a review from Copilot July 15, 2025 08:38
@trivialfis trivialfis marked this pull request as ready for review July 15, 2025 08:38
@trivialfis trivialfis requested a review from hcho3 July 15, 2025 08:38
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes support for the deprecated binary model format and fully transitions to JSON and UBJSON serialization. It also adds compatibility tests for categorical features and AFT survival, and standardizes the set of test models used in both Python and R.

  • Remove legacy Load/Save implementations for the old binary format across C++ code.
  • Update CLI and C API to recognize .ubj and .json extensions exclusively.
  • Refactor Python and R tests to cover categorical features, AFT survival, and use a unified model set.

Reviewed Changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 1 comment.

File Description
src/learner.cc Dropped old binary serialization code and updated deprecation warnings.
src/cli_main.cc Updated CLI to save/load .ubj and .json formats only.
tests/python/test_model_compatibility.py Refactored compatibility test harness and download logic.
tests/python/generate_models.py Extended model generator with categorical and AFT survival cases.
Comments suppressed due to low confidence (2)

tests/python/test_model_compatibility.py:13

  • The test uses xgboost.Booster and other xgboost APIs but does not import the xgboost module. Add import xgboost (or alias) at the top of the file to avoid NameError during test execution.
from xgboost import testing as tm

src/cli_main.cc:347

  • [nitpick] This line is misaligned in the else if (ext == "ubj") block. Adjust the indentation to match the surrounding 6-space indent for consistency with project style.
      learner->LoadModel(in);

@trivialfis
Copy link
Member Author

@hcho3 Could you please help take a look when you are available?

@trivialfis
Copy link
Member Author

I need to let the R check skip the compatibility check when the link is down to make CRAN tests more resilient.

Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions, but otherwise looks good

Comment on lines -137 to -141
def write_versions():
versions = {'numpy': np.__version__,
'xgboost': version}
with open(os.path.join(target_dir, 'version'), 'w') as fd:
fd.write(str(versions))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we dropping write_versions()? Is it because the file name for each model artifact already contains the version number?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Also the models we test are from multiple versions.(1-3)

auto ext = common::FileExtension(path);
auto read_file = [&]() {
auto str = common::LoadSequentialFile(path);
CHECK_GE(str.size(), 3); // "{}\0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does common::LoadSequentialFile return a string with \0? I thought the terminating \0 won't be part of the string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns a vector of char as it consumes binary inputs. I removed the requirement for \0 and added a small test for the JSON parser.

@trivialfis trivialfis merged commit 29ada72 into dmlc:master Jul 17, 2025
78 of 82 checks passed
@trivialfis trivialfis deleted the drop-binary branch July 17, 2025 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Roadmap] Phasing out the support for old binary format.
2 participants