Skip to content

Conversation

@nikita-savelyevv
Copy link
Collaborator

What does this PR do?

In this PR "gsm8k" dataset option is added to apply data-aware quantization to causal language modes. Calibrating on "gsm8k" in some cases provides better results compared to "wikitext2":

Model Precision Calibration Dataset seqlen gsm8k_cot_llama (strict-match) gsm8k_cot_llama (flexible-extract)
Llama-3.1-8B-Instruct nf4 per-channel wikitext2 32 78.70% 78.77%
Llama-3.1-8B-Instruct nf4 per-channel gsm8k 64 80.29% 80.82%
Llama-3.1-8B-Instruct nf4 per-channel gsm8k 128 79.30% 79.53%
Llama-3.1-8B-Instruct nf4 per-channel gsm8k 256 81.20% 81.50%

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@nikita-savelyevv nikita-savelyevv marked this pull request as ready for review December 5, 2025 12:55
@nikita-savelyevv
Copy link
Collaborator Author

@ljaljushkin, please take a look.

Copy link
Contributor

@ljaljushkin ljaljushkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Nikita! LGTM

Copy link
Member

@IlyasMoutawwakil IlyasMoutawwakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! isn't the perf increase to be expected since you are calibrating on on test data/distribution ?

@nikita-savelyevv
Copy link
Collaborator Author

nikita-savelyevv commented Dec 5, 2025

LGTM ! isn't the perf increase to be expected since you are calibrating on on test data/distribution ?

That's a good question! To be honest some experiments show higher accuracy when calibrating on wikitext2, so that seems not to be the case here. For example:

Model Precision Calibration Dataset Seq. length gsm8k_cot_llama (strict-match) gsm8k_cot_llama (flexible-extract)
Llama-3.1-8B-Instruct int4 wikitext2 32 83.17% 83.32%
Llama-3.1-8B-Instruct int4 gsm8k 256 81.50% 81.80%

Copy link
Collaborator

@rkazants rkazants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible and reasonable to specify several datasets (e.g., wikitext2 and gsm8k) simultaneously for calibration? Can it provide higher accuracy?

I think it is reasonable to update documentation and mention in what cases to use gsm8k dataset.

@nikita-savelyevv
Copy link
Collaborator Author

nikita-savelyevv commented Dec 9, 2025

Is it possible and reasonable to specify several datasets (e.g., wikitext2 and gsm8k) simultaneously for calibration? Can it provide higher accuracy?

Thanks for the idea! In theory it could, however there were no experiments done in this direction.

I think it is reasonable to update documentation and mention in what cases to use gsm8k dataset.

Experiments are still work-in-progress in order to determine when exactly it is beneficial to use gsm8k dataset. For now, the results are mixed, however in some cases gsm8k provides better quality. With this PR we add another option for users to try improve quantized model quality.

@nikita-savelyevv nikita-savelyevv merged commit b90e045 into main Dec 9, 2025
25 of 26 checks passed
@nikita-savelyevv nikita-savelyevv deleted the ns/gsm8k-dataset branch December 9, 2025 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants