-
Notifications
You must be signed in to change notification settings - Fork 624
Validate tokenizer and model alignment before training #2074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
59dc807
faa2122
50b10b8
883c281
cdc1f1b
31853e8
24557cf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -468,3 +468,34 @@ def get_moe_model_nparams_and_flops( | |||||
| nparams = nparams - nparams_embedding | ||||||
|
|
||||||
| return nparams, num_flops_per_token | ||||||
|
|
||||||
|
|
||||||
| def validate_tokenizer_model_alignment( | ||||||
|
||||||
| def validate_tokenizer_model_alignment( | |
| def validate_tokenizer_model_compatibility( |
since we no longer require them to be identical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tianyu-l That makes sense. I’ve updated the function name accordingly, and also reverted the removal of eos_id. Thanks for the suggestion!
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint field should not be surrounded by quotation mark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wwwjn Thanks for pointing this out! I originally wrote it this way to avoid potential circular import issues. (like https://github.com/pytorch/torchtitan/blob/main/torchtitan/components/metrics.py#L496)
However, after testing on Python 3.10 and above, it seems to work fine without the quotation marks, so I’ve updated the code accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems a different usage of
eos_id, let's revert this change