-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Update Evaluation Logic to Latest lm_eval
(0.4.8) and Support Automatic Benchmark Evals w/o Validation Set
#1348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Kyle1668
wants to merge
12
commits into
main
Choose a base branch
from
update_lm_eval
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
6065d05
Update eval code
Kyle1668 c918e04
Resolve W&B error
Kyle1668 4383212
Add eval results to gitignore and specify latest lm_eval in reqs file
Kyle1668 0cc037c
Add limit neox arg
Kyle1668 a2a5084
Update ci to Python 3.9
Kyle1668 6207d7b
Resolved challenge with task groups
Kyle1668 04eabe7
Update docs
Kyle1668 e8c33e5
Make type handling clearer for task groups
Kyle1668 6133de9
Run eval tasks even when no validaiton set is provided
Kyle1668 a4c8e00
Removed commented out code
Kyle1668 c60d595
Run pre-commit
Kyle1668 2841bb0
Resolved potential transformer engine bug
Kyle1668 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -155,3 +155,6 @@ src/ | |
# test data files | ||
tests/data/*.bin | ||
tests/data/*.idx | ||
|
||
# evaluation results | ||
*eval_results*.json |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,19 +14,23 @@ LR Scheduler Arguments | |
Learning rate decay function. Choose from 'constant', 'linear', 'cosine', 'exponential'. | ||
|
||
|
||
|
||
- **lr_decay_iters**: int | ||
|
||
Default = None | ||
|
||
Number of iterations to decay learning rate over. If None, defaults to | ||
--train-iters or the equivalent inferred value from train_epochs. | ||
Number of iterations to decay learning rate over, If None defaults to | ||
--train-iters or the equivalent inferred valued from train_epochs. | ||
|
||
|
||
|
||
- **lr_decay_fraction**: float | ||
|
||
Default = None | ||
|
||
Effective fraction of training over which to decay lr. Overrides lr_decay_iters. | ||
Useful when specifying train_epochs. | ||
Effective fraction of training over which to decay lr, overrides lr_decay_iters, useful when specifying train_epochs | ||
|
||
|
||
|
||
- **min_lr**: float | ||
|
||
|
@@ -82,6 +86,14 @@ Logging Arguments | |
|
||
|
||
|
||
- **wandb_run_name**: str | ||
|
||
Default = None | ||
|
||
Weights and Biases run name for the current experiment | ||
|
||
|
||
|
||
- **wandb_team**: str | ||
|
||
Default = None | ||
|
@@ -116,7 +128,7 @@ Logging Arguments | |
|
||
- **git_hash**: str | ||
|
||
Default = 62c9738a | ||
Default = bb881f3b | ||
|
||
current git hash of repository | ||
|
||
|
@@ -186,6 +198,22 @@ Logging Arguments | |
|
||
|
||
|
||
- **comet_experiment**: Any | ||
|
||
Default = None | ||
|
||
Initialized comet experiment object used to log data | ||
|
||
|
||
|
||
- **peak_theoretical_tflops**: float | ||
|
||
Default = None | ||
|
||
The peak hardware flops with which to compute MFU and HFU, in units of teraflops. Automatic detection is more trouble than it's worth, so this is left to the user. Helpful table listed at https://github.com/stas00/ml-engineering/tree/master/compute/accelerator#tflops-comparison-table | ||
|
||
|
||
|
||
- **log_interval**: int | ||
|
||
Default = 100 | ||
|
@@ -215,8 +243,7 @@ Logging Arguments | |
Default = False | ||
|
||
Log the frob norm of the gradients to wandb / tensorboard (useful for debugging). | ||
(N.B - this will only work with pp = 0 for now, as we don't have access to the gradients of the model because | ||
deepspeed.) | ||
(N.B - this will only work with pp = 0 for now, as we don't have access to the gradients of the model because deepspeed.) | ||
|
||
|
||
|
||
|
@@ -272,8 +299,8 @@ Logging Arguments | |
|
||
Default = False | ||
|
||
Enable nsys profiling. When using this option, | ||
nsys options should be specified in commandline. | ||
Enable nsys and pytorch profiling. When using this option with nsys, | ||
nsys options should be directly specified in commandline. | ||
An example nsys commandline is | ||
``` | ||
nsys profile -s none -t nvtx,cuda -o <path/to/output_file> | ||
|
@@ -402,11 +429,11 @@ Model Arguments | |
|
||
|
||
|
||
- **norm**: typing.Literal['layernorm', 'rmsnorm', 'scalenorm', 'te_rmsnorm', 'te_layernorm'] | ||
- **norm**: typing.Literal['layernorm', 'rmsnorm', 'non_parametric_layernorm', 'scalenorm', 'te_rmsnorm', 'te_layernorm'] | ||
|
||
Default = layernorm | ||
|
||
Normalization layer to use. Choose from "layernorm", "rmsnorm", "scalenorm", "te_rmsnorm", "te_layernorm". | ||
Normalization layer to use. Choose from "layernorm", "rmsnorm", "non_parametric_layernorm", "scalenorm", "te_rmsnorm", "te_layernorm". | ||
|
||
|
||
|
||
|
@@ -843,6 +870,124 @@ Model Arguments | |
|
||
|
||
|
||
- **serve_model_weights**: bool | ||
|
||
Default = False | ||
|
||
If true, serve model weight pointers over a socket connection | ||
|
||
|
||
|
||
- **weight_server_port**: typing.Union[int, typing.List[int]] | ||
|
||
Default = 6000 | ||
|
||
Port(s) to serve model weights over | ||
If an integer is provided, the port for each GPU will be 6000 + global rank | ||
If a list is provided, the ports will be used in order, e.g. rank0 will be weight_server_port[0] | ||
|
||
|
||
|
||
- **online_dataserver_ips**: typing.Union[str, typing.List[str]] | ||
|
||
Default = localhost | ||
|
||
ip addresses to connect to for online data serving, defaults to localhost | ||
|
||
|
||
|
||
- **online_dataserver_ports**: typing.Union[int, typing.List[int]] | ||
|
||
Default = 10000 | ||
|
||
Port(s) to connect to for online data serving, defaults to 10000 | ||
|
||
|
||
|
||
- **te_columnparallel**: bool | ||
|
||
Default = False | ||
|
||
Use TransformerEngine for RowParallelLinear layer. | ||
|
||
|
||
|
||
- **te_rowparallel**: bool | ||
|
||
Default = False | ||
|
||
Use TransformerEngine for ColumnParallelLinear layer. | ||
|
||
|
||
|
||
- **te_layernorm_mlp**: bool | ||
|
||
Default = False | ||
|
||
Use TransformerEngine for LayerNormMLP layer. | ||
|
||
|
||
|
||
- **te_mha**: bool | ||
|
||
Default = False | ||
|
||
Use TransformerEngine for MultiheadAttention layer. | ||
|
||
|
||
|
||
- **te_fp8_format**: typing.Literal['e4m3', 'hybrid'] | ||
|
||
Default = hybrid | ||
|
||
Controls the FP8 data format used during forward and backward pass by TransformerEngine. | ||
Hybrid uses E4M3 during forward pass, E5M2 during backward pass. | ||
|
||
|
||
|
||
- **te_fp8_wgrad**: bool | ||
|
||
Default = True | ||
|
||
When set to False, override FP8 config options and do the wgrad computation | ||
in higher precision. | ||
|
||
|
||
|
||
- **te_fp8_amax_history_len**: int | ||
|
||
Default = 1 | ||
|
||
The length of the amax history window used for scaling factor computation. | ||
|
||
|
||
|
||
- **te_fp8_amax_compute_algo**: str | ||
|
||
Default = most_recent | ||
|
||
Algorithm used for choosing the `amax` value for the scaling factor computation. There are 2 | ||
predefined choices: `max` chooses the largest `amax` in the history window, while `most_recent` | ||
always chooses the most recently seen value. | ||
|
||
|
||
|
||
- **te_fp8_margin**: int | ||
|
||
Default = 0 | ||
|
||
Margin for the scaling factor computation. | ||
|
||
|
||
|
||
- **te_fp8_mha**: bool | ||
|
||
Default = False | ||
|
||
When set to True, use the FP8 implementation of Multi Head Attention. | ||
|
||
|
||
|
||
- **dim_att**: int | ||
|
||
Default = None | ||
|
@@ -866,6 +1011,7 @@ Model Arguments | |
Dimension of the feed-forward network for RWKV. If not set, calculated based on hidden_size and expansion_factor. | ||
|
||
|
||
|
||
## NeoXArgsOptimizer | ||
|
||
Optimizer Arguments | ||
|
@@ -1095,14 +1241,6 @@ Misc. Arguments | |
|
||
|
||
|
||
- **save_iters**: list | ||
|
||
Default = None | ||
|
||
Set during training | ||
|
||
|
||
|
||
- **global_num_gpus**: int | ||
|
||
Default = None | ||
|
@@ -1307,6 +1445,14 @@ Text Generation arguments | |
|
||
|
||
|
||
- **eval_task_limit**: int | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the only new argument in this PR. The updates elsewhere to this file are from running |
||
|
||
Default = None | ||
|
||
Limit the number of examples per lm_eval_harness task | ||
|
||
|
||
|
||
- **moe_top_k**: int | ||
|
||
Default = 1 | ||
|
@@ -1727,19 +1873,19 @@ Training Arguments | |
|
||
|
||
|
||
- **dataset_impl**: typing.Literal['gpt2', 'pairwise'] | ||
- **dataset_impl**: typing.Literal['gpt2', 'pairwise', 'online'] | ||
|
||
Default = gpt2 | ||
|
||
Dataset implementation, can be one of "gpt2" or "pairwise" | ||
Dataset implementation, can be one of "gpt2", "pairwise", or "online" | ||
|
||
|
||
|
||
- **train_impl**: typing.Literal['normal', 'dpo', 'rm', 'kto'] | ||
- **train_impl**: typing.Literal['normal', 'dpo', 'rm', 'kto', 'reinforce'] | ||
|
||
Default = normal | ||
|
||
Training implementation, can be one of "normal", "dpo", "kto", or "rm" | ||
Training implementation, can be one of "normal", "dpo", "kto", "reinforce", or "rm" | ||
|
||
|
||
|
||
|
@@ -1791,6 +1937,16 @@ Training Arguments | |
|
||
|
||
|
||
- **z_loss**: float | ||
|
||
Default = 0.0 | ||
|
||
Z-loss parameter, only implemented for RM training currently. | ||
https://arxiv.org/pdf/2204.02311 | ||
https://arxiv.org/pdf/2309.10305 | ||
|
||
|
||
|
||
- **kto_beta**: float | ||
|
||
Default = 0.1 | ||
|
@@ -1799,6 +1955,39 @@ Training Arguments | |
|
||
|
||
|
||
- **fp32_reinforce**: bool | ||
|
||
Default = True | ||
|
||
Whether to cast logits to fp32 for Reinforce loss calculation. | ||
|
||
|
||
|
||
- **kl_impl**: typing.Literal['abs', 'mse', 'kl', 'full'] | ||
|
||
Default = mse | ||
|
||
KL divergence implementation, can be one of "abs", "mse", "kl", or "full" | ||
|
||
|
||
|
||
- **kl_div_beta**: float | ||
|
||
Default = 0.1 | ||
|
||
Beta value for KL divergence in Reinforce loss calculation. | ||
|
||
|
||
|
||
- **reinforce_leave_one_out**: bool | ||
|
||
Default = False | ||
|
||
Whether to use reinforce leave one out for training | ||
(from https://arxiv.org/abs/2402.14740 and https://api.semanticscholar.org/CorpusID:198489118) | ||
|
||
|
||
|
||
- **allow_chopped**: bool | ||
|
||
Default = True | ||
|
@@ -1875,7 +2064,7 @@ Training Arguments | |
|
||
|
||
|
||
- **checkpoint_factor**: int | ||
- **checkpoint_factor**: typing.Union[int, float] | ||
|
||
Default = None | ||
|
||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.