-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Closed
Description
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
NPROC_PER_NODE=8 \
swift rlhf \
--rlhf_type dpo \
--model ${MODEL} \
--train_type full \
--dataset ${dataset} \
--load_from_cache_file true \
--split_dataset_ratio 0.01 \
--torch_dtype bfloat16 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 1e-5 \
--gradient_accumulation_steps 1 \
--eval_steps 160 \
--save_steps 160 \
--logging_steps 5 \
--warmup_ratio 0.05 \
--dataloader_num_workers 8 \
--dataset_num_proc 8 \
--save_total_limit 10 \
--save_only_model true \
--output_dir ${SAVE} \
--deepspeed zero3 \
--attn_impl flash_attn \
--max_length 131072 \
--use_liger_kernel true \
--sequence_parallel_size 2
用上面的脚本训练,train loop都好好的,但是eval loop报了下面的错误
[rank0]: File "/cpfs02/user/zhengyuxiang.zyx/ms-swift/swift/cli/rlhf.py", line 7, in <module>
[rank0]: rlhf_main()
[rank0]: File "/cpfs02/user/zhengyuxiang.zyx/ms-swift/swift/llm/train/rlhf.py", line 233, in rlhf_main
[rank0]: return SwiftRLHF(args).main()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/cpfs02/user/zhengyuxiang.zyx/ms-swift/swift/llm/base.py", line 49, in main
[rank0]: result = self.run()
[rank0]: ^^^^^^^^^^
[rank0]: File "/cpfs02/user/zhengyuxiang.zyx/ms-swift/swift/ray/base.py", line 170, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/cpfs02/user/zhengyuxiang.zyx/ms-swift/swift/llm/train/sft.py", line 206, in run
[rank0]: return self.train(trainer)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/cpfs02/user/zhengyuxiang.zyx/ms-swift/swift/llm/train/sft.py", line 254, in train
[rank0]: trainer.train(trainer.args.resume_from_checkpoint)
[rank0]: File "/cpfs02/user/zhengyuxiang.zyx/ms-swift/swift/trainers/mixin.py", line 815, in train
[rank0]: res = super().train(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 2325, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 2756, in _inner_training_loop
[rank0]: self._maybe_log_save_evaluate(
[rank0]: File "/cpfs02/user/zhengyuxiang.zyx/ms-swift/swift/trainers/mixin.py", line 888, in _maybe_log_save_evaluate
[rank0]: super()._maybe_log_save_evaluate(tr_loss, *args, **kwargs)
[rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 3221, in _maybe_log_save_evaluate
[rank0]: metrics = self._evaluate(trial, ignore_keys_for_eval)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 3170, in _evaluate
[rank0]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 4489, in evaluate
[rank0]: output = eval_loop(
[rank0]: ^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py", line 1919, in evaluation_loop
[rank0]: initial_output = super().evaluation_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 4685, in evaluation_loop
[rank0]: losses, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/cpfs02/user/zhengyuxiang.zyx/ms-swift/swift/trainers/rlhf_trainer/dpo_trainer.py", line 185, in prediction_step
[rank0]: return super().prediction_step(model, inputs, *args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py", line 1846, in prediction_step
[rank0]: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="eval")
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py", line 1684, in get_batch_loss_metrics
[rank0]: model_output = self.concatenated_forward(model, batch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/cpfs02/user/zhengyuxiang.zyx/ms-swift/swift/trainers/rlhf_trainer/dpo_trainer.py", line 109, in concatenated_forward
[rank0]: per_token_logps, mean_all_logits, loss_mask = self.get_per_token_logps(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/cpfs02/user/zhengyuxiang.zyx/ms-swift/swift/trainers/rlhf_trainer/rlhf_mixin.py", line 139, in get_per_token_logps
[rank0]: raise ValueError(f'Logits (batch and sequence length dim) {logits.shape[:-1]}'
[rank0]: ValueError: Logits (batch and sequence length dim) torch.Size([4, 2654])and labels must have the same shape {labels.shape}
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
cuda 12.2, python 3.11, GPU H20, torch 2.8, ms-swift 3.11.0.dev0
Additional context
Add any other context about the problem here(在这里补充其他信息)
Metadata
Metadata
Assignees
Labels
No labels