Open
Description
A bunch of issues are a bit stale, and @SunMarc + @muellerzr are a bit short on bandwidth!
Thus we would love to have community support to solve the following:
Help needed
- [2023-12-04 11:52:08,378] [INFO] [autotuner.py:1110:run_after_tuning] No optimal DeepSpeed configuration found by autotuning. #27830
- Training hangs at the first gradient syncing of an MoE model while using deepspeed #30911
- DDP error with load_best_model_at_end enabled #30702
- Trainer do not move the model to GPU when doing evaluation with FSDP #30239
- Please correct the following DeepSpeed config values that mismatch TrainingArguments values: scheduler.params.total_num_steps=0 vs hf num_training_steps (calculated)= 260 #29348
- Failed to load universal_checkpoint with deepspeed integreation #33157
-
dataloader_persistent_workers=True
causes fork-bomb due to repeated creation ofeval_dataloader
#28469 - Question about quantized model with zero3 #30663
- Cannot restore FSDP checkpoint with LOCAL_STATE_DICT #30811
- torchrun breaks with load_model_at_end and with metric_for_best_model=eval_f1 on question_answering example #30819
- Galore finetuning #stopped #31313
- Trainer having issues with DataLoaderShard when running with torchrun #31457
- Cache problem while runing on multiple nodes with GPU #30859
- RuntimeError: Expected a 'mps:0' generator device but found 'cpu' #31897
- [Finetuning OneFormer] Seems not to use multiple GPUs, with both DataParallel and Accelerate #30340
- Load fsdp+lora checkpoint error #31892
- Observed_masks not behaving as expected #28914
- resume_from_checkpoint may still fail with auto_find_batch_size #29518
- Jamba-v01 Model + Deepspeed Zero3 lead to "RuntimeError: Detected mismatch between collectives on ranks." #30277
- Weird text encoder NaNs specifically for FSDP + multi GPU #33376
Feature request
Replied with potential fix and following
- Cannot find the best model after training #31734 followed by @irislin1006
- DeepSpeed ZeRO stage3+Qwen2/Qwen2-57B-A14B-Instruct: RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) #32312 followed by @irislin1006
- [Bug - I think]
data_seed
inTrainingArguments
is unused #31818 followed by @MekkCyber - Memory leak when using CLIPTextModel #31439 followed by @nnilayy
- [Trainer.train] learning rate logging inconsistency: learning rate for the future step is logged #28124 followed by @muellerz and @WizKnight
- Issues occuring during parallel evaluation (using Trainer.evaluate) #30767 followed by @SunMarc
- Multi-GPU setup: indices should be either on cpu or on the same device as the indexed tensor (cuda:1) #33147 followed by @SunMarc
- Encounter error when loading checkpoint generated by latest accelerate>=0.34.0 #33400 followed by @SunMarc
-
resume_from_checkpoint
function fails because "There seems to be not a single sample in your epoch_iterator" #26413 followed by @muupan, @muellerzr and @SunMarc - It's an AlignModel or Deepspeed Zero3 bug. #28808 followed by @Ben-Schneider-code
- Batch is empty when fine-tuning flan-t5 using LoRA #31357 followed by @MekkCyber
- Trainer doesn't save evaluation metrics. #33733 resolved by the author
- Error on TPU: Invalid --2a886c8_slice_builder_worker_addresses specified. Expected 4 worker addresses, got 1. #30330 followed by @SunMarc
- Title: CUDA RuntimeError: Unspecified Launch Failure during Training #30913 followed by @MekkCyber
- ERROR in run_hp_search_optuna when trying to use multi-GPU #27487 followed by @SunMarc
- Training Resumes with Increased Loss Despite Checkpoint Loading #33336 followed by @MekkCyber
- tensor size mismatch with larger gradient_accumulation_steps and fewer training data #25695 followed by @MekkCyber
- Stuck on Initializing Transformers Model with FSDP (Fully Sharded Data Parallel) using meta device #31278 followed by @muellerzr
- error need either state_dict or a save folder #32427 followed by @muellerzr and @SunMarc
- Using accelerate launch FDSP cause weight saved after 2nd time onwards to be incomplete #31034 followed by @muellerzr
- Resuming from checkpoint runs into OOM #30822 followed by @muellerzr
- Transformer.Trainer fails in creating optimizer for optim adamw_torch_fused when launched with deepspeed. #31867 followed by @Ben-Schneider-code and @SunMarc