Skip to content

训练异常生成模块,最后崩掉了 #112

@DqqHns

Description

@DqqHns

想问一下大家有没有遇过这种情况,训练异常生成模块,到最后才崩掉,训练了6天:Epoch 428: 100%|█| 2331/2331 [18:55<00:00, 2.05it/s, loss=0.00671, v_num=0, train/loss_simple_step=0.00161, train/loss_vlb_step=5.32e-6, train/loss_step=0.00161, global_Average Epoch time: 1135.78 seconds
Average Peak memory 17159.02MiB
Epoch 429: 0%| | 0/2331 [00:00<?, ?it/s, loss=0.00671, v_num=0, train/loss_simple_step=0.00161, train/loss_vlb_step=5.32e-6, train/loss_step=0.00161, global_step=1e+6, thuggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Epoch 429: 18%|▏| 430/2331 [03:13<14:16, 2.22it/s, loss=0.00793, v_num=0, train/loss_simple_step=0.00195, train/loss_vlb_step=6.72e-6, train/loss_step=0.00195, global_stEpoch 429, global step 999999: val/loss_simple_ema was not in top 1
Average Epoch time: 197.85 seconds
Average Peak memory 17159.02MiB
Epoch 429: 18%|▏| 430/2331 [03:18<14:36, 2.17it/s, loss=0.00793, v_num=0, train/loss_simple_step=0.00195, train/loss_vlb_step=6.72e-6, train/loss_step=0.00195, global_st
Saving latest checkpoint...

Traceback (most recent call last):
File "main.py", line 871, in
trainer.test(model, data)
File "/root/miniconda3/envs/Anomalydiffusion/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 904, in test
return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
File "/root/miniconda3/envs/Anomalydiffusion/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/root/miniconda3/envs/Anomalydiffusion/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 947, in _test_impl
results = self._run(model, ckpt_path=self.tested_ckpt_path)
File "/root/miniconda3/envs/Anomalydiffusion/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1115, in _run
verify_loop_configurations(self, model)
File "/root/miniconda3/envs/Anomalydiffusion/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 38, in verify_loop_configurations
__verify_eval_loop_configuration(trainer, model, "test")
File "/root/miniconda3/envs/Anomalydiffusion/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 182, in __verify_eval_loop_configuration
raise MisconfigurationException(f"No {loader_name}() method defined to run Trainer.{trainer_method}.")
pytorch_lightning.utilities.exceptions.MisconfigurationException: No test_dataloader() method defined to run Trainer.test.
(Anomalydiffusion) root@autodl-container-857740b613-a811c0e1:~/autodl-tmp/anomalydiffusion-master#

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions