Inconsistent Training Throughput Across Epochs #2449
              
                Unanswered
              
          
                  
                    
                      caojiaolong
                    
                  
                
                  asked this question in
                Q&A
              
            Replies: 2 comments
-
| I also plotted the average speed (epochs per minute) and noticed that the speed gradually decreases as training progresses—a very strange phenomenon. | 
Beta Was this translation helpful? Give feedback.
                  
                    0 replies
                  
                
            -
| It appears that the training speed fluctuates—starting fast, then slowing down, and then speeding up again. | 
Beta Was this translation helpful? Give feedback.
                  
                    0 replies
                  
                
            
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    

Uh oh!
There was an error while loading. Please reload this page.
-
I attempted to reproduce MobileNetV4 using this [configuration](https://gist.github.com/rwightman/f6705cb65c03daeebca8aa129b1b94ad#file-mnv4_hm_r384_e550_ix_gpu8-yaml) on ImageNet-1K with 8×RTX 3090 GPUs and 100 CPUs. However, I noticed that certain epochs and iterations experience significantly lower throughput compared to others. Is this behavior expected?
In the screenshots below, the throughput during epochs 114 and 115 is noticeably lower than in epoch 116—around 2K images per second instead of the usual 4K. This slowdown occurs randomly in other epochs as well.
My training script:
Has anyone encountered similar issues or found a potential cause?
Beta Was this translation helpful? Give feedback.
All reactions