Issues with Training using SLURM/Distributed Learning #2075
              
                Unanswered
              
          
                  
                    
                      dslisleedh
                    
                  
                
                  asked this question in
                General
              
            Replies: 1 comment 2 replies
-
| @dslisleedh I'm not so sure your solution will end up faster based on tests I performed in the past. Ignoring the warning I'd be curious to see the actual throughput numbers between the two options. EDIT: quick check on a convnext model running in convolutional mode, the norm impl you have above (which was I think close to the original impl for convnext) yields a throughput of just over 800 im/sec on a local distributed train test and does not have the stride bucket warning. My current implementation (with the warning) is 1500 im/sec. | 
Beta Was this translation helpful? Give feedback.
                  
                    2 replies
                  
                
            
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I would like to express my gratitude for your exceptional work.
However, I encountered some difficulties during training with SLURM and Distributed Learning.
I opted not to use the training code from timm. My primary training code was based on BasicSR, and I only imported a few layers from timm (
LayerNorm2dandDropPath).During training, I faced an issue where the following warning appeared:
UserWarning: Grad strides do not match bucket view strides.Consequently, the model's performance was bad, as expected.Given that my model is fully convolutional and doesn't involve memory rearrangement operations, this initially puzzled me. After some investigation, I discovered that
LayerNorm2din timm usespermutewithout contiguous.To address this, I modified my code to use a custom
LayerNorm2d, as shown below. This change resolved the problem:Although I am not certain how timm addresses this issue while supporting SLURM training, I wanted to share my solution for those who might only be using layers from timm, as I did.
Beta Was this translation helpful? Give feedback.
All reactions