[DRAFT] Implement EMA of the model #1005

sophie-xhonneux · 2025-09-29T16:31:36Z

Description

Standard machine learning trick to improve results, common in Diffusion in particular!

Issue Number

#1004

Is this PR a draft? Mark it as draft.

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

Committing simply so it is saved, obviously needs cleanup

now....

with different ranks using different model parts

sophie-xhonneux · 2025-09-29T16:32:28Z

This is based on FSDP2, still very much in draft mode and only tested for 1 GPU, so the multi-GPU case is a ToDo, hence the draft mode.

sophie-xhonneux and others added 30 commits August 5, 2025 12:01

Save current state

cf1f829

Save current state

5017f46

Barebone FSDP2 prototype TODO save checkpoints

305658e

First version of saving model

1db9e19

Fix save_model

38226bd

Merge branch 'develop' into sophiex/dev/fsdp2

f9183d9

Log everything and log to files

24e865b

Remove redundant path creation

f2562b9

Allow for both slurm and torchrun + fewer log files

3eb5bec

Cleaning up init_ddp

3ba291b

Ruff

7f0a088

Attempt to avoid duplicate logging

748021b

FSDP2 with mixed precision policy

181d170

Ruff

0176658

Clean up and logging

44e1062

Try to get loggers to behave as we want

cb58cda

Makes ruff unhappy but works

f877ab0

Fixed ruff issue

a398ffa

Fixed problems with multi-node training.

2f8ab49

Fix for interactive/non-DDP runs

27bd8ba

No idea why, but this seems to work so far

c4b47c4

Committing simply so it is saved, obviously needs cleanup

Still works! So which is it memory or the grad scaler?

4b0fd83

Also still works, I now strongly suspect the amp.gradscaler

ca4e56a

This still works, I have no clue anymore why but whatever it works

f4ecf2c

now....

Enable loading model from absolute paths

6426614

Enable loading for 1 GPU only

df97c31

Fix 1 GPU train continue

0669dc1

Merge branch 'develop' into sophiex/dev/fsdp2

9426c0f

Appease ruff

beceba2

Fix saving the model more regularly and perf logging

ee7e619

clessig and others added 9 commits September 22, 2025 11:52

Fixed problem when training with 2 nodes.

3b3a754

Fix data loader seed

76ac336

Appease ruff

fecfe66

Shouldn't overwrite with_fsdp like this

5170ea5

Potential fix for FSDP2 issue

a092f05

with different ranks using different model parts

Fix loss scaling and logging of dummy data loss

f90b030

Clean up

0bed983

Appease ruff

7790924

Start implementing EMA, works for 1 GPU

5495860

github-project-automation bot added this to WeatherGen-dev Sep 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT] Implement EMA of the model #1005

[DRAFT] Implement EMA of the model #1005

Uh oh!

sophie-xhonneux commented Sep 29, 2025

Uh oh!

sophie-xhonneux commented Sep 29, 2025

Uh oh!

Uh oh!

[DRAFT] Implement EMA of the model #1005

Are you sure you want to change the base?

[DRAFT] Implement EMA of the model #1005

Uh oh!

Conversation

sophie-xhonneux commented Sep 29, 2025

Description

Issue Number

Checklist before asking for review

Uh oh!

sophie-xhonneux commented Sep 29, 2025

Uh oh!

Uh oh!