Fix 2.8 issue per sample grad #3460

svekars · 2025-07-14T20:22:59Z

Fixes https://github.com/pytorch/tutorials/actions/runs/16273680500/job/45947334370?pr=3416#step:9:2304

pytorch-bot · 2025-07-14T20:23:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3460

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit d67bcb8 with merge base 9a44439 ():

NEW FAILURES - The following jobs have failed:

Build tutorials / pytorch_tutorial_build_worker (10, 15, linux.g5.4xlarge.nvidia.gpu) (gh)
Process completed with exit code 2.
Build tutorials / pytorch_tutorial_build_worker (2, 15, linux.g5.4xlarge.nvidia.gpu) (gh)
RuntimeError: Expected input at *args[0].shape[0] to be equal to 1, but got 2. If you meant for this dimension to be dynamic, please re-export and specify dynamic_shapes (e.g. with Dim.DYNAMIC)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

intermediate_source/per_sample_grads.py

svekars · 2025-07-17T17:50:53Z

@albanD can you take a look?

mikaylagawarecki

I'm quite confused by this change, as well as why this fails with the 2.8 RC :/ is the issue that we need to increase the tolerance?

mikaylagawarecki · 2025-07-17T20:01:58Z

intermediate_source/per_sample_grads.py

+    print(f"Parameter {name}: max difference = {max_diff}")
+
+    # Optional: still assert for very large differences that might indicate real problems
+    assert max_diff < 0.5, f"Extremely large difference in {name}: {max_diff}"


Why did we change this to not use allclose anymore?

mikaylagawarecki · 2025-07-17T20:11:50Z

intermediate_source/per_sample_grads.py

+
+    # Print differences instead of asserting
+    max_diff = (per_sample_grad - ft_per_sample_grad).abs().max().item()
+    print(f"Parameter {name}: max difference = {max_diff}")


Why are we printing this? On a side note, could you share what this prints with the 2.8 RC?

mikaylagawarecki · 2025-07-17T20:12:13Z

intermediate_source/per_sample_grads.py

+    idx = list(model.named_parameters()).index((name, model.get_parameter(name)))
+    per_sample_grad = per_sample_grads[idx]
+
+    # Check if shapes match and reshape if needed
+    if per_sample_grad.shape != ft_per_sample_grad.shape and per_sample_grad.numel() == ft_per_sample_grad.numel():
+        ft_per_sample_grad = ft_per_sample_grad.view(per_sample_grad.shape)


I'm a bit confused by this part

Is the issue that torch.allclose now fails due to the ordering of per_sample_grad and ft_per_sample_grad being different during zip?

Fix 2.8 issue per sample grad

280521f

meta-cla bot added the cla signed label Jul 14, 2025

Update per_sample_grads.py

19e68c8

svekars changed the base branch from main to RC-TEST-2.8 July 14, 2025 21:09

svekars added the 2.8 label Jul 14, 2025

Fix pendulum.py issues, updated to use newer APIs

311059c

svekars commented Jul 14, 2025

View reviewed changes

intermediate_source/per_sample_grads.py Show resolved Hide resolved

svekars and others added 4 commits July 14, 2025 16:03

Update intermediate_source/per_sample_grads.py

9cce023

Update per_sample_grads.py

2d8bda9

Update per_sample_grads.py

5fc349e

Update per_sample_grads.py

d67bcb8

svekars requested a review from albanD July 17, 2025 17:50

mikaylagawarecki reviewed Jul 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix 2.8 issue per sample grad #3460

Fix 2.8 issue per sample grad #3460

svekars commented Jul 14, 2025

Uh oh!

pytorch-bot bot commented Jul 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

svekars commented Jul 17, 2025

Uh oh!

mikaylagawarecki left a comment

Uh oh!

mikaylagawarecki Jul 17, 2025

Uh oh!

mikaylagawarecki Jul 17, 2025

Uh oh!

mikaylagawarecki Jul 17, 2025

Uh oh!

Uh oh!

Fix 2.8 issue per sample grad #3460

Are you sure you want to change the base?

Fix 2.8 issue per sample grad #3460

Conversation

svekars commented Jul 14, 2025

Uh oh!

pytorch-bot bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3460

❌ 2 New Failures

Uh oh!

Uh oh!

svekars commented Jul 17, 2025

Uh oh!

mikaylagawarecki left a comment

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 14, 2025 •

edited

Loading