Add QK norm to Causal Self Attention Block #414

CYHSM · 2025-10-27T10:48:19Z

What does this PR do?

Adds QK-Norm to self-attention block. Without qk-norm the attention logits are computed as: (Q @ K^T) / sqrt(d_h), which is equivalent to (||q_i|| * ||k_j|| * cos(θ_ij)) / sqrt(d_h) using the geometric form of the dot product. This means the model can increase distance between logits by either scaling q or k vectors (magnitude) or adjusting the angle between them (direction). QK-Norm constrains the magnitude updates and steers the model towards directional updates (which improves training stability, see this paper for more details)

Here are the results for the runs with and without QK-Norm (r2 denotes a second run, s=slow, f=fast, g=gradients, so the first entry will run with around 25 samples/s with torch>2.9.0)

run	lr	qk	norm	loss	g_mean	g_max	g_std	samples/s
qk_rms	1.5e-4	X	RMS(s)	3.219	0.726	15.75	0.336	15.12
qk_ln_r2	1.5e-4	X	LN	3.229	0.777	79.09	1.085	19.11
base_ln_r2	1.5e-4		LN	3.266	0.820	64.74	1.067	22.90
qk_ln	1.5e-4	X	LN	3.328	0.883	95.37	1.348	20.56
base_ln	1.5e-4		LN	3.349	0.979	127.36	1.745	23.51
qk_rms_lre3	1.5e-3	X	RMS(f)	3.381	0.273	40.39	0.621	24.80
qk_rms_lre2	1.5e-2	X	RMS(f)	3.788	0.375	113.46	1.241	24.86
base_lre3	1.5e-3		RMS(f)	5.906	42.87	15521.78	190.23	27.80
base_lre2	1.5e-2		RMS(f)	6.723	3.612	281.83	7.516	28.15
--> Compiled
base_ln	1.5e-4		LN	-	-	-	-	29.1
base_rms	1.5e-4		RMS(f)	-	-	-	-	30.8
qk_ln	1.5e-4	X	LN	-	-	-	-	26.3
qk_rms	1.5e-4	X	RMS(f)	-	-	-	-	28.2

And the loss curves for the extreme LR values:

General Changes

Add QK norm to config parameters
Add QK norm calculation to attention blocks
Remove manual RMSnorm and replace with pytorch RMSNorm
Added a test to see if output with and without qk norm differs

Breaking Changes

Configs need to be updated although use_qk_norm is set to false by default

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py) - Some still fail, might be due to torch nightly (checking now)
I have updated the internal changelog (CHANGELOG_DEV.md)

…wd_pass.py Co-authored-by: Copilot <[email protected]>

Co-authored-by: Copilot <[email protected]>

CYHSM and others added 22 commits October 23, 2025 14:54

initial changes adding qk norm

99cca4c

feat: added qk norm

2f3598d

added config files

d918c87

feat: added qk norm to attention block

f182727

Merge: main into current branch

6a53a31

fix: Fix spelling mistake

693c57d

Fix: Adapt configts to latest changes

f48aa0c

fix: Reverse removal of RMSNorm as its imported in the test suite

941be96

Fix: Adapt configs to latest configs

329beb0

fix: Compute peak memory correctly on cpu device

f1152e5

test: Require A100 GPUs for mfu test

5ab3ecf

test: Adapt tests to latest changes

4571ffe

test(parallelism): Adapted fsdp2+tp+tt test to recent changes.

e8e8126

test: add test for qk norm

ce00ba3

Merge branch 'Modalities:main' into qk_norm

e692e39

fixing rebase

61b4a2c

fix: fix tests by making RMSNorm backward compatible

7ce6d50

Update tests/fsdp2_parallelization/pipeline_parallelism/test_pp_fwd_b…

28186f7

…wd_pass.py Co-authored-by: Copilot <[email protected]>

Update src/modalities/models/parallelism/pipeline_parallelism.py

f375c2a

Co-authored-by: Copilot <[email protected]>

fix: merge pp branch into qk branch

0c4eee8

added compiled model configs for throughput tests

f3bfb58

Merge remote-tracking branch 'upstream/main' into qk_norm

6c625cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add QK norm to Causal Self Attention Block #414

Add QK norm to Causal Self Attention Block #414

CYHSM commented Oct 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add QK norm to Causal Self Attention Block #414

Are you sure you want to change the base?

Add QK norm to Causal Self Attention Block #414

Conversation

CYHSM commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CYHSM commented Oct 27, 2025 •

edited

Loading