B & S transpose issue in BSH collectives_layout MLP output #106

kahfizulkifli · 2025-05-08T07:56:28Z

Description of PR

In our testing with the latest transformers-neuronx version 1ade6d7 we found a similar case to 69d039d in the MLP output calculation, where using the Llama-3 model, we noted that the outputs between --tp_degree=1 and --tp_degree=2 --collectives_layout="BSH" are highly different to each other.

Steps to reproduce the bug:

Setup transformers-neuronx library on AWS Neuron trainium machine
Download the Llama-3 pretrained model from Huggingface https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/tree/main
Edit the config.json in the Llama-3 folder to use 1 layer ("num_hidden_layers": 1)
Run the test script provided in two modes: baseline and distributed with collectives layout BSH

# Baseline mode
python llama_driver.py run <model_path_folder> --tp_degree=1

# Distributed mode
python llama_driver.py run <model_path_folder> --tp_degree=2 --collectives_layout="BSH"

Check that the results of both scripts are different to one another

Here are some sample generated output and logit outputs that we got from running the two modes

Output and logits for the baseline

generated_sequence= tensor([[128000,   9906,     11,    358,   2846,    264,   4221,   1646,     11,
          93918, 104979,  98859,  19697,  75279, 116579,  35967,  74179,  44221,
          62243, 127834,  81366,    100,  20934,    246,  27105, 110590,  37986, … ]])

tensor([-3.1151, -1.2302, -0.7817,  ..., -0.7651, -0.7653, -0.7651])
tensor([-3.1151, -1.2302, -0.7817,  ..., -0.7651, -0.7653, -0.7651])
tensor([-3.1151, -1.2302, -0.7817,  ..., -0.7651, -0.7653, -0.7651])
tensor([-3.1151, -1.2302, -0.7817,  ..., -0.7651, -0.7653, -0.7651])

Output and logits for the TP 2 with collectives layout BSH

generated_sequence= tensor([[128000,   9906,     11,    358,   2846,    264,   4221,   1646,     11,
          32084,  22580, 107442,  26792,  41875,  12857, 105316,  85889,  77251,
          13056,  72690,  53945, 116250,  97314,    354,  86216,  58562,  92932, … ]])

tensor([-2.1269,  0.7309, -0.6974,  ..., -0.8386, -0.8388, -0.8388])
tensor([-2.8376, -0.1400,  0.1552,  ..., -0.0865, -0.0867, -0.0868])
tensor([-2.5670,  0.1932,  0.2707,  ..., -0.1999, -0.1999, -0.2001])
tensor([-2.6546,  0.0942, -0.0927,  ..., -0.0946, -0.0946, -0.0948])

In 69d039d the output of the dot is split into (s, b, h) through a reshape operator, and then converted into (b, s, h) through a transpose of (1, 0, 2), getting the result of (b, s, h).

We propose a similar modification in the BSH collectives_layout feature, where the initial output of the dot is mapped to (s * b, h) and not (b * s, h). Therefore, we need to convert (s * b, h) => (b, s, h).

We tested the fix and the outputs of the --tp_degree=2 --collectives_layout="BSH" are equivalent to the --tp_degree=1 mode.

Your insights are very much appreciated. We will continue following up this issue until it is resolved.

Credits to @wenboqian for providing initial direction to detecting and fixing the bug

Co-Authored-By: Wenbo Qian <[email protected]>

kahfizulkifli and others added 2 commits May 8, 2025 03:34

[mlp] fix B & S transpose issue in collectives BSH layout

4928fc8

Co-Authored-By: Wenbo Qian <[email protected]>

add test

e3b67f3

Co-Authored-By: Wenbo Qian <[email protected]>

kahfizulkifli requested review from aws-maens and musunita as code owners May 8, 2025 07:56

kahfizulkifli mentioned this pull request May 19, 2025

B & S layout optimization issue in BSH collectives_layout MLP output #107

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

B & S transpose issue in BSH collectives_layout MLP output #106

B & S transpose issue in BSH collectives_layout MLP output #106

Uh oh!

kahfizulkifli commented May 8, 2025

Uh oh!

Uh oh!

B & S transpose issue in BSH collectives_layout MLP output #106

Are you sure you want to change the base?

B & S transpose issue in BSH collectives_layout MLP output #106

Uh oh!

Conversation

kahfizulkifli commented May 8, 2025

Uh oh!

Uh oh!