Skip to content

Conversation

kahfizulkifli
Copy link

Description of PR

In our testing with the latest transformers-neuronx version 1ade6d7 we found a similar case to 69d039d in the MLP output calculation, where using the Llama-3 model, we noted that the outputs between --tp_degree=1 and --tp_degree=2 --collectives_layout="BSH" are highly different to each other.

Steps to reproduce the bug:

  1. Setup transformers-neuronx library on AWS Neuron trainium machine
  2. Download the Llama-3 pretrained model from Huggingface https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/tree/main
  3. Edit the config.json in the Llama-3 folder to use 1 layer ("num_hidden_layers": 1)
  4. Run the test script provided in two modes: baseline and distributed with collectives layout BSH
# Baseline mode
python llama_driver.py run <model_path_folder> --tp_degree=1
# Distributed mode
python llama_driver.py run <model_path_folder> --tp_degree=2 --collectives_layout="BSH" 
  1. Check that the results of both scripts are different to one another

Here are some sample generated output and logit outputs that we got from running the two modes

Output and logits for the baseline

generated_sequence= tensor([[128000,   9906,     11,    358,   2846,    264,   4221,   1646,     11,
          93918, 104979,  98859,  19697,  75279, 116579,  35967,  74179,  44221,
          62243, 127834,  81366,    100,  20934,    246,  27105, 110590,  37986, … ]])

tensor([-3.1151, -1.2302, -0.7817,  ..., -0.7651, -0.7653, -0.7651])
tensor([-3.1151, -1.2302, -0.7817,  ..., -0.7651, -0.7653, -0.7651])
tensor([-3.1151, -1.2302, -0.7817,  ..., -0.7651, -0.7653, -0.7651])
tensor([-3.1151, -1.2302, -0.7817,  ..., -0.7651, -0.7653, -0.7651])

Output and logits for the TP 2 with collectives layout BSH

generated_sequence= tensor([[128000,   9906,     11,    358,   2846,    264,   4221,   1646,     11,
          32084,  22580, 107442,  26792,  41875,  12857, 105316,  85889,  77251,
          13056,  72690,  53945, 116250,  97314,    354,  86216,  58562,  92932, … ]])

tensor([-2.1269,  0.7309, -0.6974,  ..., -0.8386, -0.8388, -0.8388])
tensor([-2.8376, -0.1400,  0.1552,  ..., -0.0865, -0.0867, -0.0868])
tensor([-2.5670,  0.1932,  0.2707,  ..., -0.1999, -0.1999, -0.2001])
tensor([-2.6546,  0.0942, -0.0927,  ..., -0.0946, -0.0946, -0.0948])

In 69d039d the output of the dot is split into (s, b, h) through a reshape operator, and then converted into (b, s, h) through a transpose of (1, 0, 2), getting the result of (b, s, h).

We propose a similar modification in the BSH collectives_layout feature, where the initial output of the dot is mapped to (s * b, h) and not (b * s, h). Therefore, we need to convert (s * b, h) => (b, s, h).

We tested the fix and the outputs of the --tp_degree=2 --collectives_layout="BSH" are equivalent to the --tp_degree=1 mode.

Your insights are very much appreciated. We will continue following up this issue until it is resolved.

Credits to @wenboqian for providing initial direction to detecting and fixing the bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant