Skip to content

Qwen3 MoE CPU is slower than full precision #1275

Open
@SearchSavior

Description

@SearchSavior

Hello!

Thank you team for the awesome work on getting Qwen3 support out.

Did some testing today against full precision and got some pretty terrible results compared to what I think we should expect from an MoE.

Test code:

import time
from threading import Thread
from transformers import AutoTokenizer, TextIteratorStreamer
# from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoModelForCausalLM


model_id = "" # Can be a local path or an HF id
#ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": 4}

print("Loading model...")
load_time = time.perf_counter()
# model = OVModelForCausalLM.from_pretrained(
model = AutoModelForCausalLM.from_pretrained(
    model_id,
   # export=False,
   # ov_config=ov_config,
    device_map="auto",
    torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
load_time = time.perf_counter() - load_time
print(f"Model loaded in {load_time:.3f} seconds.") 

text_prompt = "We really should join the OpenArc Discord"
conversation = [
    {
        "role": "user",
        "content": text_prompt
    }
]
text_prompt_templated = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text=text_prompt_templated, return_tensors="pt")
input_token_count = inputs['input_ids'].shape[1]

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=128)
thread = Thread(target=model.generate, kwargs=generation_kwargs)

first_token_received = False
generate_start = 0.0
first_token = 0.0
ttft = 0.0
generated_text = ""

generate_start = time.perf_counter()
thread.start()

for new_text in streamer:
    if not first_token_received:
        first_token = time.perf_counter()
        ttft = first_token - generate_start
        first_token_received = True

    print(new_text, end='', flush=True)
    generated_text += new_text

thread.join()
generate_end = time.perf_counter()

generation_time = generate_end - generate_start

num_tokens_generated = len(tokenizer.encode(generated_text))

if generation_time > 0 and num_tokens_generated > 0:
    tokens_per_second = num_tokens_generated / generation_time
    average_token_latency = generation_time / num_tokens_generated

print("\nPerformance Report:")
print("-"*50)
print(f"Input Tokens    : {input_token_count:>9}")
print(f"Output Tokens   : {num_tokens_generated:>9}")
print("")
print(f"Load Time       : {load_time:>9.3f} sec (Model Load Time)")
print(f"TTFT            : {ttft:>9.3f} sec (Time To First Token)")
print(f"Generation Time : {generation_time:>9.3f} sec (Total Generation Time)")
print(f"Throughput      : {tokens_per_second:>9.2f} t/s (Tokens Per Second)")
print(f"Avg Latency     : {average_token_latency:>9.3f} sec (Average Token Latency)")
print("-"*50)

OpenVINO Results:

Model: https://huggingface.co/Echo9Zulu/Qwen3-30B-A3B-int8_asym-ov/tree/main
Converted with:

optimum-cli export openvino -m "" --task text-generation-with-past --weight-format int8_asym ""
Input Tokens    :        16
Output Tokens   :       128

Load Time       :   863.396 sec (Model Load Time)
TTFT            :     6.474 sec (Time To First Token)
Generation Time :    86.857 sec (Total Generation Time)
Throughput      :      1.47 t/s (Tokens Per Second)
Avg Latency     :     0.679 sec (Average Token Latency)

Model: https://huggingface.co/Echo9Zulu/Qwen3-30B-A3B-nf4-ov/tree/main
Converted with:

optimum-cli export openvino -m "" --task text-generation-with-past --weight-format nf4 --ratio 1 --group-size 128 --backup-precision int8_asym --sensitivity-metric weight_quantization_error ""
Input Tokens    :        16

Output Tokens   :       128

Load Time       :   963.818 sec (Model Load Time)
TTFT            :    15.691 sec (Time To First Token)
Generation Time :   119.668 sec (Total Generation Time)
Throughput      :      1.07 t/s (Tokens Per Second)
Avg Latency     :     0.935 sec (Average Token Latency)

Full Precision

Input Tokens    :        16
Output Tokens   :       128

Load Time       :    21.417 sec (Model Load Time)
TTFT            :     3.609 sec (Time To First Token)
Generation Time :   133.330 sec (Total Generation Time)
Throughput      :      0.96 t/s (Tokens Per Second)
Avg Latency     :     1.042 sec (Average Token Latency)

Some Notes

Machine:
2x Xeon 6242
768gb DDR4 ECC
Bare Metal
Ubuntu 22.04

  • This test was only one-shot. Timing more generations may benefit performance by keeping model in memory.
  • TTFT is faster for int8_asym and nf4 both cases as expected but throughput should be faster
  • All generations reached 128 because model did not complete CoT so no eos token
  • Maybe its worth testing on Tiber to see Max 1550 performance (keep model in memory
  • Later on I will be checking out the openvino_model.xml to see if maybe the nncf algos nuked the model by quanting layers which should be in a different weight format?

I want to help diagnose this but don't know where to start. In #1214 a 'tiny' model was mentioned. How are those used, and are they on the hub? My feeling is that we might need a more nuanced custom export config but I'm not sure- there is probably something simpler to try first.

Again, thank you team for the awesome work!!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions