Qwen3 MoE CPU is slower than full precision

Hello!

Thank you team for the awesome work on getting Qwen3 support out. 

Did some testing today against full precision and got some pretty terrible results compared to what I think we should expect from an MoE.

Test code:

```
import time
from threading import Thread
from transformers import AutoTokenizer, TextIteratorStreamer
# from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoModelForCausalLM


model_id = "" # Can be a local path or an HF id
#ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": 4}

print("Loading model...")
load_time = time.perf_counter()
# model = OVModelForCausalLM.from_pretrained(
model = AutoModelForCausalLM.from_pretrained(
    model_id,
   # export=False,
   # ov_config=ov_config,
    device_map="auto",
    torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
load_time = time.perf_counter() - load_time
print(f"Model loaded in {load_time:.3f} seconds.") 

text_prompt = "We really should join the OpenArc Discord"
conversation = [
    {
        "role": "user",
        "content": text_prompt
    }
]
text_prompt_templated = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text=text_prompt_templated, return_tensors="pt")
input_token_count = inputs['input_ids'].shape[1]

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=128)
thread = Thread(target=model.generate, kwargs=generation_kwargs)

first_token_received = False
generate_start = 0.0
first_token = 0.0
ttft = 0.0
generated_text = ""

generate_start = time.perf_counter()
thread.start()

for new_text in streamer:
    if not first_token_received:
        first_token = time.perf_counter()
        ttft = first_token - generate_start
        first_token_received = True

    print(new_text, end='', flush=True)
    generated_text += new_text

thread.join()
generate_end = time.perf_counter()

generation_time = generate_end - generate_start

num_tokens_generated = len(tokenizer.encode(generated_text))

if generation_time > 0 and num_tokens_generated > 0:
    tokens_per_second = num_tokens_generated / generation_time
    average_token_latency = generation_time / num_tokens_generated

print("\nPerformance Report:")
print("-"*50)
print(f"Input Tokens    : {input_token_count:>9}")
print(f"Output Tokens   : {num_tokens_generated:>9}")
print("")
print(f"Load Time       : {load_time:>9.3f} sec (Model Load Time)")
print(f"TTFT            : {ttft:>9.3f} sec (Time To First Token)")
print(f"Generation Time : {generation_time:>9.3f} sec (Total Generation Time)")
print(f"Throughput      : {tokens_per_second:>9.2f} t/s (Tokens Per Second)")
print(f"Avg Latency     : {average_token_latency:>9.3f} sec (Average Token Latency)")
print("-"*50)

```

## OpenVINO Results:

Model: https://huggingface.co/Echo9Zulu/Qwen3-30B-A3B-int8_asym-ov/tree/main
Converted with:
```
optimum-cli export openvino -m "" --task text-generation-with-past --weight-format int8_asym ""
```
```
Input Tokens    :        16
Output Tokens   :       128

Load Time       :   863.396 sec (Model Load Time)
TTFT            :     6.474 sec (Time To First Token)
Generation Time :    86.857 sec (Total Generation Time)
Throughput      :      1.47 t/s (Tokens Per Second)
Avg Latency     :     0.679 sec (Average Token Latency)
```


Model: https://huggingface.co/Echo9Zulu/Qwen3-30B-A3B-nf4-ov/tree/main 
Converted with:
```
optimum-cli export openvino -m "" --task text-generation-with-past --weight-format nf4 --ratio 1 --group-size 128 --backup-precision int8_asym --sensitivity-metric weight_quantization_error ""
```
```
Input Tokens    :        16

Output Tokens   :       128

Load Time       :   963.818 sec (Model Load Time)
TTFT            :    15.691 sec (Time To First Token)
Generation Time :   119.668 sec (Total Generation Time)
Throughput      :      1.07 t/s (Tokens Per Second)
Avg Latency     :     0.935 sec (Average Token Latency)
```

### Full Precision

```
Input Tokens    :        16
Output Tokens   :       128

Load Time       :    21.417 sec (Model Load Time)
TTFT            :     3.609 sec (Time To First Token)
Generation Time :   133.330 sec (Total Generation Time)
Throughput      :      0.96 t/s (Tokens Per Second)
Avg Latency     :     1.042 sec (Average Token Latency)
```



### Some Notes

Machine:
2x Xeon 6242
768gb DDR4 ECC
Bare Metal 
Ubuntu 22.04

- This test was only one-shot. Timing more generations may benefit performance by keeping model in memory.
- TTFT is faster for int8_asym and nf4 both cases as expected but throughput should be faster
- All generations reached 128 because model did not complete CoT so no eos token
- Maybe its worth testing on Tiber to see Max 1550 performance (keep model in memory
- Later on I will be checking out the openvino_model.xml to see if maybe the nncf algos nuked the model by quanting layers which should be in a different weight format?

I want to help diagnose this but don't know where to start. In #1214 a 'tiny' model was mentioned. How are those used, and are they on the hub? My feeling is that we might need a more nuanced custom export config but I'm not sure- there is probably something simpler to try first. 

Again, thank you team for the awesome work!!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen3 MoE CPU is slower than full precision #1275

OpenVINO Results:

Full Precision

Some Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen3 MoE CPU is slower than full precision #1275

Description

OpenVINO Results:

Full Precision

Some Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions