Description
Hello!
Thank you team for the awesome work on getting Qwen3 support out.
Did some testing today against full precision and got some pretty terrible results compared to what I think we should expect from an MoE.
Test code:
import time
from threading import Thread
from transformers import AutoTokenizer, TextIteratorStreamer
# from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoModelForCausalLM
model_id = "" # Can be a local path or an HF id
#ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": 4}
print("Loading model...")
load_time = time.perf_counter()
# model = OVModelForCausalLM.from_pretrained(
model = AutoModelForCausalLM.from_pretrained(
model_id,
# export=False,
# ov_config=ov_config,
device_map="auto",
torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
load_time = time.perf_counter() - load_time
print(f"Model loaded in {load_time:.3f} seconds.")
text_prompt = "We really should join the OpenArc Discord"
conversation = [
{
"role": "user",
"content": text_prompt
}
]
text_prompt_templated = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text=text_prompt_templated, return_tensors="pt")
input_token_count = inputs['input_ids'].shape[1]
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=128)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
first_token_received = False
generate_start = 0.0
first_token = 0.0
ttft = 0.0
generated_text = ""
generate_start = time.perf_counter()
thread.start()
for new_text in streamer:
if not first_token_received:
first_token = time.perf_counter()
ttft = first_token - generate_start
first_token_received = True
print(new_text, end='', flush=True)
generated_text += new_text
thread.join()
generate_end = time.perf_counter()
generation_time = generate_end - generate_start
num_tokens_generated = len(tokenizer.encode(generated_text))
if generation_time > 0 and num_tokens_generated > 0:
tokens_per_second = num_tokens_generated / generation_time
average_token_latency = generation_time / num_tokens_generated
print("\nPerformance Report:")
print("-"*50)
print(f"Input Tokens : {input_token_count:>9}")
print(f"Output Tokens : {num_tokens_generated:>9}")
print("")
print(f"Load Time : {load_time:>9.3f} sec (Model Load Time)")
print(f"TTFT : {ttft:>9.3f} sec (Time To First Token)")
print(f"Generation Time : {generation_time:>9.3f} sec (Total Generation Time)")
print(f"Throughput : {tokens_per_second:>9.2f} t/s (Tokens Per Second)")
print(f"Avg Latency : {average_token_latency:>9.3f} sec (Average Token Latency)")
print("-"*50)
OpenVINO Results:
Model: https://huggingface.co/Echo9Zulu/Qwen3-30B-A3B-int8_asym-ov/tree/main
Converted with:
optimum-cli export openvino -m "" --task text-generation-with-past --weight-format int8_asym ""
Input Tokens : 16
Output Tokens : 128
Load Time : 863.396 sec (Model Load Time)
TTFT : 6.474 sec (Time To First Token)
Generation Time : 86.857 sec (Total Generation Time)
Throughput : 1.47 t/s (Tokens Per Second)
Avg Latency : 0.679 sec (Average Token Latency)
Model: https://huggingface.co/Echo9Zulu/Qwen3-30B-A3B-nf4-ov/tree/main
Converted with:
optimum-cli export openvino -m "" --task text-generation-with-past --weight-format nf4 --ratio 1 --group-size 128 --backup-precision int8_asym --sensitivity-metric weight_quantization_error ""
Input Tokens : 16
Output Tokens : 128
Load Time : 963.818 sec (Model Load Time)
TTFT : 15.691 sec (Time To First Token)
Generation Time : 119.668 sec (Total Generation Time)
Throughput : 1.07 t/s (Tokens Per Second)
Avg Latency : 0.935 sec (Average Token Latency)
Full Precision
Input Tokens : 16
Output Tokens : 128
Load Time : 21.417 sec (Model Load Time)
TTFT : 3.609 sec (Time To First Token)
Generation Time : 133.330 sec (Total Generation Time)
Throughput : 0.96 t/s (Tokens Per Second)
Avg Latency : 1.042 sec (Average Token Latency)
Some Notes
Machine:
2x Xeon 6242
768gb DDR4 ECC
Bare Metal
Ubuntu 22.04
- This test was only one-shot. Timing more generations may benefit performance by keeping model in memory.
- TTFT is faster for int8_asym and nf4 both cases as expected but throughput should be faster
- All generations reached 128 because model did not complete CoT so no eos token
- Maybe its worth testing on Tiber to see Max 1550 performance (keep model in memory
- Later on I will be checking out the openvino_model.xml to see if maybe the nncf algos nuked the model by quanting layers which should be in a different weight format?
I want to help diagnose this but don't know where to start. In #1214 a 'tiny' model was mentioned. How are those used, and are they on the hub? My feeling is that we might need a more nuanced custom export config but I'm not sure- there is probably something simpler to try first.
Again, thank you team for the awesome work!!!