Merge branch 'main' into issue-1927-type-hints

brian-dellabetta · web-flow · commit efb46332c893 · 2025-10-21T17:06:09.000-04:00
diff --git a/.gitignore b/.gitignore
@@ -126,7 +126,7 @@ venv.bak/
 
 # mkdocs documentation
 /site
-docs/.cache/
+docs/.cache/*
 
 # mypy
 .mypy_cache/
diff --git a/docs/index.md b/docs/index.md
@@ -20,7 +20,7 @@ Review the [LLM Compressor v0.8.0 release notes](https://github.com/vllm-project
 !!! info "Support for multiple modifiers in oneshot compression runs"
     LLM Compressor now supports using multiple modifiers in oneshot compression runs such as applying both AWQ and GPTQ in a single model. 
 
-    Using multiple modifiers is an advanced usage of LLM Compressor and an active area of research. See [Non-uniform Quantization](examples/quantization_non_uniform/) for more detail and example usage.
+    Using multiple modifiers is an advanced usage of LLM Compressor and an active area of research. See [Non-uniform Quantization](/examples/quantization_non_uniform/) for more detail and example usage.
 
 !!! info "Quantization and calibration support for Qwen3 models"
     Quantization and calibration support for Qwen3 Next models has been added to LLM Compressor.
@@ -39,16 +39,16 @@ Review the [LLM Compressor v0.8.0 release notes](https://github.com/vllm-project
 ## Recent Updates
 
 !!! info "QuIP and SpinQuant-style Transforms" 
-    The newly added [`QuIPModifier`](../examples/transform/quip_example.py) and [`SpinQuantModifier`](../examples/transform/spinquant_example.py) allow you to quantize models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit-weight and activation quantization.
+    The newly added [`QuIPModifier` and `SpinQuantModifier`](/examples/transform) transforms allow you to quantize models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit-weight and activation quantization.
 
 !!! info "DeepSeekV3-style Block Quantization Support" 
-    Allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](../examples/quantization_w8a8_fp8/fp8_block_example.py).
+    Allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to [W8A8](/examples/quantization_w8a8_fp8/).
 
 !!! info "FP4 Quantization - now with MoE and non-uniform support" 
-    Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the [NVFP4 configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [FP4 activation support](../examples/quantization_w4a4_fp4/llama3_example.py), [MoE support](../examples/quantization_w4a4_fp4/qwen_30b_a3b.py), and [Non-uniform quantization support](../examples/quantization_non_uniform/README.md) where some layers are selectively quantized to FP8 for better recovery. You can also mix other quantization schemes, such as INT8 and INT4.
+    Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the [NVFP4 configuration](https://github.com/neuralmagic/compressed-tensors/blob/f5dbfc336b9c9c361b9fe7ae085d5cb0673e56eb/src/compressed_tensors/quantization/quant_scheme.py#L104). See examples of [FP4 activation support](/examples/quantization_w4a4_fp4/), [MoE support](/examples/quantization_w4a4_fp4/), and [Non-uniform quantization support](/examples/quantization_non_uniform/) where some layers are selectively quantized to FP8 for better recovery. You can also mix other quantization schemes, such as INT8 and INT4.
 
 !!! info "Llama4 Quantization Support"
-    Quantize a Llama4 model to [W4A16](../examples/quantization_w4a16) or [NVFP4](../examples/quantization_w4a4_fp4/llama4_example.py). The checkpoint produced can seamlessly run in vLLM.
+    Quantize a Llama4 model to [W4A16](/examples/quantization_w4a16/) or [NVFP4](/examples/quantization_w4a4_fp4/). The checkpoint produced can seamlessly run in vLLM.
 
 For more information, check out the [latest release on GitHub](https://github.com/vllm-project/llm-compressor/releases/latest).