Skip to content

Conversation

dreadnode-renovate-bot[bot]
Copy link
Contributor

@dreadnode-renovate-bot dreadnode-renovate-bot bot commented Aug 21, 2025

This PR contains the following updates:

Package Change Age Confidence
vllm ^0.5.0 -> ^0.10.0 age confidence

Release Notes

vllm-project/vllm (vllm)

v0.10.1

Compare Source

Highlights

v0.10.1 release includes 727 commits, 245 committers (105 new contributors).

Model Support
  • New model families: GPT-OSS with comprehensive tool calling and streaming support (#​22327, #​22330, #​22332, #​22335, #​22339, #​22340, #​22342), Command-A-Vision (#​22660), mBART (#​22883), and SmolLM3 using Transformers backend (#​22665).
  • Vision-language models: Official Eagle multimodal support with Llama4 backend (#​20788), Step3 vision-language models (#​21998), Gemma3n multimodal (#​20495), MiniCPM-V 4.0 (#​22166), HyperCLOVAX-SEED-Vision-Instruct-3B (#​20931), Emu3 with Transformers backend (#​21319), Intern-S1 (#​21628), and Prithvi in online serving mode (#​21518).
  • Enhanced existing models: NemotronH support (#​22349), Ernie 4.5 Base 0.3B model name change (#​21735), GLM-4.5 series improvements (#​22215), Granite models with fused MoE configurations (#​21332) and quantized checkpoint loading (#​22925), Ultravox support for Llama 4 and Gemma 3 backends (#​17818), Mamba1 and Jamba model support in V1 (without CUDA graphs) (#​21249)
  • Advanced model capabilities: Qwen3 EPLB (#​20815) and dual-chunk attention support (#​21924), Qwen native Eagle3 target support (#​22333).
  • Architecture expansions: Encoder-only models without KV-cache enabling BERT-style architectures (#​21270), expanded tensor parallelism support in Transformers backend (#​22651), tensor parallelism for Deepseek_vl2 vision transformer (#​21494), and tensor/pipeline parallelism with Mamba2 kernel for PLaMo2 (#​19674).
  • V1 engine compatibility: Extended support for additional pooling models (#​21747) and Step3VisionEncoder distributed processing option (#​22697).
Engine Core
  • CUDA graph performance: Full CUDA graph support with separate attention routines, adding FA2 and FlashInfer compatibility (#​20059), plus 6% end-to-end throughput improvement from Cutlass MLA (#​22763).
  • Attention system advances: Multiple attention metadata builders per KV cache specification (#​21588), tree attention backend for v1 engine (experimental) (#​20401), FlexAttention encoder-only support (#​22273), upgraded FlashAttention 3 with attention sink support (#​22313), and multiple attention groups for KV sharing patterns (#​22672).
  • Speculative decoding optimizations: N-gram speculative decoding with single KMP token proposal algorithm (#​22437), explicit EAGLE3 interface for enhanced compatibility (#​22642).
  • Default behavior improvements: Pooling models now default to chunked prefill and prefix caching (#​20930), disabled chunked local attention by default for Llama4 for better performance (#​21761).
  • Extensibility and configuration: Model loader plugin system (#​21067), custom operations support for FusedMoe (#​22509), rate limiting with bucket algorithm for proxy server (#​22643), torch.compile support for bailing MoE (#​21664).
  • Performance optimizations: Improved startup time by disabling C++ compilation of symbolic shapes (#​20836), enhanced headless models for pooling in Transformers backend (#​21767).
Hardware & Performance
  • NVIDIA Blackwell (SM100) optimizations: CutlassMLA as default backend (#​21626), FlashInfer MoE per-tensor scale FP8 backend (#​21458), SM90 CUTLASS FP8 GEMM with kernel tuning and swap AB support (#​20396).
  • NVIDIA RTX 5090/RTX PRO 6000 (SM120) support: Block FP8 quantization (#​22131) and CUTLASS NVFP4 4-bit weights/activations support (#​21309).
  • AMD ROCm platform enhancements: Flash Attention backend for Qwen-VL models (#​22069), AITER HIP block quantization kernels (#​21242), reduced device-to-host transfers (#​22683), and optimized kernel performance for small batch sizes 1-4 (#​21350).
  • Attention and compute optimizations: FlashAttention 3 attention sinks performance boost (#​22478), Triton-based multi-dimensional RoPE replacing PyTorch implementation (#​22375), async tensor parallelism for scaled matrix multiplication (#​20155), optimized FlashInfer metadata building (#​21137).
  • Memory and throughput improvements: Mamba2 reduced device-to-device copy overhead (#​21075), fused Triton kernels for RMSNorm (#​20839, #​22184), improved multimodal hasher performance for repeated image prompts (#​22825), multithreaded async multimodal loading (#​22710).
  • Parallelization and MoE optimizations: Guided decoding throughput improvements (#​21862), balanced expert sharding for MoE models (#​21497), expanded fused kernel support for topk softmax (#​22211), fused MoE for nomic-embed-text-v2-moe (#​18321).
  • Hardware compatibility and kernels: ARM CPU build fixes for systems without BF16 support (#​21848), Machete memory-bound performance improvements (#​21556), FlashInfer TRT-LLM prefill attention kernel support (#​22095), optimized reshape_and_cache_flash CUDA kernel (#​22036), CPU transfer support in NixlConnector (#​18293).
  • Specialized CUDA kernels: GPT-OSS activation functions (#​22538), RLHF weight loading acceleration (#​21164).
Quantization
  • Advanced quantization techniques: MXFP4 and bias support for Marlin kernel (#​22428), NVFP4 GEMM FlashInfer backends (#​22346), compressed-tensors mixed-precision model loading (#​22468), FlashInfer MoE support for NVFP4 (#​21639).
  • Hardware-optimized quantization: Dynamic 4-bit quantization with Kleidiai kernels for CPU inference (#​17112), TensorRT-LLM FP4 quantization optimized for MoE low-latency inference (#​21331).
  • Expanded model quantization support: BitsAndBytes quantization for InternS1 (#​21953) and additional MoE models (#​21370, #​21548), Gemma3n quantization compatibility (#​21974), calibration-free RTN quantization for MoE models (#​20766), ModelOpt Qwen3 NVFP4 support (#​20101).
  • Performance and compatibility improvements: CUDA kernel optimization for Int8 per-token group quantization (#​21476), non-contiguous tensor support in FP8 quantization (#​21961), automatic detection of ModelOpt quantization formats (#​22073).
  • Breaking change: Removed AQLM quantization support (#​22943) - users should migrate to alternative quantization methods.
API & Frontend
  • OpenAI API compatibility: Unix domain socket support for local communication (#​18097), improved error response format matching upstream specification (#​22099), aligned tool_choice="required" behavior with OpenAI when tools list is empty (#​21052).
  • New API capabilities: Dedicated LLM.reward interface for reward models (#​21720), chunked processing for long inputs in embedding models (#​22280), AsyncLLM proper response handling for aborted requests (#​22283).
  • Configuration and environment: Multiple API keys support for enhanced authentication (#​18548), custom vLLM tuned configuration paths (#​22791), environment variable control for logging statistics (#​22905), multimodal cache size (#​22441), and DeepGEMM E8M0 scaling behavior (#​21968).
  • CLI and tooling improvements: V1 API support for run-batch command (#​21541), custom process naming for better monitoring (#​21445), improved help display showing available choices (#​21760), optional memory profiling skip for multimodal models (#​22950), enhanced logging of non-default arguments (#​21680).
  • Tool and parser support: HermesToolParser for models without special tokens (#​16890), multi-turn conversation benchmarking tool (#​20267).
  • Distributed serving enhancements: Enhanced hybrid distributed serving with multiple API servers in load balancing mode (#​21510), request_id support for external load balancers (#​21009).
  • User experience enhancements: Improved error messaging for multimodal items (#​22114), per-request pooling control via PoolingParams (#​20538).
Dependencies
  • FlashInfer updates: Updated to v0.2.8 for improved performance (#​21385), moved to optional dependency install with pip install vllm[flashinfer] for flexible installation (#​21959).
  • Mamba SSM restructuring: Updated to version 2.2.5 (#​21421), removed from core requirements to reduce installation complexity (#​22541).
  • Docker and deployment: Docker-aware precompiled wheel support for easier containerized deployment (#​21127, #​22106).
  • Python package updates: OpenAI Python dependency updated to latest version for API compatibility (#​22316).
  • Dependency optimizations: Removed xformers requirement for Mistral-format Pixtral and Mistral3 models (#​21154), deprecation warnings added for old DeepGEMM version (#​22194).
V0 Deprecation

Important: As part of the ongoing V0 engine cleanup, several breaking changes have been introduced:

  • CLI flag updates: Replaced --task with --runner and --convert options (#​21470), deprecated --disable-log-requests in favor of --enable-log-requests for clearer semantics (#​21739), renamed --expand-tools-even-if-tool-choice-none to --exclude-tools-when-tool-choice-none for consistency (#​20544).
  • API cleanup: Removed previously deprecated arguments and methods as part of ongoing V0 engine codebase cleanup (#​21907).

What's Changed


Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Enabled.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR has been generated by Renovate Bot.

| datasource | package | from  | to     |
| ---------- | ------- | ----- | ------ |
| pypi       | vllm    | 0.5.5 | 0.10.1 |
@dreadnode-renovate-bot dreadnode-renovate-bot bot requested a review from a team as a code owner August 21, 2025 20:04
@dreadnode-renovate-bot
Copy link
Contributor Author

⚠️ Artifact update problem

Renovate failed to update an artifact related to this branch. You probably do not want to merge this PR as-is.

♻ Renovate will retry this branch, including artifacts, only when one of the following happens:

  • any of the package files in this branch needs updating, or
  • the branch becomes conflicted, or
  • you click the rebase/retry checkbox if found above, or
  • you rename this PR's title to start with "rebase!" to trigger it manually

The artifact failure details are included below:

File name: poetry.lock
Updating dependencies
Resolving dependencies...


The current project's supported Python range (>=3.10,<3.14) is not compatible with some of the required packages Python requirement:
  - vllm requires Python <3.13,>=3.9, so it will not be installable for Python >=3.13,<3.14
  - vllm requires Python <3.13,>=3.9, so it will not be installable for Python >=3.13,<3.14
  - vllm requires Python <3.13,>=3.9, so it will not be installable for Python >=3.13,<3.14

Because no versions of vllm match >0.10.0,<0.10.1 || >0.10.1,<0.10.1.1 || >0.10.1.1,<0.11.0
 and vllm (0.10.0) requires Python <3.13,>=3.9, vllm is forbidden.
And because vllm (0.10.1) requires Python <3.13,>=3.9, vllm is forbidden.
So, because vllm (0.10.1.1) requires Python <3.13,>=3.9
 and rigging depends on vllm (^0.10.0), version solving failed.

  * Check your dependencies Python requirement: The Python requirement can be specified via the `python` or `markers` properties

    For vllm, a possible solution would be to set the `python` property to ">=3.10,<3.13"
For vllm, a possible solution would be to set the `python` property to ">=3.10,<3.13"
For vllm, a possible solution would be to set the `python` property to ">=3.10,<3.13"

    https://python-poetry.org/docs/dependency-specification/#python-restricted-dependencies,
    https://python-poetry.org/docs/dependency-specification/#using-environment-markers


@dreadnode-renovate-bot dreadnode-renovate-bot bot added type/digest Dependency digest updates area/python Changes to Python package configuration and dependencies labels Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/python Changes to Python package configuration and dependencies type/digest Dependency digest updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants