fix(deps): update dependency vllm to ^0.10.0 #257

dreadnode-renovate-bot · 2025-08-21T20:04:22Z

This PR contains the following updates:

Package	Change	Age	Confidence
vllm	`^0.5.0` -> `^0.10.0`

Release Notes

vllm-project/vllm (vllm)

`v0.10.1`

Compare Source

Highlights

v0.10.1 release includes 727 commits, 245 committers (105 new contributors).

Model Support

New model families: GPT-OSS with comprehensive tool calling and streaming support (#22327, #22330, #22332, #22335, #22339, #22340, #22342), Command-A-Vision (#22660), mBART (#22883), and SmolLM3 using Transformers backend (#22665).
Vision-language models: Official Eagle multimodal support with Llama4 backend (#20788), Step3 vision-language models (#21998), Gemma3n multimodal (#20495), MiniCPM-V 4.0 (#22166), HyperCLOVAX-SEED-Vision-Instruct-3B (#20931), Emu3 with Transformers backend (#21319), Intern-S1 (#21628), and Prithvi in online serving mode (#21518).
Enhanced existing models: NemotronH support (#22349), Ernie 4.5 Base 0.3B model name change (#21735), GLM-4.5 series improvements (#22215), Granite models with fused MoE configurations (#21332) and quantized checkpoint loading (#22925), Ultravox support for Llama 4 and Gemma 3 backends (#17818), Mamba1 and Jamba model support in V1 (without CUDA graphs) (#21249)
Advanced model capabilities: Qwen3 EPLB (#20815) and dual-chunk attention support (#21924), Qwen native Eagle3 target support (#22333).
Architecture expansions: Encoder-only models without KV-cache enabling BERT-style architectures (#21270), expanded tensor parallelism support in Transformers backend (#22651), tensor parallelism for Deepseek_vl2 vision transformer (#21494), and tensor/pipeline parallelism with Mamba2 kernel for PLaMo2 (#19674).
V1 engine compatibility: Extended support for additional pooling models (#21747) and Step3VisionEncoder distributed processing option (#22697).

Engine Core

CUDA graph performance: Full CUDA graph support with separate attention routines, adding FA2 and FlashInfer compatibility (#20059), plus 6% end-to-end throughput improvement from Cutlass MLA (#22763).
Attention system advances: Multiple attention metadata builders per KV cache specification (#21588), tree attention backend for v1 engine (experimental) (#20401), FlexAttention encoder-only support (#22273), upgraded FlashAttention 3 with attention sink support (#22313), and multiple attention groups for KV sharing patterns (#22672).
Speculative decoding optimizations: N-gram speculative decoding with single KMP token proposal algorithm (#22437), explicit EAGLE3 interface for enhanced compatibility (#22642).
Default behavior improvements: Pooling models now default to chunked prefill and prefix caching (#20930), disabled chunked local attention by default for Llama4 for better performance (#21761).
Extensibility and configuration: Model loader plugin system (#21067), custom operations support for FusedMoe (#22509), rate limiting with bucket algorithm for proxy server (#22643), torch.compile support for bailing MoE (#21664).
Performance optimizations: Improved startup time by disabling C++ compilation of symbolic shapes (#20836), enhanced headless models for pooling in Transformers backend (#21767).

Hardware & Performance

NVIDIA Blackwell (SM100) optimizations: CutlassMLA as default backend (#21626), FlashInfer MoE per-tensor scale FP8 backend (#21458), SM90 CUTLASS FP8 GEMM with kernel tuning and swap AB support (#20396).
NVIDIA RTX 5090/RTX PRO 6000 (SM120) support: Block FP8 quantization (#22131) and CUTLASS NVFP4 4-bit weights/activations support (#21309).
AMD ROCm platform enhancements: Flash Attention backend for Qwen-VL models (#22069), AITER HIP block quantization kernels (#21242), reduced device-to-host transfers (#22683), and optimized kernel performance for small batch sizes 1-4 (#21350).
Attention and compute optimizations: FlashAttention 3 attention sinks performance boost (#22478), Triton-based multi-dimensional RoPE replacing PyTorch implementation (#22375), async tensor parallelism for scaled matrix multiplication (#20155), optimized FlashInfer metadata building (#21137).
Memory and throughput improvements: Mamba2 reduced device-to-device copy overhead (#21075), fused Triton kernels for RMSNorm (#20839, #22184), improved multimodal hasher performance for repeated image prompts (#22825), multithreaded async multimodal loading (#22710).
Parallelization and MoE optimizations: Guided decoding throughput improvements (#21862), balanced expert sharding for MoE models (#21497), expanded fused kernel support for topk softmax (#22211), fused MoE for nomic-embed-text-v2-moe (#18321).
Hardware compatibility and kernels: ARM CPU build fixes for systems without BF16 support (#21848), Machete memory-bound performance improvements (#21556), FlashInfer TRT-LLM prefill attention kernel support (#22095), optimized reshape_and_cache_flash CUDA kernel (#22036), CPU transfer support in NixlConnector (#18293).
Specialized CUDA kernels: GPT-OSS activation functions (#22538), RLHF weight loading acceleration (#21164).

Quantization

Advanced quantization techniques: MXFP4 and bias support for Marlin kernel (#22428), NVFP4 GEMM FlashInfer backends (#22346), compressed-tensors mixed-precision model loading (#22468), FlashInfer MoE support for NVFP4 (#21639).
Hardware-optimized quantization: Dynamic 4-bit quantization with Kleidiai kernels for CPU inference (#17112), TensorRT-LLM FP4 quantization optimized for MoE low-latency inference (#21331).
Expanded model quantization support: BitsAndBytes quantization for InternS1 (#21953) and additional MoE models (#21370, #21548), Gemma3n quantization compatibility (#21974), calibration-free RTN quantization for MoE models (#20766), ModelOpt Qwen3 NVFP4 support (#20101).
Performance and compatibility improvements: CUDA kernel optimization for Int8 per-token group quantization (#21476), non-contiguous tensor support in FP8 quantization (#21961), automatic detection of ModelOpt quantization formats (#22073).
Breaking change: Removed AQLM quantization support (#22943) - users should migrate to alternative quantization methods.

API & Frontend

OpenAI API compatibility: Unix domain socket support for local communication (#18097), improved error response format matching upstream specification (#22099), aligned tool_choice="required" behavior with OpenAI when tools list is empty (#21052).
New API capabilities: Dedicated LLM.reward interface for reward models (#21720), chunked processing for long inputs in embedding models (#22280), AsyncLLM proper response handling for aborted requests (#22283).
Configuration and environment: Multiple API keys support for enhanced authentication (#18548), custom vLLM tuned configuration paths (#22791), environment variable control for logging statistics (#22905), multimodal cache size (#22441), and DeepGEMM E8M0 scaling behavior (#21968).
CLI and tooling improvements: V1 API support for run-batch command (#21541), custom process naming for better monitoring (#21445), improved help display showing available choices (#21760), optional memory profiling skip for multimodal models (#22950), enhanced logging of non-default arguments (#21680).
Tool and parser support: HermesToolParser for models without special tokens (#16890), multi-turn conversation benchmarking tool (#20267).
Distributed serving enhancements: Enhanced hybrid distributed serving with multiple API servers in load balancing mode (#21510), request_id support for external load balancers (#21009).
User experience enhancements: Improved error messaging for multimodal items (#22114), per-request pooling control via PoolingParams (#20538).

Dependencies

FlashInfer updates: Updated to v0.2.8 for improved performance (#21385), moved to optional dependency install with pip install vllm[flashinfer] for flexible installation (#21959).
Mamba SSM restructuring: Updated to version 2.2.5 (#21421), removed from core requirements to reduce installation complexity (#22541).
Docker and deployment: Docker-aware precompiled wheel support for easier containerized deployment (#21127, #22106).
Python package updates: OpenAI Python dependency updated to latest version for API compatibility (#22316).
Dependency optimizations: Removed xformers requirement for Mistral-format Pixtral and Mistral3 models (#21154), deprecation warnings added for old DeepGEMM version (#22194).

V0 Deprecation

Important: As part of the ongoing V0 engine cleanup, several breaking changes have been introduced:

CLI flag updates: Replaced --task with --runner and --convert options (#21470), deprecated --disable-log-requests in favor of --enable-log-requests for clearer semantics (#21739), renamed --expand-tools-even-if-tool-choice-none to --exclude-tools-when-tool-choice-none for consistency (#20544).
API cleanup: Removed previously deprecated arguments and methods as part of ongoing V0 engine codebase cleanup (#21907).

What's Changed

Deduplicate Transformers backend code using inheritance by @hmellor in #21461
[Bugfix][ROCm] Fix for warp_size uses on host by @gshtras in #21205
[TPU][Bugfix] fix moe layer by @yaochengji in #21340
[v1][Core] Clean up usages of SpecializedManager by @zhouwfang in #21407
[Misc] Fix duplicate FusedMoEConfig debug messages by @njhill in #21455
[Core] Support model loader plugins by @22quinn in #21067
remove GLM-4 quantization wrong Code by @zRzRzRzRzRzRzR in #21435
Replace --expand-tools-even-if-tool-choice-none with --exclude-tools-when-tool-choice-none for v0.10.0 by @okdshin in #20544
[Misc] Improve comment for DPEngineCoreActor._set_cuda_visible_devices() by @ruisearch42 in #21501
[Feat] Allow custom naming of vLLM processes by @chaunceyjiang in #21445
bump flashinfer to v0.2.8 by @cjackal in #21385
[Attention] Optimize FlashInfer MetadataBuilder Build call by @LucasWilkinson in #21137
[Model] Officially support Emu3 with Transformers backend by @hmellor in #21319
[Bugfix] Fix CUDA arch flags for MoE permute by @minosfuture in #21426
[Fix] Update mamba_ssm to 2.2.5 by @elvischenv in #21421
[Docs] Update Tensorizer usage documentation by @sangstar in #21190
[Docs] Rewrite Distributed Inference and Serving guide by @crypdick in #20593
[Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access by @yewentao256 in #21465
Update flashinfer CUTLASS MoE Kernel by @wenscarl in #21408
[XPU] Conditionally import CUDA-specific passes to avoid import errors on xpu platform by @chaojun-zhang in #21036
[P/D] Move FakeNixlWrapper to test dir by @ruisearch42 in #21328
[P/D] Support CPU Transfer in NixlConnector by @juncgu in #18293
[Docs][minor] Fix broken gh-file link in distributed serving docs by @crypdick in #21543
[Docs] Add Expert Parallelism Initial Documentation by @simon-mo in #21373
update flashinfer to v0.2.9rc1 by @weireweire in #21485
[TPU][TEST] HF_HUB_DISABLE_XET=1 the test 3. by @QiliangCui in #21539
[MoE] More balanced expert sharding by @WoosukKwon in #21497
[Frontend] run-batch supports V1 by @DarkLight1337 in #21541
[Docs] Fix site_url for RunLLM by @hmellor in #21564
[Bug] Fix DeepGemm Init Error by @yewentao256 in #21554
Fix GLM-4 PP Missing Layer When using with PP. by @zRzRzRzRzRzRzR in #21531
[Kernel] adding fused_moe configs for upcoming granite4 by @bringlein in #21332
[Bugfix] DeepGemm utils : Fix hardcoded type-cast by @varun-sundar-rabindranath in #21517
[DP] Support api-server-count > 0 in hybrid DP LB mode by @njhill in #21510
[TPU][Test] Temporarily suspend this MoE model in test_basic.py. by @QiliangCui in #21560
[Docs] Add requirements/common.txt to run unit tests by @zhouwfang in #21572
Integrate TensorSchema with shape validation for Phi3VImagePixelInputs by @bbeckca in #21232
[CI] Update CODEOWNERS for CPU and Intel GPU by @bigPYJ1151 in #21582
[Bugfix] fix modelscope snapshot_download serialization by @andyxning in #21536
[Model] Support tensor parallel for timm ViT in Deepseek_vl2 by @wzqd in #21494
[Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings by @hfan in #21479
[Misc][Tools] make max-model-len a parameter in auto_tune script by @yaochengji in #21321
[CI/Build] fix cpu_extension for apple silicon by @ignaciosica in #21195
[Misc] Removed undefined cmake variables MOE_PERMUTE_ARCHS by @chenyang78 in #21262
[TPU][Bugfix] fix OOM issue in CI test by @yaochengji in #21550
[Tests] Harden DP tests by @njhill in #21508
Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct by @Xu-Wenqing in #21598
[Bugfix] GGUF: fix AttributeError: 'PosixPath' object has no attribute 'startswith' by @kebe7jun in #21579
[Quantization] Enable BNB support for more MoE models by @jeejeelee in #21370
[V1] Get supported tasks from model runner instead of model config by @DarkLight1337 in #21585
[Bugfix][Logprobs] Fix logprobs op to support more backend by @MengqingCao in #21591
[Model] Fix Ernie4.5MoE e_score_correction_bias parameter by @xyxinyang in #21586
[MODEL] New model support for naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B by @bigshanedogg in #20931
[Frontend] Add request_id to the Request object so they can be controlled better via external load balancers by @kouroshHakha in #21009
[Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel by @cyang49 in #20839
[ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. by @fsx950223 in #20295
[Kernel] Improve machete memory bound perf by @czhu-cohere in #21556
Add support for Prithvi in Online serving mode by @mgazz in #21518
[CI] Unifying Dockerfiles for ARM and X86 Builds by @kebe7jun in #21343
[Docs] add auto-round quantization readme by @wenhuach21 in #21600
[TPU][Test] Rollback PR-21550. by @QiliangCui in #21619
Add Unsloth to RLHF.md by @danielhanchen in #21636
[Perf] Cuda Kernel for Int8 Per Token Group Quant by @yewentao256 in #21476
Add interleaved RoPE test for Llama4 (Maverick) by @sarckk in #21478
[Bugfix] Fix sync_and_slice_intermediate_tensors by @ruisearch42 in #21537
[Bugfix] Always set RAY_ADDRESS for Ray actor before spawn by @ruisearch42 in #21540
[TPU] Update ptxla nightly version to 2025072 by @yaochengji in #21555
[Feature] Add support for MoE models in the calibration-free RTN-based quantization by @sakogan in #20766
[Model] Ultravox: Support Llama 4 and Gemma 3 backends by @farzadab in #17818
[Docs] add offline serving multi-modal video input expamle Qwen2.5-VL by @david6666666 in #21530
Correctly kill vLLM processes after finishing serving benchmarks by @huydhn in #21641
[Bugfix] Fix isinstance check for tensor types in _load_prompt_embeds to use dtype comparison by @Mitix-EPI in #21612
[TPU][Test] Divide TPU v1 Test into 2 parts. by @QiliangCui in #21431
Support Intern-S1 by @lvhan028 in #21628
[Misc] remove unused try-except in pooling config check by @reidliu41 in #21618
[Take 2] Correctly kill vLLM processes after benchmarks by @huydhn in #21646
Migrate AriaImagePixelInputs to TensorSchema for shape validation by @bbeckca in #21620
Migrate AyaVisionImagePixelInputs to TensorSchema for shape validation by @bbeckca in #21622
[Bugfix] Investigate Qwen2-VL failing test by @Isotr0py in #21527
Support encoder-only models without KV-Cache by @maxdebayser in #21270
[Bug] Fix has_flashinfer_moe Import Error when it is not installed by @yewentao256 in #21634
[Misc] Improve memory profiling debug message by @yeqcharlotte in #21429
[BugFix] Fix shared storage connector load kv only load attention layer by @david6666666 in #21428
[Refactor] Remove moe_align_block_size_triton by @yewentao256 in #21335
[Bugfix][Apple Silicon] fix missing symbols when build from source on Mac with Apple Silicon by @zhouyeju in #21380
[CI/Build][Doc] Move existing benchmark scripts in CI/document/example to vllm bench CLI by @yeqcharlotte in #21355
[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscale moe fp8 kernels by @kaixih in #21411
Remove xformers requirement for Mistral-format Pixtral and Mistral3 by @wenchen76 in #21154
support torch.compile for bailing moe by @jinzhen-lin in #21664
Migrate Blip2ImagePixelInputs and Blip2ImageEmbeddingInputs to TensorSchema by @bbeckca in #21656
Migrate DeepseekVL2ImageInputs to TensorSchema by @bbeckca in #21658
Migrate FuyuImagePatchInputs to TensorSchema by @bbeckca in #21662
Migrate ChameleonImagePixelInputs to TensorSchema by @bbeckca in #21657
[VLM] Support HF format Phi-4-MM model by @Isotr0py in #17121
Handle non-serializable objects in vllm bench by @huydhn in #21665
[CI/Build][Doc] Clean up more docs that point to old bench scripts by @yeqcharlotte in #21667
Refactor: Remove numpy dependency from LoggingStatLogger by @skyloevil in #20529
[Misc] add default value for file pattern arg by @andyxning in #21659
Migrate Florence2ImagePixelInputs to TensorSchema by @bbeckca in #21663
[VLM] Add video support for Intern-S1 by @Isotr0py in #21671
[Refactor] Refactor MOE NVFP4 Code Base: ModelOpt + Compressed Tensor by @yewentao256 in #21631
Fix CUDA permute/unpermute for use with DeepGemm Moe by @CalebDu in #17934
[Misc] Refactor vllm config str by @andyxning in #21666
[Attention] Make CutlassMLA the default backend for SM100 (blackwell) by @alexm-redhat in #21626
[Deprecation][2/N] Replace --task with --runner and --convert by @DarkLight1337 in #21470
Fix typo for limit-mm-per-prompt in docs by @joa-stdn in #21697
Fix GLM tool parser by @zRzRzRzRzRzRzR in #21668
[Misc] Add fused_moe configs for Qwen3-Coder-480B-A35B-Instruct-FP8 by @jeejeelee in #21700
[V1] Exception Handling when Loading KV Cache from Remote Store by @liuyumoye in #21534
[Model] Support TP/PP/mamba2 kernel for PLaMo2 by @Alnusjaponica in #19674
[FEAT] [ROCm] [AITER]: Add AITER HIP block quant kernel by @tjtanaa in #21242
Migrate Gemma3ImagePixelInputs to TensorSchema by @bbeckca in #21676
Migrate Glm4vImageInputs, Glm4vVideoInputs to TensorSchema by @bbeckca in #21678
Migrate GLMVImagePixelInputs to TensorSchema by @bbeckca in #21679
Migrate GraniteSpeechAudioInputs to TensorSchema by @bbeckca in #21682
Migrate Idefics3ImagePixelInputs and Idefics3ImageEmbeddingInputs to … by @bbeckca in #21683
[Bugfix] [issue-21565] Fix the incompatibility issue with stream and named function calling when Thinking is disabled by @hsliuustc0106 in #21573
[bugfix] fix profile impact benchmark results by @lengrongfu in #21507
[Bugfix] Fix shape checking for Fuyu by @DarkLight1337 in #21709
[Bugfix] fix max-file-size type from str to int by @andyxning in #21675
[BugFix] Fix ChunkedLocalAttention when the hybrid kv-cache is disabled by @LucasWilkinson in #21707
[v1][mamba] Added mamba_type into MambaSpec by @Josephasafg in #21715
Migrate KeyeImageInputs and KeyeVideoInputs to TensorSchema by @bbeckca in #21686
[Model] Prioritize Transformers fallback over suffix matching by @DarkLight1337 in #21719
[feature] add log non default args in LLM by @lengrongfu in #21680
[Bugfix] Fix Ernie4_5_MoeForCausalLM shared experts by @jeejeelee in #21717
[Bugfix] Fix environment variable setting in CPU Dockerfile by @bigPYJ1151 in #21730
[Bugfix] Fix glm4.1v video_grid_thw tensor shape scheme by @Isotr0py in #21744
[PD] let p2p nccl toy proxy handle /chat/completions by @chaunceyjiang in #21734
[Ernie 4.5] Name Change for Base 0.3B Model by @vasqu in #21735
[Bugfix] Improve JSON extraction in LlamaToolParser by @key4ng in #19024
[Docs] Add revision date to rendered docs by @hmellor in #21752
[Bugfix]check health for engine core process exiting unexpectedly by @wuhang2014 in #21728
[Bugfix][CI/Build] Update peft version in test requirement by @Isotr0py in #21729
[Logs] Change flashinfer sampler logs to once by @mgoin in #21759
[Misc] Reduce logs for model resolution by @DarkLight1337 in #21765
[Bugfix] Mistral crashes on tool with no description by @HugoMichard in #21167
[CI/Build] Fix plugin tests by @DarkLight1337 in #21758
[XPU] IPEX-optimized Punica Wrapper on XPU by @chaojun-zhang in #21703
[Bugfix] Fix granite speech shape validation by @DarkLight1337 in #21762
[P/D] Log warnings related to prefill KV expiry by @njhill in #21753
Use metavar to list the choices for a CLI arg when custom values are also accepted by @hmellor in #21760
update flashinfer to v0.2.9rc2 by @weireweire in #21701
[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (1-4) due to torch.compile by @rasmith in #21350
[Bug] Enforce contiguous input for dynamic_scaled_fp8_quant and static_scaled_fp8_quant by @yewentao256 in #21773
[AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol exposure by @houseroad in #21647
Revert "[V1] Exception Handling when Loading KV Cache from Remote Store" by @KuntaiDu in #21778
[Bugfix] DeepGEMM is not enabled on B200 due to _lazy_init() by @smarterclayton in #21472
[Feat]: Add support for Dynamic Quant 4 bit CPU kleidiai kernels by @nikhil-arm in #17112
[Perf] Disable chunked local attention by default with llama4 by @LucasWilkinson in #21761
[Kernel] SM90 CUTLASS FP8 GEMM: add support for swap AB + kernel tuning by @LyrisZhong in #20396
[Docs] Minimize spacing for supported_hardware.md table by @mgoin in #21779
[Refactor] Merge Compressed Tensor FP8 CompressedTensorsW8A8Fp8MoEMethod and CompressedTensorsW8A8Fp8MoECutlassMethod by @yewentao256 in #21775
[CI] Parallelize Kernels MoE Test by @mgoin in #21764
skip fusedmoe layer for start_load_kv by @calvin0327 in #21378
[AMD][CI/Build][Bugfix] Guarding CUDA specific functions by ifndef ROCM by @gshtras in #21766
Migrate InternVLImageInputs and InternVLVideoInputs to TensorSchema by @bbeckca in #21684
[Misc] Rework process titles by @njhill in #21780
[Doc] Link to RFC for pooling optimizations by @DarkLight1337 in #21806
[Model]: Fused MoE for nomic-embed-text-v2-moe by @Isotr0py in #18321
[V0 deprecation] Guided decoding by @rzabarazesh in #21347
[Model] Refactor JambaForCausalLM by @jeejeelee in #21394
[Docs] Fix the outdated URL for installing from vLLM binaries by @yankay in #21523
[KVCache] Make KVCacheSpec hashable by @heheda12345 in #21791
[Doc] Update compatibility matrix for pooling and multimodal models by @DarkLight1337 in #21831
[Bugfix] VLLM_V1 supports passing other compilation levels by @zou3519 in #19340
[Docs] Merge design docs for a V1 only future by @hmellor in #21832
[TPU] Add an optimization doc on TPU by @bvrockwell in #21155
[Bugfix]fix mixed bits and visual language model quantization in AutoRound by @wenhuach21 in #21802
[Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend by @elvischenv in #21525
[Docs] use uv in GPU installation docs by @davidxia in #20277
[Doc] Add FusedMoE Modular Kernel Documentation by @varun-sundar-rabindranath in #21623
[Doc] update Contributing page's testing section by @davidxia in #18272
Add flashinfer_python to CUDA wheel requirements by @mgoin in #21389
docker: docker-aware precompiled wheel support by @dougbtv in #21127
Revert "[AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol exposure (#21647)" by @gshtras in #21850
[BugFix] Fix interleaved sliding window not set for Gemma3n by @sarckk in #21863
[ci] add b200 test placeholder by @simon-mo in #21866
[ci] mark blackwell test optional for now by @simon-mo in #21878
[Bugfix] Correct max tokens for non-contiguous embeds by @milesial in #21798
[v1][attention] Support Hybrid Allocator + FlashInfer by @heheda12345 in #21412
[Docs] Switch to better markdown linting pre-commit hook by @hmellor in #21851
[DOC] Fix path of v1 related figures by @heheda12345 in #21868
[Docs] Update docker.md with HF_TOKEN, new model, and podman fix by @mgoin in #21856
Expose PyTorch profiler configuration to environment variables by @Csrayz in #21803
[Bugfix] Fix shape mismatch assertion error when loading Gemma3n model with BitsAndBytes quantization by @sydarb in #21808
[Bugfix] Fix comment typo of get_num_common_prefix_blocks() by @MingzhenHan in #21827
[Bugfix] Actually disable processing cache when API server is scaled out by @DarkLight1337 in #21839
[Perf] Using __nv_fp8_e4m3 instead of c10::e4m3 for per_token_group_quant by @yewentao256 in #21867
[Frontend] Add LLM.reward specific to reward models by @noooop in #21720
[XPU] use ZE_AFFINITY_MASK for device select on xpu by @jikunshang in #21815
Add @sighingnow as maintainer of qwen's related files. by @sighingnow in #21895
[CI/Build] Fix pre-commit failure in docs by @DarkLight1337 in #21897
[Docs] Expand introduction to Ray in Multi-node deployment section by [@crypd

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Enabled.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR has been generated by Renovate Bot.

| datasource | package | from | to | | ---------- | ------- | ----- | ------ | | pypi | vllm | 0.5.5 | 0.10.1 |

dreadnode-renovate-bot · 2025-08-21T20:04:24Z

⚠️ Artifact update problem

Renovate failed to update an artifact related to this branch. You probably do not want to merge this PR as-is.

♻ Renovate will retry this branch, including artifacts, only when one of the following happens:

any of the package files in this branch needs updating, or
the branch becomes conflicted, or
you click the rebase/retry checkbox if found above, or
you rename this PR's title to start with "rebase!" to trigger it manually

The artifact failure details are included below:

File name: poetry.lock

Updating dependencies
Resolving dependencies...


The current project's supported Python range (>=3.10,<3.14) is not compatible with some of the required packages Python requirement:
  - vllm requires Python <3.13,>=3.9, so it will not be installable for Python >=3.13,<3.14
  - vllm requires Python <3.13,>=3.9, so it will not be installable for Python >=3.13,<3.14
  - vllm requires Python <3.13,>=3.9, so it will not be installable for Python >=3.13,<3.14

Because no versions of vllm match >0.10.0,<0.10.1 || >0.10.1,<0.10.1.1 || >0.10.1.1,<0.11.0
 and vllm (0.10.0) requires Python <3.13,>=3.9, vllm is forbidden.
And because vllm (0.10.1) requires Python <3.13,>=3.9, vllm is forbidden.
So, because vllm (0.10.1.1) requires Python <3.13,>=3.9
 and rigging depends on vllm (^0.10.0), version solving failed.

  * Check your dependencies Python requirement: The Python requirement can be specified via the `python` or `markers` properties

    For vllm, a possible solution would be to set the `python` property to ">=3.10,<3.13"
For vllm, a possible solution would be to set the `python` property to ">=3.10,<3.13"
For vllm, a possible solution would be to set the `python` property to ">=3.10,<3.13"

    https://python-poetry.org/docs/dependency-specification/#python-restricted-dependencies,
    https://python-poetry.org/docs/dependency-specification/#using-environment-markers

fix(deps): update dependency vllm to ^0.10.0

b262dd5

| datasource | package | from | to | | ---------- | ------- | ----- | ------ | | pypi | vllm | 0.5.5 | 0.10.1 |

dreadnode-renovate-bot bot requested a review from a team as a code owner August 21, 2025 20:04

dreadnode-renovate-bot bot added type/digest Dependency digest updates area/python Changes to Python package configuration and dependencies labels Aug 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(deps): update dependency vllm to ^0.10.0 #257

fix(deps): update dependency vllm to ^0.10.0 #257

Uh oh!

dreadnode-renovate-bot bot commented Aug 21, 2025 •

edited

Loading

Uh oh!

dreadnode-renovate-bot bot commented Aug 21, 2025

Uh oh!

Uh oh!

fix(deps): update dependency vllm to ^0.10.0 #257

Are you sure you want to change the base?

fix(deps): update dependency vllm to ^0.10.0 #257

Uh oh!

Conversation

dreadnode-renovate-bot bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Notes

v0.10.1

Highlights

Model Support

Engine Core

Hardware & Performance

Quantization

API & Frontend

Dependencies

V0 Deprecation

What's Changed

Configuration

Uh oh!

dreadnode-renovate-bot bot commented Aug 21, 2025

⚠️ Artifact update problem

File name: poetry.lock

Uh oh!

Uh oh!

dreadnode-renovate-bot bot commented Aug 21, 2025 •

edited

Loading

`v0.10.1`