fix(deps): update dependency vllm to ^0.11.0 [security] #280

dreadnode-renovate-bot · 2025-10-07T20:04:21Z

This PR contains the following updates:

Package	Change	Age	Confidence
vllm	`^0.5.0` -> `^0.11.0`

GitHub Vulnerability Alerts

CVE-2025-24357

Description

The vllm/model_executor/weight_utils.py implements hf_model_weights_iterator to load the model checkpoint, which is downloaded from huggingface. It use torch.load function and weights_only parameter is default value False. There is a security warning on https://pytorch.org/docs/stable/generated/torch.load.html, when torch.load load a malicious pickle data it will execute arbitrary code during unpickling.

Impact

This vulnerability can be exploited to execute arbitrary codes and OS commands in the victim machine who fetch the pretrained repo remotely.

Note that most models now use the safetensors format, which is not vulnerable to this issue.

References

CVE-2025-25183

Summary

Maliciously constructed prompts can lead to hash collisions, resulting in prefix cache reuse, which can interfere with subsequent responses and cause unintended behavior.

Details

vLLM's prefix caching makes use of Python's built-in hash() function. As of Python 3.12, the behavior of hash(None) has changed to be a predictable constant value. This makes it more feasible that someone could try exploit hash collisions.

Impact

The impact of a collision would be using cache that was generated using different content. Given knowledge of prompts in use and predictable hashing behavior, someone could intentionally populate the cache using a prompt known to collide with another prompt in use.

Solution

We address this problem by initializing hashes in vllm with a value that is no longer constant and predictable. It will be different each time vllm runs. This restores behavior we got in Python versions prior to 3.12.

Using a hashing algorithm that is less prone to collision (like sha256, for example) would be the best way to avoid the possibility of a collision. However, it would have an impact to both performance and memory footprint. Hash collisions may still occur, though they are no longer straight forward to predict.

To give an idea of the likelihood of a collision, for randomly generated hash values (assuming the hash generation built into Python is uniformly distributed), with a cache capacity of 50,000 messages and an average prompt length of 300, a collision will occur on average once every 1 trillion requests.

References

CVE-2025-29770

Impact

The outlines library is one of the backends used by vLLM to support structured output (a.k.a. guided decoding). Outlines provides an optional cache for its compiled grammars on the local filesystem. This cache has been on by default in vLLM. Outlines is also available by default through the OpenAI compatible API server.

The affected code in vLLM is vllm/model_executor/guided_decoding/outlines_logits_processors.py, which unconditionally uses the cache from outlines. vLLM should have this off by default and allow administrators to opt-in due to the potential for abuse.

A malicious user can send a stream of very short decoding requests with unique schemas, resulting in an addition to the cache for each request. This can result in a Denial of Service if the filesystem runs out of space.

Note that even if vLLM was configured to use a different backend by default, it is still possible to choose outlines on a per-request basis using the guided_decoding_backend key of the extra_body field of the request.

This issue applies to the V0 engine only. The V1 engine is not affected.

Patches

https://github.com/vllm-project/vllm/pull/14837

The fix is to disable this cache by default since it does not provide an option to limit its size. If you want to use this cache anyway, you may set the VLLM_V0_USE_OUTLINES_CACHE environment variable to 1.

Workarounds

There is no way to workaround this issue in existing versions of vLLM other than preventing untrusted access to the OpenAI compatible API server.

References

GHSA-ggpf-24jw-3fcw

Description

GHSA-rh4j-5rhw-hr54 reported a vulnerability where loading a malicious model could result in code execution on the vllm host. The fix applied to specify weights_only=True to calls to torch.load() did not solve the problem prior to PyTorch 2.6.0.

PyTorch has issued a new CVE about this problem: GHSA-53q9-r3pm-6pq6

This means that versions of vLLM using PyTorch before 2.6.0 are vulnerable to this problem.

Background Knowledge

When users install VLLM according to the official manual

But the version of PyTorch is specified in the requirements. txt file

So by default when the user install VLLM, it will install the PyTorch with version 2.5.1

In CVE-2025-24357, weights_only=True was used for patching, but we know this is not secure.
Because we found that using Weights_only=True in pyTorch before 2.5.1 was unsafe

Here, we use this interface to prove that it is not safe.

Fix

update PyTorch version to 2.6.0

Credit

This vulnerability was found By Ji'an Zhou and Li'shuo Song

CVE-2025-30202

Impact

In a multi-node vLLM deployment, vLLM uses ZeroMQ for some multi-node communication purposes. The primary vLLM host opens an XPUB ZeroMQ socket and binds it to ALL interfaces. While the socket is always opened for a multi-node deployment, it is only used when doing tensor parallelism across multiple hosts.

Any client with network access to this host can connect to this XPUB socket unless its port is blocked by a firewall. Once connected, these arbitrary clients will receive all of the same data broadcasted to all of the secondary vLLM hosts. This data is internal vLLM state information that is not useful to an attacker.

By potentially connecting to this socket many times and not reading data published to them, an attacker can also cause a denial of service by slowing down or potentially blocking the publisher.

Detailed Analysis

The XPUB socket in question is created here:

https://github.com/vllm-project/vllm/blob/c21b99b91241409c2fdf9f3f8c542e8748b317be/vllm/distributed/device_communicators/shm_broadcast.py#L236-L237

Data is published over this socket via MessageQueue.enqueue() which is called by MessageQueue.broadcast_object():

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/device_communicators/shm_broadcast.py#L452-L453

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/device_communicators/shm_broadcast.py#L475-L478

The MessageQueue.broadcast_object() method is called by the GroupCoordinator.broadcast_object() method in parallel_state.py:

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L364-L366

The broadcast over ZeroMQ is only done if the GroupCoordinator was created with use_message_queue_broadcaster set to True:

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L216-L219

The only case where GroupCoordinator is created with use_message_queue_broadcaster is the coordinator for the tensor parallelism group:

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L931-L936

To determine what data is broadcasted to the tensor parallism group, we must continue tracing. GroupCoordinator.broadcast_object() is called by GroupCoordinator.broadcoast_tensor_dict():

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L489

which is called by broadcast_tensor_dict() in communication_op.py:

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/communication_op.py#L29-L34

If we look at _get_driver_input_and_broadcast() in the V0 worker_base.py, we'll see how this tensor dict is formed:

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/worker/worker_base.py#L332-L352

but the data actually sent over ZeroMQ is the metadata_list portion that is split from this tensor_dict. The tensor parts are sent via torch.distributed and only metadata about those tensors is sent via ZeroMQ.

https://github.com/vllm-project/vllm/blob/54a66e5fee4a1ea62f1e4c79a078b20668e408c6/vllm/distributed/parallel_state.py#L61-L83

Patches

https://github.com/vllm-project/vllm/pull/17197

Workarounds

Prior to the fix, your options include:

Do not expose the vLLM host to a network where any untrusted connections may reach the host.
Ensure that only the other vLLM hosts are able to connect to the TCP port used for the XPUB socket. Note that port used is random.

References

Relevant code first introduced in https://github.com/vllm-project/vllm/pull/6183

CVE-2025-46570

This issue arises from the prefix caching mechanism, which may expose the system to a timing side-channel attack.

Description

When a new prompt is processed, if the PageAttention mechanism finds a matching prefix chunk, the prefill process speeds up, which is reflected in the TTFT (Time to First Token). Our tests revealed that the timing differences caused by matching chunks are significant enough to be recognized and exploited.

For instance, if the victim has submitted a sensitive prompt or if a valuable system prompt has been cached, an attacker sharing the same backend could attempt to guess the victim's input. By measuring the TTFT based on prefix matches, the attacker could verify if their guess is correct, leading to potential leakage of private information.

Unlike token-by-token sharing mechanisms, vLLM’s chunk-based approach (PageAttention) processes tokens in larger units (chunks). In our tests, with chunk_size=2, the timing differences became noticeable enough to allow attackers to infer whether portions of their input match the victim's prompt at the chunk level.

Environment

GPU: NVIDIA A100 (40G)
CUDA: 11.8
PyTorch: 2.3.1
OS: Ubuntu 18.04
vLLM: v0.5.1
Configuration: We launched vLLM using the default settings and adjusted chunk_size=2 to evaluate the TTFT.

Leakage

We conducted our tests using LLaMA2-70B-GPTQ on a single device. We analyzed the timing differences when prompts shared prefixes of 2 chunks, and plotted the corresponding ROC curves. Our results suggest that timing differences can be reliably used to distinguish prefix matches, demonstrating a potential side-channel vulnerability.

Results

In our experiment, we analyzed the response time differences between cache hits and misses in vLLM's PageAttention mechanism. Using ROC curve analysis to assess the distinguishability of these timing differences, we observed the following results:

With a 1-token prefix, the ROC curve yielded an AUC value of 0.571, indicating that even with a short prefix, an attacker can reasonably distinguish between cache hits and misses based on response times.
When the prefix length increases to 8 tokens, the AUC value rises significantly to 0.99, showing that the attacker can almost perfectly identify cache hits with a longer prefix.

Fixes

https://github.com/vllm-project/vllm/pull/17045

CVE-2025-48956

Summary

A Denial of Service (DoS) vulnerability can be triggered by sending a single HTTP GET request with an extremely large header to an HTTP endpoint. This results in server memory exhaustion, potentially leading to a crash or unresponsiveness. The attack does not require authentication, making it exploitable by any remote user.

Details

The vulnerability leverages the abuse of HTTP headers. By setting a header such as X-Forwarded-For to a very large value like ("A" * 5_800_000_000), the server's HTTP parser or application logic may attempt to load the entire request into memory, overwhelming system resources.

Impact

What kind of vulnerability is it? Who is impacted?
Type of vulnerability: Denial of Service (DoS)

Resolution

Upgrade to a version of vLLM that includes appropriate HTTP limits by deafult, or use a proxy in front of vLLM which provides protection against this issue.

CVE-2025-59425

Summary

The API key support in vLLM performed validation using a method that was vulnerable to a timing attack. This could potentially allow an attacker to discover a valid API key using an approach more efficient than brute force.

Details

https://github.com/vllm-project/vllm/blob/4b946d693e0af15740e9ca9c0e059d5f333b1083/vllm/entrypoints/openai/api_server.py#L1270-L1274

API key validation used a string comparison that will take longer the more characters the provided API key gets correct. Data analysis across many attempts can allow an attacker to determine when it finds the next correct character in the key sequence.

Impact

Deployments relying on vLLM's built-in API key validation are vulnerable to authentication bypass using this technique.

CVE-2025-61620

Summary

A resource-exhaustion (denial-of-service) vulnerability exists in multiple endpoints of the OpenAI-Compatible Server due to the ability to specify Jinja templates via the chat_template and chat_template_kwargs parameters. If an attacker can supply these parameters to the API, they can cause a service outage by exhausting CPU and/or memory resources.

Details

When using an LLM as a chat model, the conversation history must be rendered into a text input for the model. In hf/transformer, this rendering is performed using a Jinja template. The OpenAI-Compatible Server launched by vllm serve exposes a chat_template parameter that lets users specify that template. In addition, the server accepts a chat_template_kwargs parameter to pass extra keyword arguments to the rendering function.

Because Jinja templates support programming-language-like constructs (loops, nested iterations, etc.), a crafted template can consume extremely large amounts of CPU and memory and thereby trigger a denial-of-service condition.

Importantly, simply forbidding the chat_template parameter does not fully mitigate the issue. The implementation constructs a dictionary of keyword arguments for apply_hf_chat_template and then updates that dictionary with the user-supplied chat_template_kwargs via dict.update. Since dict.update can overwrite existing keys, an attacker can place a chat_template key inside chat_template_kwargs to replace the template that will be used by apply_hf_chat_template.

# vllm/entrypoints/openai/serving_engine.py#L794-L816
_chat_template_kwargs: dict[str, Any] = dict(
    chat_template=chat_template,
    add_generation_prompt=add_generation_prompt,
    continue_final_message=continue_final_message,
    tools=tool_dicts,
    documents=documents,
)
_chat_template_kwargs.update(chat_template_kwargs or {})

request_prompt: Union[str, list[int]]
if isinstance(tokenizer, MistralTokenizer):
    ...
else:
    request_prompt = apply_hf_chat_template(
        tokenizer=tokenizer,
        conversation=conversation,
        model_config=model_config,
        **_chat_template_kwargs,
    )

Impact

If an OpenAI-Compatible Server exposes endpoints that accept chat_template or chat_template_kwargs from untrusted clients, an attacker can submit a malicious Jinja template (directly or by overriding chat_template inside chat_template_kwargs) that consumes excessive CPU and/or memory. This can result in a resource-exhaustion denial-of-service that renders the server unresponsive to legitimate requests.

Fixes

https://github.com/vllm-project/vllm/pull/25794

CVE-2025-6242

Summary

A Server-Side Request Forgery (SSRF) vulnerability exists in the MediaConnector class within the vLLM project's multimodal feature set. The load_from_url and load_from_url_async methods fetch and process media from user-provided URLs without adequate restrictions on the target hosts. This allows an attacker to coerce the vLLM server into making arbitrary requests to internal network resources.

This vulnerability is particularly critical in containerized environments like llm-d, where a compromised vLLM pod could be used to scan the internal network, interact with other pods, and potentially cause denial of service or access sensitive data. For example, an attacker could make the vLLM pod send malicious requests to an internal llm-d management endpoint, leading to system instability by falsely reporting metrics like the KV cache state.

Vulnerability Details

The core of the vulnerability lies in the MediaConnector.load_from_url method and its asynchronous counterpart. These methods accept a URL string to fetch media content (images, audio, video).

https://github.com/vllm-project/vllm/blob/119f683949dfed10df769fe63b2676d7f1eb644e/vllm/multimodal/utils.py#L97-L113

The function directly processes URLs with http, https, and file schemes. An attacker can supply a URL pointing to an internal IP address or a localhost endpoint. The vLLM server will then initiate a connection to this internal resource.

HTTP/HTTPS Scheme: An attacker can craft a request like {"image_url": "http://127.0.0.1:8080/internal_api"}. The vLLM server will send a GET request to this internal endpoint.
File Scheme: The _load_file_url method attempts to restrict file access to a subdirectory defined by --allowed-local-media-path. While this is a good security measure for local file access, it does not prevent network-based SSRF attacks.

Impact in `llm-d` Environments

The risk is significantly amplified in orchestrated environments such as llm-d, where multiple pods communicate over an internal network.

Denial of Service (DoS): An attacker could target internal management endpoints of other services within the llm-d cluster. For instance, if a monitoring or metrics service is exposed internally, an attacker could send malformed requests to it. A specific example is an attacker causing the vLLM pod to call an internal API that reports a false KV cache utilization, potentially triggering incorrect scaling decisions or even a system shutdown.
Internal Network Reconnaissance: Attackers can use the vulnerability to scan the internal network for open ports and services by providing URLs like http://10.0.0.X:PORT and observing the server's response time or error messages.
Interaction with Internal Services: Any unsecured internal service becomes a potential target. This could include databases, internal APIs, or other model pods that might not have robust authentication, as they are not expected to be directly exposed.

Delegating this security responsibility to an upper-level orchestrator like llm-d is problematic. The orchestrator cannot easily distinguish between legitimate requests initiated by the vLLM engine for its own purposes and malicious requests originating from user input, thus complicating traffic filtering rules and increasing management overhead.

Proposed Mitigation

To address this vulnerability, it is essential to restrict the URLs that the MediaConnector can access. The principle of least privilege should be applied.

It is recommend to implement a configurable allowlist or denylist for domains and IP addresses.

Allowlist: The most secure approach is to allow connections only to a predefined list of trusted domains. This could be configured via a command-line argument, such as --allowed-media-domains. By default, this list could be empty, forcing administrators to explicitly enable external media fetching.
Denylist: Alternatively, a denylist could block access to private IP address ranges (127.0.0.1, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) and other sensitive domains.

A check should be added at the beginning of the load_from_url methods to validate the parsed hostname against this list before any connection is made.

Example Implementation Idea:

# In MediaConnector.__init__
self.allowed_domains = set(config.get("allowed_media_domains", []))
self.denied_ip_ranges = [ip_network(r) for r in PRIVATE_IP_RANGES]

# In MediaConnector.load_from_url
url_spec = urlparse(url)
hostname = url_spec.hostname

if self.allowed_domains and hostname not in self.allowed_domains:
    raise ValueError(f"Domain {hostname} is not in the allowed list.")

ip_address = ip_address(socket.gethostbyname(hostname))
if any(ip_address in network for network in self.denied_ip_ranges):
    raise ValueError(f"Access to private IP address {ip_address} is forbidden.")

By integrating this control directly into vLLM, empower administrators to enforce security policies at the source, creating a more secure deployment by default and reducing the burden on higher-level infrastructure management.

Release Notes

vllm-project/vllm (vllm)

`v0.11.0`

Compare Source

Highlights

This release features 538 commits, 207 contributors (65 new contributors)!

This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention backends, and related components have been removed. V1 is the only engine in the codebase now.
This releases turns on FULL_AND_PIECEWISE as the CUDA graph mode default. This should provide better out of the box performance for most models, particularly fine-grained MoEs, while preserving compatibility with existing models supporting only PIECEWISE mode.

Note: In v0.11.0 (and v0.10.2), --async-scheduling will produce gibberish output in some cases such as preemption and others. This functionality is correct in v0.10.1. We are actively fixing it for the next version.

Model Support

New architectures: DeepSeek-V3.2-Exp (#25896), Qwen3-VL series (#24727), Qwen3-Next (#24526), OLMo3 (#24534), LongCat-Flash (#23991), Dots OCR (#24645), Ling2.0 (#24627), CWM (#25611).
Encoders: RADIO encoder support (#24595), Transformers backend support for encoder-only models (#25174).
Task expansion: BERT token classification/NER (#24872), multimodal models for pooling tasks (#24451).
Data parallel for vision encoders: InternVL (#23909), Qwen2-VL (#25445), Qwen3-VL (#24955).
Speculative decoding: EAGLE3 for MiniCPM3 (#24243) and GPT-OSS (#25246).
Features: Qwen3-VL text-only mode (#26000), EVS video token pruning (#22980), Mamba2 TP+quantization (#24593), MRoPE + YaRN (#25384), Whisper on XPU (#25123), LongCat-Flash-Chat tool calling (#24083).
Performance: GLM-4.1V 916ms TTFT reduction via fused RMSNorm (#24733), GLM-4 MoE SharedFusedMoE optimization (#24849), Qwen2.5-VL CUDA sync removal (#24741), Qwen3-VL Triton MRoPE kernel (#25055), FP8 checkpoints for Qwen3-Next (#25079).
Reasoning: SeedOSS reason parser (#24263).

Engine Core

KV cache offloading: CPU offloading with LRU management (#19848, #20075, #21448, #22595, #24251).
V1 features: Prompt embeddings (#24278), sharded state loading (#25308), FlexAttention sliding window (#24089), LLM.apply_model (#18465).
Hybrid allocator: Pipeline parallel (#23974), varying hidden sizes (#25101).
Async scheduling: Uniprocessor executor support (#24219).
Architecture: Tokenizer group removal (#24078), shared memory multimodal caching (#20452).
Attention: Hybrid SSM/Attention in Triton (#21197), FlashAttention 3 for ViT (#24347).
Performance: FlashInfer RoPE 2x speedup (#21126), fused Q/K RoPE 11% improvement (#24511, #25005), 8x spec decode overhead reduction (#24986), FlashInfer spec decode with 1.14x speedup (#25196), model info caching (#23558), inputs_embeds copy avoidance (#25739).
LoRA: Optimized weight loading (#25403).
Defaults: CUDA graph mode FULL_AND_PIECEWISE (#25444), Inductor standalone compile disabled (#25391).
torch.compile: CUDA graph Inductor partition integration (#24281).

Hardware & Performance

NVIDIA: FP8 FlashInfer MLA decode (#24705), BF16 fused MoE for Hopper/Blackwell expert parallel (#25503).
DeepGEMM: Enabled by default (#24462), 5.5% throughput improvement (#24783).
New architectures: RISC-V 64-bit (#22112), ARM non-x86 CPU (#25166), ARM 4-bit fused MoE (#23809).
AMD: ROCm 7.0 (#25178), GLM-4.5 MI300X tuning (#25703).
Intel XPU: MoE DP accuracy fix (#25465).

Large Scale Serving & Performance

Dual-Batch Overlap (DBO): Overlapping computation mechanism (#23693), DeepEP high throughput + prefill (#24845).
Data Parallelism: torchrun launcher (#24899), Ray placement groups (#25026), Triton DP/EP kernels (#24588).
EPLB: Hunyuan V1 (#23078), Mixtral (#22842), static placement (#23745), reduced overhead (#24573).
Disaggregated serving: KV transfer metrics (#22188), NIXL MLA latent dimension (#25902).
MoE: Shared expert overlap optimization (#24254), SiLU kernel for DeepSeek-R1 (#24054), Enable Allgather/ReduceScatter backend for NaiveAllToAll (#23964).
Distributed: NCCL symmetric memory with 3-4% throughput improvement (#24532), enabled by default for TP (#25070).

Quantization

FP8: Per-token-group quantization (#24342), hardware-accelerated instructions (#24757), torch.compile KV cache (#22758), paged attention update (#22222).
FP4: NVFP4 for dense models (#25609), Gemma3 (#22771), Llama 3.1 405B (#25135).
W4A8: Faster preprocessing (#23972).
Compressed tensors: Blocked FP8 for MoE (#25219).

API & Frontend

OpenAI: Prompt logprobs for all tokens (#24956), logprobs=-1 for full vocab (#25031), reasoning streaming events (#24938), Responses API MCP tools (#24628, #24985), health 503 on dead engine (#24897).
Multimodal: Media UUID caching (#23950), image path format (#25081).
Tool calling: XML parser for Qwen3-Coder (#25028), Hermes-style tokens (#25281).
CLI: --enable-logging (#25610), improved --help (#24903).
Config: Speculative model engine args (#25250), env validation (#24761), NVTX profiling (#25501), guided decoding backward compatibility (#25615, #25422).
Metrics: V1 TPOT histogram (#24015), hidden deprecated gpu_ metrics (#24245), KV cache GiB units (#25204, #25479).
UX: Removed misleading quantization warning (#25012).

Security

GHSA-wr9h-g72x-mwhm

Dependencies

PyTorch 2.8 for CPU (#25652), FlashInfer 0.3.1 (#24470), CUDA 13 (#24599), ROCm 7.0 (#25178).
Build requirements: C++17 now enforced globally (#24823).
TPU: Deprecated xm.mark_step in favor of torch_xla.sync (#25254).

V0 Deprecation

Engines: AsyncLLMEngine (#25025), LLMEngine (#25033), MQLLMEngine (#25019), core (#25321), model runner (#25328), MP executor (#25329).
Components: Attention backends (#25351), encoder-decoder (#24907), output processor (#25320), sampling metadata (#25345), Sequence/Sampler (#25332).
Interfaces: LoRA (#25686), async output processor (#25334), MultiModalPlaceholderMap (#25366), seq group methods (#25330), placeholder attention (#25510), input embeddings (#25242), multimodal registry (#25362), max_seq_len_to_capture (#25543), attention classes (#25541), hybrid models (#25400), backend suffixes (#25489), compilation fallbacks (#25675), default args (#25409).

What's Changed

[Qwen3-Next] MoE configs for H20 TP=1,2,4,8 by @jeejeelee in #24707
[DOCs] Update ROCm installation docs section by @gshtras in #24691
Enable conversion of multimodal models to pooling tasks by @maxdebayser in #24451
Fix implementation divergence for BLOOM models between vLLM and HuggingFace when using prompt embeds by @qthequartermasterman in #24686
[Bugfix] Fix MRoPE dispatch on CPU by @bigPYJ1151 in #24712
[BugFix] Fix Qwen3-Next PP by @njhill in #24709
[CI] Fix flaky test v1/worker/test_gpu_model_runner.py::test_kv_cache_stride_order by @heheda12345 in #24640
[CI] Add ci_envs for convenient local testing by @noooop in #24630
[CI/Build] Skip prompt embeddings tests on V1-only CPU backend by @bigPYJ1151 in #24721
[Misc][gpt-oss] Add gpt-oss label to PRs that mention harmony or related to builtin tool call by @heheda12345 in #24717
[Bugfix] Fix BNB name match by @jeejeelee in #24735
[Kernel] [CPU] refactor cpu_attn.py:_run_sdpa_forward for better memory access by @ignaciosica in #24701
[sleep mode] save memory for on-the-fly quantization by @youkaichao in #24731
[Multi Modal] Add FA3 in VIT by @wwl2755 in #24347
[Multimodal] Remove legacy multimodal fields in favor of MultiModalFeatureSpec by @sfeng33 in #24548
[Doc]: fix typos in various files by @didier-durand in #24726
[Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in #24740
[Bugfix] Fix MRoPE dispatch on XPU by @yma11 in #24724
[Qwen3-Next] MoE configs for H100 TP=1,2 and TP2/EP by @elvircrn in #24739
[Core] Shared memory based object store for Multimodal data caching and IPC by @dongluw in #20452
[Bugfix][Frontend] Fix --enable-log-outputs does not match the documentation by @kebe7jun in #24626
[Models] Optimise and simplify _validate_and_reshape_mm_tensor by @lgeiger in #24742
[Models] Prevent CUDA sync in Qwen2.5-VL by @lgeiger in #24741
[Model] Switch to Fused RMSNorm in GLM-4.1V model by @SamitHuang in #24733
[UX] Remove AsyncLLM torch profiler disabled log by @mgoin in #24609
[CI] Speed up model unit tests in CI by @afeldman-nm in #24253
[Bugfix] Fix incompatibility between #20452 and #24548 by @DarkLight1337 in #24754
[CI] Trigger BC Linter when labels are added/removed by @zhewenl in #24767
[Benchmark] Allow arbitrary headers to be passed to benchmarked endpoints by @smarterclayton in #23937
[Compilation Bug] Fix Inductor Graph Output with Shape Issue by @yewentao256 in #24772
Invert pattern order to make sure that out_proj layers are identified by @anmarques in #24781
[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode by @MatthewBonanni in #24705
Add FLASHINFER_MLA to backend selector test by @MatthewBonanni in #24753
[Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes (#24660) by @sighingnow in #24667
[Core] Support async scheduling with uniproc executor by @njhill in #24219
[Frontend][Multimodal] Allow skipping media data when UUIDs are provided. by @huachenheli in #23950
[Model] Add Olmo3 model implementation by @2015aroras in #24534
[Bugfix] Fix GPUModelRunner has no attribute lora_manager by @jeejeelee in #24762
[Chore] Remove unused batched RoPE op & kernel by @WoosukKwon in #24789
[Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in #24791
[Docs] Remove Neuron install doc as backend no longer exists by @hmellor in #24396
[Doc]: Remove 404 hyperlinks by @rozeappletree in #24785
[Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization by @elvischenv in #24757
[Kernels][DP/EP] Optimize Silu Kernel for R1 by @elvircrn in #24054
[Core][Multimodal] Cache supports_kw by @lgeiger in #24773
[CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe by @mgoin in #24750
[Misc] Correct an outdated comment. by @russellb in #24765
[Doc]: fix typos in various files by @didier-durand in #24798
[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again by @wwl2755 in #24771
Remove redundant assignment in xfer_buffers, This is a little fix by @ChenTaoyu-SJTU in #24732
[Minor] Simplify duplicative device check for cuda by @ziliangpeng in #24793
[Chore] Minor simplification for non-PP path by @WoosukKwon in #24810
[Multi Modal][Performance] Fused Q,K's apply_rope into one by @wwl2755 in #24511
[Misc] Improve s3_utils type hints with BaseClient by @Zerohertz in #24825
[Perf] Fix DeepGEMM Contiguous Layout Issue, 5.5% Throughput Improvement by @yewentao256 in #24783
fix type of sampling rate for encode_base64 by @co63oc in #24826
[Benchmarks] Throw usage error when using dataset-name random and dataset-path together by @yeqcharlotte in #24819
Force use C++17 globally to avoid compilation error by @chenfengjin in #24823
[Chore] Remove ipex_ops warning by @robertgshaw2-redhat in #24835
[Spec Decoding]Support Spec Decoding Metrics in DP Mode by @wuhang2014 in #24049
[Hybrid Allocator] Support Pipeline Parallel by @heheda12345 in #23974
[Docs] Have a try to improve frameworks/streamlit.md by @windsonsea in #24841
[kv cache] update num_free_blocks in the end by @andyxning in #24228
[Frontend] Skip stop in reasoning content by @gaocegege in #14550
[Bugfix] MiDashengLM model contact error under concurrent testing by @bingchen-mi in #24738
[Doc]: fix typos in various files by @didier-durand in #24821
[Misc] rename interval to max_recent_requests by @andyxning in #24229
[Misc] Own KVConnectors installation by @NickLucche in #24867
[P/D]kv_output_aggregator support heterogeneous by @LCAIZJ in #23917
[UT] enhance free kv cache block queue popleft_n by @andyxning in #24220
[XPU] Set consistent default KV cache layout by @NickLucche in #24745
[Misc] Fix examples openai_pooling_client.py by @noooop in #24853
[Model]: support Ling2.0 by @ant-yy in #24627
[Bugfix] Fix GLM4.1V multimodal processor with compatability for Transformers v4.56 by @Isotr0py in #24822
Fp8 paged attention update by @xiao-llm in #22222
Reinstate existing torch script by @hmellor in #24729
[USAGE] Improve error handling for weight initialization in Unquantized… by @koiker in #20321
Move MultiModalConfig from config/__init__.py to config/multimodal.py by @hmellor in #24659
[Transform] Deterministic Hadacore Transforms by @kylesayrs in #24106
Update num_tokens_across_dp to use nccl instead of gloo by @SageMoore in #24105
Bump Flashinfer to 0.3.1 by @bbartels in #24868
[gpt-oss] Add IncompleteDetails to ResponsesRepsonse by @qandrew in #24561
[gpt-oss][1a] create_responses stream outputs BaseModel type, api server is SSE still by @qandrew in #24759
[Performance] Remove redundant clone() calls in cutlass_mla by @alexm-redhat in #24891
[Bug] Fix Cutlass Scaled MM Compilation Error by @yewentao256 in #24887
[ci] fix wheel names for arm wheels by @simon-mo in [#24898](https://redirect.github.com/vllm-project/vllm

Configuration

📅 Schedule: Branch creation - "" (UTC), Automerge - At any time (no schedule defined).

🚦 Automerge: Enabled.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR has been generated by Renovate Bot.

| datasource | package | from | to | | ---------- | ------- | ----- | ------ | | pypi | vllm | 0.5.5 | 0.11.0 |

fix(deps): update dependency vllm to ^0.11.0 [security]

99c7585

| datasource | package | from | to | | ---------- | ------- | ----- | ------ | | pypi | vllm | 0.5.5 | 0.11.0 |

dreadnode-renovate-bot bot requested a review from a team as a code owner October 7, 2025 20:04

dreadnode-renovate-bot bot added type/digest Dependency digest updates area/python Changes to Python package configuration and dependencies labels Oct 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(deps): update dependency vllm to ^0.11.0 [security] #280

fix(deps): update dependency vllm to ^0.11.0 [security] #280

Uh oh!

dreadnode-renovate-bot bot commented Oct 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

fix(deps): update dependency vllm to ^0.11.0 [security] #280

Are you sure you want to change the base?

fix(deps): update dependency vllm to ^0.11.0 [security] #280

Uh oh!

Conversation

dreadnode-renovate-bot bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Vulnerability Alerts

Description

Impact

References

Summary

Details

Impact

Solution

References

Impact

Patches

Workarounds

References

Description

Background Knowledge

Fix

Credit

Impact

Detailed Analysis

Patches

Workarounds

References

Description

Environment

Leakage

Results

Fixes

Summary

Details

Impact

Resolution

Summary

Details

Impact

Summary

Details

Impact

Fixes

Summary

Vulnerability Details

Impact in llm-d Environments

Proposed Mitigation

Release Notes

Highlights

Model Support

Engine Core

Hardware & Performance

Large Scale Serving & Performance

Quantization

API & Frontend

Security

Dependencies

V0 Deprecation

What's Changed

Configuration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

dreadnode-renovate-bot bot commented Oct 7, 2025 •

edited

Loading

Impact in `llm-d` Environments