Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions pocs/linux/kernelctf/CVE-2023-5717_lts_cos/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# CVE-2023-5717
## Overview
A `poll_list` is allocated next to a vulnerable array in `kmalloc-4096` and has its `next` pointer incremented, causing a misaligned free in `kmalloc-2048` when `next` is freed. This lets us overlap a `simple_xattr` with a `netlink_sock` to execute a ROP chain and escape the container.

## Performance Counters Background

The `perf_event_open()` syscall is used to measure information about a target process. The `perf_event_attr` argument specifies what to measure in the `.type` and `.config` fields. Since the kernelCTF instances are virtualized, only events that use the software PMU (`.type = PERF_TYPE_SOFTWARE` or `.type = PERF_TYPE_BREAKPOINT`) can be created. We will use events with `.type = PERF_TYPE_SOFTWARE` and `.config = PERF_COUNT_SW_PAGE_FAULTS` to count the number of page faults in the exploit process.

`perf_event_open()` returns a file descriptor which can be `read()` to get the event count. The perf event will only perform measurements while it is active. It can be activated and deactivated with `ioctl()` or the `.disabled` field in `perf_event_attr`.

Events can be created as part of a group by passing the file descriptor of a group leader event to `perf_event_open()`. Events in the group will only be measured if the group leader is also active. If an event in the group has the `PERF_FORMAT_GROUP` flag in its `perf_event_attr`'s `.read_format` field, it will output an array with the counts of all events in the group when `read()`. The out-of-bounds increment occurs while preparing this array in `perf_read_group()`.

## Setting up events

We create an event group large enough that `perf_read_group()` will allocate a 4096-byte array after one event is removed. The vulnerability will then cause the 8 bytes after the end of the array to be incremented. The fds of this group's events are stored in `sib_fds[]`. The events in this group are created with the `inherit` attribute, meaning that new threads will be created with a copy of each event. These copies are added to the original `perf_event`'s `child_list`.

A new thread is immediately created and the event group `read_fds` is opened in it. The child events inherited by this thread will be the first ones seen by `perf_release_kernel()`. They share `ctx_lock` with the events in `read_fds`, so we can make `perf_release_kernel()` sleep by performing an operation on `read_fds`. This will be used to lengthen the race window for the race.

## Preparing the heap

The vulnerable array is allocated and deallocated in the same syscall that performs the OOB increment, so we need to make sure that an object we want to increment is already allocated in front of the next open slot when we trigger the vulnerability.

The target of the increment is the `next` field of a `poll_list` object. A `poll_list` is allocated by the `poll()` syscall, which takes an array of file descriptors and sleeps until there is activity on one of them. The `do_sys_poll()` function allocates a list of `poll_list`s large enough to hold all of the passed file descriptors and event information. The head `poll_list` is 256 bytes on the stack, after which each subsequent `poll_list` is allocated on the heap with a max size of 4096 bytes. Since we want the target `poll_list` to have a `next` pointer, it has to be 4096 bytes large.

After spraying `simple_xattr`s to fill pre-existing slabs in `kmalloc-4096`, a thread is created for each `poll()` call. Enough file descriptors are passed for three `poll_list`s to be used in each call: the on-stack 256-byte `poll_list`, a target `poll_list` in `kmalloc-4096`, and a `poll_list` in `kmalloc-2048` which is used for the misaligned free. Three pipes are created for these threads to wait on. Every sixteenth thread waits on the first pipe and the rest wait on the other two. Enough threads are created to allocate more slabs of `poll_list`s than there are `poll_list`s per slab.

Writing to the first pipe will free every sixteenth `poll_list` in `kmalloc-4096`, leaving a hole in each slab. The active slab may have more than one free slot, so it is filled by spraying `simple_xattr`s. Some will be allocated in the holes in the other slabs, but we created enough slabs to account for this.

Now the next allocation in `kmalloc-4096` will be in a slab with one free slot and fifteen `poll_list`s, giving it at least a one in sixteen chance of being behind a `poll_list`.

## Triggering the vulnerability

Four threads are involved in the race, two of which call `close()` and `read()` to trigger the vulnerability while the other two control the timing of the race. An additional thread is created at the start of each race attempt to inherit the events in `sib_fds`. A number of page faults equal to the desired OOB increment are performed in this thread.

When one of the `sib_fds` is closed, `perf_release_kernel()` will remove it from the `sibling_list` and then remove each child event from the `sibling_list`s of the child threads. It will take a new per-thread `ctx_lock` and release the previous one as it goes from one child event to the next. We want to `read()` an event in `sib_fds` after the `ctx_lock` of the main thread is released and before the recently created child event is deleted.

 There is another child event which will be deleted first since it was added earlier (when we created a thread during setup). This child shares its `ctx_lock` with the `read_fds` events, so we can make `perf_release_kernel()` sleep by making sure it's called during a `read()` on `read_fd`. This lets us lengthen the race window, and was also necessary to exploit the vulnerability when the container only had one CPU available. The `read()` can be made to sleep by passing a buffer that is in the process of being deallocated by `fallocate()`.

 When `perf_read_group()` is called on the vulnerable event group, the `values` array will be too small to fit the event in the child group. The count of the last child event will be added to the start of the next slot after `values`, which should contain a `poll_list`. The number of increments is set such that when a `simple_xattr` is allocated in the misaligned slot its header will be within the original slot and its data will overlap with the next slot.

The operations in the race should happen in this order:

1. A new thread is created and performs `NUM_INCS` page faults, setting the counts of its inherited events.
2. `falloc_thread` begins to deallocate a file mapped buffer.
3. The main thread calls `read()` on `read_fds[0]`. The read will cross a page boundry into the memory being deallocated by `falloc_thread` and fault while holding the `ctx_lock` of the child thread.
4. `close_thread` detects that the read is in progress and calls `close()` on an event in `sib_fds`. `perf_release_kernel()` will remove the event from the group's `sibling_list` and then sleep while waiting on the `ctx_lock` taken by the faulting read.
5. `read_thread` calls `read()` on `sib_fds[0]`, causing `perf_read_group()` to perform the out-of-bound increment.

If the race fails, a new event is added to `sib_fds` before trying again to maintain the size of the vulnerable array.

## Misaligned free

Each of the `poll()` threads waits on two fds, one stored in its first `poll_list` and the other in the third. We have hopefully just incremented the `next` pointer of the second `poll_list` in one of these `poll()` threads, which will corrupt the header of the third `poll_list` and cause the fd stored in it to be ignored. Writing to this fd will free all of the `poll_list`s except the corrupted ones. The misaligned free can then be performed by writing to the other fd.

## Leaking kernel base

We want to allocate a `simple_xattr` in the misaligned slot and a then allocate a `netlink_sock` over it, letting us leak the `netlink_sock`. When the `simple_xattr` is allocated it will overwrite the adjacent slot, so we need to make sure that the object allocated there can be safely freed after being overwritten. A `poll_list` works since it can be safely freed as long as the first 8 bytes are zero.  

We first spray `poll_list`s in `kmalloc-2048`. The misaligned `next` pointer is then freed and a `simple_xattr` is allocated in its place, overlapping an adjacent `poll_list`. The sprayed `poll_list`s are freed and `netlink_sock`s are sprayed to replace them. One of them will be allocated over the `value` field of the misaligned `simple_xattr`. The kernel base is then calculated by reading the `netlink_sock`'s `sk_write_space` function pointer.

## ROP

After leaking the kernel base, the misaligned `simple_xattr` is freed and another `simple_xattr` containing a ROP chain is allocated over the `netlink_sock`. The `sk_write_space` pointer is replaced by the stack pivot

```
push rdi ; pop rsp ; pop r13 ; jmp __x86_return_thunk
```
on LTS and
```
push rbx ; pop rsp ; pop rbp ; jmp __x86_return thunk
```
on COS. The rest of the chain is placed at the beginning of the `netlink_sock`:
```
0
pop rdi ; jmp __x86_return_thunk
0
prepare_kernel_cred()
pop rcx ; jmp __x86_return_thunk  
commit_creds()
mov rdi, rax ; jmp __x86_indirect_thunk_rcx
pop rdi ; jmp __x86_return_thunk
1
find_task_by_pid()
pop rsi ; jmp __x86_return_thunk
init_nsproxy
pop rcx ; jmp __x86_return_thunk  
switch_task_namespace()
mov rdi, rax ; jmp __x86_indirect_thunk_rcx
do_sys_vfork()
msleep()
```
The chain is executed by calling `setsockopt()` with `SO_SNDBUF`.
17 changes: 17 additions & 0 deletions pocs/linux/kernelctf/CVE-2023-5717_lts_cos/docs/vulnerability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
There is a heap OOB write in Performance Events in `perf_read_group()`, which is reached when `read()` is called on an event created with the `PERF_FORMAT_GROUP` flag set in its `read_format` attribute.

`perf_read_group()` allocates a buffer using the passed event's `read_size`, then fills it by iterating over the event's `sibling_list` and its children's `sibling_list`s. If a child event has a larger `siblig_list` than the parent, it will overflow the allocated buffer. Memory after the end of the buffer will be incremented and possibly written to depending on the event's `read_format`.

This can be achieved by racing `perf_read_group()` with `perf_release_kernel()`, which is called when the last reference to an event's file descriptor is closed. `perf_release_kernel()` will first remove an event from its `sibling_list`, then release all locks before removing the children from their `sibling_list`s. This leaves a window during which the children's `sibling_list`s are longer than the parent's. If `perf_read_group()` is run on another event in the group during this window, the buffer overflow described above will happen.

Steps to trigger the vulnerability:

- Create a group of events with the `PERF_FORMAT_GROUP` flag set in their read_format attribute and their inherit attribute set to 1.
- Create a separate event with `inherit` set to 0. This will prevent the kernel from doing an optimization that makes the vulnerability harder to trigger.
- Fork off a thread from the monitored process. This will create the inherited group of child events.
- Call `read()` on one event in the group and `close()` on another.
- Repeat until successful.

The vulnerability was fixed in` 32671e3799ca ("perf: Disallow mis-matched inherited group reads")` in version 6.5.9. This commit says the bug was introduced with `fa8c269353d5 ("perf/core: Invert perf_read_group() loops")` in version 4.4 (back ported to 3.16). However the method for exploiting it described above relies on a later commit which reworked `perf_release_kernel()` (`commit c6e5b73242d2 ("perf: Synchronously clean up child events")` from version 4.5), and there may not be a way to exploit it earlier than this.

The vulnerability can be prevented by blocking the perf_event_open() syscall.
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
CFLAGS = -Wno-incompatible-pointer-types -Wno-format -Wno-int-conversion -lpthread -static

exploit: exploit.c
gcc $(CFLAGS) -o $@ $<
Binary file not shown.
Loading
Loading