Skip to content

Commit 2329d5c

Browse files
peaktocreekakpm00
authored andcommitted
mseal: update mseal.rst
Pedro Falcato's optimization [1] for checking sealed VMAs, which replaces the can_modify_mm() function with an in-loop check, necessitates an update to the mseal.rst documentation to reflect this change. Furthermore, the document has received offline comments regarding the code sample and suggestions for sentence clarification to enhance reader comprehension. [1] https://lore.kernel.org/linux-mm/20240817-mseal-depessimize-v3-0-d8d2e037df30@gmail.com/ Update doc after in-loop change: mprotect/madvise can have partially updated and munmap is atomic. Fix indentation and clarify some sections to improve readability. Link: https://lkml.kernel.org/r/20241008040942.1478931-2-jeffxu@chromium.org Fixes: df2a7df ("mm/munmap: replace can_modify_mm with can_modify_vma") Fixes: 4a2dd02 ("mm/mprotect: replace can_modify_mm with can_modify_vma") Fixes: 3807567 ("mm/mremap: replace can_modify_mm with can_modify_vma") Fixes: 23c57d1 ("mseal: replace can_modify_mm_madv with a vma variant") Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Elliott Hughes <enh@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Theo de Raadt" <deraadt@openbsd.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent e05411d commit 2329d5c

File tree

1 file changed

+148
-159
lines changed

1 file changed

+148
-159
lines changed

Documentation/userspace-api/mseal.rst

+148-159
Original file line numberDiff line numberDiff line change
@@ -23,177 +23,166 @@ applications can additionally seal security critical data at runtime.
2323
A similar feature already exists in the XNU kernel with the
2424
VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
2525

26-
User API
27-
========
28-
mseal()
29-
-----------
30-
The mseal() syscall has the following signature:
31-
32-
``int mseal(void addr, size_t len, unsigned long flags)``
33-
34-
**addr/len**: virtual memory address range.
35-
36-
The address range set by ``addr``/``len`` must meet:
37-
- The start address must be in an allocated VMA.
38-
- The start address must be page aligned.
39-
- The end address (``addr`` + ``len``) must be in an allocated VMA.
40-
- no gap (unallocated memory) between start and end address.
41-
42-
The ``len`` will be paged aligned implicitly by the kernel.
43-
44-
**flags**: reserved for future use.
45-
46-
**return values**:
47-
48-
- ``0``: Success.
49-
50-
- ``-EINVAL``:
51-
- Invalid input ``flags``.
52-
- The start address (``addr``) is not page aligned.
53-
- Address range (``addr`` + ``len``) overflow.
54-
55-
- ``-ENOMEM``:
56-
- The start address (``addr``) is not allocated.
57-
- The end address (``addr`` + ``len``) is not allocated.
58-
- A gap (unallocated memory) between start and end address.
59-
60-
- ``-EPERM``:
61-
- sealing is supported only on 64-bit CPUs, 32-bit is not supported.
62-
63-
- For above error cases, users can expect the given memory range is
64-
unmodified, i.e. no partial update.
65-
66-
- There might be other internal errors/cases not listed here, e.g.
67-
error during merging/splitting VMAs, or the process reaching the max
68-
number of supported VMAs. In those cases, partial updates to the given
69-
memory range could happen. However, those cases should be rare.
70-
71-
**Blocked operations after sealing**:
72-
Unmapping, moving to another location, and shrinking the size,
73-
via munmap() and mremap(), can leave an empty space, therefore
74-
can be replaced with a VMA with a new set of attributes.
75-
76-
Moving or expanding a different VMA into the current location,
77-
via mremap().
78-
79-
Modifying a VMA via mmap(MAP_FIXED).
80-
81-
Size expansion, via mremap(), does not appear to pose any
82-
specific risks to sealed VMAs. It is included anyway because
83-
the use case is unclear. In any case, users can rely on
84-
merging to expand a sealed VMA.
85-
86-
mprotect() and pkey_mprotect().
87-
88-
Some destructive madvice() behaviors (e.g. MADV_DONTNEED)
89-
for anonymous memory, when users don't have write permission to the
90-
memory. Those behaviors can alter region contents by discarding pages,
91-
effectively a memset(0) for anonymous memory.
92-
93-
Kernel will return -EPERM for blocked operations.
94-
95-
For blocked operations, one can expect the given address is unmodified,
96-
i.e. no partial update. Note, this is different from existing mm
97-
system call behaviors, where partial updates are made till an error is
98-
found and returned to userspace. To give an example:
99-
100-
Assume following code sequence:
101-
102-
- ptr = mmap(null, 8192, PROT_NONE);
103-
- munmap(ptr + 4096, 4096);
104-
- ret1 = mprotect(ptr, 8192, PROT_READ);
105-
- mseal(ptr, 4096);
106-
- ret2 = mprotect(ptr, 8192, PROT_NONE);
107-
108-
ret1 will be -ENOMEM, the page from ptr is updated to PROT_READ.
109-
110-
ret2 will be -EPERM, the page remains to be PROT_READ.
111-
112-
**Note**:
113-
114-
- mseal() only works on 64-bit CPUs, not 32-bit CPU.
115-
116-
- users can call mseal() multiple times, mseal() on an already sealed memory
117-
is a no-action (not error).
118-
119-
- munseal() is not supported.
120-
121-
Use cases:
122-
==========
26+
SYSCALL
27+
=======
28+
mseal syscall signature
29+
-----------------------
30+
``int mseal(void \* addr, size_t len, unsigned long flags)``
31+
32+
**addr**/**len**: virtual memory address range.
33+
The address range set by **addr**/**len** must meet:
34+
- The start address must be in an allocated VMA.
35+
- The start address must be page aligned.
36+
- The end address (**addr** + **len**) must be in an allocated VMA.
37+
- no gap (unallocated memory) between start and end address.
38+
39+
The ``len`` will be paged aligned implicitly by the kernel.
40+
41+
**flags**: reserved for future use.
42+
43+
**Return values**:
44+
- **0**: Success.
45+
- **-EINVAL**:
46+
* Invalid input ``flags``.
47+
* The start address (``addr``) is not page aligned.
48+
* Address range (``addr`` + ``len``) overflow.
49+
- **-ENOMEM**:
50+
* The start address (``addr``) is not allocated.
51+
* The end address (``addr`` + ``len``) is not allocated.
52+
* A gap (unallocated memory) between start and end address.
53+
- **-EPERM**:
54+
* sealing is supported only on 64-bit CPUs, 32-bit is not supported.
55+
56+
**Note about error return**:
57+
- For above error cases, users can expect the given memory range is
58+
unmodified, i.e. no partial update.
59+
- There might be other internal errors/cases not listed here, e.g.
60+
error during merging/splitting VMAs, or the process reaching the maximum
61+
number of supported VMAs. In those cases, partial updates to the given
62+
memory range could happen. However, those cases should be rare.
63+
64+
**Architecture support**:
65+
mseal only works on 64-bit CPUs, not 32-bit CPUs.
66+
67+
**Idempotent**:
68+
users can call mseal multiple times. mseal on an already sealed memory
69+
is a no-action (not error).
70+
71+
**no munseal**
72+
Once mapping is sealed, it can't be unsealed. The kernel should never
73+
have munseal, this is consistent with other sealing feature, e.g.
74+
F_SEAL_SEAL for file.
75+
76+
Blocked mm syscall for sealed mapping
77+
-------------------------------------
78+
It might be important to note: **once the mapping is sealed, it will
79+
stay in the process's memory until the process terminates**.
80+
81+
Example::
82+
83+
*ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
84+
rc = mseal(ptr, 4096, 0);
85+
/* munmap will fail */
86+
rc = munmap(ptr, 4096);
87+
assert(rc < 0);
88+
89+
Blocked mm syscall:
90+
- munmap
91+
- mmap
92+
- mremap
93+
- mprotect and pkey_mprotect
94+
- some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
95+
MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
96+
97+
The first set of syscalls to block is munmap, mremap, mmap. They can
98+
either leave an empty space in the address space, therefore allowing
99+
replacement with a new mapping with new set of attributes, or can
100+
overwrite the existing mapping with another mapping.
101+
102+
mprotect and pkey_mprotect are blocked because they changes the
103+
protection bits (RWX) of the mapping.
104+
105+
Certain destructive madvise behaviors, specifically MADV_DONTNEED,
106+
MADV_FREE, MADV_DONTNEED_LOCKED, and MADV_WIPEONFORK, can introduce
107+
risks when applied to anonymous memory by threads lacking write
108+
permissions. Consequently, these operations are prohibited under such
109+
conditions. The aforementioned behaviors have the potential to modify
110+
region contents by discarding pages, effectively performing a memset(0)
111+
operation on the anonymous memory.
112+
113+
Kernel will return -EPERM for blocked syscalls.
114+
115+
When blocked syscall return -EPERM due to sealing, the memory regions may
116+
or may not be changed, depends on the syscall being blocked:
117+
118+
- munmap: munmap is atomic. If one of VMAs in the given range is
119+
sealed, none of VMAs are updated.
120+
- mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
121+
when mprotect over multiple VMAs, mprotect might update the beginning
122+
VMAs before reaching the sealed VMA and return -EPERM.
123+
- mmap and mremap: undefined behavior.
124+
125+
Use cases
126+
=========
123127
- glibc:
124128
The dynamic linker, during loading ELF executables, can apply sealing to
125-
non-writable memory segments.
126-
127-
- Chrome browser: protect some security sensitive data-structures.
129+
mapping segments.
128130

129-
Notes on which memory to seal:
130-
==============================
131+
- Chrome browser: protect some security sensitive data structures.
131132

132-
It might be important to note that sealing changes the lifetime of a mapping,
133-
i.e. the sealed mapping won’t be unmapped till the process terminates or the
134-
exec system call is invoked. Applications can apply sealing to any virtual
135-
memory region from userspace, but it is crucial to thoroughly analyze the
136-
mapping's lifetime prior to apply the sealing.
133+
When not to use mseal
134+
=====================
135+
Applications can apply sealing to any virtual memory region from userspace,
136+
but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
137+
apply the sealing. This is because the sealed mapping *won’t be unmapped*
138+
until the process terminates or the exec system call is invoked.
137139

138140
For example:
141+
- aio/shm
142+
aio/shm can call mmap and munmap on behalf of userspace, e.g.
143+
ksys_shmdt() in shm.c. The lifetimes of those mapping are not tied to
144+
the lifetime of the process. If those memories are sealed from userspace,
145+
then munmap will fail, causing leaks in VMA address space during the
146+
lifetime of the process.
147+
148+
- ptr allocated by malloc (heap)
149+
Don't use mseal on the memory ptr return from malloc().
150+
malloc() is implemented by allocator, e.g. by glibc. Heap manager might
151+
allocate a ptr from brk or mapping created by mmap.
152+
If an app calls mseal on a ptr returned from malloc(), this can affect
153+
the heap manager's ability to manage the mappings; the outcome is
154+
non-deterministic.
155+
156+
Example::
157+
158+
ptr = malloc(size);
159+
/* don't call mseal on ptr return from malloc. */
160+
mseal(ptr, size);
161+
/* free will success, allocator can't shrink heap lower than ptr */
162+
free(ptr);
163+
164+
mseal doesn't block
165+
===================
166+
In a nutshell, mseal blocks certain mm syscall from modifying some of VMA's
167+
attributes, such as protection bits (RWX). Sealed mappings doesn't mean the
168+
memory is immutable.
139169

140-
- aio/shm
141-
142-
aio/shm can call mmap()/munmap() on behalf of userspace, e.g. ksys_shmdt() in
143-
shm.c. The lifetime of those mapping are not tied to the lifetime of the
144-
process. If those memories are sealed from userspace, then munmap() will fail,
145-
causing leaks in VMA address space during the lifetime of the process.
146-
147-
- Brk (heap)
148-
149-
Currently, userspace applications can seal parts of the heap by calling
150-
malloc() and mseal().
151-
let's assume following calls from user space:
152-
153-
- ptr = malloc(size);
154-
- mprotect(ptr, size, RO);
155-
- mseal(ptr, size);
156-
- free(ptr);
157-
158-
Technically, before mseal() is added, the user can change the protection of
159-
the heap by calling mprotect(RO). As long as the user changes the protection
160-
back to RW before free(), the memory range can be reused.
161-
162-
Adding mseal() into the picture, however, the heap is then sealed partially,
163-
the user can still free it, but the memory remains to be RO. If the address
164-
is re-used by the heap manager for another malloc, the process might crash
165-
soon after. Therefore, it is important not to apply sealing to any memory
166-
that might get recycled.
167-
168-
Furthermore, even if the application never calls the free() for the ptr,
169-
the heap manager may invoke the brk system call to shrink the size of the
170-
heap. In the kernel, the brk-shrink will call munmap(). Consequently,
171-
depending on the location of the ptr, the outcome of brk-shrink is
172-
nondeterministic.
173-
174-
175-
Additional notes:
176-
=================
177170
As Jann Horn pointed out in [3], there are still a few ways to write
178-
to RO memory, which is, in a way, by design. Those cases are not covered
179-
by mseal(). If applications want to block such cases, sandbox tools (such as
180-
seccomp, LSM, etc) might be considered.
171+
to RO memory, which is, in a way, by design. And those could be blocked
172+
by different security measures.
181173

182174
Those cases are:
183175

184-
- Write to read-only memory through /proc/self/mem interface.
185-
- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
186-
- userfaultfd.
176+
- Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
177+
- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
178+
- userfaultfd.
187179

188180
The idea that inspired this patch comes from Stephen Röttger’s work in V8
189181
CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
190182

191-
Reference:
192-
==========
193-
[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
194-
195-
[2] https://man.openbsd.org/mimmutable.2
196-
197-
[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
198-
199-
[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
183+
Reference
184+
=========
185+
- [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
186+
- [2] https://man.openbsd.org/mimmutable.2
187+
- [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
188+
- [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc

0 commit comments

Comments
 (0)