@@ -23,177 +23,166 @@ applications can additionally seal security critical data at runtime.
23
23
A similar feature already exists in the XNU kernel with the
24
24
VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
25
25
26
- User API
27
- ========
28
- mseal()
29
- -----------
30
- The mseal() syscall has the following signature:
31
-
32
- ``int mseal(void addr, size_t len, unsigned long flags) ``
33
-
34
- **addr/len **: virtual memory address range.
35
-
36
- The address range set by ``addr ``/``len `` must meet:
37
- - The start address must be in an allocated VMA.
38
- - The start address must be page aligned.
39
- - The end address (``addr `` + ``len ``) must be in an allocated VMA.
40
- - no gap (unallocated memory) between start and end address.
41
-
42
- The ``len `` will be paged aligned implicitly by the kernel.
43
-
44
- **flags **: reserved for future use.
45
-
46
- **return values **:
47
-
48
- - ``0 ``: Success.
49
-
50
- - ``-EINVAL ``:
51
- - Invalid input ``flags ``.
52
- - The start address (``addr ``) is not page aligned.
53
- - Address range (``addr `` + ``len ``) overflow.
54
-
55
- - ``-ENOMEM ``:
56
- - The start address (``addr ``) is not allocated.
57
- - The end address (``addr `` + ``len ``) is not allocated.
58
- - A gap (unallocated memory) between start and end address.
59
-
60
- - ``-EPERM ``:
61
- - sealing is supported only on 64-bit CPUs, 32-bit is not supported.
62
-
63
- - For above error cases, users can expect the given memory range is
64
- unmodified, i.e. no partial update.
65
-
66
- - There might be other internal errors/cases not listed here, e.g.
67
- error during merging/splitting VMAs, or the process reaching the max
68
- number of supported VMAs. In those cases, partial updates to the given
69
- memory range could happen. However, those cases should be rare.
70
-
71
- **Blocked operations after sealing **:
72
- Unmapping, moving to another location, and shrinking the size,
73
- via munmap() and mremap(), can leave an empty space, therefore
74
- can be replaced with a VMA with a new set of attributes.
75
-
76
- Moving or expanding a different VMA into the current location,
77
- via mremap().
78
-
79
- Modifying a VMA via mmap(MAP_FIXED).
80
-
81
- Size expansion, via mremap(), does not appear to pose any
82
- specific risks to sealed VMAs. It is included anyway because
83
- the use case is unclear. In any case, users can rely on
84
- merging to expand a sealed VMA.
85
-
86
- mprotect() and pkey_mprotect().
87
-
88
- Some destructive madvice() behaviors (e.g. MADV_DONTNEED)
89
- for anonymous memory, when users don't have write permission to the
90
- memory. Those behaviors can alter region contents by discarding pages,
91
- effectively a memset(0) for anonymous memory.
92
-
93
- Kernel will return -EPERM for blocked operations.
94
-
95
- For blocked operations, one can expect the given address is unmodified,
96
- i.e. no partial update. Note, this is different from existing mm
97
- system call behaviors, where partial updates are made till an error is
98
- found and returned to userspace. To give an example:
99
-
100
- Assume following code sequence:
101
-
102
- - ptr = mmap(null, 8192, PROT_NONE);
103
- - munmap(ptr + 4096, 4096);
104
- - ret1 = mprotect(ptr, 8192, PROT_READ);
105
- - mseal(ptr, 4096);
106
- - ret2 = mprotect(ptr, 8192, PROT_NONE);
107
-
108
- ret1 will be -ENOMEM, the page from ptr is updated to PROT_READ.
109
-
110
- ret2 will be -EPERM, the page remains to be PROT_READ.
111
-
112
- **Note **:
113
-
114
- - mseal() only works on 64-bit CPUs, not 32-bit CPU.
115
-
116
- - users can call mseal() multiple times, mseal() on an already sealed memory
117
- is a no-action (not error).
118
-
119
- - munseal() is not supported.
120
-
121
- Use cases:
122
- ==========
26
+ SYSCALL
27
+ =======
28
+ mseal syscall signature
29
+ -----------------------
30
+ ``int mseal(void \* addr, size_t len, unsigned long flags) ``
31
+
32
+ **addr **/**len **: virtual memory address range.
33
+ The address range set by **addr **/**len ** must meet:
34
+ - The start address must be in an allocated VMA.
35
+ - The start address must be page aligned.
36
+ - The end address (**addr ** + **len **) must be in an allocated VMA.
37
+ - no gap (unallocated memory) between start and end address.
38
+
39
+ The ``len `` will be paged aligned implicitly by the kernel.
40
+
41
+ **flags **: reserved for future use.
42
+
43
+ **Return values **:
44
+ - **0 **: Success.
45
+ - **-EINVAL **:
46
+ * Invalid input ``flags ``.
47
+ * The start address (``addr ``) is not page aligned.
48
+ * Address range (``addr `` + ``len ``) overflow.
49
+ - **-ENOMEM **:
50
+ * The start address (``addr ``) is not allocated.
51
+ * The end address (``addr `` + ``len ``) is not allocated.
52
+ * A gap (unallocated memory) between start and end address.
53
+ - **-EPERM **:
54
+ * sealing is supported only on 64-bit CPUs, 32-bit is not supported.
55
+
56
+ **Note about error return **:
57
+ - For above error cases, users can expect the given memory range is
58
+ unmodified, i.e. no partial update.
59
+ - There might be other internal errors/cases not listed here, e.g.
60
+ error during merging/splitting VMAs, or the process reaching the maximum
61
+ number of supported VMAs. In those cases, partial updates to the given
62
+ memory range could happen. However, those cases should be rare.
63
+
64
+ **Architecture support **:
65
+ mseal only works on 64-bit CPUs, not 32-bit CPUs.
66
+
67
+ **Idempotent **:
68
+ users can call mseal multiple times. mseal on an already sealed memory
69
+ is a no-action (not error).
70
+
71
+ **no munseal **
72
+ Once mapping is sealed, it can't be unsealed. The kernel should never
73
+ have munseal, this is consistent with other sealing feature, e.g.
74
+ F_SEAL_SEAL for file.
75
+
76
+ Blocked mm syscall for sealed mapping
77
+ -------------------------------------
78
+ It might be important to note: **once the mapping is sealed, it will
79
+ stay in the process's memory until the process terminates **.
80
+
81
+ Example::
82
+
83
+ *ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
84
+ rc = mseal(ptr, 4096, 0);
85
+ /* munmap will fail */
86
+ rc = munmap(ptr, 4096);
87
+ assert(rc < 0);
88
+
89
+ Blocked mm syscall:
90
+ - munmap
91
+ - mmap
92
+ - mremap
93
+ - mprotect and pkey_mprotect
94
+ - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
95
+ MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
96
+
97
+ The first set of syscalls to block is munmap, mremap, mmap. They can
98
+ either leave an empty space in the address space, therefore allowing
99
+ replacement with a new mapping with new set of attributes, or can
100
+ overwrite the existing mapping with another mapping.
101
+
102
+ mprotect and pkey_mprotect are blocked because they changes the
103
+ protection bits (RWX) of the mapping.
104
+
105
+ Certain destructive madvise behaviors, specifically MADV_DONTNEED,
106
+ MADV_FREE, MADV_DONTNEED_LOCKED, and MADV_WIPEONFORK, can introduce
107
+ risks when applied to anonymous memory by threads lacking write
108
+ permissions. Consequently, these operations are prohibited under such
109
+ conditions. The aforementioned behaviors have the potential to modify
110
+ region contents by discarding pages, effectively performing a memset(0)
111
+ operation on the anonymous memory.
112
+
113
+ Kernel will return -EPERM for blocked syscalls.
114
+
115
+ When blocked syscall return -EPERM due to sealing, the memory regions may
116
+ or may not be changed, depends on the syscall being blocked:
117
+
118
+ - munmap: munmap is atomic. If one of VMAs in the given range is
119
+ sealed, none of VMAs are updated.
120
+ - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
121
+ when mprotect over multiple VMAs, mprotect might update the beginning
122
+ VMAs before reaching the sealed VMA and return -EPERM.
123
+ - mmap and mremap: undefined behavior.
124
+
125
+ Use cases
126
+ =========
123
127
- glibc:
124
128
The dynamic linker, during loading ELF executables, can apply sealing to
125
- non-writable memory segments.
126
-
127
- - Chrome browser: protect some security sensitive data-structures.
129
+ mapping segments.
128
130
129
- Notes on which memory to seal:
130
- ==============================
131
+ - Chrome browser: protect some security sensitive data structures.
131
132
132
- It might be important to note that sealing changes the lifetime of a mapping,
133
- i.e. the sealed mapping won’t be unmapped till the process terminates or the
134
- exec system call is invoked. Applications can apply sealing to any virtual
135
- memory region from userspace, but it is crucial to thoroughly analyze the
136
- mapping's lifetime prior to apply the sealing.
133
+ When not to use mseal
134
+ =====================
135
+ Applications can apply sealing to any virtual memory region from userspace,
136
+ but it is *crucial to thoroughly analyze the mapping's lifetime * prior to
137
+ apply the sealing. This is because the sealed mapping *won’t be unmapped *
138
+ until the process terminates or the exec system call is invoked.
137
139
138
140
For example:
141
+ - aio/shm
142
+ aio/shm can call mmap and munmap on behalf of userspace, e.g.
143
+ ksys_shmdt() in shm.c. The lifetimes of those mapping are not tied to
144
+ the lifetime of the process. If those memories are sealed from userspace,
145
+ then munmap will fail, causing leaks in VMA address space during the
146
+ lifetime of the process.
147
+
148
+ - ptr allocated by malloc (heap)
149
+ Don't use mseal on the memory ptr return from malloc().
150
+ malloc() is implemented by allocator, e.g. by glibc. Heap manager might
151
+ allocate a ptr from brk or mapping created by mmap.
152
+ If an app calls mseal on a ptr returned from malloc(), this can affect
153
+ the heap manager's ability to manage the mappings; the outcome is
154
+ non-deterministic.
155
+
156
+ Example::
157
+
158
+ ptr = malloc(size);
159
+ /* don't call mseal on ptr return from malloc. */
160
+ mseal(ptr, size);
161
+ /* free will success, allocator can't shrink heap lower than ptr */
162
+ free(ptr);
163
+
164
+ mseal doesn't block
165
+ ===================
166
+ In a nutshell, mseal blocks certain mm syscall from modifying some of VMA's
167
+ attributes, such as protection bits (RWX). Sealed mappings doesn't mean the
168
+ memory is immutable.
139
169
140
- - aio/shm
141
-
142
- aio/shm can call mmap()/munmap() on behalf of userspace, e.g. ksys_shmdt() in
143
- shm.c. The lifetime of those mapping are not tied to the lifetime of the
144
- process. If those memories are sealed from userspace, then munmap() will fail,
145
- causing leaks in VMA address space during the lifetime of the process.
146
-
147
- - Brk (heap)
148
-
149
- Currently, userspace applications can seal parts of the heap by calling
150
- malloc() and mseal().
151
- let's assume following calls from user space:
152
-
153
- - ptr = malloc(size);
154
- - mprotect(ptr, size, RO);
155
- - mseal(ptr, size);
156
- - free(ptr);
157
-
158
- Technically, before mseal() is added, the user can change the protection of
159
- the heap by calling mprotect(RO). As long as the user changes the protection
160
- back to RW before free(), the memory range can be reused.
161
-
162
- Adding mseal() into the picture, however, the heap is then sealed partially,
163
- the user can still free it, but the memory remains to be RO. If the address
164
- is re-used by the heap manager for another malloc, the process might crash
165
- soon after. Therefore, it is important not to apply sealing to any memory
166
- that might get recycled.
167
-
168
- Furthermore, even if the application never calls the free() for the ptr,
169
- the heap manager may invoke the brk system call to shrink the size of the
170
- heap. In the kernel, the brk-shrink will call munmap(). Consequently,
171
- depending on the location of the ptr, the outcome of brk-shrink is
172
- nondeterministic.
173
-
174
-
175
- Additional notes:
176
- =================
177
170
As Jann Horn pointed out in [3], there are still a few ways to write
178
- to RO memory, which is, in a way, by design. Those cases are not covered
179
- by mseal(). If applications want to block such cases, sandbox tools (such as
180
- seccomp, LSM, etc) might be considered.
171
+ to RO memory, which is, in a way, by design. And those could be blocked
172
+ by different security measures.
181
173
182
174
Those cases are:
183
175
184
- - Write to read-only memory through /proc/self/mem interface.
185
- - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
186
- - userfaultfd.
176
+ - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE) .
177
+ - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
178
+ - userfaultfd.
187
179
188
180
The idea that inspired this patch comes from Stephen Röttger’s work in V8
189
181
CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
190
182
191
- Reference:
192
- ==========
193
- [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
194
-
195
- [2] https://man.openbsd.org/mimmutable.2
196
-
197
- [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
198
-
199
- [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
183
+ Reference
184
+ =========
185
+ - [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
186
+ - [2] https://man.openbsd.org/mimmutable.2
187
+ - [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
188
+ - [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
0 commit comments