Commit | Line | Data |
---|---|---|
25edd8bf AA |
1 | = Userfaultfd = |
2 | ||
3 | == Objective == | |
4 | ||
5 | Userfaults allow the implementation of on-demand paging from userland | |
6 | and more generally they allow userland to take control of various | |
7 | memory page faults, something otherwise only the kernel code could do. | |
8 | ||
9 | For example userfaults allows a proper and more optimal implementation | |
10 | of the PROT_NONE+SIGSEGV trick. | |
11 | ||
12 | == Design == | |
13 | ||
14 | Userfaults are delivered and resolved through the userfaultfd syscall. | |
15 | ||
16 | The userfaultfd (aside from registering and unregistering virtual | |
17 | memory ranges) provides two primary functionalities: | |
18 | ||
19 | 1) read/POLLIN protocol to notify a userland thread of the faults | |
20 | happening | |
21 | ||
22 | 2) various UFFDIO_* ioctls that can manage the virtual memory regions | |
23 | registered in the userfaultfd that allows userland to efficiently | |
24 | resolve the userfaults it receives via 1) or to manage the virtual | |
25 | memory in the background | |
26 | ||
27 | The real advantage of userfaults if compared to regular virtual memory | |
28 | management of mremap/mprotect is that the userfaults in all their | |
29 | operations never involve heavyweight structures like vmas (in fact the | |
30 | userfaultfd runtime load never takes the mmap_sem for writing). | |
31 | ||
32 | Vmas are not suitable for page- (or hugepage) granular fault tracking | |
33 | when dealing with virtual address spaces that could span | |
34 | Terabytes. Too many vmas would be needed for that. | |
35 | ||
36 | The userfaultfd once opened by invoking the syscall, can also be | |
37 | passed using unix domain sockets to a manager process, so the same | |
38 | manager process could handle the userfaults of a multitude of | |
39 | different processes without them being aware about what is going on | |
40 | (well of course unless they later try to use the userfaultfd | |
41 | themselves on the same region the manager is already tracking, which | |
42 | is a corner case that would currently return -EBUSY). | |
43 | ||
44 | == API == | |
45 | ||
46 | When first opened the userfaultfd must be enabled invoking the | |
47 | UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or | |
48 | a later API version) which will specify the read/POLLIN protocol | |
a9b85f94 AA |
49 | userland intends to speak on the UFFD and the uffdio_api.features |
50 | userland requires. The UFFDIO_API ioctl if successful (i.e. if the | |
51 | requested uffdio_api.api is spoken also by the running kernel and the | |
52 | requested features are going to be enabled) will return into | |
53 | uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of | |
54 | respectively all the available features of the read(2) protocol and | |
55 | the generic ioctl available. | |
25edd8bf AA |
56 | |
57 | Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should | |
58 | be invoked (if present in the returned uffdio_api.ioctls bitmask) to | |
59 | register a memory range in the userfaultfd by setting the | |
60 | uffdio_register structure accordingly. The uffdio_register.mode | |
61 | bitmask will specify to the kernel which kind of faults to track for | |
62 | the range (UFFDIO_REGISTER_MODE_MISSING would track missing | |
63 | pages). The UFFDIO_REGISTER ioctl will return the | |
64 | uffdio_register.ioctls bitmask of ioctls that are suitable to resolve | |
65 | userfaults on the range registered. Not all ioctls will necessarily be | |
66 | supported for all memory types depending on the underlying virtual | |
67 | memory backend (anonymous memory vs tmpfs vs real filebacked | |
68 | mappings). | |
69 | ||
70 | Userland can use the uffdio_register.ioctls to manage the virtual | |
71 | address space in the background (to add or potentially also remove | |
72 | memory from the userfaultfd registered range). This means a userfault | |
73 | could be triggering just before userland maps in the background the | |
74 | user-faulted page. | |
75 | ||
76 | The primary ioctl to resolve userfaults is UFFDIO_COPY. That | |
77 | atomically copies a page into the userfault registered range and wakes | |
78 | up the blocked userfaults (unless uffdio_copy.mode & | |
79 | UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to | |
80 | UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an | |
81 | half copied page since it'll keep userfaulting until the copy has | |
82 | finished. | |
83 | ||
84 | == QEMU/KVM == | |
85 | ||
86 | QEMU/KVM is using the userfaultfd syscall to implement postcopy live | |
87 | migration. Postcopy live migration is one form of memory | |
88 | externalization consisting of a virtual machine running with part or | |
89 | all of its memory residing on a different node in the cloud. The | |
90 | userfaultfd abstraction is generic enough that not a single line of | |
91 | KVM kernel code had to be modified in order to add postcopy live | |
92 | migration to QEMU. | |
93 | ||
94 | Guest async page faults, FOLL_NOWAIT and all other GUP features work | |
95 | just fine in combination with userfaults. Userfaults trigger async | |
96 | page faults in the guest scheduler so those guest processes that | |
97 | aren't waiting for userfaults (i.e. network bound) can keep running in | |
98 | the guest vcpus. | |
99 | ||
100 | It is generally beneficial to run one pass of precopy live migration | |
101 | just before starting postcopy live migration, in order to avoid | |
102 | generating userfaults for readonly guest regions. | |
103 | ||
104 | The implementation of postcopy live migration currently uses one | |
105 | single bidirectional socket but in the future two different sockets | |
106 | will be used (to reduce the latency of the userfaults to the minimum | |
107 | possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). | |
108 | ||
109 | The QEMU in the source node writes all pages that it knows are missing | |
110 | in the destination node, into the socket, and the migration thread of | |
111 | the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE | |
112 | ioctls on the userfaultfd in order to map the received pages into the | |
113 | guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). | |
114 | ||
115 | A different postcopy thread in the destination node listens with | |
116 | poll() to the userfaultfd in parallel. When a POLLIN event is | |
117 | generated after a userfault triggers, the postcopy thread read() from | |
118 | the userfaultfd and receives the fault address (or -EAGAIN in case the | |
119 | userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run | |
120 | by the parallel QEMU migration thread). | |
121 | ||
122 | After the QEMU postcopy thread (running in the destination node) gets | |
123 | the userfault address it writes the information about the missing page | |
124 | into the socket. The QEMU source node receives the information and | |
125 | roughly "seeks" to that page address and continues sending all | |
126 | remaining missing pages from that new page offset. Soon after that | |
127 | (just the time to flush the tcp_wmem queue through the network) the | |
128 | migration thread in the QEMU running in the destination node will | |
129 | receive the page that triggered the userfault and it'll map it as | |
130 | usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it | |
131 | was spontaneously sent by the source or if it was an urgent page | |
132 | requested through an userfault). | |
133 | ||
134 | By the time the userfaults start, the QEMU in the destination node | |
135 | doesn't need to keep any per-page state bitmap relative to the live | |
136 | migration around and a single per-page bitmap has to be maintained in | |
137 | the QEMU running in the source node to know which pages are still | |
138 | missing in the destination node. The bitmap in the source node is | |
139 | checked to find which missing pages to send in round robin and we seek | |
140 | over it when receiving incoming userfaults. After sending each page of | |
141 | course the bitmap is updated accordingly. It's also useful to avoid | |
142 | sending the same page twice (in case the userfault is read by the | |
143 | postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration | |
144 | thread). |