Commit | Line | Data |
---|---|---|
2792d871 NJ |
1 | CFQ (Complete Fairness Queueing) |
2 | =============================== | |
3 | ||
4 | The main aim of CFQ scheduler is to provide a fair allocation of the disk | |
5 | I/O bandwidth for all the processes which requests an I/O operation. | |
6 | ||
7 | CFQ maintains the per process queue for the processes which request I/O | |
fdc6fdc5 | 8 | operation(synchronous requests). In case of asynchronous requests, all the |
2792d871 NJ |
9 | requests from all the processes are batched together according to their |
10 | process's I/O priority. | |
11 | ||
6d6ac1c1 VG |
12 | CFQ ioscheduler tunables |
13 | ======================== | |
14 | ||
15 | slice_idle | |
16 | ---------- | |
17 | This specifies how long CFQ should idle for next request on certain cfq queues | |
18 | (for sequential workloads) and service trees (for random workloads) before | |
19 | queue is expired and CFQ selects next queue to dispatch from. | |
20 | ||
21 | By default slice_idle is a non-zero value. That means by default we idle on | |
22 | queues/service trees. This can be very helpful on highly seeky media like | |
23 | single spindle SATA/SAS disks where we can cut down on overall number of | |
24 | seeks and see improved throughput. | |
25 | ||
26 | Setting slice_idle to 0 will remove all the idling on queues/service tree | |
27 | level and one should see an overall improved throughput on faster storage | |
28 | devices like multiple SATA/SAS disks in hardware RAID configuration. The down | |
29 | side is that isolation provided from WRITES also goes down and notion of | |
30 | IO priority becomes weaker. | |
31 | ||
32 | So depending on storage and workload, it might be useful to set slice_idle=0. | |
33 | In general I think for SATA/SAS disks and software RAID of SATA/SAS disks | |
34 | keeping slice_idle enabled should be useful. For any configurations where | |
35 | there are multiple spindles behind single LUN (Host based hardware RAID | |
36 | controller or for storage arrays), setting slice_idle=0 might end up in better | |
37 | throughput and acceptable latencies. | |
38 | ||
2792d871 NJ |
39 | back_seek_max |
40 | ------------- | |
41 | This specifies, given in Kbytes, the maximum "distance" for backward seeking. | |
42 | The distance is the amount of space from the current head location to the | |
43 | sectors that are backward in terms of distance. | |
44 | ||
45 | This parameter allows the scheduler to anticipate requests in the "backward" | |
46 | direction and consider them as being the "next" if they are within this | |
47 | distance from the current head location. | |
48 | ||
49 | back_seek_penalty | |
50 | ----------------- | |
51 | This parameter is used to compute the cost of backward seeking. If the | |
52 | backward distance of request is just 1/back_seek_penalty from a "front" | |
53 | request, then the seeking cost of two requests is considered equivalent. | |
54 | ||
55 | So scheduler will not bias toward one or the other request (otherwise scheduler | |
56 | will bias toward front request). Default value of back_seek_penalty is 2. | |
57 | ||
58 | fifo_expire_async | |
59 | ----------------- | |
60 | This parameter is used to set the timeout of asynchronous requests. Default | |
61 | value of this is 248ms. | |
62 | ||
63 | fifo_expire_sync | |
64 | ---------------- | |
65 | This parameter is used to set the timeout of synchronous requests. Default | |
66 | value of this is 124ms. In case to favor synchronous requests over asynchronous | |
67 | one, this value should be decreased relative to fifo_expire_async. | |
68 | ||
fdc6fdc5 NJ |
69 | group_idle |
70 | ----------- | |
71 | This parameter forces idling at the CFQ group level instead of CFQ | |
c9f3f2d8 | 72 | queue level. This was introduced after a bottleneck was observed |
fdc6fdc5 NJ |
73 | in higher end storage due to idle on sequential queue and allow dispatch |
74 | from a single queue. The idea with this parameter is that it can be run with | |
75 | slice_idle=0 and group_idle=8, so that idling does not happen on individual | |
76 | queues in the group but happens overall on the group and thus still keeps the | |
77 | IO controller working. | |
78 | Not idling on individual queues in the group will dispatch requests from | |
79 | multiple queues in the group at the same time and achieve higher throughput | |
80 | on higher end storage. | |
81 | ||
82 | Default value for this parameter is 8ms. | |
83 | ||
bcebb4cc LP |
84 | low_latency |
85 | ----------- | |
86 | This parameter is used to enable/disable the low latency mode of the CFQ | |
87 | scheduler. If enabled, CFQ tries to recompute the slice time for each process | |
88 | based on the target_latency set for the system. This favors fairness over | |
89 | throughput. Disabling low latency (setting it to 0) ignores target latency, | |
90 | allowing each process in the system to get a full time slice. | |
fdc6fdc5 NJ |
91 | |
92 | By default low latency mode is enabled. | |
93 | ||
94 | target_latency | |
95 | -------------- | |
96 | This parameter is used to calculate the time slice for a process if cfq's | |
97 | latency mode is enabled. It will ensure that sync requests have an estimated | |
98 | latency. But if sequential workload is higher(e.g. sequential read), | |
99 | then to meet the latency constraints, throughput may decrease because of less | |
100 | time for each process to issue I/O request before the cfq queue is switched. | |
101 | ||
102 | Though this can be overcome by disabling the latency_mode, it may increase | |
103 | the read latency for some applications. This parameter allows for changing | |
104 | target_latency through the sysfs interface which can provide the balanced | |
105 | throughput and read latency. | |
106 | ||
107 | Default value for target_latency is 300ms. | |
108 | ||
2792d871 NJ |
109 | slice_async |
110 | ----------- | |
111 | This parameter is same as of slice_sync but for asynchronous queue. The | |
112 | default value is 40ms. | |
113 | ||
114 | slice_async_rq | |
115 | -------------- | |
116 | This parameter is used to limit the dispatching of asynchronous request to | |
117 | device request queue in queue's slice time. The maximum number of request that | |
118 | are allowed to be dispatched also depends upon the io priority. Default value | |
119 | for this is 2. | |
120 | ||
121 | slice_sync | |
122 | ---------- | |
123 | When a queue is selected for execution, the queues IO requests are only | |
124 | executed for a certain amount of time(time_slice) before switching to another | |
125 | queue. This parameter is used to calculate the time slice of synchronous | |
126 | queue. | |
127 | ||
128 | time_slice is computed using the below equation:- | |
129 | time_slice = slice_sync + (slice_sync/5 * (4 - prio)). To increase the | |
130 | time_slice of synchronous queue, increase the value of slice_sync. Default | |
131 | value is 100ms. | |
132 | ||
133 | quantum | |
134 | ------- | |
135 | This specifies the number of request dispatched to the device queue. In a | |
136 | queue's time slice, a request will not be dispatched if the number of request | |
137 | in the device exceeds this parameter. This parameter is used for synchronous | |
138 | request. | |
139 | ||
140 | In case of storage with several disk, this setting can limit the parallel | |
fdc6fdc5 NJ |
141 | processing of request. Therefore, increasing the value can improve the |
142 | performance although this can cause the latency of some I/O to increase due | |
2792d871 NJ |
143 | to more number of requests. |
144 | ||
d02f7aa8 TH |
145 | CFQ Group scheduling |
146 | ==================== | |
147 | ||
148 | CFQ supports blkio cgroup and has "blkio." prefixed files in each | |
149 | blkio cgroup directory. It is weight-based and there are four knobs | |
150 | for configuration - weight[_device] and leaf_weight[_device]. | |
151 | Internal cgroup nodes (the ones with children) can also have tasks in | |
152 | them, so the former two configure how much proportion the cgroup as a | |
153 | whole is entitled to at its parent's level while the latter two | |
154 | configure how much proportion the tasks in the cgroup have compared to | |
155 | its direct children. | |
156 | ||
157 | Another way to think about it is assuming that each internal node has | |
158 | an implicit leaf child node which hosts all the tasks whose weight is | |
159 | configured by leaf_weight[_device]. Let's assume a blkio hierarchy | |
160 | composed of five cgroups - root, A, B, AA and AB - with the following | |
161 | weights where the names represent the hierarchy. | |
162 | ||
163 | weight leaf_weight | |
164 | root : 125 125 | |
165 | A : 500 750 | |
166 | B : 250 500 | |
167 | AA : 500 500 | |
168 | AB : 1000 500 | |
169 | ||
170 | root never has a parent making its weight is meaningless. For backward | |
171 | compatibility, weight is always kept in sync with leaf_weight. B, AA | |
172 | and AB have no child and thus its tasks have no children cgroup to | |
173 | compete with. They always get 100% of what the cgroup won at the | |
174 | parent level. Considering only the weights which matter, the hierarchy | |
175 | looks like the following. | |
176 | ||
177 | root | |
178 | / | \ | |
179 | A B leaf | |
180 | 500 250 125 | |
181 | / | \ | |
182 | AA AB leaf | |
183 | 500 1000 750 | |
184 | ||
185 | If all cgroups have active IOs and competing with each other, disk | |
186 | time will be distributed like the following. | |
187 | ||
188 | Distribution below root. The total active weight at this level is | |
189 | A:500 + B:250 + C:125 = 875. | |
190 | ||
191 | root-leaf : 125 / 875 =~ 14% | |
192 | A : 500 / 875 =~ 57% | |
193 | B(-leaf) : 250 / 875 =~ 28% | |
194 | ||
195 | A has children and further distributes its 57% among the children and | |
196 | the implicit leaf node. The total active weight at this level is | |
197 | AA:500 + AB:1000 + A-leaf:750 = 2250. | |
198 | ||
199 | A-leaf : ( 750 / 2250) * A =~ 19% | |
200 | AA(-leaf) : ( 500 / 2250) * A =~ 12% | |
201 | AB(-leaf) : (1000 / 2250) * A =~ 25% | |
202 | ||
6d6ac1c1 VG |
203 | CFQ IOPS Mode for group scheduling |
204 | =================================== | |
205 | Basic CFQ design is to provide priority based time slices. Higher priority | |
206 | process gets bigger time slice and lower priority process gets smaller time | |
207 | slice. Measuring time becomes harder if storage is fast and supports NCQ and | |
208 | it would be better to dispatch multiple requests from multiple cfq queues in | |
209 | request queue at a time. In such scenario, it is not possible to measure time | |
210 | consumed by single queue accurately. | |
211 | ||
212 | What is possible though is to measure number of requests dispatched from a | |
213 | single queue and also allow dispatch from multiple cfq queue at the same time. | |
214 | This effectively becomes the fairness in terms of IOPS (IO operations per | |
215 | second). | |
216 | ||
217 | If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches | |
218 | to IOPS mode and starts providing fairness in terms of number of requests | |
219 | dispatched. Note that this mode switching takes effect only for group | |
220 | scheduling. For non-cgroup users nothing should change. | |
4931402a VG |
221 | |
222 | CFQ IO scheduler Idling Theory | |
223 | =============================== | |
224 | Idling on a queue is primarily about waiting for the next request to come | |
225 | on same queue after completion of a request. In this process CFQ will not | |
226 | dispatch requests from other cfq queues even if requests are pending there. | |
227 | ||
228 | The rationale behind idling is that it can cut down on number of seeks | |
229 | on rotational media. For example, if a process is doing dependent | |
230 | sequential reads (next read will come on only after completion of previous | |
231 | one), then not dispatching request from other queue should help as we | |
232 | did not move the disk head and kept on dispatching sequential IO from | |
233 | one queue. | |
234 | ||
235 | CFQ has following service trees and various queues are put on these trees. | |
236 | ||
237 | sync-idle sync-noidle async | |
238 | ||
239 | All cfq queues doing synchronous sequential IO go on to sync-idle tree. | |
240 | On this tree we idle on each queue individually. | |
241 | ||
242 | All synchronous non-sequential queues go on sync-noidle tree. Also any | |
243 | request which are marked with REQ_NOIDLE go on this service tree. On this | |
244 | tree we do not idle on individual queues instead idle on the whole group | |
245 | of queues or the tree. So if there are 4 queues waiting for IO to dispatch | |
246 | we will idle only once last queue has dispatched the IO and there is | |
247 | no more IO on this service tree. | |
248 | ||
249 | All async writes go on async service tree. There is no idling on async | |
250 | queues. | |
251 | ||
252 | CFQ has some optimizations for SSDs and if it detects a non-rotational | |
253 | media which can support higher queue depth (multiple requests at in | |
254 | flight at a time), then it cuts down on idling of individual queues and | |
255 | all the queues move to sync-noidle tree and only tree idle remains. This | |
256 | tree idling provides isolation with buffered write queues on async tree. | |
257 | ||
258 | FAQ | |
259 | === | |
260 | Q1. Why to idle at all on queues marked with REQ_NOIDLE. | |
261 | ||
262 | A1. We only do tree idle (all queues on sync-noidle tree) on queues marked | |
263 | with REQ_NOIDLE. This helps in providing isolation with all the sync-idle | |
264 | queues. Otherwise in presence of many sequential readers, other | |
265 | synchronous IO might not get fair share of disk. | |
266 | ||
267 | For example, if there are 10 sequential readers doing IO and they get | |
268 | 100ms each. If a REQ_NOIDLE request comes in, it will be scheduled | |
269 | roughly after 1 second. If after completion of REQ_NOIDLE request we | |
270 | do not idle, and after a couple of milli seconds a another REQ_NOIDLE | |
271 | request comes in, again it will be scheduled after 1second. Repeat it | |
272 | and notice how a workload can lose its disk share and suffer due to | |
273 | multiple sequential readers. | |
274 | ||
275 | fsync can generate dependent IO where bunch of data is written in the | |
276 | context of fsync, and later some journaling data is written. Journaling | |
277 | data comes in only after fsync has finished its IO (atleast for ext4 | |
278 | that seemed to be the case). Now if one decides not to idle on fsync | |
279 | thread due to REQ_NOIDLE, then next journaling write will not get | |
280 | scheduled for another second. A process doing small fsync, will suffer | |
281 | badly in presence of multiple sequential readers. | |
282 | ||
283 | Hence doing tree idling on threads using REQ_NOIDLE flag on requests | |
284 | provides isolation from multiple sequential readers and at the same | |
285 | time we do not idle on individual threads. | |
286 | ||
287 | Q2. When to specify REQ_NOIDLE | |
288 | A2. I would think whenever one is doing synchronous write and not expecting | |
289 | more writes to be dispatched from same context soon, should be able | |
290 | to specify REQ_NOIDLE on writes and that probably should work well for | |
291 | most of the cases. |