Commit | Line | Data |
---|---|---|
5efb1d54 AH |
1 | Intel Processor Trace |
2 | ===================== | |
3 | ||
4 | Overview | |
5 | ======== | |
6 | ||
7 | Intel Processor Trace (Intel PT) is an extension of Intel Architecture that | |
8 | collects information about software execution such as control flow, execution | |
9 | modes and timings and formats it into highly compressed binary packets. | |
10 | Technical details are documented in the Intel 64 and IA-32 Architectures | |
11 | Software Developer Manuals, Chapter 36 Intel Processor Trace. | |
12 | ||
13 | Intel PT is first supported in Intel Core M and 5th generation Intel Core | |
14 | processors that are based on the Intel micro-architecture code name Broadwell. | |
15 | ||
16 | Trace data is collected by 'perf record' and stored within the perf.data file. | |
17 | See below for options to 'perf record'. | |
18 | ||
19 | Trace data must be 'decoded' which involves walking the object code and matching | |
20 | the trace data packets. For example a TNT packet only tells whether a | |
21 | conditional branch was taken or not taken, so to make use of that packet the | |
22 | decoder must know precisely which instruction was being executed. | |
23 | ||
24 | Decoding is done on-the-fly. The decoder outputs samples in the same format as | |
25 | samples output by perf hardware events, for example as though the "instructions" | |
26 | or "branches" events had been recorded. Presently 3 tools support this: | |
27 | 'perf script', 'perf report' and 'perf inject'. See below for more information | |
28 | on using those tools. | |
29 | ||
30 | The main distinguishing feature of Intel PT is that the decoder can determine | |
31 | the exact flow of software execution. Intel PT can be used to understand why | |
32 | and how did software get to a certain point, or behave a certain way. The | |
33 | software does not have to be recompiled, so Intel PT works with debug or release | |
34 | builds, however the executed images are needed - which makes use in JIT-compiled | |
35 | environments, or with self-modified code, a challenge. Also symbols need to be | |
36 | provided to make sense of addresses. | |
37 | ||
38 | A limitation of Intel PT is that it produces huge amounts of trace data | |
39 | (hundreds of megabytes per second per core) which takes a long time to decode, | |
40 | for example two or three orders of magnitude longer than it took to collect. | |
41 | Another limitation is the performance impact of tracing, something that will | |
42 | vary depending on the use-case and architecture. | |
43 | ||
44 | ||
45 | Quickstart | |
46 | ========== | |
47 | ||
48 | It is important to start small. That is because it is easy to capture vastly | |
49 | more data than can possibly be processed. | |
50 | ||
51 | The simplest thing to do with Intel PT is userspace profiling of small programs. | |
52 | Data is captured with 'perf record' e.g. to trace 'ls' userspace-only: | |
53 | ||
54 | perf record -e intel_pt//u ls | |
55 | ||
56 | And profiled with 'perf report' e.g. | |
57 | ||
58 | perf report | |
59 | ||
60 | To also trace kernel space presents a problem, namely kernel self-modifying | |
61 | code. A fairly good kernel image is available in /proc/kcore but to get an | |
62 | accurate image a copy of /proc/kcore needs to be made under the same conditions | |
63 | as the data capture. A script perf-with-kcore can do that, but beware that the | |
64 | script makes use of 'sudo' to copy /proc/kcore. If you have perf installed | |
65 | locally from the source tree you can do: | |
66 | ||
67 | ~/libexec/perf-core/perf-with-kcore record pt_ls -e intel_pt// -- ls | |
68 | ||
69 | which will create a directory named 'pt_ls' and put the perf.data file and | |
70 | copies of /proc/kcore, /proc/kallsyms and /proc/modules into it. Then to use | |
71 | 'perf report' becomes: | |
72 | ||
73 | ~/libexec/perf-core/perf-with-kcore report pt_ls | |
74 | ||
75 | Because samples are synthesized after-the-fact, the sampling period can be | |
76 | selected for reporting. e.g. sample every microsecond | |
77 | ||
78 | ~/libexec/perf-core/perf-with-kcore report pt_ls --itrace=i1usge | |
79 | ||
80 | See the sections below for more information about the --itrace option. | |
81 | ||
82 | Beware the smaller the period, the more samples that are produced, and the | |
83 | longer it takes to process them. | |
84 | ||
85 | Also note that the coarseness of Intel PT timing information will start to | |
86 | distort the statistical value of the sampling as the sampling period becomes | |
87 | smaller. | |
88 | ||
89 | To represent software control flow, "branches" samples are produced. By default | |
90 | a branch sample is synthesized for every single branch. To get an idea what | |
91 | data is available you can use the 'perf script' tool with no parameters, which | |
92 | will list all the samples. | |
93 | ||
94 | perf record -e intel_pt//u ls | |
95 | perf script | |
96 | ||
97 | An interesting field that is not printed by default is 'flags' which can be | |
98 | displayed as follows: | |
99 | ||
100 | perf script -Fcomm,tid,pid,time,cpu,event,trace,ip,sym,dso,addr,symoff,flags | |
101 | ||
102 | The flags are "bcrosyiABEx" which stand for branch, call, return, conditional, | |
103 | system, asynchronous, interrupt, transaction abort, trace begin, trace end, and | |
104 | in transaction, respectively. | |
105 | ||
106 | While it is possible to create scripts to analyze the data, an alternative | |
107 | approach is available to export the data to a postgresql database. Refer to | |
108 | script export-to-postgresql.py for more details, and to script | |
109 | call-graph-from-postgresql.py for an example of using the database. | |
110 | ||
111 | As mentioned above, it is easy to capture too much data. One way to limit the | |
112 | data captured is to use 'snapshot' mode which is explained further below. | |
113 | Refer to 'new snapshot option' and 'Intel PT modes of operation' further below. | |
114 | ||
115 | Another problem that will be experienced is decoder errors. They can be caused | |
116 | by inability to access the executed image, self-modified or JIT-ed code, or the | |
117 | inability to match side-band information (such as context switches and mmaps) | |
118 | which results in the decoder not knowing what code was executed. | |
119 | ||
120 | There is also the problem of perf not being able to copy the data fast enough, | |
121 | resulting in data lost because the buffer was full. See 'Buffer handling' below | |
122 | for more details. | |
123 | ||
124 | ||
125 | perf record | |
126 | =========== | |
127 | ||
128 | new event | |
129 | --------- | |
130 | ||
131 | The Intel PT kernel driver creates a new PMU for Intel PT. PMU events are | |
132 | selected by providing the PMU name followed by the "config" separated by slashes. | |
133 | An enhancement has been made to allow default "config" e.g. the option | |
134 | ||
135 | -e intel_pt// | |
136 | ||
137 | will use a default config value. Currently that is the same as | |
138 | ||
139 | -e intel_pt/tsc,noretcomp=0/ | |
140 | ||
141 | which is the same as | |
142 | ||
143 | -e intel_pt/tsc=1,noretcomp=0/ | |
144 | ||
9d1bf02a AH |
145 | Note there are now new config terms - see section 'config terms' further below. |
146 | ||
5efb1d54 AH |
147 | The config terms are listed in /sys/devices/intel_pt/format. They are bit |
148 | fields within the config member of the struct perf_event_attr which is | |
149 | passed to the kernel by the perf_event_open system call. They correspond to bit | |
150 | fields in the IA32_RTIT_CTL MSR. Here is a list of them and their definitions: | |
151 | ||
9d1bf02a AH |
152 | $ grep -H . /sys/bus/event_source/devices/intel_pt/format/* |
153 | /sys/bus/event_source/devices/intel_pt/format/cyc:config:1 | |
154 | /sys/bus/event_source/devices/intel_pt/format/cyc_thresh:config:19-22 | |
155 | /sys/bus/event_source/devices/intel_pt/format/mtc:config:9 | |
156 | /sys/bus/event_source/devices/intel_pt/format/mtc_period:config:14-17 | |
157 | /sys/bus/event_source/devices/intel_pt/format/noretcomp:config:11 | |
158 | /sys/bus/event_source/devices/intel_pt/format/psb_period:config:24-27 | |
159 | /sys/bus/event_source/devices/intel_pt/format/tsc:config:10 | |
5efb1d54 AH |
160 | |
161 | Note that the default config must be overridden for each term i.e. | |
162 | ||
163 | -e intel_pt/noretcomp=0/ | |
164 | ||
165 | is the same as: | |
166 | ||
167 | -e intel_pt/tsc=1,noretcomp=0/ | |
168 | ||
169 | So, to disable TSC packets use: | |
170 | ||
171 | -e intel_pt/tsc=0/ | |
172 | ||
173 | It is also possible to specify the config value explicitly: | |
174 | ||
175 | -e intel_pt/config=0x400/ | |
176 | ||
177 | Note that, as with all events, the event is suffixed with event modifiers: | |
178 | ||
179 | u userspace | |
180 | k kernel | |
181 | h hypervisor | |
182 | G guest | |
183 | H host | |
184 | p precise ip | |
185 | ||
186 | 'h', 'G' and 'H' are for virtualization which is not supported by Intel PT. | |
187 | 'p' is also not relevant to Intel PT. So only options 'u' and 'k' are | |
188 | meaningful for Intel PT. | |
189 | ||
190 | perf_event_attr is displayed if the -vv option is used e.g. | |
191 | ||
192 | ------------------------------------------------------------ | |
193 | perf_event_attr: | |
194 | type 6 | |
195 | size 112 | |
196 | config 0x400 | |
197 | { sample_period, sample_freq } 1 | |
198 | sample_type IP|TID|TIME|CPU|IDENTIFIER | |
199 | read_format ID | |
200 | disabled 1 | |
201 | inherit 1 | |
202 | exclude_kernel 1 | |
203 | exclude_hv 1 | |
204 | enable_on_exec 1 | |
205 | sample_id_all 1 | |
206 | ------------------------------------------------------------ | |
207 | sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8 | |
208 | sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8 | |
209 | sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8 | |
210 | sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8 | |
211 | ------------------------------------------------------------ | |
212 | ||
213 | ||
9d1bf02a AH |
214 | config terms |
215 | ------------ | |
216 | ||
217 | The June 2015 version of Intel 64 and IA-32 Architectures Software Developer | |
218 | Manuals, Chapter 36 Intel Processor Trace, defined new Intel PT features. | |
219 | Some of the features are reflect in new config terms. All the config terms are | |
220 | described below. | |
221 | ||
222 | tsc Always supported. Produces TSC timestamp packets to provide | |
223 | timing information. In some cases it is possible to decode | |
224 | without timing information, for example a per-thread context | |
225 | that does not overlap executable memory maps. | |
226 | ||
227 | The default config selects tsc (i.e. tsc=1). | |
228 | ||
229 | noretcomp Always supported. Disables "return compression" so a TIP packet | |
230 | is produced when a function returns. Causes more packets to be | |
231 | produced but might make decoding more reliable. | |
232 | ||
233 | The default config does not select noretcomp (i.e. noretcomp=0). | |
234 | ||
235 | psb_period Allows the frequency of PSB packets to be specified. | |
236 | ||
237 | The PSB packet is a synchronization packet that provides a | |
238 | starting point for decoding or recovery from errors. | |
239 | ||
240 | Support for psb_period is indicated by: | |
241 | ||
242 | /sys/bus/event_source/devices/intel_pt/caps/psb_cyc | |
243 | ||
244 | which contains "1" if the feature is supported and "0" | |
245 | otherwise. | |
246 | ||
247 | Valid values are given by: | |
248 | ||
249 | /sys/bus/event_source/devices/intel_pt/caps/psb_periods | |
250 | ||
251 | which contains a hexadecimal value, the bits of which represent | |
252 | valid values e.g. bit 2 set means value 2 is valid. | |
253 | ||
254 | The psb_period value is converted to the approximate number of | |
255 | trace bytes between PSB packets as: | |
256 | ||
257 | 2 ^ (value + 11) | |
258 | ||
259 | e.g. value 3 means 16KiB bytes between PSBs | |
260 | ||
261 | If an invalid value is entered, the error message | |
262 | will give a list of valid values e.g. | |
263 | ||
264 | $ perf record -e intel_pt/psb_period=15/u uname | |
265 | Invalid psb_period for intel_pt. Valid values are: 0-5 | |
266 | ||
267 | If MTC packets are selected, the default config selects a value | |
268 | of 3 (i.e. psb_period=3) or the nearest lower value that is | |
269 | supported (0 is always supported). Otherwise the default is 0. | |
270 | ||
271 | If decoding is expected to be reliable and the buffer is large | |
272 | then a large PSB period can be used. | |
273 | ||
274 | Because a TSC packet is produced with PSB, the PSB period can | |
275 | also affect the granularity to timing information in the absence | |
276 | of MTC or CYC. | |
277 | ||
278 | mtc Produces MTC timing packets. | |
279 | ||
280 | MTC packets provide finer grain timestamp information than TSC | |
281 | packets. MTC packets record time using the hardware crystal | |
282 | clock (CTC) which is related to TSC packets using a TMA packet. | |
283 | ||
284 | Support for this feature is indicated by: | |
285 | ||
286 | /sys/bus/event_source/devices/intel_pt/caps/mtc | |
287 | ||
288 | which contains "1" if the feature is supported and | |
289 | "0" otherwise. | |
290 | ||
291 | The frequency of MTC packets can also be specified - see | |
292 | mtc_period below. | |
293 | ||
294 | mtc_period Specifies how frequently MTC packets are produced - see mtc | |
295 | above for how to determine if MTC packets are supported. | |
296 | ||
297 | Valid values are given by: | |
298 | ||
299 | /sys/bus/event_source/devices/intel_pt/caps/mtc_periods | |
300 | ||
301 | which contains a hexadecimal value, the bits of which represent | |
302 | valid values e.g. bit 2 set means value 2 is valid. | |
303 | ||
304 | The mtc_period value is converted to the MTC frequency as: | |
305 | ||
306 | CTC-frequency / (2 ^ value) | |
307 | ||
308 | e.g. value 3 means one eighth of CTC-frequency | |
309 | ||
310 | Where CTC is the hardware crystal clock, the frequency of which | |
311 | can be related to TSC via values provided in cpuid leaf 0x15. | |
312 | ||
313 | If an invalid value is entered, the error message | |
314 | will give a list of valid values e.g. | |
315 | ||
316 | $ perf record -e intel_pt/mtc_period=15/u uname | |
317 | Invalid mtc_period for intel_pt. Valid values are: 0,3,6,9 | |
318 | ||
319 | The default value is 3 or the nearest lower value | |
320 | that is supported (0 is always supported). | |
321 | ||
322 | cyc Produces CYC timing packets. | |
323 | ||
324 | CYC packets provide even finer grain timestamp information than | |
325 | MTC and TSC packets. A CYC packet contains the number of CPU | |
326 | cycles since the last CYC packet. Unlike MTC and TSC packets, | |
327 | CYC packets are only sent when another packet is also sent. | |
328 | ||
329 | Support for this feature is indicated by: | |
330 | ||
331 | /sys/bus/event_source/devices/intel_pt/caps/psb_cyc | |
332 | ||
333 | which contains "1" if the feature is supported and | |
334 | "0" otherwise. | |
335 | ||
336 | The number of CYC packets produced can be reduced by specifying | |
337 | a threshold - see cyc_thresh below. | |
338 | ||
339 | cyc_thresh Specifies how frequently CYC packets are produced - see cyc | |
340 | above for how to determine if CYC packets are supported. | |
341 | ||
342 | Valid cyc_thresh values are given by: | |
343 | ||
344 | /sys/bus/event_source/devices/intel_pt/caps/cycle_thresholds | |
345 | ||
346 | which contains a hexadecimal value, the bits of which represent | |
347 | valid values e.g. bit 2 set means value 2 is valid. | |
348 | ||
349 | The cyc_thresh value represents the minimum number of CPU cycles | |
350 | that must have passed before a CYC packet can be sent. The | |
351 | number of CPU cycles is: | |
352 | ||
353 | 2 ^ (value - 1) | |
354 | ||
355 | e.g. value 4 means 8 CPU cycles must pass before a CYC packet | |
356 | can be sent. Note a CYC packet is still only sent when another | |
357 | packet is sent, not at, e.g. every 8 CPU cycles. | |
358 | ||
359 | If an invalid value is entered, the error message | |
360 | will give a list of valid values e.g. | |
361 | ||
362 | $ perf record -e intel_pt/cyc,cyc_thresh=15/u uname | |
363 | Invalid cyc_thresh for intel_pt. Valid values are: 0-12 | |
364 | ||
365 | CYC packets are not requested by default. | |
366 | ||
9d1bf02a | 367 | |
5efb1d54 AH |
368 | new snapshot option |
369 | ------------------- | |
370 | ||
9d1bf02a AH |
371 | The difference between full trace and snapshot from the kernel's perspective is |
372 | that in full trace we don't overwrite trace data that the user hasn't collected | |
373 | yet (and indicated that by advancing aux_tail), whereas in snapshot mode we let | |
374 | the trace run and overwrite older data in the buffer so that whenever something | |
375 | interesting happens, we can stop it and grab a snapshot of what was going on | |
376 | around that interesting moment. | |
377 | ||
5efb1d54 AH |
378 | To select snapshot mode a new option has been added: |
379 | ||
380 | -S | |
381 | ||
382 | Optionally it can be followed by the snapshot size e.g. | |
383 | ||
384 | -S0x100000 | |
385 | ||
386 | The default snapshot size is the auxtrace mmap size. If neither auxtrace mmap size | |
387 | nor snapshot size is specified, then the default is 4MiB for privileged users | |
388 | (or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users. | |
389 | If an unprivileged user does not specify mmap pages, the mmap pages will be | |
390 | reduced as described in the 'new auxtrace mmap size option' section below. | |
391 | ||
392 | The snapshot size is displayed if the option -vv is used e.g. | |
393 | ||
394 | Intel PT snapshot size: %zu | |
395 | ||
396 | ||
397 | new auxtrace mmap size option | |
398 | --------------------------- | |
399 | ||
400 | Intel PT buffer size is specified by an addition to the -m option e.g. | |
401 | ||
402 | -m,16 | |
403 | ||
404 | selects a buffer size of 16 pages i.e. 64KiB. | |
405 | ||
406 | Note that the existing functionality of -m is unchanged. The auxtrace mmap size | |
407 | is specified by the optional addition of a comma and the value. | |
408 | ||
409 | The default auxtrace mmap size for Intel PT is 4MiB/page_size for privileged users | |
410 | (or if /proc/sys/kernel/perf_event_paranoid < 0), 128KiB for unprivileged users. | |
411 | If an unprivileged user does not specify mmap pages, the mmap pages will be | |
412 | reduced from the default 512KiB/page_size to 256KiB/page_size, otherwise the | |
413 | user is likely to get an error as they exceed their mlock limit (Max locked | |
414 | memory as shown in /proc/self/limits). Note that perf does not count the first | |
415 | 512KiB (actually /proc/sys/kernel/perf_event_mlock_kb minus 1 page) per cpu | |
416 | against the mlock limit so an unprivileged user is allowed 512KiB per cpu plus | |
417 | their mlock limit (which defaults to 64KiB but is not multiplied by the number | |
418 | of cpus). | |
419 | ||
420 | In full-trace mode, powers of two are allowed for buffer size, with a minimum | |
421 | size of 2 pages. In snapshot mode, it is the same but the minimum size is | |
422 | 1 page. | |
423 | ||
424 | The mmap size and auxtrace mmap size are displayed if the -vv option is used e.g. | |
425 | ||
426 | mmap length 528384 | |
427 | auxtrace mmap length 4198400 | |
428 | ||
429 | ||
430 | Intel PT modes of operation | |
431 | --------------------------- | |
432 | ||
433 | Intel PT can be used in 2 modes: | |
434 | full-trace mode | |
435 | snapshot mode | |
436 | ||
437 | Full-trace mode traces continuously e.g. | |
438 | ||
439 | perf record -e intel_pt//u uname | |
440 | ||
441 | Snapshot mode captures the available data when a signal is sent e.g. | |
442 | ||
443 | perf record -v -e intel_pt//u -S ./loopy 1000000000 & | |
444 | [1] 11435 | |
445 | kill -USR2 11435 | |
446 | Recording AUX area tracing snapshot | |
447 | ||
448 | Note that the signal sent is SIGUSR2. | |
449 | Note that "Recording AUX area tracing snapshot" is displayed because the -v | |
450 | option is used. | |
451 | ||
452 | The 2 modes cannot be used together. | |
453 | ||
454 | ||
455 | Buffer handling | |
456 | --------------- | |
457 | ||
458 | There may be buffer limitations (i.e. single ToPa entry) which means that actual | |
459 | buffer sizes are limited to powers of 2 up to 4MiB (MAX_ORDER). In order to | |
460 | provide other sizes, and in particular an arbitrarily large size, multiple | |
461 | buffers are logically concatenated. However an interrupt must be used to switch | |
462 | between buffers. That has two potential problems: | |
463 | a) the interrupt may not be handled in time so that the current buffer | |
464 | becomes full and some trace data is lost. | |
465 | b) the interrupts may slow the system and affect the performance | |
466 | results. | |
467 | ||
468 | If trace data is lost, the driver sets 'truncated' in the PERF_RECORD_AUX event | |
469 | which the tools report as an error. | |
470 | ||
471 | In full-trace mode, the driver waits for data to be copied out before allowing | |
472 | the (logical) buffer to wrap-around. If data is not copied out quickly enough, | |
473 | again 'truncated' is set in the PERF_RECORD_AUX event. If the driver has to | |
474 | wait, the intel_pt event gets disabled. Because it is difficult to know when | |
475 | that happens, perf tools always re-enable the intel_pt event after copying out | |
476 | data. | |
477 | ||
478 | ||
479 | Intel PT and build ids | |
480 | ---------------------- | |
481 | ||
482 | By default "perf record" post-processes the event stream to find all build ids | |
483 | for executables for all addresses sampled. Deliberately, Intel PT is not | |
484 | decoded for that purpose (it would take too long). Instead the build ids for | |
485 | all executables encountered (due to mmap, comm or task events) are included | |
486 | in the perf.data file. | |
487 | ||
488 | To see buildids included in the perf.data file use the command: | |
489 | ||
490 | perf buildid-list | |
491 | ||
492 | If the perf.data file contains Intel PT data, that is the same as: | |
493 | ||
494 | perf buildid-list --with-hits | |
495 | ||
496 | ||
497 | Snapshot mode and event disabling | |
498 | --------------------------------- | |
499 | ||
500 | In order to make a snapshot, the intel_pt event is disabled using an IOCTL, | |
501 | namely PERF_EVENT_IOC_DISABLE. However doing that can also disable the | |
502 | collection of side-band information. In order to prevent that, a dummy | |
503 | software event has been introduced that permits tracking events (like mmaps) to | |
504 | continue to be recorded while intel_pt is disabled. That is important to ensure | |
505 | there is complete side-band information to allow the decoding of subsequent | |
506 | snapshots. | |
507 | ||
508 | A test has been created for that. To find the test: | |
509 | ||
510 | perf test list | |
511 | ... | |
512 | 23: Test using a dummy software event to keep tracking | |
513 | ||
514 | To run the test: | |
515 | ||
516 | perf test 23 | |
517 | 23: Test using a dummy software event to keep tracking : Ok | |
518 | ||
519 | ||
520 | perf record modes (nothing new here) | |
521 | ------------------------------------ | |
522 | ||
523 | perf record essentially operates in one of three modes: | |
524 | per thread | |
525 | per cpu | |
526 | workload only | |
527 | ||
528 | "per thread" mode is selected by -t or by --per-thread (with -p or -u or just a | |
529 | workload). | |
530 | "per cpu" is selected by -C or -a. | |
531 | "workload only" mode is selected by not using the other options but providing a | |
532 | command to run (i.e. the workload). | |
533 | ||
534 | In per-thread mode an exact list of threads is traced. There is no inheritance. | |
535 | Each thread has its own event buffer. | |
536 | ||
537 | In per-cpu mode all processes (or processes from the selected cgroup i.e. -G | |
538 | option, or processes selected with -p or -u) are traced. Each cpu has its own | |
539 | buffer. Inheritance is allowed. | |
540 | ||
541 | In workload-only mode, the workload is traced but with per-cpu buffers. | |
542 | Inheritance is allowed. Note that you can now trace a workload in per-thread | |
543 | mode by using the --per-thread option. | |
544 | ||
545 | ||
546 | Privileged vs non-privileged users | |
547 | ---------------------------------- | |
548 | ||
549 | Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users | |
550 | have memory limits imposed upon them. That affects what buffer sizes they can | |
551 | have as outlined above. | |
552 | ||
553 | Unless /proc/sys/kernel/perf_event_paranoid is set to -1, unprivileged users are | |
554 | not permitted to use tracepoints which means there is insufficient side-band | |
555 | information to decode Intel PT in per-cpu mode, and potentially workload-only | |
556 | mode too if the workload creates new processes. | |
557 | ||
558 | Note also, that to use tracepoints, read-access to debugfs is required. So if | |
559 | debugfs is not mounted or the user does not have read-access, it will again not | |
560 | be possible to decode Intel PT in per-cpu mode. | |
561 | ||
562 | ||
563 | sched_switch tracepoint | |
564 | ----------------------- | |
565 | ||
566 | The sched_switch tracepoint is used to provide side-band data for Intel PT | |
567 | decoding. sched_switch events are automatically added. e.g. the second event | |
568 | shown below | |
569 | ||
570 | $ perf record -vv -e intel_pt//u uname | |
571 | ------------------------------------------------------------ | |
572 | perf_event_attr: | |
573 | type 6 | |
574 | size 112 | |
575 | config 0x400 | |
576 | { sample_period, sample_freq } 1 | |
577 | sample_type IP|TID|TIME|CPU|IDENTIFIER | |
578 | read_format ID | |
579 | disabled 1 | |
580 | inherit 1 | |
581 | exclude_kernel 1 | |
582 | exclude_hv 1 | |
583 | enable_on_exec 1 | |
584 | sample_id_all 1 | |
585 | ------------------------------------------------------------ | |
586 | sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8 | |
587 | sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8 | |
588 | sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8 | |
589 | sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8 | |
590 | ------------------------------------------------------------ | |
591 | perf_event_attr: | |
592 | type 2 | |
593 | size 112 | |
594 | config 0x108 | |
595 | { sample_period, sample_freq } 1 | |
596 | sample_type IP|TID|TIME|CPU|PERIOD|RAW|IDENTIFIER | |
597 | read_format ID | |
598 | inherit 1 | |
599 | sample_id_all 1 | |
600 | exclude_guest 1 | |
601 | ------------------------------------------------------------ | |
602 | sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 | |
603 | sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8 | |
604 | sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8 | |
605 | sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8 | |
606 | ------------------------------------------------------------ | |
607 | perf_event_attr: | |
608 | type 1 | |
609 | size 112 | |
610 | config 0x9 | |
611 | { sample_period, sample_freq } 1 | |
612 | sample_type IP|TID|TIME|IDENTIFIER | |
613 | read_format ID | |
614 | disabled 1 | |
615 | inherit 1 | |
616 | exclude_kernel 1 | |
617 | exclude_hv 1 | |
618 | mmap 1 | |
619 | comm 1 | |
620 | enable_on_exec 1 | |
621 | task 1 | |
622 | sample_id_all 1 | |
623 | mmap2 1 | |
624 | comm_exec 1 | |
625 | ------------------------------------------------------------ | |
626 | sys_perf_event_open: pid 31104 cpu 0 group_fd -1 flags 0x8 | |
627 | sys_perf_event_open: pid 31104 cpu 1 group_fd -1 flags 0x8 | |
628 | sys_perf_event_open: pid 31104 cpu 2 group_fd -1 flags 0x8 | |
629 | sys_perf_event_open: pid 31104 cpu 3 group_fd -1 flags 0x8 | |
630 | mmap size 528384B | |
631 | AUX area mmap length 4194304 | |
632 | perf event ring buffer mmapped per cpu | |
633 | Synthesizing auxtrace information | |
634 | Linux | |
635 | [ perf record: Woken up 1 times to write data ] | |
636 | [ perf record: Captured and wrote 0.042 MB perf.data ] | |
637 | ||
638 | Note, the sched_switch event is only added if the user is permitted to use it | |
639 | and only in per-cpu mode. | |
640 | ||
641 | Note also, the sched_switch event is only added if TSC packets are requested. | |
642 | That is because, in the absence of timing information, the sched_switch events | |
643 | cannot be matched against the Intel PT trace. | |
644 | ||
645 | ||
646 | perf script | |
647 | =========== | |
648 | ||
649 | By default, perf script will decode trace data found in the perf.data file. | |
650 | This can be further controlled by new option --itrace. | |
651 | ||
652 | ||
653 | New --itrace option | |
654 | ------------------- | |
655 | ||
656 | Having no option is the same as | |
657 | ||
658 | --itrace | |
659 | ||
660 | which, in turn, is the same as | |
661 | ||
662 | --itrace=ibxe | |
663 | ||
664 | The letters are: | |
665 | ||
666 | i synthesize "instructions" events | |
667 | b synthesize "branches" events | |
668 | x synthesize "transactions" events | |
669 | c synthesize branches events (calls only) | |
670 | r synthesize branches events (returns only) | |
671 | e synthesize tracing error events | |
672 | d create a debug log | |
673 | g synthesize a call chain (use with i or x) | |
f14445ee | 674 | l synthesize last branch entries (use with i or x) |
d1706b39 | 675 | s skip initial number of events |
5efb1d54 AH |
676 | |
677 | "Instructions" events look like they were recorded by "perf record -e | |
678 | instructions". | |
679 | ||
680 | "Branches" events look like they were recorded by "perf record -e branches". "c" | |
681 | and "r" can be combined to get calls and returns. | |
682 | ||
683 | "Transactions" events correspond to the start or end of transactions. The | |
684 | 'flags' field can be used in perf script to determine whether the event is a | |
685 | tranasaction start, commit or abort. | |
686 | ||
687 | Error events are new. They show where the decoder lost the trace. Error events | |
688 | are quite important. Users must know if what they are seeing is a complete | |
689 | picture or not. | |
690 | ||
691 | The "d" option will cause the creation of a file "intel_pt.log" containing all | |
692 | decoded packets and instructions. Note that this option slows down the decoder | |
693 | and that the resulting file may be very large. | |
694 | ||
695 | In addition, the period of the "instructions" event can be specified. e.g. | |
696 | ||
697 | --itrace=i10us | |
698 | ||
699 | sets the period to 10us i.e. one instruction sample is synthesized for each 10 | |
700 | microseconds of trace. Alternatives to "us" are "ms" (milliseconds), | |
701 | "ns" (nanoseconds), "t" (TSC ticks) or "i" (instructions). | |
702 | ||
703 | "ms", "us" and "ns" are converted to TSC ticks. | |
704 | ||
705 | The timing information included with Intel PT does not give the time of every | |
706 | instruction. Consequently, for the purpose of sampling, the decoder estimates | |
707 | the time since the last timing packet based on 1 tick per instruction. The time | |
708 | on the sample is *not* adjusted and reflects the last known value of TSC. | |
709 | ||
710 | For Intel PT, the default period is 100us. | |
711 | ||
e1791347 AH |
712 | Setting it to a zero period means "as often as possible". |
713 | ||
714 | In the case of Intel PT that is the same as a period of 1 and a unit of | |
715 | 'instructions' (i.e. --itrace=i1i). | |
716 | ||
5efb1d54 AH |
717 | Also the call chain size (default 16, max. 1024) for instructions or |
718 | transactions events can be specified. e.g. | |
719 | ||
720 | --itrace=ig32 | |
721 | --itrace=xg32 | |
722 | ||
f14445ee AH |
723 | Also the number of last branch entries (default 64, max. 1024) for instructions or |
724 | transactions events can be specified. e.g. | |
725 | ||
726 | --itrace=il10 | |
727 | --itrace=xl10 | |
728 | ||
729 | Note that last branch entries are cleared for each sample, so there is no overlap | |
730 | from one sample to the next. | |
731 | ||
5efb1d54 AH |
732 | To disable trace decoding entirely, use the option --no-itrace. |
733 | ||
d1706b39 AK |
734 | It is also possible to skip events generated (instructions, branches, transactions) |
735 | at the beginning. This is useful to ignore initialization code. | |
736 | ||
737 | --itrace=i0nss1000000 | |
738 | ||
739 | skips the first million instructions. | |
5efb1d54 AH |
740 | |
741 | dump option | |
742 | ----------- | |
743 | ||
744 | perf script has an option (-D) to "dump" the events i.e. display the binary | |
745 | data. | |
746 | ||
747 | When -D is used, Intel PT packets are displayed. The packet decoder does not | |
748 | pay attention to PSB packets, but just decodes the bytes - so the packets seen | |
749 | by the actual decoder may not be identical in places where the data is corrupt. | |
750 | One example of that would be when the buffer-switching interrupt has been too | |
751 | slow, and the buffer has been filled completely. In that case, the last packet | |
752 | in the buffer might be truncated and immediately followed by a PSB as the trace | |
753 | continues in the next buffer. | |
754 | ||
755 | To disable the display of Intel PT packets, combine the -D option with | |
756 | --no-itrace. | |
757 | ||
758 | ||
759 | perf report | |
760 | =========== | |
761 | ||
762 | By default, perf report will decode trace data found in the perf.data file. | |
763 | This can be further controlled by new option --itrace exactly the same as | |
764 | perf script, with the exception that the default is --itrace=igxe. | |
765 | ||
766 | ||
767 | perf inject | |
768 | =========== | |
769 | ||
770 | perf inject also accepts the --itrace option in which case tracing data is | |
771 | removed and replaced with the synthesized events. e.g. | |
772 | ||
773 | perf inject --itrace -i perf.data -o perf.data.new | |
ba11ba65 AH |
774 | |
775 | Below is an example of using Intel PT with autofdo. It requires autofdo | |
776 | (https://github.com/google/autofdo) and gcc version 5. The bubble | |
777 | sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial) | |
778 | amended to take the number of elements as a parameter. | |
779 | ||
780 | $ gcc-5 -O3 sort.c -o sort_optimized | |
781 | $ ./sort_optimized 30000 | |
782 | Bubble sorting array of 30000 elements | |
783 | 2254 ms | |
784 | ||
785 | $ cat ~/.perfconfig | |
786 | [intel-pt] | |
787 | mispred-all | |
788 | ||
789 | $ perf record -e intel_pt//u ./sort 3000 | |
790 | Bubble sorting array of 3000 elements | |
791 | 58 ms | |
792 | [ perf record: Woken up 2 times to write data ] | |
793 | [ perf record: Captured and wrote 3.939 MB perf.data ] | |
794 | $ perf inject -i perf.data -o inj --itrace=i100usle --strip | |
795 | $ ./create_gcov --binary=./sort --profile=inj --gcov=sort.gcov -gcov_version=1 | |
796 | $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo | |
797 | $ ./sort_autofdo 30000 | |
798 | Bubble sorting array of 30000 elements | |
799 | 2155 ms | |
800 | ||
801 | Note there is currently no advantage to using Intel PT instead of LBR, but | |
802 | that may change in the future if greater use is made of the data. |