Commit | Line | Data |
---|---|---|
e7bc62b6 IM |
1 | |
2 | Performance Counters for Linux | |
3 | ------------------------------ | |
4 | ||
5 | Performance counters are special hardware registers available on most modern | |
6 | CPUs. These registers count the number of certain types of hw events: such | |
7 | as instructions executed, cachemisses suffered, or branches mis-predicted - | |
8 | without slowing down the kernel or applications. These registers can also | |
9 | trigger interrupts when a threshold number of events have passed - and can | |
10 | thus be used to profile the code that runs on that CPU. | |
11 | ||
12 | The Linux Performance Counter subsystem provides an abstraction of these | |
447557ac IM |
13 | hardware capabilities. It provides per task and per CPU counters, counter |
14 | groups, and it provides event capabilities on top of those. | |
e7bc62b6 IM |
15 | |
16 | Performance counters are accessed via special file descriptors. | |
17 | There's one file descriptor per virtual counter used. | |
18 | ||
19 | The special file descriptor is opened via the perf_counter_open() | |
20 | system call: | |
21 | ||
447557ac IM |
22 | int sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr, |
23 | pid_t pid, int cpu, int group_fd); | |
e7bc62b6 IM |
24 | |
25 | The syscall returns the new fd. The fd can be used via the normal | |
26 | VFS system calls: read() can be used to read the counter, fcntl() | |
27 | can be used to set the blocking mode, etc. | |
28 | ||
29 | Multiple counters can be kept open at a time, and the counters | |
30 | can be poll()ed. | |
31 | ||
447557ac IM |
32 | When creating a new counter fd, 'perf_counter_hw_event' is: |
33 | ||
34 | /* | |
35 | * Hardware event to monitor via a performance monitoring counter: | |
36 | */ | |
37 | struct perf_counter_hw_event { | |
38 | s64 type; | |
39 | ||
40 | u64 irq_period; | |
41 | u32 record_type; | |
42 | ||
43 | u32 disabled : 1, /* off by default */ | |
44 | nmi : 1, /* NMI sampling */ | |
45 | raw : 1, /* raw event type */ | |
46 | __reserved_1 : 29; | |
47 | ||
48 | u64 __reserved_2; | |
49 | }; | |
50 | ||
51 | /* | |
52 | * Generalized performance counter event types, used by the hw_event.type | |
53 | * parameter of the sys_perf_counter_open() syscall: | |
54 | */ | |
55 | enum hw_event_types { | |
56 | /* | |
57 | * Common hardware events, generalized by the kernel: | |
58 | */ | |
59 | PERF_COUNT_CYCLES = 0, | |
60 | PERF_COUNT_INSTRUCTIONS = 1, | |
61 | PERF_COUNT_CACHE_REFERENCES = 2, | |
62 | PERF_COUNT_CACHE_MISSES = 3, | |
63 | PERF_COUNT_BRANCH_INSTRUCTIONS = 4, | |
64 | PERF_COUNT_BRANCH_MISSES = 5, | |
65 | ||
66 | /* | |
67 | * Special "software" counters provided by the kernel, even if | |
68 | * the hardware does not support performance counters. These | |
69 | * counters measure various physical and sw events of the | |
70 | * kernel (and allow the profiling of them as well): | |
71 | */ | |
72 | PERF_COUNT_CPU_CLOCK = -1, | |
73 | PERF_COUNT_TASK_CLOCK = -2, | |
74 | /* | |
75 | * Future software events: | |
76 | */ | |
77 | /* PERF_COUNT_PAGE_FAULTS = -3, | |
78 | PERF_COUNT_CONTEXT_SWITCHES = -4, */ | |
79 | }; | |
e7bc62b6 IM |
80 | |
81 | These are standardized types of events that work uniformly on all CPUs | |
82 | that implements Performance Counters support under Linux. If a CPU is | |
83 | not able to count branch-misses, then the system call will return | |
84 | -EINVAL. | |
85 | ||
447557ac IM |
86 | More hw_event_types are supported as well, but they are CPU |
87 | specific and are enumerated via /sys on a per CPU basis. Raw hw event | |
88 | types can be passed in under hw_event.type if hw_event.raw is 1. | |
89 | For example, to count "External bus cycles while bus lock signal asserted" | |
90 | events on Intel Core CPUs, pass in a 0x4064 event type value and set | |
91 | hw_event.raw to 1. | |
e7bc62b6 IM |
92 | |
93 | 'record_type' is the type of data that a read() will provide for the | |
94 | counter, and it can be one of: | |
95 | ||
447557ac IM |
96 | /* |
97 | * IRQ-notification data record type: | |
98 | */ | |
99 | enum perf_counter_record_type { | |
100 | PERF_RECORD_SIMPLE = 0, | |
101 | PERF_RECORD_IRQ = 1, | |
102 | PERF_RECORD_GROUP = 2, | |
103 | }; | |
e7bc62b6 IM |
104 | |
105 | a "simple" counter is one that counts hardware events and allows | |
106 | them to be read out into a u64 count value. (read() returns 8 on | |
107 | a successful read of a simple counter.) | |
108 | ||
109 | An "irq" counter is one that will also provide an IRQ context information: | |
110 | the IP of the interrupted context. In this case read() will return | |
111 | the 8-byte counter value, plus the Instruction Pointer address of the | |
112 | interrupted context. | |
113 | ||
447557ac IM |
114 | The parameter 'hw_event_period' is the number of events before waking up |
115 | a read() that is blocked on a counter fd. Zero value means a non-blocking | |
116 | counter. | |
117 | ||
e7bc62b6 IM |
118 | The 'pid' parameter allows the counter to be specific to a task: |
119 | ||
120 | pid == 0: if the pid parameter is zero, the counter is attached to the | |
121 | current task. | |
122 | ||
123 | pid > 0: the counter is attached to a specific task (if the current task | |
124 | has sufficient privilege to do so) | |
125 | ||
126 | pid < 0: all tasks are counted (per cpu counters) | |
127 | ||
128 | The 'cpu' parameter allows a counter to be made specific to a full | |
129 | CPU: | |
130 | ||
131 | cpu >= 0: the counter is restricted to a specific CPU | |
132 | cpu == -1: the counter counts on all CPUs | |
133 | ||
447557ac | 134 | (Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.) |
e7bc62b6 IM |
135 | |
136 | A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts | |
137 | events of that task and 'follows' that task to whatever CPU the task | |
138 | gets schedule to. Per task counters can be created by any user, for | |
139 | their own tasks. | |
140 | ||
141 | A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts | |
142 | all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege. | |
143 | ||
447557ac IM |
144 | Group counters are created by passing in a group_fd of another counter. |
145 | Groups are scheduled at once and can be used with PERF_RECORD_GROUP | |
146 | to record multi-dimensional timestamps. | |
147 |