move everything out of trunk
[lttv.git] / tests / markers / markers-microbench-0.2.txt
1
2
3 Hi,
4
5 Following the huge discussion thread about tracing/static vs dynamic
6 instrumentation/markers, a consensus seems to emerge about the need for a
7 marker system in the Linux kernel. The main issues this mechanism addresses are:
8
9 - Identify code important to runtime data collection/analysis tools in tree so
10 that it follows the code changes naturally.
11 - Be visually appealing to kernel developers.
12 - Have a very low impact on the system performance.
13 - Integrate in the standard kernel infrastructure : use C and loadable modules.
14
15 The time has come for some performance measurements of the Linux Kernel Markers,
16 which follows.
17
18
19 * Micro-benchmarks
20
21 Use timestamp counter to calculate the time spent, with interrupts disabled.
22 Machine : Pentium 4 3GHz, 1GB ram
23 Fully preemptible kernel
24 Linux Kernel Markers 0.19
25 Kernel : Linux 2.6.17
26
27 marker : MARK(subsys_mark1, "%d %p", 1, NULL);
28
29 This marker, with two elements (integer and pointer) have been chosen because it
30 is representative of high volume events. For instance, a trap entry event logs a
31 trap_id (long) and an address (pointer). The same applies to system calls, where
32 a system call entry event logs both the ID of the system call and the address of
33 the caller.
34
35
36 * Execute an empty loop
37
38 - Without marker
39 NR_LOOPS : 10000000
40 time delta (cycles): 15026497
41 cycles per loop : 1.50
42
43 - i386 "optimized" : immediate value, test and predicted branch
44 (non connected marker)
45 NR_LOOPS : 10000000
46 time delta (cycles): 40031640
47 cycles per loop : 4.00
48 cycles per loop for marker : 4.00-1.50=2.50
49
50 - i386 "generic" : load, test and predicted branch
51 (non connected marker)
52 NR_LOOPS : 10000000
53 time delta (cycles): 26697878
54 cycles per loop : 2.67
55 cycles per loop for marker : 2.67-1.50=1.17
56
57
58 * Execute a loop of memcpy 4096 bytes
59
60 This test has been done to show the impact of markers on a system where the
61 memory is already used, which is more representative of a running kernel.
62
63 - Without marker
64 NR_LOOPS : 10000
65 time delta (cycles): 12981555
66 cycles per loop : 1298.16
67
68 - i386 "optimized" : immediate value, test and predicted branch
69 (non connected marker)
70 NR_LOOPS : 10000
71 time delta (cycles): 12982290
72 cycles per loop : 1298.23
73 cycles per loop for marker : 1298.23-1298.16=0.074
74
75 - i386 "generic" : load, test and predicted branch
76 (non connected marker)
77 NR_LOOPS : 10000
78 time delta (cycles): 13002788
79 cycles per loop : 1300.28
80 cycles per loop for marker : 1300.28-1298.16=2.123
81
82
83 The following tests are done with the "optimized" markers only
84
85 - Execute a loop with marker enabled, with i386 "fastcall" register argument
86 setup, probe empty. With preemption disabling.
87 NR_LOOPS : 100000
88 time delta (cycles): 4407608
89 cycles per loop : 44.08
90 cycles per loop to disable preemption and setup arguments in registers :
91 44.08-4.00=40.08
92
93 - Execute a loop with a marker enabled, with an empty probe. Var args argument
94 setup, probe empty. With preemption disabling.
95 NR_LOOPS : 100000
96 time delta (cycles): 5210587
97 cycles per loop : 52.11
98 additional cycles per loop to setup var args : 52.11-44.08=8.03
99
100 - Execute a loop with a marker enabled, with an empty probe. Var args argument
101 setup, probe empty. No preemption disabling.
102 NR_LOOPS : 100000
103 time delta (cycles): 3363450
104 cycles per loop : 33.63
105 cycles per loop to disable preemption : 44.08-33.63=10.45
106
107 - Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected.
108 Data is copied by the probe. With preemption disabling.
109 NR_LOOPS : 100000
110 time delta (cycles): 5299837
111 cycles per loop : 53.00
112 additional cycles per loop to get arguments in probe (from stack) on x86 :
113 53.00-52.11=0.89
114
115 - Execute a loop with marker enabled, with var args probe expecting arguments.
116 Data is copied by the probe. With preemption disabling.
117 NR_LOOPS : 100000
118 time delta (cycles): 5574300
119 cycles per loop : 55.74
120 additional cycles per loop to get expected variable arguments on x86 :
121 55.74-53.00=2.74
122
123 - Execute a loop with marker enabled, with var args probe, format string
124 Data is copied by the probe. This is a 6 bytes string to decode.
125 NR_LOOPS : 100000
126 time delta (cycles): 9622117
127 cycles per loop : 96.22
128 additional cycles per loop to dynamically parse arguments with a 6 bytes format
129 string : 96.22-55.74=40.48
130
131 - Execute a loop with marker enabled, with var args probe expecting arguments.
132 Data is copied by the probe. With preemption disabling. An empty "kprobe" is
133 connected to the probe.
134 NR_LOOPS : 100000
135 time delta (cycles): 423397455
136 cycles per loop : 4233.97
137 additional cycles per loop to execute the kprobe : 4233.97-55.74=4178.23
138
139
140 * Assembly code
141
142 The disassembly of the following marked function will be shown :
143
144 static int my_open(struct inode *inode, struct file *file)
145 {
146 MARK(subsys_mark1, "%d %p", 1, NULL);
147
148 return -EPERM;
149 }
150
151
152 - Optimized
153
154 static int my_open(struct inode *inode, struct file *file)
155 {
156 0: 55 push %ebp
157 1: 89 e5 mov %esp,%ebp
158 3: 83 ec 0c sub $0xc,%esp
159 MARK(subsys_mark1, "%d %p", 1, NULL);
160 6: b0 00 mov $0x0,%al <-- immediate load 0 in al
161 8: 84 c0 test %al,%al
162 a: 75 07 jne 13 <my_open+0x13>
163
164 return -EPERM;
165 }
166 c: b8 ff ff ff ff mov $0xffffffff,%eax
167 11: c9 leave
168 12: c3 ret
169 13: b8 01 00 00 00 mov $0x1,%eax
170 18: e8 fc ff ff ff call 19 <my_open+0x19> <-- preempt_disable
171 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
172 24: 00
173 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
174 2c: 00
175 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
176 34: ff 15 74 10 00 00 call *0x1074 <-- function pointer
177 3a: b8 01 00 00 00 mov $0x1,%eax
178 3f: e8 fc ff ff ff call 40 <my_open+0x40> <-- preempt_enable
179 44: eb c6 jmp c <my_open+0xc>
180
181
182 - Generic
183
184 static int my_open(struct inode *inode, struct file *file)
185 {
186 0: 55 push %ebp
187 1: 89 e5 mov %esp,%ebp
188 3: 83 ec 0c sub $0xc,%esp
189 MARK(subsys_mark1, "%d %p", 1, NULL);
190 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax <-- memory load byte
191 d: 84 c0 test %al,%al
192 f: 75 07 jne 18 <my_open+0x18>
193
194 return -EPERM;
195 }
196 11: b8 ff ff ff ff mov $0xffffffff,%eax
197 16: c9 leave
198 17: c3 ret
199 18: b8 01 00 00 00 mov $0x1,%eax
200 1d: e8 fc ff ff ff call 1e <my_open+0x1e> <-- preempt_disable
201 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
202 29: 00
203 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
204 31: 00
205 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
206 39: ff 15 74 10 00 00 call *0x1074 <-- function pointer
207 3f: b8 01 00 00 00 mov $0x1,%eax
208 44: e8 fc ff ff ff call 45 <my_open+0x45> <-- preempt_enable
209 49: eb c6 jmp 11 <my_open+0x11>
210
211
212
213 Here is the typical var arg probe that has been used in those tests. It saves
214 the values expectes as parameters in global variables. The DO_MARK1_FORMAT
215 define is used for probe registration to make sure that it will be connected
216 with a marker that has a matching format string. Note that this checking is
217 optional : the probe can register with a NULL format and afterward check itself
218 the format string received in parameter dynamically.
219
220 int value;
221 void *ptr;
222
223 #define DO_MARK1_FORMAT "%d %p"
224 void do_mark1(const char *format, ...)
225 {
226 va_list ap;
227
228 va_start(ap, format);
229 value = va_arg(ap, int);
230 ptr = va_arg(ap, void*);
231
232 va_end(ap);
233 }
234
235
236 Here is the disassembly of the probe :
237
238 #define DO_MARK1_FORMAT "%d %p"
239 void do_mark1(const char *format, ...)
240 {
241 0: 55 push %ebp
242 1: 89 e5 mov %esp,%ebp
243 3: 83 ec 04 sub $0x4,%esp
244 va_list ap;
245
246 va_start(ap, format);
247 value = va_arg(ap, int);
248 6: 8b 45 0c mov 0xc(%ebp),%eax
249 9: a3 00 00 00 00 mov %eax,0x0
250 ptr = va_arg(ap, void*);
251 e: 8b 45 10 mov 0x10(%ebp),%eax
252 11: a3 00 00 00 00 mov %eax,0x0
253
254 va_end(ap);
255 }
256 16: c9 leave
257 17: c3 ret
258
259
260 * Size (x86)
261
262 This is the size added by each marker to the memory image :
263
264 - Optimized
265
266 .text section : instructions
267 Adds 6 bytes in the "likely" path.
268 Adds 32 bytes in the "unlikely" path.
269 .data section : r/w data
270 0 byte
271 .rodata.str1 : strings
272 Length of the marker name
273 .debug_str : strings (if loaded..)
274 Length of the marker name + 7 bytes (__mark_)
275 .markers
276 8 bytes (2 pointers)
277 .markers.c
278 12 bytes (3 pointers)
279
280 - Generic
281
282 .text section : instructions
283 Adds 11 bytes in the "likely" path.
284 Adds 32 bytes in the "unlikely" path.
285 .data section : r/w data
286 1 byte (the activation flag)
287 .rodata.str1 : strings
288 Length of the marker name
289 .debug_str : strings (if loaded..)
290 Length of the marker name + 7 bytes (__mark_)
291 .markers
292 8 bytes (2 pointers)
293 .markers.c
294 12 bytes (3 pointers)
295
296
297 * Macro-benchmarks
298
299 Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache.
300 Running a 2.6.17 vanilla kernel :
301 real 8m2.443s
302 user 7m35.124s
303 sys 0m34.950s
304
305 Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
306 real 8m1.635s
307 user 7m34.552s
308 sys 0m36.298s
309
310 --> 0.98 % speedup with markers
311
312 Ping flood on loopback interface :
313 Running a 2.6.17 vanilla kernel :
314 136596 packets transmitted, 136596 packets received, 0% packet loss
315 round-trip min/avg/max = 0.0/0.0/0.1 ms
316
317 real 0m10.840s
318 user 0m0.360s
319 sys 0m10.485s
320
321 12601 packets transmitted/s
322
323 Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
324 108504 packets transmitted, 108504 packets received, 0% packet loss
325 round-trip min/avg/max = 0.0/0.0/0.1 ms
326
327 real 0m8.614s
328 user 0m0.264s
329 sys 0m8.353s
330
331 12596 packets transmitted/s
332
333 --> 0.03 % slowdown with markers
334
335
336 Conclusion
337
338 In an empty loop, the generic marker is faster than the optimized marker. This
339 may be due to better performances of the movzbl instruction over the movb on the
340 Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact
341 of the movzbl becomes greater because it uses the memory bandwidth.
342
343 The preemption disabling and call to a probe itself costs 48.11 cycles, almost
344 as much as dynamically parsing the format string to get the variable arguments
345 (40.48 cycles).
346
347 There is almost no difference, on x86, between passing the arguments directly on
348 the stack and using a variable argument list when its layout is known
349 statically (0.89 cycles vs 2.74 cycles).
350
351 The int3 approach for adding instrumentation dynamically saves the 0.074 cycle
352 (typcal use, high memory usage) used by the optimized marker by adding the
353 ability to insert a breakpoint at any location without any impact on the code
354 when inactive. This breakpoint based approach is very useful to instrument core
355 kernel code that has not been previously marked without need to recompile and
356 reboot. We can therefore compare the case "without markers" to the null impact
357 of an inactive int3 breakpoint.
358
359 However, the performance impact for using a kprobe is non negligible when
360 activated. Assuming that kprobes would have a mechanism to get the variables
361 from the caller's stack, it would perform the same task in at least 4178.23
362 cycles vs 55.74 for a marker and a probe (ratio : 75). While kprobes are very
363 useful for the reason explained earlier, the high event rate paths in the kernel
364 would clearly benefit from a marker mechanism when the are probed.
365
366 Code size and memory footprints are smaller with the optimized version : 6
367 bytes of code in the likely path compared to 11 bytes. The memory footprint of
368 the optimized approach saves 4 bytes of data memory that would otherwise have to
369 stay in cache.
370
371 On the macro-benchmark side, no significant difference in performance has been
372 found between the vanilla kernel and a kernel "marked" with the standard LTTng
373 instrumentation.
374
375
This page took 0.038297 seconds and 4 git commands to generate.