convert from svn repository: remove tags directory
[lttv.git] / trunk / tests / markers / markers-microbench-0.2.txt
CommitLineData
bc82195a 1
2
c7b68d19 3Hi,
4
5Following the huge discussion thread about tracing/static vs dynamic
6instrumentation/markers, a consensus seems to emerge about the need for a
7marker system in the Linux kernel. The main issues this mechanism addresses are:
8
9- Identify code important to runtime data collection/analysis tools in tree so
10 that it follows the code changes naturally.
11- Be visually appealing to kernel developers.
12- Have a very low impact on the system performance.
13- Integrate in the standard kernel infrastructure : use C and loadable modules.
14
15The time has come for some performance measurements of the Linux Kernel Markers,
16which follows.
17
18
19* Micro-benchmarks
bc82195a 20
21Use timestamp counter to calculate the time spent, with interrupts disabled.
b9c0c958 22Machine : Pentium 4 3GHz, 1GB ram
bc82195a 23Fully preemptible kernel
bc82195a 24Linux Kernel Markers 0.19
b9c0c958 25Kernel : Linux 2.6.17
26
27marker : MARK(subsys_mark1, "%d %p", 1, NULL);
28
29This marker, with two elements (integer and pointer) have been chosen because it
30is representative of high volume events. For instance, a trap entry event logs a
31trap_id (long) and an address (pointer). The same applies to system calls, where
32a system call entry event logs both the ID of the system call and the address of
33the caller.
34
bc82195a 35
36* Execute an empty loop
b9c0c958 37
38- Without marker
bc82195a 39NR_LOOPS : 10000000
40time delta (cycles): 15026497
41cycles per loop : 1.50
b9c0c958 42
bc82195a 43- i386 "optimized" : immediate value, test and predicted branch
44 (non connected marker)
45NR_LOOPS : 10000000
46time delta (cycles): 40031640
47cycles per loop : 4.00
b9c0c958 48cycles per loop for marker : 4.00-1.50=2.50
49
bc82195a 50- i386 "generic" : load, test and predicted branch
51 (non connected marker)
52NR_LOOPS : 10000000
53time delta (cycles): 26697878
54cycles per loop : 2.67
b9c0c958 55cycles per loop for marker : 2.67-1.50=1.17
56
bc82195a 57
58* Execute a loop of memcpy 4096 bytes
b9c0c958 59
60This test has been done to show the impact of markers on a system where the
61memory is already used, which is more representative of a running kernel.
62
bc82195a 63- Without marker
64NR_LOOPS : 10000
65time delta (cycles): 12981555
66cycles per loop : 1298.16
b9c0c958 67
bc82195a 68- i386 "optimized" : immediate value, test and predicted branch
69 (non connected marker)
70NR_LOOPS : 10000
71time delta (cycles): 12982290
72cycles per loop : 1298.23
b9c0c958 73cycles per loop for marker : 1298.23-1298.16=0.074
74
bc82195a 75- i386 "generic" : load, test and predicted branch
76 (non connected marker)
77NR_LOOPS : 10000
78time delta (cycles): 13002788
79cycles per loop : 1300.28
b9c0c958 80cycles per loop for marker : 1300.28-1298.16=2.123
bc82195a 81
82
83The following tests are done with the "optimized" markers only
84
b9c0c958 85- Execute a loop with marker enabled, with i386 "fastcall" register argument
86 setup, probe empty. With preemption disabling.
bc82195a 87NR_LOOPS : 100000
88time delta (cycles): 4407608
89cycles per loop : 44.08
b9c0c958 90cycles per loop to disable preemption and setup arguments in registers :
9144.08-4.00=40.08
bc82195a 92
b9c0c958 93- Execute a loop with a marker enabled, with an empty probe. Var args argument
94 setup, probe empty. With preemption disabling.
bc82195a 95NR_LOOPS : 100000
96time delta (cycles): 5210587
97cycles per loop : 52.11
98additional cycles per loop to setup var args : 52.11-44.08=8.03
99
b9c0c958 100- Execute a loop with a marker enabled, with an empty probe. Var args argument
101 setup, probe empty. No preemption disabling.
102NR_LOOPS : 100000
103time delta (cycles): 3363450
104cycles per loop : 33.63
105cycles per loop to disable preemption : 44.08-33.63=10.45
106
107- Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected.
108 Data is copied by the probe. With preemption disabling.
bc82195a 109NR_LOOPS : 100000
110time delta (cycles): 5299837
111cycles per loop : 53.00
112additional cycles per loop to get arguments in probe (from stack) on x86 :
b9c0c958 11353.00-52.11=0.89
bc82195a 114
b9c0c958 115- Execute a loop with marker enabled, with var args probe expecting arguments.
116 Data is copied by the probe. With preemption disabling.
bc82195a 117NR_LOOPS : 100000
118time delta (cycles): 5574300
119cycles per loop : 55.74
120additional cycles per loop to get expected variable arguments on x86 :
121 55.74-53.00=2.74
122
b9c0c958 123- Execute a loop with marker enabled, with var args probe, format string
124 Data is copied by the probe. This is a 6 bytes string to decode.
bc82195a 125NR_LOOPS : 100000
126time delta (cycles): 9622117
127cycles per loop : 96.22
b9c0c958 128additional cycles per loop to dynamically parse arguments with a 6 bytes format
c7b68d19 129string : 96.22-55.74=40.48
130
131- Execute a loop with marker enabled, with var args probe expecting arguments.
132 Data is copied by the probe. With preemption disabling. An empty "kprobe" is
133 connected to the probe.
134NR_LOOPS : 100000
135time delta (cycles): 423397455
136cycles per loop : 4233.97
137additional cycles per loop to execute the kprobe : 4233.97-55.74=4178.23
bc82195a 138
139
140* Assembly code
141
b9c0c958 142The disassembly of the following marked function will be shown :
143
144static int my_open(struct inode *inode, struct file *file)
145{
146 MARK(subsys_mark1, "%d %p", 1, NULL);
147
148 return -EPERM;
149}
150
bc82195a 151
152- Optimized
153
154static int my_open(struct inode *inode, struct file *file)
155{
156 0: 55 push %ebp
157 1: 89 e5 mov %esp,%ebp
158 3: 83 ec 0c sub $0xc,%esp
159 MARK(subsys_mark1, "%d %p", 1, NULL);
b9c0c958 160 6: b0 00 mov $0x0,%al <-- immediate load 0 in al
bc82195a 161 8: 84 c0 test %al,%al
162 a: 75 07 jne 13 <my_open+0x13>
163
164 return -EPERM;
165}
166 c: b8 ff ff ff ff mov $0xffffffff,%eax
167 11: c9 leave
168 12: c3 ret
169 13: b8 01 00 00 00 mov $0x1,%eax
b9c0c958 170 18: e8 fc ff ff ff call 19 <my_open+0x19> <-- preempt_disable
bc82195a 171 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
172 24: 00
173 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
174 2c: 00
175 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
b9c0c958 176 34: ff 15 74 10 00 00 call *0x1074 <-- function pointer
bc82195a 177 3a: b8 01 00 00 00 mov $0x1,%eax
b9c0c958 178 3f: e8 fc ff ff ff call 40 <my_open+0x40> <-- preempt_enable
bc82195a 179 44: eb c6 jmp c <my_open+0xc>
180
181
182- Generic
183
184static int my_open(struct inode *inode, struct file *file)
185{
186 0: 55 push %ebp
187 1: 89 e5 mov %esp,%ebp
188 3: 83 ec 0c sub $0xc,%esp
189 MARK(subsys_mark1, "%d %p", 1, NULL);
b9c0c958 190 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax <-- memory load byte
bc82195a 191 d: 84 c0 test %al,%al
192 f: 75 07 jne 18 <my_open+0x18>
193
194 return -EPERM;
195}
196 11: b8 ff ff ff ff mov $0xffffffff,%eax
197 16: c9 leave
198 17: c3 ret
199 18: b8 01 00 00 00 mov $0x1,%eax
b9c0c958 200 1d: e8 fc ff ff ff call 1e <my_open+0x1e> <-- preempt_disable
bc82195a 201 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
202 29: 00
203 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
204 31: 00
205 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
b9c0c958 206 39: ff 15 74 10 00 00 call *0x1074 <-- function pointer
bc82195a 207 3f: b8 01 00 00 00 mov $0x1,%eax
b9c0c958 208 44: e8 fc ff ff ff call 45 <my_open+0x45> <-- preempt_enable
bc82195a 209 49: eb c6 jmp 11 <my_open+0x11>
210
b9c0c958 211
212
213Here is the typical var arg probe that has been used in those tests. It saves
214the values expectes as parameters in global variables. The DO_MARK1_FORMAT
215define is used for probe registration to make sure that it will be connected
216with a marker that has a matching format string. Note that this checking is
217optional : the probe can register with a NULL format and afterward check itself
218the format string received in parameter dynamically.
219
220int value;
221void *ptr;
222
223#define DO_MARK1_FORMAT "%d %p"
224void do_mark1(const char *format, ...)
225{
226 va_list ap;
227
228 va_start(ap, format);
229 value = va_arg(ap, int);
230 ptr = va_arg(ap, void*);
231
232 va_end(ap);
233}
234
235
236Here is the disassembly of the probe :
237
238#define DO_MARK1_FORMAT "%d %p"
239void do_mark1(const char *format, ...)
240{
241 0: 55 push %ebp
242 1: 89 e5 mov %esp,%ebp
243 3: 83 ec 04 sub $0x4,%esp
244 va_list ap;
245
246 va_start(ap, format);
247 value = va_arg(ap, int);
248 6: 8b 45 0c mov 0xc(%ebp),%eax
249 9: a3 00 00 00 00 mov %eax,0x0
250 ptr = va_arg(ap, void*);
251 e: 8b 45 10 mov 0x10(%ebp),%eax
252 11: a3 00 00 00 00 mov %eax,0x0
253
254 va_end(ap);
255}
256 16: c9 leave
257 17: c3 ret
258
259
bc82195a 260* Size (x86)
261
b9c0c958 262This is the size added by each marker to the memory image :
263
bc82195a 264- Optimized
265
b9c0c958 266.text section : instructions
bc82195a 267Adds 6 bytes in the "likely" path.
268Adds 32 bytes in the "unlikely" path.
b9c0c958 269.data section : r/w data
2700 byte
271.rodata.str1 : strings
272Length of the marker name
273.debug_str : strings (if loaded..)
274Length of the marker name + 7 bytes (__mark_)
275.markers
2768 bytes (2 pointers)
277.markers.c
27812 bytes (3 pointers)
279
280- Generic
281
282.text section : instructions
bc82195a 283Adds 11 bytes in the "likely" path.
284Adds 32 bytes in the "unlikely" path.
b9c0c958 285.data section : r/w data
2861 byte (the activation flag)
287.rodata.str1 : strings
288Length of the marker name
289.debug_str : strings (if loaded..)
290Length of the marker name + 7 bytes (__mark_)
291.markers
2928 bytes (2 pointers)
293.markers.c
29412 bytes (3 pointers)
bc82195a 295
296
c7b68d19 297* Macro-benchmarks
bc82195a 298
299Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache.
300Running a 2.6.17 vanilla kernel :
301real 8m2.443s
302user 7m35.124s
303sys 0m34.950s
304
305Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
306real 8m1.635s
307user 7m34.552s
308sys 0m36.298s
309
c7b68d19 310--> 0.98 % speedup with markers
311
bc82195a 312Ping flood on loopback interface :
313Running a 2.6.17 vanilla kernel :
314136596 packets transmitted, 136596 packets received, 0% packet loss
315round-trip min/avg/max = 0.0/0.0/0.1 ms
316
317real 0m10.840s
318user 0m0.360s
319sys 0m10.485s
320
32112601 packets transmitted/s
322
323Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
324108504 packets transmitted, 108504 packets received, 0% packet loss
325round-trip min/avg/max = 0.0/0.0/0.1 ms
326
327real 0m8.614s
328user 0m0.264s
329sys 0m8.353s
330
33112596 packets transmitted/s
332
c7b68d19 333--> 0.03 % slowdown with markers
bc82195a 334
335
336Conclusion
337
338In an empty loop, the generic marker is faster than the optimized marker. This
339may be due to better performances of the movzbl instruction over the movb on the
340Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact
341of the movzbl becomes greater because it uses the memory bandwidth.
342
343The preemption disabling and call to a probe itself costs 48.11 cycles, almost
344as much as dynamically parsing the format string to get the variable arguments
345(40.48 cycles).
346
347There is almost no difference, on x86, between passing the arguments directly on
348the stack and using a variable argument list when its layout is known
349statically (0.89 cycles vs 2.74 cycles).
350
b9c0c958 351The int3 approach for adding instrumentation dynamically saves the 0.074 cycle
352(typcal use, high memory usage) used by the optimized marker by adding the
353ability to insert a breakpoint at any location without any impact on the code
354when inactive. This breakpoint based approach is very useful to instrument core
355kernel code that has not been previously marked without need to recompile and
356reboot. We can therefore compare the case "without markers" to the null impact
c7b68d19 357of an inactive int3 breakpoint.
358
359However, the performance impact for using a kprobe is non negligible when
360activated. Assuming that kprobes would have a mechanism to get the variables
361from the caller's stack, it would perform the same task in at least 4178.23
362cycles vs 55.74 for a marker and a probe (ratio : 75). While kprobes are very
363useful for the reason explained earlier, the high event rate paths in the kernel
364would clearly benefit from a marker mechanism when the are probed.
365
366Code size and memory footprints are smaller with the optimized version : 6
367bytes of code in the likely path compared to 11 bytes. The memory footprint of
368the optimized approach saves 4 bytes of data memory that would otherwise have to
369stay in cache.
370
371On the macro-benchmark side, no significant difference in performance has been
372found between the vanilla kernel and a kernel "marked" with the standard LTTng
373instrumentation.
bc82195a 374
375
This page took 0.045492 seconds and 4 git commands to generate.