move everything out of trunk
[lttv.git] / tests / markers / markers-microbench-0.2.txt
... / ...
CommitLineData
1
2
3Hi,
4
5Following the huge discussion thread about tracing/static vs dynamic
6instrumentation/markers, a consensus seems to emerge about the need for a
7marker system in the Linux kernel. The main issues this mechanism addresses are:
8
9- Identify code important to runtime data collection/analysis tools in tree so
10 that it follows the code changes naturally.
11- Be visually appealing to kernel developers.
12- Have a very low impact on the system performance.
13- Integrate in the standard kernel infrastructure : use C and loadable modules.
14
15The time has come for some performance measurements of the Linux Kernel Markers,
16which follows.
17
18
19* Micro-benchmarks
20
21Use timestamp counter to calculate the time spent, with interrupts disabled.
22Machine : Pentium 4 3GHz, 1GB ram
23Fully preemptible kernel
24Linux Kernel Markers 0.19
25Kernel : Linux 2.6.17
26
27marker : MARK(subsys_mark1, "%d %p", 1, NULL);
28
29This marker, with two elements (integer and pointer) have been chosen because it
30is representative of high volume events. For instance, a trap entry event logs a
31trap_id (long) and an address (pointer). The same applies to system calls, where
32a system call entry event logs both the ID of the system call and the address of
33the caller.
34
35
36* Execute an empty loop
37
38- Without marker
39NR_LOOPS : 10000000
40time delta (cycles): 15026497
41cycles per loop : 1.50
42
43- i386 "optimized" : immediate value, test and predicted branch
44 (non connected marker)
45NR_LOOPS : 10000000
46time delta (cycles): 40031640
47cycles per loop : 4.00
48cycles per loop for marker : 4.00-1.50=2.50
49
50- i386 "generic" : load, test and predicted branch
51 (non connected marker)
52NR_LOOPS : 10000000
53time delta (cycles): 26697878
54cycles per loop : 2.67
55cycles per loop for marker : 2.67-1.50=1.17
56
57
58* Execute a loop of memcpy 4096 bytes
59
60This test has been done to show the impact of markers on a system where the
61memory is already used, which is more representative of a running kernel.
62
63- Without marker
64NR_LOOPS : 10000
65time delta (cycles): 12981555
66cycles per loop : 1298.16
67
68- i386 "optimized" : immediate value, test and predicted branch
69 (non connected marker)
70NR_LOOPS : 10000
71time delta (cycles): 12982290
72cycles per loop : 1298.23
73cycles per loop for marker : 1298.23-1298.16=0.074
74
75- i386 "generic" : load, test and predicted branch
76 (non connected marker)
77NR_LOOPS : 10000
78time delta (cycles): 13002788
79cycles per loop : 1300.28
80cycles per loop for marker : 1300.28-1298.16=2.123
81
82
83The following tests are done with the "optimized" markers only
84
85- Execute a loop with marker enabled, with i386 "fastcall" register argument
86 setup, probe empty. With preemption disabling.
87NR_LOOPS : 100000
88time delta (cycles): 4407608
89cycles per loop : 44.08
90cycles per loop to disable preemption and setup arguments in registers :
9144.08-4.00=40.08
92
93- Execute a loop with a marker enabled, with an empty probe. Var args argument
94 setup, probe empty. With preemption disabling.
95NR_LOOPS : 100000
96time delta (cycles): 5210587
97cycles per loop : 52.11
98additional cycles per loop to setup var args : 52.11-44.08=8.03
99
100- Execute a loop with a marker enabled, with an empty probe. Var args argument
101 setup, probe empty. No preemption disabling.
102NR_LOOPS : 100000
103time delta (cycles): 3363450
104cycles per loop : 33.63
105cycles per loop to disable preemption : 44.08-33.63=10.45
106
107- Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected.
108 Data is copied by the probe. With preemption disabling.
109NR_LOOPS : 100000
110time delta (cycles): 5299837
111cycles per loop : 53.00
112additional cycles per loop to get arguments in probe (from stack) on x86 :
11353.00-52.11=0.89
114
115- Execute a loop with marker enabled, with var args probe expecting arguments.
116 Data is copied by the probe. With preemption disabling.
117NR_LOOPS : 100000
118time delta (cycles): 5574300
119cycles per loop : 55.74
120additional cycles per loop to get expected variable arguments on x86 :
121 55.74-53.00=2.74
122
123- Execute a loop with marker enabled, with var args probe, format string
124 Data is copied by the probe. This is a 6 bytes string to decode.
125NR_LOOPS : 100000
126time delta (cycles): 9622117
127cycles per loop : 96.22
128additional cycles per loop to dynamically parse arguments with a 6 bytes format
129string : 96.22-55.74=40.48
130
131- Execute a loop with marker enabled, with var args probe expecting arguments.
132 Data is copied by the probe. With preemption disabling. An empty "kprobe" is
133 connected to the probe.
134NR_LOOPS : 100000
135time delta (cycles): 423397455
136cycles per loop : 4233.97
137additional cycles per loop to execute the kprobe : 4233.97-55.74=4178.23
138
139
140* Assembly code
141
142The disassembly of the following marked function will be shown :
143
144static int my_open(struct inode *inode, struct file *file)
145{
146 MARK(subsys_mark1, "%d %p", 1, NULL);
147
148 return -EPERM;
149}
150
151
152- Optimized
153
154static int my_open(struct inode *inode, struct file *file)
155{
156 0: 55 push %ebp
157 1: 89 e5 mov %esp,%ebp
158 3: 83 ec 0c sub $0xc,%esp
159 MARK(subsys_mark1, "%d %p", 1, NULL);
160 6: b0 00 mov $0x0,%al <-- immediate load 0 in al
161 8: 84 c0 test %al,%al
162 a: 75 07 jne 13 <my_open+0x13>
163
164 return -EPERM;
165}
166 c: b8 ff ff ff ff mov $0xffffffff,%eax
167 11: c9 leave
168 12: c3 ret
169 13: b8 01 00 00 00 mov $0x1,%eax
170 18: e8 fc ff ff ff call 19 <my_open+0x19> <-- preempt_disable
171 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
172 24: 00
173 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
174 2c: 00
175 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
176 34: ff 15 74 10 00 00 call *0x1074 <-- function pointer
177 3a: b8 01 00 00 00 mov $0x1,%eax
178 3f: e8 fc ff ff ff call 40 <my_open+0x40> <-- preempt_enable
179 44: eb c6 jmp c <my_open+0xc>
180
181
182- Generic
183
184static int my_open(struct inode *inode, struct file *file)
185{
186 0: 55 push %ebp
187 1: 89 e5 mov %esp,%ebp
188 3: 83 ec 0c sub $0xc,%esp
189 MARK(subsys_mark1, "%d %p", 1, NULL);
190 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax <-- memory load byte
191 d: 84 c0 test %al,%al
192 f: 75 07 jne 18 <my_open+0x18>
193
194 return -EPERM;
195}
196 11: b8 ff ff ff ff mov $0xffffffff,%eax
197 16: c9 leave
198 17: c3 ret
199 18: b8 01 00 00 00 mov $0x1,%eax
200 1d: e8 fc ff ff ff call 1e <my_open+0x1e> <-- preempt_disable
201 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
202 29: 00
203 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
204 31: 00
205 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
206 39: ff 15 74 10 00 00 call *0x1074 <-- function pointer
207 3f: b8 01 00 00 00 mov $0x1,%eax
208 44: e8 fc ff ff ff call 45 <my_open+0x45> <-- preempt_enable
209 49: eb c6 jmp 11 <my_open+0x11>
210
211
212
213Here is the typical var arg probe that has been used in those tests. It saves
214the values expectes as parameters in global variables. The DO_MARK1_FORMAT
215define is used for probe registration to make sure that it will be connected
216with a marker that has a matching format string. Note that this checking is
217optional : the probe can register with a NULL format and afterward check itself
218the format string received in parameter dynamically.
219
220int value;
221void *ptr;
222
223#define DO_MARK1_FORMAT "%d %p"
224void do_mark1(const char *format, ...)
225{
226 va_list ap;
227
228 va_start(ap, format);
229 value = va_arg(ap, int);
230 ptr = va_arg(ap, void*);
231
232 va_end(ap);
233}
234
235
236Here is the disassembly of the probe :
237
238#define DO_MARK1_FORMAT "%d %p"
239void do_mark1(const char *format, ...)
240{
241 0: 55 push %ebp
242 1: 89 e5 mov %esp,%ebp
243 3: 83 ec 04 sub $0x4,%esp
244 va_list ap;
245
246 va_start(ap, format);
247 value = va_arg(ap, int);
248 6: 8b 45 0c mov 0xc(%ebp),%eax
249 9: a3 00 00 00 00 mov %eax,0x0
250 ptr = va_arg(ap, void*);
251 e: 8b 45 10 mov 0x10(%ebp),%eax
252 11: a3 00 00 00 00 mov %eax,0x0
253
254 va_end(ap);
255}
256 16: c9 leave
257 17: c3 ret
258
259
260* Size (x86)
261
262This is the size added by each marker to the memory image :
263
264- Optimized
265
266.text section : instructions
267Adds 6 bytes in the "likely" path.
268Adds 32 bytes in the "unlikely" path.
269.data section : r/w data
2700 byte
271.rodata.str1 : strings
272Length of the marker name
273.debug_str : strings (if loaded..)
274Length of the marker name + 7 bytes (__mark_)
275.markers
2768 bytes (2 pointers)
277.markers.c
27812 bytes (3 pointers)
279
280- Generic
281
282.text section : instructions
283Adds 11 bytes in the "likely" path.
284Adds 32 bytes in the "unlikely" path.
285.data section : r/w data
2861 byte (the activation flag)
287.rodata.str1 : strings
288Length of the marker name
289.debug_str : strings (if loaded..)
290Length of the marker name + 7 bytes (__mark_)
291.markers
2928 bytes (2 pointers)
293.markers.c
29412 bytes (3 pointers)
295
296
297* Macro-benchmarks
298
299Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache.
300Running a 2.6.17 vanilla kernel :
301real 8m2.443s
302user 7m35.124s
303sys 0m34.950s
304
305Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
306real 8m1.635s
307user 7m34.552s
308sys 0m36.298s
309
310--> 0.98 % speedup with markers
311
312Ping flood on loopback interface :
313Running a 2.6.17 vanilla kernel :
314136596 packets transmitted, 136596 packets received, 0% packet loss
315round-trip min/avg/max = 0.0/0.0/0.1 ms
316
317real 0m10.840s
318user 0m0.360s
319sys 0m10.485s
320
32112601 packets transmitted/s
322
323Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
324108504 packets transmitted, 108504 packets received, 0% packet loss
325round-trip min/avg/max = 0.0/0.0/0.1 ms
326
327real 0m8.614s
328user 0m0.264s
329sys 0m8.353s
330
33112596 packets transmitted/s
332
333--> 0.03 % slowdown with markers
334
335
336Conclusion
337
338In an empty loop, the generic marker is faster than the optimized marker. This
339may be due to better performances of the movzbl instruction over the movb on the
340Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact
341of the movzbl becomes greater because it uses the memory bandwidth.
342
343The preemption disabling and call to a probe itself costs 48.11 cycles, almost
344as much as dynamically parsing the format string to get the variable arguments
345(40.48 cycles).
346
347There is almost no difference, on x86, between passing the arguments directly on
348the stack and using a variable argument list when its layout is known
349statically (0.89 cycles vs 2.74 cycles).
350
351The int3 approach for adding instrumentation dynamically saves the 0.074 cycle
352(typcal use, high memory usage) used by the optimized marker by adding the
353ability to insert a breakpoint at any location without any impact on the code
354when inactive. This breakpoint based approach is very useful to instrument core
355kernel code that has not been previously marked without need to recompile and
356reboot. We can therefore compare the case "without markers" to the null impact
357of an inactive int3 breakpoint.
358
359However, the performance impact for using a kprobe is non negligible when
360activated. Assuming that kprobes would have a mechanism to get the variables
361from the caller's stack, it would perform the same task in at least 4178.23
362cycles vs 55.74 for a marker and a probe (ratio : 75). While kprobes are very
363useful for the reason explained earlier, the high event rate paths in the kernel
364would clearly benefit from a marker mechanism when the are probed.
365
366Code size and memory footprints are smaller with the optimized version : 6
367bytes of code in the likely path compared to 11 bytes. The memory footprint of
368the optimized approach saves 4 bytes of data memory that would otherwise have to
369stay in cache.
370
371On the macro-benchmark side, no significant difference in performance has been
372found between the vanilla kernel and a kernel "marked" with the standard LTTng
373instrumentation.
374
375
This page took 0.024298 seconds and 4 git commands to generate.