5 Use timestamp counter to calculate the time spent, with interrupts disabled.
6 Machine : Pentium 4 3GHz, 1GB ram
7 Fully preemptible kernel
8 Linux Kernel Markers 0.19
11 marker : MARK(subsys_mark1, "%d %p", 1, NULL);
13 This marker, with two elements (integer and pointer) have been chosen because it
14 is representative of high volume events. For instance, a trap entry event logs a
15 trap_id (long) and an address (pointer). The same applies to system calls, where
16 a system call entry event logs both the ID of the system call and the address of
20 * Execute an empty loop
24 time delta (cycles): 15026497
25 cycles per loop : 1.50
27 - i386 "optimized" : immediate value, test and predicted branch
28 (non connected marker)
30 time delta (cycles): 40031640
31 cycles per loop : 4.00
32 cycles per loop for marker : 4.00-1.50=2.50
34 - i386 "generic" : load, test and predicted branch
35 (non connected marker)
37 time delta (cycles): 26697878
38 cycles per loop : 2.67
39 cycles per loop for marker : 2.67-1.50=1.17
42 * Execute a loop of memcpy 4096 bytes
44 This test has been done to show the impact of markers on a system where the
45 memory is already used, which is more representative of a running kernel.
49 time delta (cycles): 12981555
50 cycles per loop : 1298.16
52 - i386 "optimized" : immediate value, test and predicted branch
53 (non connected marker)
55 time delta (cycles): 12982290
56 cycles per loop : 1298.23
57 cycles per loop for marker : 1298.23-1298.16=0.074
59 - i386 "generic" : load, test and predicted branch
60 (non connected marker)
62 time delta (cycles): 13002788
63 cycles per loop : 1300.28
64 cycles per loop for marker : 1300.28-1298.16=2.123
67 The following tests are done with the "optimized" markers only
69 - Execute a loop with marker enabled, with i386 "fastcall" register argument
70 setup, probe empty. With preemption disabling.
72 time delta (cycles): 4407608
73 cycles per loop : 44.08
74 cycles per loop to disable preemption and setup arguments in registers :
77 - Execute a loop with a marker enabled, with an empty probe. Var args argument
78 setup, probe empty. With preemption disabling.
80 time delta (cycles): 5210587
81 cycles per loop : 52.11
82 additional cycles per loop to setup var args : 52.11-44.08=8.03
84 - Execute a loop with a marker enabled, with an empty probe. Var args argument
85 setup, probe empty. No preemption disabling.
87 time delta (cycles): 3363450
88 cycles per loop : 33.63
89 cycles per loop to disable preemption : 44.08-33.63=10.45
91 - Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected.
92 Data is copied by the probe. With preemption disabling.
94 time delta (cycles): 5299837
95 cycles per loop : 53.00
96 additional cycles per loop to get arguments in probe (from stack) on x86 :
99 - Execute a loop with marker enabled, with var args probe expecting arguments.
100 Data is copied by the probe. With preemption disabling.
102 time delta (cycles): 5574300
103 cycles per loop : 55.74
104 additional cycles per loop to get expected variable arguments on x86 :
107 - Execute a loop with marker enabled, with var args probe, format string
108 Data is copied by the probe. This is a 6 bytes string to decode.
111 time delta (cycles): 9622117
112 cycles per loop : 96.22
113 additional cycles per loop to dynamically parse arguments with a 6 bytes format
120 The disassembly of the following marked function will be shown :
122 static int my_open(struct inode *inode, struct file *file)
124 MARK(subsys_mark1, "%d %p", 1, NULL);
132 static int my_open(struct inode *inode, struct file *file)
135 1: 89 e5 mov %esp,%ebp
136 3: 83 ec 0c sub $0xc,%esp
137 MARK(subsys_mark1, "%d %p", 1, NULL);
138 6: b0 00 mov $0x0,%al <-- immediate load 0 in al
139 8: 84 c0 test %al,%al
140 a: 75 07 jne 13 <my_open+0x13>
144 c: b8 ff ff ff ff mov $0xffffffff,%eax
147 13: b8 01 00 00 00 mov $0x1,%eax
148 18: e8 fc ff ff ff call 19 <my_open+0x19> <-- preempt_disable
149 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
151 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
153 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
154 34: ff 15 74 10 00 00 call *0x1074 <-- function pointer
155 3a: b8 01 00 00 00 mov $0x1,%eax
156 3f: e8 fc ff ff ff call 40 <my_open+0x40> <-- preempt_enable
157 44: eb c6 jmp c <my_open+0xc>
162 static int my_open(struct inode *inode, struct file *file)
165 1: 89 e5 mov %esp,%ebp
166 3: 83 ec 0c sub $0xc,%esp
167 MARK(subsys_mark1, "%d %p", 1, NULL);
168 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax <-- memory load byte
169 d: 84 c0 test %al,%al
170 f: 75 07 jne 18 <my_open+0x18>
174 11: b8 ff ff ff ff mov $0xffffffff,%eax
177 18: b8 01 00 00 00 mov $0x1,%eax
178 1d: e8 fc ff ff ff call 1e <my_open+0x1e> <-- preempt_disable
179 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
181 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
183 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
184 39: ff 15 74 10 00 00 call *0x1074 <-- function pointer
185 3f: b8 01 00 00 00 mov $0x1,%eax
186 44: e8 fc ff ff ff call 45 <my_open+0x45> <-- preempt_enable
187 49: eb c6 jmp 11 <my_open+0x11>
191 Here is the typical var arg probe that has been used in those tests. It saves
192 the values expectes as parameters in global variables. The DO_MARK1_FORMAT
193 define is used for probe registration to make sure that it will be connected
194 with a marker that has a matching format string. Note that this checking is
195 optional : the probe can register with a NULL format and afterward check itself
196 the format string received in parameter dynamically.
201 #define DO_MARK1_FORMAT "%d %p"
202 void do_mark1(const char *format, ...)
206 va_start(ap, format);
207 value = va_arg(ap, int);
208 ptr = va_arg(ap, void*);
214 Here is the disassembly of the probe :
216 #define DO_MARK1_FORMAT "%d %p"
217 void do_mark1(const char *format, ...)
220 1: 89 e5 mov %esp,%ebp
221 3: 83 ec 04 sub $0x4,%esp
224 va_start(ap, format);
225 value = va_arg(ap, int);
226 6: 8b 45 0c mov 0xc(%ebp),%eax
227 9: a3 00 00 00 00 mov %eax,0x0
228 ptr = va_arg(ap, void*);
229 e: 8b 45 10 mov 0x10(%ebp),%eax
230 11: a3 00 00 00 00 mov %eax,0x0
240 This is the size added by each marker to the memory image :
244 .text section : instructions
245 Adds 6 bytes in the "likely" path.
246 Adds 32 bytes in the "unlikely" path.
247 .data section : r/w data
249 .rodata.str1 : strings
250 Length of the marker name
251 .debug_str : strings (if loaded..)
252 Length of the marker name + 7 bytes (__mark_)
256 12 bytes (3 pointers)
260 .text section : instructions
261 Adds 11 bytes in the "likely" path.
262 Adds 32 bytes in the "unlikely" path.
263 .data section : r/w data
264 1 byte (the activation flag)
265 .rodata.str1 : strings
266 Length of the marker name
267 .debug_str : strings (if loaded..)
268 Length of the marker name + 7 bytes (__mark_)
272 12 bytes (3 pointers)
277 Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache.
278 Running a 2.6.17 vanilla kernel :
283 Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
288 Ping flood on loopback interface :
289 Running a 2.6.17 vanilla kernel :
290 136596 packets transmitted, 136596 packets received, 0% packet loss
291 round-trip min/avg/max = 0.0/0.0/0.1 ms
297 12601 packets transmitted/s
299 Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
300 108504 packets transmitted, 108504 packets received, 0% packet loss
301 round-trip min/avg/max = 0.0/0.0/0.1 ms
307 12596 packets transmitted/s
313 In an empty loop, the generic marker is faster than the optimized marker. This
314 may be due to better performances of the movzbl instruction over the movb on the
315 Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact
316 of the movzbl becomes greater because it uses the memory bandwidth.
318 The preemption disabling and call to a probe itself costs 48.11 cycles, almost
319 as much as dynamically parsing the format string to get the variable arguments
322 There is almost no difference, on x86, between passing the arguments directly on
323 the stack and using a variable argument list when its layout is known
324 statically (0.89 cycles vs 2.74 cycles).
326 The int3 approach for adding instrumentation dynamically saves the 0.074 cycle
327 (typcal use, high memory usage) used by the optimized marker by adding the
328 ability to insert a breakpoint at any location without any impact on the code
329 when inactive. This breakpoint based approach is very useful to instrument core
330 kernel code that has not been previously marked without need to recompile and
331 reboot. We can therefore compare the case "without markers" to the null impact
332 of the int3 breakpoint based approach when inactive.