* Microbenchmarks Use timestamp counter to calculate the time spent, with interrupts disabled. Machine : Pentium 4 3GHz Fully preemptible kernel marker : MARK(subsys_mark1, "%d %p", 1, NULL); Linux Kernel Markers 0.19 * Execute an empty loop NR_LOOPS : 10000000 time delta (cycles): 15026497 cycles per loop : 1.50 - i386 "optimized" : immediate value, test and predicted branch (non connected marker) NR_LOOPS : 10000000 time delta (cycles): 40031640 cycles per loop : 4.00 cycles per loop for marker : 2.50 - i386 "generic" : load, test and predicted branch (non connected marker) NR_LOOPS : 10000000 time delta (cycles): 26697878 cycles per loop : 2.67 cycles per loop for marker : 1.17 * Execute a loop of memcpy 4096 bytes - Without marker NR_LOOPS : 10000 time delta (cycles): 12981555 cycles per loop : 1298.16 - i386 "optimized" : immediate value, test and predicted branch (non connected marker) NR_LOOPS : 10000 time delta (cycles): 12982290 cycles per loop : 1298.23 cycles per loop for marker : 0.074 - i386 "generic" : load, test and predicted branch (non connected marker) NR_LOOPS : 10000 time delta (cycles): 13002788 cycles per loop : 1300.28 cycles per loop for marker : 2.123 The following tests are done with the "optimized" markers only Execute a loop with a marker enabled, with an empty probe. NR_LOOPS : 100000 time delta (cycles): 5210587 cycles per loop : 52.11 cycles per loop for empty probe : 52.11-4.00=48.11 Execute a loop with marker enabled, with i386 direct argument passing. NR_LOOPS : 100000 time delta (cycles): 5299837 cycles per loop : 53.00 cycles per loop to get arguments in probe (from stack) on x86 : 53.00-52.11=0.89 Execute a loop with marker enabled, with var args probe. NR_LOOPS : 100000 time delta (cycles): 5574300 cycles per loop : 55.74 cycles per loop to get expected variable arguments on x86 : 55.74-53.00=2.74 Execute a loop with marker enabled, with var args probe, format string processing. NR_LOOPS : 100000 time delta (cycles): 9622117 cycles per loop : 96.22 cycles per loop to dynamically parse arguments with format string : 96.22-55.74=40.48 * Assembly code - Optimized static int my_open(struct inode *inode, struct file *file) { 0: 55 push %ebp 1: 89 e5 mov %esp,%ebp 3: 83 ec 0c sub $0xc,%esp MARK(subsys_mark1, "%d %p", 1, NULL); 6: b0 00 mov $0x0,%al 8: 84 c0 test %al,%al a: 75 07 jne 13 return -EPERM; } c: b8 ff ff ff ff mov $0xffffffff,%eax 11: c9 leave 12: c3 ret 13: b8 01 00 00 00 mov $0x1,%eax 18: e8 fc ff ff ff call 19 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp) 24: 00 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp) 2c: 00 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp) 34: ff 15 74 10 00 00 call *0x1074 3a: b8 01 00 00 00 mov $0x1,%eax 3f: e8 fc ff ff ff call 40 44: eb c6 jmp c - Generic static int my_open(struct inode *inode, struct file *file) { 0: 55 push %ebp 1: 89 e5 mov %esp,%ebp 3: 83 ec 0c sub $0xc,%esp MARK(subsys_mark1, "%d %p", 1, NULL); 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax d: 84 c0 test %al,%al f: 75 07 jne 18 return -EPERM; } 11: b8 ff ff ff ff mov $0xffffffff,%eax 16: c9 leave 17: c3 ret 18: b8 01 00 00 00 mov $0x1,%eax 1d: e8 fc ff ff ff call 1e 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp) 29: 00 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp) 31: 00 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp) 39: ff 15 74 10 00 00 call *0x1074 3f: b8 01 00 00 00 mov $0x1,%eax 44: e8 fc ff ff ff call 45 49: eb c6 jmp 11 * Size (x86) - Optimized Adds 6 bytes in the "likely" path. Adds 32 bytes in the "unlikely" path. - Generic Adds 11 bytes in the "likely" path. Adds 32 bytes in the "unlikely" path. Conclusion In an empty loop, the generic marker is faster than the optimized marker. This may be due to better performances of the movzbl instruction over the movb on the Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact of the movzbl becomes greater because it uses the memory bandwidth. The preemption disabling and call to a probe itself costs 48.11 cycles, almost as much as dynamically parsing the format string to get the variable arguments (40.48 cycles). There is almost no difference, on x86, between passing the arguments directly on the stack and using a variable argument list when its layout is known statically (0.89 cycles vs 2.74 cycles).