tests/markers/markers-microbench-0.2.txt

   1
   2
   3 Hi,
   4
   5 Following the huge discussion thread about tracing/static vs dynamic
   6 instrumentation/markers, a consensus seems to emerge about the need for a
   7 marker system in the Linux kernel. The main issues this mechanism addresses are:
   8
   9 - Identify code important to runtime data collection/analysis tools in tree so
  10   that it follows the code changes naturally.
  11 - Be visually appealing to kernel developers.
  12 - Have a very low impact on the system performance.
  13 - Integrate in the standard kernel infrastructure : use C and loadable modules.
  14
  15 The time has come for some performance measurements of the Linux Kernel Markers,
  16 which follows.
  17
  18
  19 * Micro-benchmarks
  20
  21 Use timestamp counter to calculate the time spent, with interrupts disabled.
  22 Machine : Pentium 4 3GHz, 1GB ram
  23 Fully preemptible kernel
  24 Linux Kernel Markers 0.19
  25 Kernel : Linux 2.6.17
  26
  27 marker : MARK(subsys_mark1, "%d %p", 1, NULL);
  28
  29 This marker, with two elements (integer and pointer) have been chosen because it
  30 is representative of high volume events. For instance, a trap entry event logs a
  31 trap_id (long) and an address (pointer). The same applies to system calls, where
  32 a system call entry event logs both the ID of the system call and the address of
  33 the caller.
  34
  35
  36 * Execute an empty loop
  37
  38 - Without marker
  39 NR_LOOPS : 10000000
  40 time delta (cycles): 15026497
  41 cycles per loop : 1.50
  42
  43 - i386 "optimized" : immediate value, test and predicted branch
  44   (non connected marker)
  45 NR_LOOPS : 10000000
  46 time delta (cycles): 40031640
  47 cycles per loop : 4.00
  48 cycles per loop for marker : 4.00-1.50=2.50
  49
  50 - i386 "generic" : load, test and predicted branch
  51   (non connected marker)
  52 NR_LOOPS : 10000000
  53 time delta (cycles): 26697878
  54 cycles per loop : 2.67
  55 cycles per loop for marker : 2.67-1.50=1.17
  56
  57
  58 * Execute a loop of memcpy 4096 bytes
  59
  60 This test has been done to show the impact of markers on a system where the
  61 memory is already used, which is more representative of a running kernel.
  62
  63 - Without marker
  64 NR_LOOPS : 10000
  65 time delta (cycles): 12981555
  66 cycles per loop : 1298.16
  67
  68 - i386 "optimized" : immediate value, test and predicted branch
  69   (non connected marker)
  70 NR_LOOPS : 10000
  71 time delta (cycles): 12982290
  72 cycles per loop : 1298.23
  73 cycles per loop for marker : 1298.23-1298.16=0.074
  74
  75 - i386 "generic" : load, test and predicted branch
  76   (non connected marker)
  77 NR_LOOPS : 10000
  78 time delta (cycles): 13002788
  79 cycles per loop : 1300.28
  80 cycles per loop for marker : 1300.28-1298.16=2.123
  81
  82
  83 The following tests are done with the "optimized" markers only
  84
  85 - Execute a loop with marker enabled, with i386 "fastcall" register argument
  86   setup, probe empty. With preemption disabling.
  87 NR_LOOPS : 100000
  88 time delta (cycles): 4407608
  89 cycles per loop : 44.08
  90 cycles per loop to disable preemption and setup arguments in registers :
  91 44.08-4.00=40.08
  92
  93 - Execute a loop with a marker enabled, with an empty probe. Var args argument
  94   setup, probe empty. With preemption disabling.
  95 NR_LOOPS : 100000
  96 time delta (cycles): 5210587
  97 cycles per loop : 52.11
  98 additional cycles per loop to setup var args : 52.11-44.08=8.03
  99
 100 - Execute a loop with a marker enabled, with an empty probe. Var args argument
 101   setup, probe empty. No preemption disabling.
 102 NR_LOOPS : 100000
 103 time delta (cycles): 3363450
 104 cycles per loop : 33.63
 105 cycles per loop to disable preemption : 44.08-33.63=10.45
 106
 107 - Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected.
 108   Data is copied by the probe. With preemption disabling.
 109 NR_LOOPS : 100000
 110 time delta (cycles): 5299837
 111 cycles per loop : 53.00
 112 additional cycles per loop to get arguments in probe (from stack) on x86 :
 113 53.00-52.11=0.89
 114
 115 - Execute a loop with marker enabled, with var args probe expecting arguments.
 116   Data is copied by the probe. With preemption disabling.
 117 NR_LOOPS : 100000
 118 time delta (cycles): 5574300
 119 cycles per loop : 55.74
 120 additional cycles per loop to get expected variable arguments on x86 :
 121   55.74-53.00=2.74
 122
 123 - Execute a loop with marker enabled, with var args probe, format string
 124   Data is copied by the probe. This is a 6 bytes string to decode.
 125 NR_LOOPS : 100000
 126 time delta (cycles): 9622117
 127 cycles per loop : 96.22
 128 additional cycles per loop to dynamically parse arguments with a 6 bytes format
 129 string : 96.22-55.74=40.48
 130
 131 - Execute a loop with marker enabled, with var args probe expecting arguments.
 132   Data is copied by the probe. With preemption disabling. An empty "kprobe" is
 133   connected to the probe.
 134 NR_LOOPS : 100000
 135 time delta (cycles): 423397455
 136 cycles per loop : 4233.97
 137 additional cycles per loop to execute the kprobe : 4233.97-55.74=4178.23
 138
 139
 140 * Assembly code
 141
 142 The disassembly of the following marked function will be shown :
 143
 144 static int my_open(struct inode *inode, struct file *file)
 145 {
 146         MARK(subsys_mark1, "%d %p", 1, NULL);
 147
 148         return -EPERM;
 149 }
 150
 151
 152 - Optimized
 153
 154 static int my_open(struct inode *inode, struct file *file)
 155 {
 156    0:   55                      push   %ebp
 157    1:   89 e5                   mov    %esp,%ebp
 158    3:   83 ec 0c                sub    $0xc,%esp
 159         MARK(subsys_mark1, "%d %p", 1, NULL);
 160    6:   b0 00                   mov    $0x0,%al <-- immediate load 0 in al
 161    8:   84 c0                   test   %al,%al
 162    a:   75 07                   jne    13 <my_open+0x13>
 163
 164         return -EPERM;
 165 }
 166    c:   b8 ff ff ff ff          mov    $0xffffffff,%eax
 167   11:   c9                      leave
 168   12:   c3                      ret
 169   13:   b8 01 00 00 00          mov    $0x1,%eax
 170   18:   e8 fc ff ff ff          call   19 <my_open+0x19> <-- preempt_disable
 171   1d:   c7 44 24 08 00 00 00    movl   $0x0,0x8(%esp)
 172   24:   00
 173   25:   c7 44 24 04 01 00 00    movl   $0x1,0x4(%esp)
 174   2c:   00
 175   2d:   c7 04 24 0d 00 00 00    movl   $0xd,(%esp)
 176   34:   ff 15 74 10 00 00       call   *0x1074 <-- function pointer
 177   3a:   b8 01 00 00 00          mov    $0x1,%eax
 178   3f:   e8 fc ff ff ff          call   40 <my_open+0x40> <-- preempt_enable
 179   44:   eb c6                   jmp    c <my_open+0xc>
 180
 181
 182 - Generic
 183
 184 static int my_open(struct inode *inode, struct file *file)
 185 {
 186    0:   55                      push   %ebp
 187    1:   89 e5                   mov    %esp,%ebp
 188    3:   83 ec 0c                sub    $0xc,%esp
 189         MARK(subsys_mark1, "%d %p", 1, NULL);
 190    6:   0f b6 05 20 10 00 00    movzbl 0x1020,%eax <-- memory load byte
 191    d:   84 c0                   test   %al,%al
 192    f:   75 07                   jne    18 <my_open+0x18>
 193
 194         return -EPERM;
 195 }
 196   11:   b8 ff ff ff ff          mov    $0xffffffff,%eax
 197   16:   c9                      leave
 198   17:   c3                      ret
 199   18:   b8 01 00 00 00          mov    $0x1,%eax
 200   1d:   e8 fc ff ff ff          call   1e <my_open+0x1e> <-- preempt_disable
 201   22:   c7 44 24 08 00 00 00    movl   $0x0,0x8(%esp)
 202   29:   00
 203   2a:   c7 44 24 04 01 00 00    movl   $0x1,0x4(%esp)
 204   31:   00
 205   32:   c7 04 24 0d 00 00 00    movl   $0xd,(%esp)
 206   39:   ff 15 74 10 00 00       call   *0x1074 <-- function pointer
 207   3f:   b8 01 00 00 00          mov    $0x1,%eax
 208   44:   e8 fc ff ff ff          call   45 <my_open+0x45> <-- preempt_enable
 209   49:   eb c6                   jmp    11 <my_open+0x11>
 210
 211
 212
 213 Here is the typical var arg probe that has been used in those tests. It saves
 214 the values expectes as parameters in global variables. The DO_MARK1_FORMAT
 215 define is used for probe registration to make sure that it will be connected
 216 with a marker that has a matching format string. Note that this checking is
 217 optional : the probe can register with a NULL format and afterward check itself
 218 the format string received in parameter dynamically.
 219
 220 int value;
 221 void *ptr;
 222
 223 #define DO_MARK1_FORMAT "%d %p"
 224 void do_mark1(const char *format, ...)
 225 {
 226         va_list ap;
 227
 228         va_start(ap, format);
 229         value = va_arg(ap, int);
 230         ptr = va_arg(ap, void*);
 231
 232         va_end(ap);
 233 }
 234
 235
 236 Here is the disassembly of the probe :
 237
 238 #define DO_MARK1_FORMAT "%d %p"
 239 void do_mark1(const char *format, ...)
 240 {
 241    0:   55                      push   %ebp
 242    1:   89 e5                   mov    %esp,%ebp
 243    3:   83 ec 04                sub    $0x4,%esp
 244         va_list ap;
 245
 246         va_start(ap, format);
 247         value = va_arg(ap, int);
 248    6:   8b 45 0c                mov    0xc(%ebp),%eax
 249    9:   a3 00 00 00 00          mov    %eax,0x0
 250         ptr = va_arg(ap, void*);
 251    e:   8b 45 10                mov    0x10(%ebp),%eax
 252   11:   a3 00 00 00 00          mov    %eax,0x0
 253
 254         va_end(ap);
 255 }
 256   16:   c9                      leave
 257   17:   c3                      ret
 258
 259
 260 * Size (x86)
 261
 262 This is the size added by each marker to the memory image :
 263
 264 - Optimized
 265
 266 .text section : instructions
 267 Adds 6 bytes in the "likely" path.
 268 Adds 32 bytes in the "unlikely" path.
 269 .data section : r/w data
 270 0 byte
 271 .rodata.str1 : strings
 272 Length of the marker name
 273 .debug_str : strings (if loaded..)
 274 Length of the marker name + 7 bytes (__mark_)
 275 .markers
 276 8 bytes (2 pointers)
 277 .markers.c
 278 12 bytes (3 pointers)
 279
 280 - Generic
 281
 282 .text section : instructions
 283 Adds 11 bytes in the "likely" path.
 284 Adds 32 bytes in the "unlikely" path.
 285 .data section : r/w data
 286 1 byte (the activation flag)
 287 .rodata.str1 : strings
 288 Length of the marker name
 289 .debug_str : strings (if loaded..)
 290 Length of the marker name + 7 bytes (__mark_)
 291 .markers
 292 8 bytes (2 pointers)
 293 .markers.c
 294 12 bytes (3 pointers)
 295
 296
 297 * Macro-benchmarks
 298
 299 Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache.
 300 Running a 2.6.17 vanilla kernel :
 301 real    8m2.443s
 302 user    7m35.124s
 303 sys     0m34.950s
 304
 305 Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
 306 real    8m1.635s
 307 user    7m34.552s
 308 sys     0m36.298s
 309
 310 --> 0.98 % speedup with markers
 311
 312 Ping flood on loopback interface :
 313 Running a 2.6.17 vanilla kernel :
 314 136596 packets transmitted, 136596 packets received, 0% packet loss
 315 round-trip min/avg/max = 0.0/0.0/0.1 ms
 316
 317 real    0m10.840s
 318 user    0m0.360s
 319 sys     0m10.485s
 320
 321 12601 packets transmitted/s
 322
 323 Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
 324 108504 packets transmitted, 108504 packets received, 0% packet loss
 325 round-trip min/avg/max = 0.0/0.0/0.1 ms
 326
 327 real    0m8.614s
 328 user    0m0.264s
 329 sys     0m8.353s
 330
 331 12596 packets transmitted/s
 332
 333 --> 0.03 % slowdown with markers
 334
 335
 336 Conclusion
 337
 338 In an empty loop, the generic marker is faster than the optimized marker. This
 339 may be due to better performances of the movzbl instruction over the movb on the
 340 Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact
 341 of the movzbl becomes greater because it uses the memory bandwidth.
 342
 343 The preemption disabling and call to a probe itself costs 48.11 cycles, almost
 344 as much as dynamically parsing the format string to get the variable arguments
 345 (40.48 cycles).
 346
 347 There is almost no difference, on x86, between passing the arguments directly on
 348 the stack and using a variable argument list when its layout is known
 349 statically (0.89 cycles vs 2.74 cycles).
 350
 351 The int3 approach for adding instrumentation dynamically saves the 0.074 cycle
 352 (typcal use, high memory usage) used by the optimized marker by adding the
 353 ability to insert a breakpoint at any location without any impact on the code
 354 when inactive. This breakpoint based approach is very useful to instrument core
 355 kernel code that has not been previously marked without need to recompile and
 356 reboot. We can therefore compare the case "without markers" to the null impact
 357 of an inactive int3 breakpoint.
 358
 359 However, the performance impact for using a kprobe is non negligible when
 360 activated. Assuming that kprobes would have a mechanism to get the variables
 361 from the caller's stack, it would perform the same task in at least 4178.23
 362 cycles vs 55.74 for a marker and a probe (ratio : 75). While kprobes are very
 363 useful for the reason explained earlier, the high event rate paths in the kernel
 364 would clearly benefit from a marker mechanism when the are probed.
 365
 366 Code size and memory footprints are smaller with the optimized version : 6
 367 bytes of code in the likely path compared to 11 bytes. The memory footprint of
 368 the optimized approach saves 4 bytes of data memory that would otherwise have to
 369 stay in cache.
 370
 371 On the macro-benchmark side, no significant difference in performance has been
 372 found between the vanilla kernel and a kernel "marked" with the standard LTTng
 373 instrumentation.
 374
 375