tests/markers/markers-microbench-0.2.txt

   1
   2
   3 * Microbenchmarks
   4
   5 Use timestamp counter to calculate the time spent, with interrupts disabled.
   6 Machine : Pentium 4 3GHz, 1GB ram
   7 Fully preemptible kernel
   8 Linux Kernel Markers 0.19
   9 Kernel : Linux 2.6.17
  10
  11 marker : MARK(subsys_mark1, "%d %p", 1, NULL);
  12
  13 This marker, with two elements (integer and pointer) have been chosen because it
  14 is representative of high volume events. For instance, a trap entry event logs a
  15 trap_id (long) and an address (pointer). The same applies to system calls, where
  16 a system call entry event logs both the ID of the system call and the address of
  17 the caller.
  18
  19
  20 * Execute an empty loop
  21
  22 - Without marker
  23 NR_LOOPS : 10000000
  24 time delta (cycles): 15026497
  25 cycles per loop : 1.50
  26
  27 - i386 "optimized" : immediate value, test and predicted branch
  28   (non connected marker)
  29 NR_LOOPS : 10000000
  30 time delta (cycles): 40031640
  31 cycles per loop : 4.00
  32 cycles per loop for marker : 4.00-1.50=2.50
  33
  34 - i386 "generic" : load, test and predicted branch
  35   (non connected marker)
  36 NR_LOOPS : 10000000
  37 time delta (cycles): 26697878
  38 cycles per loop : 2.67
  39 cycles per loop for marker : 2.67-1.50=1.17
  40
  41
  42 * Execute a loop of memcpy 4096 bytes
  43
  44 This test has been done to show the impact of markers on a system where the
  45 memory is already used, which is more representative of a running kernel.
  46
  47 - Without marker
  48 NR_LOOPS : 10000
  49 time delta (cycles): 12981555
  50 cycles per loop : 1298.16
  51
  52 - i386 "optimized" : immediate value, test and predicted branch
  53   (non connected marker)
  54 NR_LOOPS : 10000
  55 time delta (cycles): 12982290
  56 cycles per loop : 1298.23
  57 cycles per loop for marker : 1298.23-1298.16=0.074
  58
  59 - i386 "generic" : load, test and predicted branch
  60   (non connected marker)
  61 NR_LOOPS : 10000
  62 time delta (cycles): 13002788
  63 cycles per loop : 1300.28
  64 cycles per loop for marker : 1300.28-1298.16=2.123
  65
  66
  67 The following tests are done with the "optimized" markers only
  68
  69 - Execute a loop with marker enabled, with i386 "fastcall" register argument
  70   setup, probe empty. With preemption disabling.
  71 NR_LOOPS : 100000
  72 time delta (cycles): 4407608
  73 cycles per loop : 44.08
  74 cycles per loop to disable preemption and setup arguments in registers :
  75 44.08-4.00=40.08
  76
  77 - Execute a loop with a marker enabled, with an empty probe. Var args argument
  78   setup, probe empty. With preemption disabling.
  79 NR_LOOPS : 100000
  80 time delta (cycles): 5210587
  81 cycles per loop : 52.11
  82 additional cycles per loop to setup var args : 52.11-44.08=8.03
  83
  84 - Execute a loop with a marker enabled, with an empty probe. Var args argument
  85   setup, probe empty. No preemption disabling.
  86 NR_LOOPS : 100000
  87 time delta (cycles): 3363450
  88 cycles per loop : 33.63
  89 cycles per loop to disable preemption : 44.08-33.63=10.45
  90
  91 - Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected.
  92   Data is copied by the probe. With preemption disabling.
  93 NR_LOOPS : 100000
  94 time delta (cycles): 5299837
  95 cycles per loop : 53.00
  96 additional cycles per loop to get arguments in probe (from stack) on x86 :
  97 53.00-52.11=0.89
  98
  99 - Execute a loop with marker enabled, with var args probe expecting arguments.
 100   Data is copied by the probe. With preemption disabling.
 101 NR_LOOPS : 100000
 102 time delta (cycles): 5574300
 103 cycles per loop : 55.74
 104 additional cycles per loop to get expected variable arguments on x86 :
 105   55.74-53.00=2.74
 106
 107 - Execute a loop with marker enabled, with var args probe, format string
 108   Data is copied by the probe. This is a 6 bytes string to decode.
 109 processing.
 110 NR_LOOPS : 100000
 111 time delta (cycles): 9622117
 112 cycles per loop : 96.22
 113 additional cycles per loop to dynamically parse arguments with a 6 bytes format
 114 string :
 115 96.22-55.74=40.48
 116
 117
 118 * Assembly code
 119
 120 The disassembly of the following marked function will be shown :
 121
 122 static int my_open(struct inode *inode, struct file *file)
 123 {
 124         MARK(subsys_mark1, "%d %p", 1, NULL);
 125
 126         return -EPERM;
 127 }
 128
 129
 130 - Optimized
 131
 132 static int my_open(struct inode *inode, struct file *file)
 133 {
 134    0:   55                      push   %ebp
 135    1:   89 e5                   mov    %esp,%ebp
 136    3:   83 ec 0c                sub    $0xc,%esp
 137         MARK(subsys_mark1, "%d %p", 1, NULL);
 138    6:   b0 00                   mov    $0x0,%al <-- immediate load 0 in al
 139    8:   84 c0                   test   %al,%al
 140    a:   75 07                   jne    13 <my_open+0x13>
 141
 142         return -EPERM;
 143 }
 144    c:   b8 ff ff ff ff          mov    $0xffffffff,%eax
 145   11:   c9                      leave
 146   12:   c3                      ret
 147   13:   b8 01 00 00 00          mov    $0x1,%eax
 148   18:   e8 fc ff ff ff          call   19 <my_open+0x19> <-- preempt_disable
 149   1d:   c7 44 24 08 00 00 00    movl   $0x0,0x8(%esp)
 150   24:   00
 151   25:   c7 44 24 04 01 00 00    movl   $0x1,0x4(%esp)
 152   2c:   00
 153   2d:   c7 04 24 0d 00 00 00    movl   $0xd,(%esp)
 154   34:   ff 15 74 10 00 00       call   *0x1074 <-- function pointer
 155   3a:   b8 01 00 00 00          mov    $0x1,%eax
 156   3f:   e8 fc ff ff ff          call   40 <my_open+0x40> <-- preempt_enable
 157   44:   eb c6                   jmp    c <my_open+0xc>
 158
 159
 160 - Generic
 161
 162 static int my_open(struct inode *inode, struct file *file)
 163 {
 164    0:   55                      push   %ebp
 165    1:   89 e5                   mov    %esp,%ebp
 166    3:   83 ec 0c                sub    $0xc,%esp
 167         MARK(subsys_mark1, "%d %p", 1, NULL);
 168    6:   0f b6 05 20 10 00 00    movzbl 0x1020,%eax <-- memory load byte
 169    d:   84 c0                   test   %al,%al
 170    f:   75 07                   jne    18 <my_open+0x18>
 171
 172         return -EPERM;
 173 }
 174   11:   b8 ff ff ff ff          mov    $0xffffffff,%eax
 175   16:   c9                      leave
 176   17:   c3                      ret
 177   18:   b8 01 00 00 00          mov    $0x1,%eax
 178   1d:   e8 fc ff ff ff          call   1e <my_open+0x1e> <-- preempt_disable
 179   22:   c7 44 24 08 00 00 00    movl   $0x0,0x8(%esp)
 180   29:   00
 181   2a:   c7 44 24 04 01 00 00    movl   $0x1,0x4(%esp)
 182   31:   00
 183   32:   c7 04 24 0d 00 00 00    movl   $0xd,(%esp)
 184   39:   ff 15 74 10 00 00       call   *0x1074 <-- function pointer
 185   3f:   b8 01 00 00 00          mov    $0x1,%eax
 186   44:   e8 fc ff ff ff          call   45 <my_open+0x45> <-- preempt_enable
 187   49:   eb c6                   jmp    11 <my_open+0x11>
 188
 189
 190
 191 Here is the typical var arg probe that has been used in those tests. It saves
 192 the values expectes as parameters in global variables. The DO_MARK1_FORMAT
 193 define is used for probe registration to make sure that it will be connected
 194 with a marker that has a matching format string. Note that this checking is
 195 optional : the probe can register with a NULL format and afterward check itself
 196 the format string received in parameter dynamically.
 197
 198 int value;
 199 void *ptr;
 200
 201 #define DO_MARK1_FORMAT "%d %p"
 202 void do_mark1(const char *format, ...)
 203 {
 204         va_list ap;
 205
 206         va_start(ap, format);
 207         value = va_arg(ap, int);
 208         ptr = va_arg(ap, void*);
 209
 210         va_end(ap);
 211 }
 212
 213
 214 Here is the disassembly of the probe :
 215
 216 #define DO_MARK1_FORMAT "%d %p"
 217 void do_mark1(const char *format, ...)
 218 {
 219    0:   55                      push   %ebp
 220    1:   89 e5                   mov    %esp,%ebp
 221    3:   83 ec 04                sub    $0x4,%esp
 222         va_list ap;
 223
 224         va_start(ap, format);
 225         value = va_arg(ap, int);
 226    6:   8b 45 0c                mov    0xc(%ebp),%eax
 227    9:   a3 00 00 00 00          mov    %eax,0x0
 228         ptr = va_arg(ap, void*);
 229    e:   8b 45 10                mov    0x10(%ebp),%eax
 230   11:   a3 00 00 00 00          mov    %eax,0x0
 231
 232         va_end(ap);
 233 }
 234   16:   c9                      leave
 235   17:   c3                      ret
 236
 237
 238 * Size (x86)
 239
 240 This is the size added by each marker to the memory image :
 241
 242 - Optimized
 243
 244 .text section : instructions
 245 Adds 6 bytes in the "likely" path.
 246 Adds 32 bytes in the "unlikely" path.
 247 .data section : r/w data
 248 0 byte
 249 .rodata.str1 : strings
 250 Length of the marker name
 251 .debug_str : strings (if loaded..)
 252 Length of the marker name + 7 bytes (__mark_)
 253 .markers
 254 8 bytes (2 pointers)
 255 .markers.c
 256 12 bytes (3 pointers)
 257
 258 - Generic
 259
 260 .text section : instructions
 261 Adds 11 bytes in the "likely" path.
 262 Adds 32 bytes in the "unlikely" path.
 263 .data section : r/w data
 264 1 byte (the activation flag)
 265 .rodata.str1 : strings
 266 Length of the marker name
 267 .debug_str : strings (if loaded..)
 268 Length of the marker name + 7 bytes (__mark_)
 269 .markers
 270 8 bytes (2 pointers)
 271 .markers.c
 272 12 bytes (3 pointers)
 273
 274
 275 * Macrobenchmarks
 276
 277 Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache.
 278 Running a 2.6.17 vanilla kernel :
 279 real    8m2.443s
 280 user    7m35.124s
 281 sys     0m34.950s
 282
 283 Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
 284 real    8m1.635s
 285 user    7m34.552s
 286 sys     0m36.298s
 287
 288 Ping flood on loopback interface :
 289 Running a 2.6.17 vanilla kernel :
 290 136596 packets transmitted, 136596 packets received, 0% packet loss
 291 round-trip min/avg/max = 0.0/0.0/0.1 ms
 292
 293 real    0m10.840s
 294 user    0m0.360s
 295 sys     0m10.485s
 296
 297 12601 packets transmitted/s
 298
 299 Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
 300 108504 packets transmitted, 108504 packets received, 0% packet loss
 301 round-trip min/avg/max = 0.0/0.0/0.1 ms
 302
 303 real    0m8.614s
 304 user    0m0.264s
 305 sys     0m8.353s
 306
 307 12596 packets transmitted/s
 308
 309
 310
 311 Conclusion
 312
 313 In an empty loop, the generic marker is faster than the optimized marker. This
 314 may be due to better performances of the movzbl instruction over the movb on the
 315 Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact
 316 of the movzbl becomes greater because it uses the memory bandwidth.
 317
 318 The preemption disabling and call to a probe itself costs 48.11 cycles, almost
 319 as much as dynamically parsing the format string to get the variable arguments
 320 (40.48 cycles).
 321
 322 There is almost no difference, on x86, between passing the arguments directly on
 323 the stack and using a variable argument list when its layout is known
 324 statically (0.89 cycles vs 2.74 cycles).
 325
 326 The int3 approach for adding instrumentation dynamically saves the 0.074 cycle
 327 (typcal use, high memory usage) used by the optimized marker by adding the
 328 ability to insert a breakpoint at any location without any impact on the code
 329 when inactive. This breakpoint based approach is very useful to instrument core
 330 kernel code that has not been previously marked without need to recompile and
 331 reboot. We can therefore compare the case "without markers" to the null impact
 332 of the int3 breakpoint based approach when inactive.
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343