lttv/doc/developer/obsolete/lttng-userspace-tracing.txt

   1
   2 Some thoughts about userspace tracing
   3
   4 Mathieu Desnoyers January 2006
   5
   6
   7
   8 * Goals
   9
  10 Fast and secure user space tracing.
  11
  12 Fast :
  13
  14 - 5000ns for a system call is too long. Writing an event directly to memory
  15         takes 220ns.
  16 - Still, we can afford a system call for buffer switch, which occurs less often.
  17 - No locking, no signal disabling. Disabling signals require 2 system calls.
  18         Mutexes are implemented with a short spin lock, followed by a yield. Yet
  19         another system call. In addition, we have no way to know on which CPU we are
  20         running when in user mode. We can be preempted anywhere.
  21 - No contention.
  22 - No interrupt disabling : it doesn't exist in user mode.
  23
  24 Secure :
  25
  26 - A process shouldn't be able to corrupt the system's trace or another
  27         process'trace. It should be limited to its own memory space.
  28
  29
  30
  31 * Solution
  32
  33 - Signal handler concurrency
  34
  35 Using atomic space reservation in the buffer(s) will remove the requirement for
  36 locking. This is the fast and safe way to deal with concurrency coming from
  37 signal handlers.
  38
  39 - Start/stop tracing
  40
  41 Two possible solutions :
  42
  43 Either we export a read-only memory page from kernel to user space. That would
  44 be somehow seen as a hack, as I have never even seen such interface anywhere
  45 else. It may lead to problems related to exported types. The proper, but slow,
  46 way to do it would be to have a system call that would return the tracing
  47 status.
  48
  49 My suggestion is to go for a system call, but only call it :
  50
  51 - when the thread starts
  52 - when receiving a SIGRTMIN+3 (multithread ?)
  53
  54 Note : save the thread ID (process ID) in the logging function and the update
  55 handler. Use it as a comparison to check if we are a forked child thread.
  56 Start a brand new buffer list in that case.
  57
  58
  59 Two possibilities :
  60
  61 - one system call per information to get/one system call to get all information.
  62 - one signal per information to get/one signal for "update" tracing info.
  63
  64 I would tend to adopt :
  65
  66 - One signal for "general tracing update"
  67         One signal handler would clearly be enough, more would be unnecessary
  68         overhead/pollution.
  69 - One system call for all updates.
  70         We will need to have multiple parameters though. We have up to 6 parameters.
  71
  72 syscall get_tracing_info
  73
  74 parameter 1 : trace buffer map address. (id)
  75
  76 parameter 2 : active ? (int)
  77
  78
  79 Concurrency
  80
  81 We must have per thread buffers. Then, no memory can be written by two threads
  82 at once. It removes the need for locks (ok, atomic reservation was already doing
  83 that) and removes false sharing.
  84
  85
  86 Multiple traces
  87
  88 By having the number of active traces, we can allocate as much buffers as we
  89 need. Allocation is done in the kernel with relay_open. User space mapping is
  90 done when receiving the signal/starting the process and getting the number of
  91 traces actives.
  92
  93 It means that we must make sure to only update the data structures used by
  94 tracing functions once the buffers are created.
  95
  96 We could have a syscall "get_next_buffer" that would basically mmap the next
  97 unmmapped buffer, or return NULL is all buffers are mapped.
  98
  99 If we remove a trace, the kernel should stop the tracing, and then get the last
 100 buffer for this trace. What is important is to make sure no writers are still
 101 trying to write in a memory region that get desallocated.
 102
 103 For that, we will keep an atomic variable "tracing_level", which tells how many
 104 times we are nested in tracing code (program code/signal handlers) for a
 105 specific trace.
 106
 107 We could do that trace removal in two operations :
 108
 109 - Send an update tracing signal to the process
 110         - the sig handler get the new tracing status, which tells that tracing is
 111                 disabled for the specific trace. It writes this status in the tracing
 112                 control structure of the process.
 113         - If tracing_level is 0, well, it's fine : there are no potential writers in
 114                 the removed trace. It's up to us to buffer switch the removed trace, and,
 115                 after the control returns to us, set_tracing_info this page to NULL and
 116                 delete this memory area.
 117         - Else (tracing_level > 0), flag the removed trace for later switch/delete.
 118
 119         It then returns control to the process.
 120
 121 - If the tracing_level was > 0, there was one or more writers potentially
 122         accessing this memory area. When the control comes back to the writer, at the
 123         end of the write in a trace, if the trace is marked for switch/delete and the
 124         tracing_level is 0 (after the decrement of the writer itself), then the
 125         writer must buffer switch, and then delete the memory area.
 126
 127
 128 Filter
 129
 130 The update tracing info signal will make the thread get the new filter
 131 information. Getting this information will also happen upon process creation.
 132
 133 parameter 3 for the get tracing info : a integer containing the 32 bits mask.
 134
 135
 136 Buffer switch
 137
 138 There could be a tracing_buffer_switch system call, that would give the page
 139 start address as parameter. The job of the kernel is to steal this page,
 140 possibly replacing it with a zeroed page (we don't care about the content of the
 141 page after the syscall).
 142
 143 Process dying
 144
 145 The kernel should be aware of the current pages used for tracing in each thread.
 146 If a thread dies unexpectedly, we want the kernel to get the last bits of
 147 information before the thread crashes.
 148
 149 Memory protection
 150
 151 If a process corrupt its own mmaped buffers, the rest of the trace will be
 152 readable, and each process have its own memory space.
 153
 154 Two possibilities :
 155
 156 Either we create one channel per process, or we have per cpu tracefiles for all
 157 the processes, with the specification that data is written in a monotically
 158 increasing time order and that no process share a 4k page with another process.
 159
 160 The problem with having only one tracefile per cpu is that we cannot safely
 161 steal a process'buffer upon a schedule change because it may be currently
 162 writing to it.
 163
 164 It leaves the one tracefile per thread as the only solution.
 165
 166 Another argument in favor of this solution is the possibility to have mixed
 167 32-64 bits processes on the same machine. Dealing with types will be easier.
 168
 169
 170 Corrupted trace
 171
 172 A corrupted tracefile will only affect one thread. The rest of the trace will
 173 still be readable.
 174
 175
 176 Facilities
 177
 178 Upon process creation or when receiving the signal of trace info update, when a
 179 new trace appears, the thread should write the facility information into it. It
 180 must then have a list of registered facilities, all done at the thread level.
 181
 182 We must decide if we allow a facility channel for each thread. The advantage is
 183 that we have a readable channel in flight recorder mode, while the disadvantage
 184 is to duplicate the number of channels, which may become quite high. To follow
 185 the general design of a high throughput channel and a low throughput channel for
 186 vital information, I suggest to have a separate channel for facilities, per
 187 trace, per process.
 188
 189
 190
 191 API :
 192
 193 syscall 1 :
 194
 195 in :
 196 buffer : NULL means get new traces
 197                                  non NULL means to get the information for the specified buffer
 198 out :
 199 buffer : returns the address of the trace buffer
 200 active : is the trace active ?
 201 filter : 32 bits filter mask
 202
 203 return : 0 on success, 1 on error.
 204
 205 int ltt_update(void **buffer, int *active, int *filter);
 206
 207 syscall 2 :
 208
 209 in :
 210 buffer : Switch the specified buffer.
 211 return : 0 on success, 1 on error.
 212
 213 int ltt_switch(void *buffer);
 214
 215
 216 Signal :
 217
 218 SIGRTMIN+3
 219 (like hardware fault and expiring timer : to the thread, see p. 413 of Advances
 220 prog. in the UNIX env.)
 221
 222 Signal is sent on tracing create/destroy, start/stop and filter change.
 223
 224 Will update for itself only : it will remove unnecessary concurrency.
 225
 226
 227
 228 Notes :
 229
 230 It doesn't matter "when" the process receives the update signal after a trace
 231 start : it will receive it in priority, before executing anything else when it
 232 will be scheduled in.
 233
 234
 235
 236 Major enhancement :
 237
 238 * Buffer pool *
 239
 240 The problem with the design, up to now, is if an heavily threaded application
 241 launches many threads that has a short lifetime : it will allocate memory for
 242 each traced thread, consuming time and it will create an incredibly high
 243 number of files in the trace (or per thread).
 244
 245 (thanks to Matthew Khouzam)
 246 The solution to this sits in the use of a buffer poll : We typically create a
 247 buffer pool of a specified size (say, 10 buffers by default, alterable by the
 248 user), each 8k in size (4k for normal trace, 4k for facility channel), for a
 249 total of 80kB of memory. It has to be tweaked to the maximum number of
 250 expected threads running at once, or it will have to grow dynamically (thus
 251 impacting on the trace).
 252
 253 A typical approach to dynamic growth is to double the number of allocated
 254 buffers each time a threashold near the limit is reached.
 255
 256 Each channel would be found as :
 257
 258 trace_name/user/facilities_0
 259 trace_name/user/cpu_0
 260 trace_name/user/facilities_1
 261 trace_name/user/cpu_1
 262 ...
 263
 264 When a thread asks for being traced, it gets a buffer from free buffers pool. If
 265 the number of available buffers falls under a threshold, the pool is marked for
 266 expansion and the thread gets its buffer quickly. The expansion will be executed
 267 a little bit later by a worker thread. If however, the number of available
 268 buffer is 0, then an "emergency" reservation will be done, allocating only one
 269 buffer. The goal of this is to modify the thread fork time as less as possible.
 270
 271 When a thread releases a buffer (the thread terminates), a buffer switch is
 272 performed, so the data can be flushed to disk and no other thread will mess
 273 with it or render the buffer unreadable.
 274
 275 Upon trace creation, the pre-allocated pool is allocated. Upon trace
 276 destruction, the threads are first informed of the trace destruction, any
 277 pending worker thread (for pool allocation) is cancelled and then the pool is
 278 released. Buffers used by threads at this moment but not mapped for reading
 279 will be simply destroyed (as their refcount will fall to 0). It means that
 280 between the "trace stop" and "trace destroy", there should be enough time to let
 281 the lttd daemon open the newly created channels or they will be lost.
 282
 283 Upon buffer switch, the reader can read directly from the buffer. Note that when
 284 the reader finish reading a buffer, if the associated thread writer has
 285 exited, it must fill the buffer with zeroes and put it back into the free pool.
 286 In the case where the trace is destroyed, it must just derement its refcount (as
 287 it would do otherwise) and the buffer will be destroyed.
 288
 289 This pool will reduce the number of trace files created to the order of the
 290 number of threads present in the system at a given time.
 291
 292 A worse cast scenario is 32768 processes traced at the same time, for a total
 293 amount of 256MB of buffers. If a machine has so many threads, it probably have
 294 enough memory to handle this.
 295
 296 In flight recorder mode, it would be interesting to use a LRU algorithm to
 297 choose which buffer from the pool we must take for a newly forked thread. A
 298 simple queue would do it.
 299
 300 SMP : per cpu pools ? -> no, L1 and L2 caches are typically too small to be
 301 impacted by the fact that a reused buffer is on a different or the same CPU.
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314