lttv/lttv/sync/README

   1 Benjamin Poirier
   2 benjamin.poirier@polymtl.ca
   3 2009, 2010
   4
   5 + About time synchronization
   6 This framework performs offline time synchronization. This means that the
   7 synchronization is done after tracing is over. It is not the same as online
   8 synchronization like what is done by NTP. Nor is it directly influenced by it.
   9
  10 Event timestamps are adjusted according to a clock correction function that
  11 palliates for initial offset and rate offset (ie. clocks that don't start out
  12 at the same value and clocks that don't run at the same speed). It can work on
  13 two or more traces.
  14
  15 The synchronization is based on relations identified in network traffic
  16 between nodes. So, for it to work, there must be traffic exchanged between the
  17 nodes. At the moment, this must be TCP traffic. Any kind will do (ssh, http,
  18 ...)
  19
  20 For scientific information about the algorithms used, see:
  21 * Duda, A., Harrus, G., Haddad, Y., and Bernard, G.: Estimating global time in
  22 distributed systems, Proc. 7th Int. Conf. on Distributed Computing Systems,
  23 Berlin, volume 18, 1987
  24 * Ashton, P.: Algorithms for Off-line Clock Synchronisation, University of
  25 Canterbury, December 1995
  26 http://www.cosc.canterbury.ac.nz/research/reports/TechReps/1995/tr_9512.pdf
  27
  28 + Using time synchronization
  29 ++ Recording traces
  30 To use time synchronization you have to record traces on multiple nodes
  31 simultaneously with lttng (the tracer). While recording the traces, you have
  32 to make sure the following markers are enabled:
  33 * dev_receive
  34 * dev_xmit_extended
  35 * tcpv4_rcv_extended
  36 * udpv4_rcv_extended
  37 You can use 'ltt-armall -n' for this.
  38
  39 You also have to make sure there is some TCP traffic between the traced nodes.
  40
  41 ++ Viewing traces
  42 Afterwards, you have to make sure all the traces are accessible from a single
  43 machine, where lttv (the viewer) is run.
  44
  45 Time synchronization is enabled and controlled via the following lttv options,
  46 as seen with "-h":
  47 --sync
  48                      synchronize the time between the traces
  49 --sync-stats
  50                      print statistics about the time synchronization
  51                                          See the section "Statistics" for more information.
  52 --sync-null
  53                                          read the events but do not perform any processing, this
  54                                          is mostly for performance evaluation
  55 --sync-analysis  -  argument: chull, linreg, eval
  56                                          specify the algorithm to use for event analysis. See the
  57                                          section "Synchronization Alogrithms".
  58 --sync-reduction  -  argument: accuracy
  59                                          specify the algorithm to use for factor reduction. See
  60                                          the section "Reduction Algorithms".
  61 --sync-graphs
  62                      output gnuplot graph showing synchronization points
  63 --sync-graphs-dir  -  argument: DIRECTORY
  64                      specify the directory where to store the graphs, by
  65                                          default in "graphs-<lttv-pid>"
  66
  67 To enable synchronization, start lttv with the "--sync" option. It can be
  68 used in text mode or in GUI mode. You can add the traces one by one in the GUI
  69 but this will recompute the synchronization after every trace that is added.
  70 Instead, you can save some time by specifying all your traces on the command
  71 line (using -t).
  72
  73 Example:
  74 lttv-gui -t traces/node1 -t traces/node2 --sync
  75
  76 ++ Statistics
  77 The --sync-stats option is useful to know how well the synchronization
  78 algorithms worked. Here is an example output (with added comments) from a
  79 successful chull (one of the synchronization algorithms) run of two traces:
  80         LTTV processing stats:
  81                 received frames: 452
  82                 received frames that are IP: 452
  83                 received and processed packets that are TCP: 268
  84                 sent packets that are TCP: 275
  85         TCP matching stats:
  86                 total input and output events matched together to form a packet: 240
  87                 Message traffic:
  88                           0 - 1  : sent 60         received 60
  89 # Note that 60 + 60 < 240, this is because there was loopback traffic, which is
  90 # discarded.
  91         Convex hull analysis stats:
  92                 out of order packets dropped from analysis: 0
  93                 Number of points in convex hulls:
  94                           0 - 1  : lower half-hull 7     upper half-hull 9
  95                 Individual synchronization factors:
  96                           0 - 1  : Middle     a0= -1.33641e+08 a1= 1 - 4.5276e-08 accuracy 1.35355e-05
  97                                                                   a0: -1.34095e+08 to -1.33187e+08 (delta=  907388)
  98                                                                   a1: 1 -6.81298e-06 to +6.72248e-06 (delta= 1.35355e-05)
  99 # "Middle" is the best type of synchronization for chull. See the section
 100 # "Convex Hull" below.
 101         Resulting synchronization factors:
 102                 trace 0 drift= 1 offset= 0 (0.000000) start time= 18.799023588
 103                 trace 1 drift= 1 offset= 1.33641e+08 (0.066818) start time= 19.090688494
 104         Synchronization time:
 105                 real time: 0.113308
 106                 user time: 0.112007
 107                 system time: 0.000000
 108
 109 ++ Synchronization Algorithms
 110 The synchronization framework is extensible and already includes two
 111 algorithms: chull and linreg. (There is also a special "eval" module
 112 available.) You can choose which analysis algorithm to use with the
 113 --sync-analysis option.
 114
 115 +++ Convex Hull
 116 chull, the default analysis module, can provide a garantee that there are no
 117 message inversions after synchronization. When printing the statistics, it
 118 will print, for each trace, the type of factors found:
 119 * "Middle", all went according to assumptions and there will be no message
 120   inversions
 121 * "Fallback", it was not possible to garantee no message inversion so
 122   approximate factors were given instead. This may happen during long running
 123   traces where the non-linearity of the clocks was notable. If you can, try to
 124   reduce the duration of the trace. (Sometimes this may happen during a trace
 125   as short as 120s. but sometimes traces 30 mins. or longer are ok, your
 126   milleage may vary). It would also be to improve the algorithms to avoid
 127   this, see the "Todo" section. In any case, you may get better results (but
 128   still no garantee) by choosing the linreg algorithm instead.
 129 * "Absent", the trace pair does not contain common communication events. Are
 130   you sure the nodes exchanged TCP traffic during the trace?
 131
 132 There are also other, less common, types. See the enum ApproxType in
 133 event_analysis_chull.h.
 134
 135 +++ Linear Regression
 136 linreg sometimes gives more precise results than chull but it provides no
 137 garantee
 138
 139 +++ Synchronization evaluation
 140 eval is a special module, it doesn't really perform synchronization, instead
 141 it calculates and prints different metrics about how well traces are
 142 synchronized. Although it can be run like other analysis modules, it is most
 143 useful when run in a postprocessing step, after another synchronization module
 144 has been run. Eval is most commonly run in text mode. To do this, run:
 145 lttv -m sync_chain_batch [usual options, ex: -t traces/node1 -t traces/node2
 146 --sync ...]
 147 It can also be run from the lttv source tree via runlttv:
 148 ./runlttv -m eval [usual runlttv options, ex: traces/node1 traces/node2]
 149
 150 eval provides a few more options:
 151 --eval-rtt-file  -  argument: FILE
 152                      specify the file containing RTT information
 153 --eval-graphs  -  argument: none
 154                      output gnuplot graph showing synchronization points
 155 --eval-graphs-dir  -  argument: eval-graphs-<lttv pid>
 156                      specify the directory where to store the graphs
 157
 158 The RTT file should contain information on the minimum round-trip time between
 159 nodes involved in the trace. This information is used (optionally) in the
 160 evaluation displayed and in the histogram graphs produced. The file should
 161 contain a series of lines of the form:
 162 192.168.112.56 192.168.112.57 0.100
 163 The first two fields are the IP addresses of the source and destination hosts.
 164 (hostnames are not supported). The last field is the minimum rtt in ms. The
 165 fields are separated by whitespace. '#' comments a line.
 166
 167 Many commands can be used to measure the RTT, for example:
 168 ping -s 8 -A -c 8000 -w 10 192.168.112.57
 169
 170 Note that this must be repeated in both directions in the file, that is:
 171 192.168.112.49 192.168.112.50 0.057
 172 192.168.112.50 192.168.112.49 0.050
 173
 174 ++++ Linear Programming and GLPK
 175 The synchronization evaluation can optionally perform an analysis similar to
 176 chull but by using a linear program in one of the steps. This can be used to
 177 validate a part of the chull algorithm but it can also be used to provide a
 178 measure of the accuracy of the synchronization in any point (this is seen in
 179 the graph output).
 180
 181 This is enabled by default at configure time (--with-glpk) if the GNU Linear
 182 Programming Kit is available (libglpk). On Debian-like systems (ex. Ubuntu),
 183 install the package "libglpk-dev".
 184
 185 To see the output of this mode, run:
 186 lttv -m sync_chain_batch --eval-graphs [usual options, ex: -t traces/node1 -t
 187 traces/node2 --sync ...]
 188
 189 + Reduction Algorithms
 190 Event analysis yields time correction factors between trace pairs. For groups
 191 of more than two traces, an extra step is necessary to identify a reference
 192 trace and calculate correction factors for each trace relative to this
 193 reference. There are usually many possibilities and so this step is called
 194 "factor reduction".
 195
 196 ++ Accuracy
 197 At the moment, only one algorithm is available to do this, the "accuracy"
 198 algorithm. This algorithm tries to choose the reference and the factors that
 199 yield the best accuracy. See the function header comments in
 200 factor_reduction_accuracy.c for more details.
 201
 202 + Design
 203 This part describes the design of the synchronization framework. This is to
 204 help programmers interested in:
 205 * adding new synchronization algorithms (analysis part)
 206         There are already two analysis algorithms available: chull and linreg
 207 * using new types of events (processing and matching parts)
 208         There are already two types of events supported: tcp messages and udp
 209         broadcasts
 210 * using time synchronization with another data source/tracer (processing part)
 211         There are already two data sources available: lttng and unittest
 212
 213 ++ Sync chain
 214 This part is specific to the framework in use: the program doing
 215 synchronization, the executable linking to the event_*.o
 216 eg. LTTV, unittest
 217
 218 This reads parameters, creates SyncState and calls the init functions of the
 219 modules to be used. The "sync chain" is this set of modules. At the moment
 220 there is only one module at each stage. However, as more modules are added, it
 221 will become relevant to have many modules at the same stage simultaneously.
 222 This will require some modifications. It is already partly supported at the
 223 matching stage through encapsulation of other matching modules.
 224
 225 sync_chain_unitest:main() provides a fairly simple example of sync chain
 226 implementation.
 227
 228 ++ Stage 1: Event processing
 229 Specific to the tracing data source.
 230 eg. LTTng, LTT userspace, libpcap
 231
 232 Read the events from the trace and stuff them in an appropriate Event object.
 233
 234 ++ Communication between stages 1 and 2: events
 235 Communication is done via objects specialized from Event. At the moment, all
 236 *Event are in data_structures.h. Specific event structures and functions could
 237 be in separate files. This way, adding a new set of modules would require
 238 shipping extra data_structures* files instead of modifying the existing one.
 239 For this to work, Event.type couldn't be an enum, it could be an int and use
 240 #defines or constants defined in the specialized data_structures* files.
 241 Event.event could be a void*.
 242
 243 ++ Stage 2: Event matching
 244 This stage and its modules are specific to the type of event. Event processing
 245 feeds the events one at a time but event analysis works on groups of events.
 246 Event matching is responsible for forming these groups. Generally speaking,
 247 these can have different types of relation ("one to one", "one to many", or a
 248 mix) and it will influence the overall behavior of the module.
 249 eg. TCP, UDP, MPI
 250
 251 matchEvent() takes an Event pointer. An actual matching module doesn't have to
 252 be able to process every type of event. It will only be passed events of a
 253 type it can process (according to the .canMatch field of its MatchingModule
 254 struct).
 255
 256 ++ Communication between stages 2 and 3: event groups
 257 Communication consists of events grouped in Message, Exchange or Broadcast
 258 structs.
 259
 260 About exchanges:
 261 If one event pair is a packet (more generally, something representable as a
 262 Message), an exchange is composed of at least two packets, one in each
 263 direction. There should be a non-negative minimum "round trip time" (RTT)
 264 between the first and last event of the exchange. This RTT should be as small
 265 as possible so these packets should be closely related in time like a data
 266 packet and an acknowledgement packet. If the events analyzed are such that the
 267 minimum RTT can be zero, there's nothing gained in analyzing exchanges beyond
 268 what can already be figured out by analyzing packets.
 269
 270 An exchange can also consist of more than two packets, in case one packet
 271 single handedly acknowledges many data packets. In this case, it is best to
 272 use the last data packet. Assuming a linear clock, an acknowledged
 273 packet is as good as any other. However, since the linear clock assumption is
 274 further from reality as the interval grows longer, it is best to keep the
 275 interval between the two packets as short as possible.
 276
 277 ++ Stage 3: Event analysis
 278 This stage and its modules are specific to the algorithm that analyzes events
 279 to deduce synchronization factors.
 280 eg. convex hull, linear regression, broadcast Maximum Likelihood Estimator
 281
 282 This module should return a set of synchronization factors for each trace
 283 pair. Some trace pairs may have no factors, their approxType should be set to
 284 ABSENT.
 285
 286 Instead of having one analyzeEvents() function that can receive any sort of
 287 grouping of events, there are three prototypes: analyzeMessage(),
 288 analyzeExchange() and analyzeBroadcast(). A module implements only the
 289 relevant one(s) and the other function pointers are NULL.
 290
 291 The approach is different from matchEvent() where there is one point of entry
 292 no mather the type of event. The analyze*() approach has the advantage that
 293 there is no casting or type detection to do. It is also possible to deduce
 294 from the functions pointers which groupings of events a module can analyze.
 295 However, it means each analysis module will have to be modified if there is
 296 ever a new type of event grouping.
 297
 298 I chose this approach because:
 299 1) I thought it likely that there will be new types of events but not so
 300    likely that there will be new types of event groups.
 301 2) all events share some members (time, traceNb, ...) but not event groups
 302 3) we'll see which one of the two approaches works best and we can adapt
 303    later.
 304
 305 ++ Stage 4: Factor reduction
 306 This stage reduces the pair-wise synchronization factors obtained in step 3 to
 307 time correction factors for each trace. It is most useful when synchronizing
 308 more than two traces.
 309
 310 ++ Evolution and adaptation
 311 It is possible to change/add another sync chain and to add other modules. It
 312 has been done. New types of events may need to be added to data_structures.h.
 313 This is only to link between Event-* modules. If the data does not have to be
 314 shared, data_structures.h does not have to be modified.