78b09b595a216a188e1e1e27a07b9429d70b949f
[lttv.git] / lttv / lttv / sync / README
1 Benjamin Poirier
2 benjamin.poirier@polymtl.ca
3 2009
4
5 + About time synchronization
6 This framework performs offline time synchronization. This means that the
7 synchronization is done after tracing is over. It is not the same as online
8 synchronization like what is done by NTP. Nor is it directly influenced by it.
9
10 Event timestamps are adjusted according to a clock correction function that
11 palliates for initial offset and rate offset (ie. clocks that don't start out
12 at the same value and clocks that don't run at the same speed). It can work on
13 two or more traces.
14
15 The synchronization is based on relations identified in network traffic
16 between nodes. So, for it to work, there must be traffic exchanged between the
17 nodes. At the moment, this must be TCP traffic. Any kind will do (ssh, http,
18 ...)
19
20 For scientific information about the algorithms used, see:
21 * Duda, A., Harrus, G., Haddad, Y., and Bernard, G.: Estimating global time in
22 distributed systems, Proc. 7th Int. Conf. on Distributed Computing Systems,
23 Berlin, volume 18, 1987
24 * Ashton, P.: Algorithms for Off-line Clock Synchronisation, University of
25 Canterbury, December 1995
26 http://www.cosc.canterbury.ac.nz/research/reports/TechReps/1995/tr_9512.pdf
27
28 + Using time synchronization
29 ++ Recording traces
30 To use time synchronization you have to record traces on multiple nodes
31 simultaneously with lttng (the tracer). While recording the traces, you have
32 to make sure the following markers are enabled:
33 * dev_receive
34 * dev_xmit_extended
35 * tcpv4_rcv_extended
36 * udpv4_rcv_extended
37 You also have to make sure there is some TCP traffic between the traced nodes.
38
39 ++ Viewing traces
40 Afterwards, you have to make sure all the traces are accessible from a single
41 machine, where lttv (the viewer) is run.
42
43 Time synchronization is enabled and controlled via the following lttv options,
44 as seen with "-h":
45 --sync
46 synchronize the time between the traces
47 --sync-stats
48 print statistics about the time synchronization
49 --sync-null
50 read the events but do not perform any processing, this
51 is mostly for performance evaluation
52 --sync-analysis - argument: chull, linreg
53 specify the algorithm to use for event analysis
54 --sync-graphs
55 output gnuplot graph showing synchronization points
56 --sync-graphs-dir - argument: DIRECTORY
57 specify the directory where to store the graphs, by
58 default in "graphs-<lttv-pid>"
59
60 To enable synchronization, start lttv with the "--sync" option. It can be
61 used in text mode or in GUI mode. You can add the traces one by one in the GUI
62 but this will recompute the synchronization after every trace that is added.
63 Instead, you can save some time by specifying all your traces on the command
64 line (using -t).
65
66 Example:
67 lttv-gui -t traces/node1 -t traces/node2 --sync
68
69 ++ Statistics
70 The --sync-stats option is useful to make sure the synchronization algorithms
71 worked. Here is an example output (with added comments) from a successful
72 chull (one of the synchronization algorithms) run of two traces:
73 LTTV processing stats:
74 received frames: 452
75 received frames that are IP: 452
76 received and processed packets that are TCP: 268
77 sent packets that are TCP: 275
78 TCP matching stats:
79 total input and output events matched together to form a packet: 240
80 Message traffic:
81 0 - 1 : sent 60 received 60
82 # Note that 60 + 60 < 240, this is because there was loopback traffic, which is
83 # discarded.
84 Convex hull analysis stats:
85 out of order packets dropped from analysis: 0
86 Number of points in convex hulls:
87 0 - 1 : lower half-hull 7 upper half-hull 9
88 Individual synchronization factors:
89 0 - 1 : Middle a0= -1.33641e+08 a1= 1 - 4.5276e-08 accuracy 1.35355e-05
90 a0: -1.34095e+08 to -1.33187e+08 (delta= 907388)
91 a1: 1 -6.81298e-06 to +6.72248e-06 (delta= 1.35355e-05)
92 Resulting synchronization factors:
93 trace 0 drift= 1 offset= 0 (0.000000) start time= 18.799023588
94 trace 1 drift= 1 offset= 1.33641e+08 (0.066818) start time= 19.090688494
95 Synchronization time:
96 real time: 0.113308
97 user time: 0.112007
98 system time: 0.000000
99
100 ++ Algorithms
101 The synchronization framework is extensible and already includes two
102 algorithms: chull and linreg. You can choose which analysis algorithm to use
103 with the --sync-analysis option.
104
105 + Design
106 This part describes the design of the synchronization framework. This is to
107 help programmers interested in:
108 * adding new synchronization algorithms (analysis part)
109 There are already two analysis algorithms available: chull and linreg
110 * using new types of events (processing and matching parts)
111 * using time synchronization with another data source/tracer (processing part)
112 There are already two data sources available: lttng and unittest
113
114 ++ Sync chain
115 This part is specific to the framework in use: the program doing
116 synchronization, the executable linking to the event_*.o
117 eg. LTTV, unittest
118
119 This reads parameters, creates SyncState and calls the processing init
120 function. The "sync chain" is the set of event-* modules. At the moment there
121 is only one module at each stage. However, as more module are added, it will
122 become relevant to have many modules at the same stage simultaneously. This
123 will require some modifications. I've kept this possibility at the back of my
124 mind while designing.
125
126 ++ Stage 1: Event processing
127 Specific to the tracing data source.
128 eg. LTTng, LTT userspace, libpcap
129
130 Read the events from the trace and stuff them in an appropriate Event object.
131
132 ++ Communication between stages 1 and 2: events
133 Communication is done via objects specialized from Event. At the moment, all
134 *Event are in data_structures.h. Specific event structures and functions could
135 be in separate files. This way, adding a new set of modules would require
136 shipping extra data_structures* files instead of modifying the existing one.
137 For this to work, Event.type couldn't be an enum, it could be an int and use
138 #defines or constants defined the specialized data_structures* files.
139 Event.event could be a void*.
140
141 ++ Stage 2: Event matching
142 This stage and its modules are specific to the type of event. Event processing
143 feeds the events one at a time but event analysis works on groups of events.
144 Event matching is responsible for forming these groups. Generally speaking,
145 these can have different types of relation ("one to one", "one to many", or a
146 mix) and it will influence the overall behavior of the module.
147 eg. TCP, UDP, MPI
148
149 matchEvent() takes an Event pointer. An actual matching module doesn't have
150 to be able to process every type of event. It has to check that the passed
151 event is of a type it can process.
152
153 ++ Communication between stages 2 and 3: event groups
154 Communication consists of events grouped in Message, Exchange or Broadcast
155 structs.
156
157 About exchanges:
158 If one event pair is a packet (more generally, something representable as a
159 Message), an exchange is composed of at least two packets, one in each
160 direction. There should be a non-negative minimum "round trip time" (RTT)
161 between the first and last event of the exchange. This RTT should be as small
162 as possible so these packets should be closely related in time like a data
163 packet and an acknowledgement packet. If the events analyzed are such that the
164 minimum RTT can be zero, there's nothing gained in analyzing exchanges beyond
165 what can already be figured out by analyzing packets.
166
167 An exchange can also consist of more than two packets, in case one packet
168 single handedly acknowledges many data packets. In this case, it is best to
169 use the last acknowledged packet. Assuming a linear clock, an acknowledged
170 packet is as good as any other. However, since the linear clock assumption is
171 further from reality as the interval grows longer, it is best to keep the
172 interval between the two packets as short as possible.
173
174 ++ Stage 3: Event analysis
175 This stage and its modules are specific to the algorithm that analyzes events
176 to deduce synchronization factors.
177 eg. convex hull, linear regression, broadcast Maximum Likelihood Estimator
178
179 Instead of having one analyzeEvents() function that can receive any sort of
180 grouping of events, there are three prototypes: analyzeMessage(),
181 analyzeExchange() and analyzeBroadcast(). A module implements only the
182 relevant one(s) and sets the other function pointers to NULL in its
183 AnalysisModule struct.
184
185 The approach is different from matchEvent() where there is one point of entry
186 no mather the type of event. The analyze*() approach has the advantage that
187 there is no casting or type detection to do. It is also possible to deduce
188 from the functions pointers which groupings of events a module can analyze.
189 However, it means each analysis module will have to be modified if there is
190 ever a new type of event grouping.
191
192 I chose this approach because:
193 1) I thought it likely that there will be new types of events but not so
194 likely that there will be new types of event groups.
195 2) all events share some members (time, traceNb, ...) but not event groups
196 3) we'll see which one of the two approaches works best and we can adapt
197 later.
198
199 ++ Data flow
200 Data from traces flows "down" from processing to matching to analysis. Factors
201 come back up.
202
203 ++ Evolution and adaptation
204 It is possible to change/add another sync chain and to add other event_*
205 modules. It has been done. New types of events may need to be added to
206 data_structures.h. This is only to link between Event-* modules. If the data
207 does not have to be shared, data_structures.h does not have to be modified.
208
209 At the moment there is some code duplication in the last steps of linreg and
210 chull analysis: the code to propagate the factors when there are more than two
211 nodes. Maybe there could be a Stage 4 that does that?
This page took 0.032301 seconds and 3 git commands to generate.