Add sync_chain_unittest
[lttv.git] / lttv / lttv / sync / README
CommitLineData
add19043
BP
1Benjamin Poirier
2benjamin.poirier@polymtl.ca
32009
4
5+ About time synchronization
6This framework performs offline time synchronization. This means that the
7synchronization is done after tracing is over. It is not the same as online
8synchronization like what is done by NTP. Nor is it directly influenced by it.
9
10Event timestamps are adjusted according to a clock correction function that
11palliates for initial offset and rate offset (ie. clocks that don't start out
12at the same value and clocks that don't run at the same speed). It can work on
13two or more traces.
14
15The synchronization is based on relations identified in network traffic
16between nodes. So, for it to work, there must be traffic exchanged between the
17nodes. At the moment, this must be TCP traffic. Any kind will do (ssh, http,
18...)
19
20For scientific information about the algorithms used, see:
21* Duda, A., Harrus, G., Haddad, Y., and Bernard, G.: Estimating global time in
22distributed systems, Proc. 7th Int. Conf. on Distributed Computing Systems,
23Berlin, volume 18, 1987
24* Ashton, P.: Algorithms for Off-line Clock Synchronisation, University of
25Canterbury, December 1995
26http://www.cosc.canterbury.ac.nz/research/reports/TechReps/1995/tr_9512.pdf
27
28+ Using time synchronization
29++ Recording traces
30To use time synchronization you have to record traces on multiple nodes
31simultaneously with lttng (the tracer). While recording the traces, you have
32to make sure the following markers are enabled:
33* dev_receive
34* dev_xmit_extended
35* tcpv4_rcv_extended
36* udpv4_rcv_extended
bc7c054d 37You can use 'ltt-armall -n' for this.
9a9ca632 38
add19043
BP
39You also have to make sure there is some TCP traffic between the traced nodes.
40
41++ Viewing traces
42Afterwards, you have to make sure all the traces are accessible from a single
43machine, where lttv (the viewer) is run.
44
45Time synchronization is enabled and controlled via the following lttv options,
46as seen with "-h":
47--sync
48 synchronize the time between the traces
49--sync-stats
50 print statistics about the time synchronization
bc7c054d 51 See the section "Statistics" for more information.
add19043
BP
52--sync-null
53 read the events but do not perform any processing, this
54 is mostly for performance evaluation
bc7c054d
BP
55--sync-analysis - argument: chull, linreg, eval
56 specify the algorithm to use for event analysis. See the
57 section "Alogrithms".
add19043
BP
58--sync-graphs
59 output gnuplot graph showing synchronization points
60--sync-graphs-dir - argument: DIRECTORY
61 specify the directory where to store the graphs, by
62 default in "graphs-<lttv-pid>"
63
64To enable synchronization, start lttv with the "--sync" option. It can be
65used in text mode or in GUI mode. You can add the traces one by one in the GUI
66but this will recompute the synchronization after every trace that is added.
67Instead, you can save some time by specifying all your traces on the command
68line (using -t).
69
70Example:
71lttv-gui -t traces/node1 -t traces/node2 --sync
72
73++ Statistics
bc7c054d
BP
74The --sync-stats option is useful to know how well the synchronization
75algorithms worked. Here is an example output (with added comments) from a
76successful chull (one of the synchronization algorithms) run of two traces:
add19043
BP
77 LTTV processing stats:
78 received frames: 452
79 received frames that are IP: 452
80 received and processed packets that are TCP: 268
81 sent packets that are TCP: 275
82 TCP matching stats:
83 total input and output events matched together to form a packet: 240
84 Message traffic:
85 0 - 1 : sent 60 received 60
86# Note that 60 + 60 < 240, this is because there was loopback traffic, which is
87# discarded.
88 Convex hull analysis stats:
89 out of order packets dropped from analysis: 0
90 Number of points in convex hulls:
91 0 - 1 : lower half-hull 7 upper half-hull 9
92 Individual synchronization factors:
93 0 - 1 : Middle a0= -1.33641e+08 a1= 1 - 4.5276e-08 accuracy 1.35355e-05
94 a0: -1.34095e+08 to -1.33187e+08 (delta= 907388)
95 a1: 1 -6.81298e-06 to +6.72248e-06 (delta= 1.35355e-05)
bc7c054d
BP
96# "Middle" is the best type of synchronization for chull. See the section
97# "Convex Hull" below.
add19043
BP
98 Resulting synchronization factors:
99 trace 0 drift= 1 offset= 0 (0.000000) start time= 18.799023588
100 trace 1 drift= 1 offset= 1.33641e+08 (0.066818) start time= 19.090688494
101 Synchronization time:
102 real time: 0.113308
103 user time: 0.112007
104 system time: 0.000000
105
106++ Algorithms
107The synchronization framework is extensible and already includes two
108algorithms: chull and linreg. You can choose which analysis algorithm to use
109with the --sync-analysis option.
110
bc7c054d
BP
111+++ Convex Hull
112chull, the default analysis module, can provide a garantee that there are no
113message inversions after synchronization. When printing the statistics, it
114will print, for each trace, the type of factors found:
115* "Middle", all went according to assumptions and there will be no message
116 inversions
117* "Fallback", it was not possible to garantee no message inversion so
118 approximate factors were given instead. This may happen during long running
119 traces where the non-linearity of the clocks was notable. If you can, try to
120 reduce the duration of the trace. (Sometimes this may happen during a trace
121 as short as 120s. but sometimes traces 30 mins. or longer are ok, your
122 milleage may vary). It would also be to improve the algorithms to avoid
123 this, see the "Todo" section. In any case, you may get better results (but
124 still no garantee) by choosing the linreg algorithm instead.
125* "Absent", the trace pair does not contain common communication events. Are
126 you sure the nodes exchanged TCP traffic during the trace?
127
128There are also other, less common, types. See the enum ApproxType in
129event_analysis_chull.h.
130
131+++ Linear Regression
132linreg sometimes gives more accurate results than chull but it provides no
133garantee
134
135+++ Synchronization evaluation
136eval is a special module, it doesn't really perform synchronization, instead
137it calculates and prints different metrics about how well traces are
138synchronized. Although it can be run like other analysis modules, it is most
139useful when run in a postprocessing step, after another synchronization module
140has been run. Eval is most common run in text mode. To do this, run
141lttv -m eval [usual options, ex: -t traces/node1 -t traces/node2 --sync ...]
142
143eval provides a few more options:
144--eval-rtt-file - argument: FILE
145 specify the file containing RTT information
146--eval-graphs - argument: none
147 output gnuplot graph showing synchronization points
148--eval-graphs-dir - argument: eval-graphs-<lttv pid>
149 specify the directory where to store the graphs
150
151The RTT file should contain information on the minimum round-trip time between
152nodes involved in the trace. This information is used (optionally) in the
153evaluation displayed and in the histogram graphs produced. The file should
154contain a series of lines of the form:
155192.168.112.56 192.168.112.57 0.100
156The first two fields are the IP addresses of the source and destination hosts.
157(hostnames are not supported). The last field is the minimum rtt in ms. The
158fields are separated by whitespace. '#' comments a line.
159
160Many commands can be used to measure the RTT, for example:
161ping -s 8 -A -c 8000 -w 10 192.168.112.57
162
163Note that this must be repeated in both directions in the file.
164
165++++ Linear Programming and GLPK
166The synchronization evaluation can optionally perform an analysis similar to
167chull but by using a linear program in one of the steps. This can be used to
168validate a part of the chull algorithm but it can also be used to provide a
169measure of the accuracy of the synchronization in any point (this is seen in
170the graph output).
171
172This is enabled by default at configure time (--with-glpk) if the GNU Linear
173Programming Kit is available (libglpk).
174
add19043
BP
175+ Design
176This part describes the design of the synchronization framework. This is to
177help programmers interested in:
178* adding new synchronization algorithms (analysis part)
179 There are already two analysis algorithms available: chull and linreg
180* using new types of events (processing and matching parts)
bc7c054d
BP
181 There are already two types of events supported: tcp messages and udp
182 broadcasts
add19043
BP
183* using time synchronization with another data source/tracer (processing part)
184 There are already two data sources available: lttng and unittest
185
186++ Sync chain
187This part is specific to the framework in use: the program doing
188synchronization, the executable linking to the event_*.o
189eg. LTTV, unittest
190
191This reads parameters, creates SyncState and calls the processing init
192function. The "sync chain" is the set of event-* modules. At the moment there
193is only one module at each stage. However, as more module are added, it will
194become relevant to have many modules at the same stage simultaneously. This
b670bb7c
BP
195will require some modifications. It is already partly supported at the
196matching stage through encapsulation of other matching modules.
197
198sync_chain_unitest provides a fairly simple example of sync chain
199implementation.
add19043
BP
200
201++ Stage 1: Event processing
202Specific to the tracing data source.
203eg. LTTng, LTT userspace, libpcap
204
205Read the events from the trace and stuff them in an appropriate Event object.
206
207++ Communication between stages 1 and 2: events
208Communication is done via objects specialized from Event. At the moment, all
209*Event are in data_structures.h. Specific event structures and functions could
210be in separate files. This way, adding a new set of modules would require
211shipping extra data_structures* files instead of modifying the existing one.
212For this to work, Event.type couldn't be an enum, it could be an int and use
f6691532 213#defines or constants defined in the specialized data_structures* files.
add19043
BP
214Event.event could be a void*.
215
216++ Stage 2: Event matching
217This stage and its modules are specific to the type of event. Event processing
218feeds the events one at a time but event analysis works on groups of events.
219Event matching is responsible for forming these groups. Generally speaking,
220these can have different types of relation ("one to one", "one to many", or a
221mix) and it will influence the overall behavior of the module.
222eg. TCP, UDP, MPI
223
f6691532
BP
224matchEvent() takes an Event pointer. An actual matching module doesn't have to
225be able to process every type of event. It will only be passed events of a
226type it can process (according to the .canMatch field of its MatchingModule
227struct).
add19043
BP
228
229++ Communication between stages 2 and 3: event groups
230Communication consists of events grouped in Message, Exchange or Broadcast
231structs.
232
233About exchanges:
234If one event pair is a packet (more generally, something representable as a
235Message), an exchange is composed of at least two packets, one in each
236direction. There should be a non-negative minimum "round trip time" (RTT)
237between the first and last event of the exchange. This RTT should be as small
238as possible so these packets should be closely related in time like a data
239packet and an acknowledgement packet. If the events analyzed are such that the
240minimum RTT can be zero, there's nothing gained in analyzing exchanges beyond
241what can already be figured out by analyzing packets.
242
243An exchange can also consist of more than two packets, in case one packet
244single handedly acknowledges many data packets. In this case, it is best to
bc7c054d 245use the last data packet. Assuming a linear clock, an acknowledged
add19043
BP
246packet is as good as any other. However, since the linear clock assumption is
247further from reality as the interval grows longer, it is best to keep the
248interval between the two packets as short as possible.
249
250++ Stage 3: Event analysis
251This stage and its modules are specific to the algorithm that analyzes events
252to deduce synchronization factors.
253eg. convex hull, linear regression, broadcast Maximum Likelihood Estimator
254
255Instead of having one analyzeEvents() function that can receive any sort of
256grouping of events, there are three prototypes: analyzeMessage(),
257analyzeExchange() and analyzeBroadcast(). A module implements only the
bc7c054d 258relevant one(s) and the other function pointers are NULL.
add19043
BP
259
260The approach is different from matchEvent() where there is one point of entry
261no mather the type of event. The analyze*() approach has the advantage that
262there is no casting or type detection to do. It is also possible to deduce
263from the functions pointers which groupings of events a module can analyze.
264However, it means each analysis module will have to be modified if there is
265ever a new type of event grouping.
266
267I chose this approach because:
2681) I thought it likely that there will be new types of events but not so
269 likely that there will be new types of event groups.
2702) all events share some members (time, traceNb, ...) but not event groups
2713) we'll see which one of the two approaches works best and we can adapt
272 later.
273
274++ Data flow
275Data from traces flows "down" from processing to matching to analysis. Factors
276come back up.
277
278++ Evolution and adaptation
279It is possible to change/add another sync chain and to add other event_*
280modules. It has been done. New types of events may need to be added to
281data_structures.h. This is only to link between Event-* modules. If the data
282does not have to be shared, data_structures.h does not have to be modified.
283
284At the moment there is some code duplication in the last steps of linreg and
285chull analysis: the code to propagate the factors when there are more than two
286nodes. Maybe there could be a Stage 4 that does that?
This page took 0.033021 seconds and 4 git commands to generate.