From: Jérémie Galarneau Date: Wed, 12 Jan 2022 20:48:00 +0000 (-0500) Subject: Fix: relayd: rotation failure for multi-domain session X-Git-Url: https://git.lttng.org/?a=commitdiff_plain;h=c5c79321eb1937f3d208210365c512f4a186ec2a;hp=c5c79321eb1937f3d208210365c512f4a186ec2a;p=lttng-tools.git Fix: relayd: rotation failure for multi-domain session Observed issue ============== Rotating a multi-domain streaming session results in the following error: $ lttng rotate Waiting for rotation to complete... Error: Failed to retrieve rotation state. Meanwhile, the relay daemon logs indicate the following: DBG1 - 14:56:04.213163667 [265774/265778]: lttng_trace_chunk_rename_path from .tmp_new_chunk to (null) (in lttng_trace_chunk_rename_path_no_lock() at trace-chunk.cpp:759) PERROR - 14:56:04.213242941 [265774/265778]: Failed to move trace chunk directory ".tmp_new_chunk" to "20220112T145604-0500-1": No such file or directory (in lttng_trace_chunk_rename_path_no_lock() at trace-chunk.cpp:799) DBG1 - 14:56:04.213396931 [265774/265778]: aborting session 2 (in session_abort() at session.cpp:588) DBG1 - 14:56:04.213512198 [265774/265778]: Control connection closed with 22 (in relay_thread_close_connection() at main.cpp:3874) The 'abort' of session 2 here causes the kernel consumer to fail to consume subbuffers: Error: Relayd send index failed. Cleaning up relayd 3. Error: Error consuming subbuffer: (0) [...] Cause ===== Following the flow of execution in the relay daemon shows that different trace chunks are used by the two relay sessions that result from the streaming of a single multi-domain session. Both trace chunks "own" the same output directory. When a rotation is performed, the first trace chunk to be closed will move the directory. Then, the second trace chunk to be closed will attempt to do the same, failing to do so as seen in the relay daemon log. Solution ======== Using different trace chunk instances for relay sessions belonging to a single sessiond session goes against the intended use of the sessiond trace chunk registry. A sessiond trace chunk registry allows the relay daemon to share trace chunks used by different "relay sessions" when they were created for the same user-visible session daemon session. Tracing multiple domains (e.g. ust and kernel) results in per-domain relay sessions being created. Sharing trace chunks, and their output directory more specifically, is essential to properly implement session rotations. The sharing of output directory handles allows directory renames to be performed once and without races that would stem from from multiple renames. The reason why sessiond trace chunk registry returns different trace chunk instances for two relay sessions is that the wrong session `id` is used to publish trace chunks. The `id` that must be used to share trace chunks accross the relay sessions that belong to the same sessiond session is `id_sessiond`. `id_sessiond` is optional as it is only provided by consumers v2.11+. Otherwise, it is fine to use the relay session `id`: it is a unique id for a given session daemon instance and those consumers will not issue a session rotation (or clear) as the feature didn't exist. A reference counting bug revealed by this change is also fixed in the implementation of the sessiond trace chunk registry. When the trace chunk is first published, two references to the published chunks exist. One is taken by the registry while the other is being returned to the caller. In the use case of the relay daemon, the reference held by the registry itself is undesirable. We want the trace chunk to be removed from the registry as soon as it is not being used by the relay daemon (through a session or a stream). This differs from the behaviour of the consumer daemon which relies on an explicit command from the session daemon to release the registry's reference. In cases where the trace chunk had already been published, the reference belonging to the sessiond trace chunk registry instance has already been 'put' by the firt publication. We must simply return the published trace chunk with a reference taken on behalf of the caller. Known drawbacks =============== None. Change-Id: Ic33443b114a87574a1b26ac5ccb022e47f886ddd Signed-off-by: Jérémie Galarneau ---