From: Jérémie Galarneau Date: Tue, 28 Jul 2020 15:29:18 +0000 (-0400) Subject: Fix: relayd: live connection fails to open file during clear X-Git-Tag: v2.13.0-rc1~556 X-Git-Url: https://git.lttng.org/?p=lttng-tools.git;a=commitdiff_plain;h=9edaf114d28249f4740de16bc9f58c43cfe8042e;hp=9edaf114d28249f4740de16bc9f58c43cfe8042e Fix: relayd: live connection fails to open file during clear Issue observed ============== A `session clear` test occasionaly fails on the CI (very rarely, and more often on PowerPC executors for some reason) with babeltrace reporting that the live connection was closed by the remote side: PASS: tools/clear/test_ust 276 - Waiting for live viewers on url: net://localhost 07-21 16:17:07.058 23855 23855 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_recv@viewer-connection.c:198 [lttng-live] Remote side has closed connection 07-21 16:17:07.058 23855 23855 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_session_get_new_streams@viewer-connection.c:1701 [lttng-live] Error receiving get new streams reply 07-21 16:17:07.058 23855 23855 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_msg_iter_next@lttng-live.c:1665 [lttng-live] Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR 07-21 16:17:07.058 23855 23855 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:864 Component input port message iterator's "next" method failed: iter-addr=0x1014d6a8, iter-upstream-comp-name="lttng-live", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR 07-21 16:17:07.059 23855 23855 E PLUGIN/FLT.UTILS.MUXER muxer_upstream_msg_iter_next@muxer.c:454 [muxer] Upstream iterator's next method returned an error: status=ERROR 07-21 16:17:07.059 23855 23855 E PLUGIN/FLT.UTILS.MUXER validate_muxer_upstream_msg_iters@muxer.c:991 [muxer] Cannot validate muxer's upstream message iterator wrapper: muxer-msg-iter-addr=0x1014d668, muxer-upstream-msg-iter-wrap-addr=0x1014e210 07-21 16:17:07.059 23855 23855 E PLUGIN/FLT.UTILS.MUXER muxer_msg_iter_next@muxer.c:1415 [muxer] Cannot get next message: comp-addr=0x1014cca8, muxer-comp-addr=0x1015afd8, muxer-msg-iter-addr=0x1014d668, msg-iter-addr=0x1014d598, status=ERROR 07-21 16:17:07.059 23855 23855 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:864 Component input port message iterator's "next" method failed: iter-addr=0x1014d598, iter-upstream-comp-name="muxer", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR 07-21 16:17:07.059 23855 23855 W LIB/GRAPH consume_graph_sink@graph.c:473 Component's "consume" method failed: status=ERROR, comp-addr=0x1014d128, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=1, comp-class-so-handle-addr=0x10159dd8, comp-class-so-handle-path="/home/jenkins/workspace/lttng-tools_stable-2.12_portbuild/arch/powerpc/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/stable-0.12/test_type/base/deps/build/lib/babeltrace2/plugins/babeltrace-plugin-text.so", comp-input-port-count=1, comp-output-port-count=0 07-21 16:17:07.059 23855 23855 E CLI cmd_run@babeltrace2.c:2548 Graph failed to complete successfully ERROR: [Babeltrace CLI] (babeltrace2.c:2548) Graph failed to complete successfully CAUSED BY [libbabeltrace2] (graph.c:473) Component's "consume" method failed: status=ERROR, comp-addr=0x1014d128, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=1, comp-class-so-handle-addr=0x10159dd8, comp-class-so-handle-path="/home/jenkins/workspace/lttng-tools_stable-2.12_portbuild/arch/powerpc/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/stable-0.12/test_type/base/deps/build/lib/babeltrace2/plugins/babeltrace-plugin-text.so", comp-input-port-count=1, comp-output-port-count=0 CAUSED BY [libbabeltrace2] (iterator.c:864) Component input port message iterator's "next" method failed: iter-addr=0x1014d598, iter-upstream-comp-name="muxer", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:991) Cannot validate muxer's upstream message iterator wrapper: muxer-msg-iter-addr=0x1014d668, muxer-upstream-msg-iter-wrap-addr=0x1014e210 CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:454) Upstream iterator's next method returned an error: status=ERROR CAUSED BY [libbabeltrace2] (iterator.c:864) Component input port message iterator's "next" method failed: iter-addr=0x1014d6a8, iter-upstream-comp-name="lttng-live", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (lttng-live.c:1665) Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (viewer-connection.c:1701) Error receiving get new streams reply CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (viewer-connection.c:198) Remote side has closed connection ok 277 - Clear session J7WXjh7fmMleTCfE PASS: tools/clear/test_ust 277 - Clear session J7WXjh7fmMleTCfE ok 278 - Clear session J7WXjh7fmMleTCfE PASS: tools/clear/test_ust 278 - Clear session J7WXjh7fmMleTCfE ok 279 - Stop lttng tracing for session J7WXjh7fmMleTCfE PASS: tools/clear/test_ust 279 - Stop lttng tracing for session J7WXjh7fmMleTCfE ok 280 - Destroy session J7WXjh7fmMleTCfE PASS: tools/clear/test_ust 280 - Destroy session J7WXjh7fmMleTCfE # Wait for viewer to exit not ok 281 - Babeltrace succeeds Cause ===== Looking at the relay daemon logs, it appears that the live client requests an enumeration of the available streams while a rotation is ongoing (clear). Ultimately, this results in the relay daemon attempting to open a non-existing file: PERROR - 16:33:59.242388809 [734380/734387]: Failed to open fs handle to ust/uid/1000/64-bit/chan_0, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550) The logs indicate that this file existed at some point. However, it was unlinked and its newest instance was created in a trace chunk named "20200727T163359-0400-1". This chunk name is a temporary name used until the original trace chunk can be unlinked (cleared) and the newest can be moved in its place. The file is being opened as part of the creation of a viewer stream when make_viewer_stream() fails to find it. This implies that, somehow, an outdated trace chunk is being used to open the viewer stream's file. The reason why is that make_viewer_stream is called with the viewer session's current trace chunk. During a rotation, the use of the viewer session's current trace chunk is touchy due to the way the switch-over to a new chunk is handled. How viewer session/stream trace chunks are rotated -------------------------------------------------- The viewer polls the relay daemon for new data/metadata to consume using the `GET_NEXT_INDEX` and `GET_METADATA` commands. Both commands will indicate to the viewer that it should try again later if a rotation is ongoing on the "side" of the relay session. When a rotation is not ongoing, the relay compares the `id` of the target viewer stream's trace chunk with the relay session's current trace chunk. If those `id`s don't match, the viewer session's current trace chunk is then updated to a copy of the relay session's trace chunk. The viewer stream's files are then closed and re-opened in the context of the viewer session's now-updated trace chunk. While the live protocol allows `GET_NEXT_INDEX` and `GET_METADATA` to communicate to the viewer that it should "retry" later, there is no such provisions made for the paths that lead to the creation of the viewer streams. This means viewer streams must be created even if a rotation is ongoing. Solution ======== If a rotation is ongoing at the moment of the creation of the viewer streams, we wish to use a copy of the trace chunk being used by the relay stream. This way, we can be sure that the streams' files exist. It is okay for viewer streams to hold references to different copies of the trace chunks since no user-visible actions are performed when the reference to those chunks is released. This is different from relay streams where the detection of the completion of a rotation is done when the last relay stream releases its reference to a specific trace chunk instance. Known drawbacks =============== None beyond a slight increase (temporary until the next rotation) in the number of FDs used when a client connects during a session rotation. Note ==== Since make_viewer_streams now acts on relay and viewer counterparts of the session and stream objects, the various variables are prefixed with `relay` and `viewer` to make the code easier to understand. The locking period of the relay stream is extended to most of the iteration in make_viewer_stream() rather than only in viewer_stream_create(). As a result, callers of viewer_stream_create() must hold the relay stream's lock. Signed-off-by: Jérémie Galarneau Change-Id: I3b9c52a57835134712eadba0f68f6a21e922356b ---