From e8b269fa71e5bccc0fd89bc7526bbcdf9f83222e Mon Sep 17 00:00:00 2001 From: Mathieu Desnoyers Date: Wed, 17 Nov 2021 11:55:36 -0500 Subject: [PATCH] Fix: relayd: live: metadata stream reference count < 0 assert MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit Observed issue ============== While running tests/regression/tools/clear/test_ust test in a loop we eventually witness the following error: The symptom on the Babeltrace side is a Connection reset by peer. This is caused by a relayd abort after an assertion failure due to a reference count being lower than 0. Test case: # Test ust streaming live clear with viewer # Parameters: tracing_active=0, clear_twice=0, buffer_type=pid Tools commit: https://review.lttng.org/c/lttng-tools/+/6754/1 Relay daemon assertion failure stack trace: (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007ffa2e850859 in __GI_abort () at abort.c:79 #2 0x00007ffa2e850729 in __assert_fail_base (fmt=0x7ffa2e9e6588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x56221bdb96f4 "res >= 0", file=0x56221bdb97a0 "/root/virtenv/usr/include/urcu/ref.h", line=66, function=) at assert.c:92 #3 0x00007ffa2e861f36 in __GI___assert_fail (assertion=assertion@entry=0x56221bdb96f4 "res >= 0", file=file@entry=0x56221bdb97a0 "/root/virtenv/usr/include/urcu/ref.h", line=line@entry=66, function=function@entry=0x56221bdbd638 <__PRETTY_FUNCTION__.7554> "urcu_ref_put") at assert.c:101 #4 0x000056221bd6a1cc in urcu_ref_put (release=, ref=0x7ffa24008cb0) at /root/virtenv/usr/include/urcu/ref.h:66 #5 viewer_stream_put (vstream=vstream@entry=0x7ffa24008cb0) at viewer-stream.c:279 #6 0x000056221bd5e4c5 in viewer_get_metadata (conn=conn@entry=0x7ffa0c000fc0) at live.c:2211 #7 0x000056221bd63778 in process_control (conn=0x7ffa0c000fc0, recv_hdr=0x7ffa297c5af0) at live.c:2376 #8 thread_worker (data=) at live.c:2541 #9 0x00007ffa2ea28609 in start_thread (arg=) at pthread_create.c:477 #10 0x00007ffa2e94d293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Babeltrace crash: 11-17 05:13:05.197 47007 47007 D PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_get_next_index@viewer-connection.c:1494 [live] Requesting next index for stream: cmd=GET_NEXT_INDEX, viewer-stream-id=3352 11-17 05:13:05.215 47007 47007 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_recv@viewer-connection.c:245 [live] Error receiving from Relay: Connection reset by peer. 11-17 05:13:05.215 47007 47007 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_get_next_index@viewer-connection.c:1522 [live] Error receiving get next index reply 11-17 05:13:05.215 47007 47007 D PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_iterator_next_msg_on_stream@lttng-live.c:1069 [live] Returning from advancing live stream iterator: status=LTTNG_LIVE_ITERATOR_STATUS_ERROR, stream-name="stream-3352", viewer-stream-id=3352 11-17 05:13:05.215 47007 47007 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_msg_iter_next@lttng-live.c:1802 [live] Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR 11-17 05:13:05.216 47007 47007 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:866 Component input port message iterator's "next" method failed: iter-addr=0x563b7dd07700, iter-upstream-comp-name="live", iter-upstream-comp-log-level=TRACE, iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUT PUT, iter-upstream-port-name="out", status=ERROR 11-17 05:13:05.216 47007 47007 E PLUGIN/FLT.UTILS.MUXER muxer_upstream_msg_iter_next@muxer.c:446 [mux] Upstream iterator's next method returned an error: status=ERROR 11-17 05:13:05.216 47007 47007 E PLUGIN/FLT.UTILS.MUXER validate_muxer_upstream_msg_iters@muxer.c:989 [mux] Cannot validate muxer's upstream message iterator wrapper: muxer-msg-iter-addr=0x563b7dd04740, muxer-upstream-msg-iter-wrap-addr=0x563b7dd0db30 11-17 05:13:05.216 47007 47007 E PLUGIN/FLT.UTILS.MUXER muxer_msg_iter_next@muxer.c:1417 [mux] Cannot get next message: comp-addr=0x563b7dd1a760, muxer-comp-addr=0x563b7dd1b930, muxer-msg-iter-addr=0x563b7dd04740, msg-iter-addr=0x563b7dd07590, status=ERROR 11-17 05:13:05.216 47007 47007 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:866 Component input port message iterator's "next" method failed: iter-addr=0x563b7dd07590, iter-upstream-comp-name="mux", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR 11-17 05:13:05.216 47007 47007 W LIB/GRAPH consume_graph_sink@graph.c:462 Component's "consume" method failed: status=ERROR, comp-addr=0x563b7dd06d20, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=1, comp-class-so-handle-addr=0x563b7dd051b0, comp-class-so-handle-path="/root/virtenv/usr/lib/babeltr ace2/plugins/babeltrace-plugin-text.so", comp-input-port-count=1, comp-output-port-count=0 11-17 05:13:05.217 47007 47007 E CLI cmd_run@babeltrace2.c:2537 Graph failed to complete successfully Cause ===== Both relay and live threads can put the ownership reference on the metadata viewer stream concurrently without synchronization, thus leading to a reference count going lower than 0. The viewer stream ownership design initially planned for being owned by the live thread, thus allowing the live thread to put the ownership reference as soon as the associated relay stream is observed as closed, and the viewer stream is considered as hung up. However, in the specific case of the metadata viewer stream, the responsibility of closing the metadata viewer stream is shared between the relay and live threads, because the viewers expect to observe a LTTNG_VIEWER_NO_NEW_METADATA message before the metadata stream hangs up (see comment in viewer_get_metadata()). Therefore, if viewer_get_metadata() is done before the metadata stream is closed, the viewer will receive the LTTNG_VIEWER_NO_NEW_METADATA message, and set the no_new_metadata_notified state to true. It's then the relay thread's relay_close_stream() which will invoke the ownership put. However, the live thread may concurrently try to put the viewer stream ownership as well from a subsequent viewer_get_metadata(), thus leading to a reference count < 0. Solution ======== Fix this by putting the ownership reference from the live viewer thread rather than the relay thread. This can be done by tracking the state of no_new_metadata_notified within the live viewer thread. Known drawbacks =============== This will postpone reclaim of the metadata viewer stream from the relay stream close to the following viewer_get_metadata (after a LTTNG_VIEWER_NO_NEW_METADATA message has been replied), which I don't think is an issue. Signed-off-by: Mathieu Desnoyers Signed-off-by: Jérémie Galarneau Change-Id: I9783d3a8402dddaa364a7a69c0764eaf30c16fde --- src/bin/lttng-relayd/live.cpp | 4 ++-- src/bin/lttng-relayd/main.cpp | 18 ------------------ 2 files changed, 2 insertions(+), 20 deletions(-) diff --git a/src/bin/lttng-relayd/live.cpp b/src/bin/lttng-relayd/live.cpp index bfef8b9f9..c169be918 100644 --- a/src/bin/lttng-relayd/live.cpp +++ b/src/bin/lttng-relayd/live.cpp @@ -2006,11 +2006,11 @@ int viewer_get_metadata(struct relay_connection *conn) * an error. */ if (vstream->metadata_sent > 0) { - vstream->stream->no_new_metadata_notified = true; - if (vstream->stream->closed) { + if (vstream->stream->closed && vstream->stream->no_new_metadata_notified) { /* Release ownership for the viewer metadata stream. */ viewer_stream_put(vstream); } + vstream->stream->no_new_metadata_notified = true; } goto send_reply; } diff --git a/src/bin/lttng-relayd/main.cpp b/src/bin/lttng-relayd/main.cpp index adbf3f3b4..73dfec2c8 100644 --- a/src/bin/lttng-relayd/main.cpp +++ b/src/bin/lttng-relayd/main.cpp @@ -1758,24 +1758,6 @@ static int relay_close_stream(const struct lttcomm_relayd_hdr *recv_hdr, * request. */ try_stream_close(stream); - if (stream->is_metadata) { - struct relay_viewer_stream *vstream; - - vstream = viewer_stream_get_by_id(stream->stream_handle); - if (vstream) { - if (stream->no_new_metadata_notified) { - /* - * Since all the metadata has been sent to the - * viewer and that we have a request to close - * its stream, we can safely teardown the - * corresponding metadata viewer stream. - */ - viewer_stream_put(vstream); - } - /* Put local reference. */ - viewer_stream_put(vstream); - } - } stream_put(stream); ret = 0; -- 2.34.1