From b6ae2a9587bfc0f38e3256c869fc822a27e1100c Mon Sep 17 00:00:00 2001 From: Mathieu Desnoyers Date: Fri, 19 Nov 2021 11:12:25 -0500 Subject: [PATCH] Fix: relayd: live: erroneous message timestamp observed from live viewer MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit Observed issue ============== Another situation where erroneous message timestamp is observed by the live viewer. Happens rarely (only two occurrences while running ust_clear in stress-tests overnight on a 16-core machine). Triggered with the following test: # Test ust streaming live clear with viewer with new metadata after clear # Parameters: tracing_active=0, clear_twice=1, buffer_type=pid Babeltrace 2 error: 11-19 06:21:02.571 488497 488497 E PLUGIN/SRC.CTF.LTTNG-LIVE handle_late_message@lttng-live.c:1206 [lttng-live] Invalid live stream state: have a late message that is not a packet discarded or event discarded message: late-msg-type=PACKET_BEGINNING 11-19 06:21:02.578 488497 488497 E PLUGIN/SRC.CTF.LTTNG-LIVE next_stream_iterator_for_trace@lttng-live.c:1360 [lttng-live] Late message could not be handled correctly: lttng-live-msg-iter-addr=0x55b4cdee1590, stream-name="stream-1024", curr-msg-ts=1637302861380968303, last-msg-ts=1637302861472798701 11-19 06:21:02.578 488497 488497 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_msg_iter_next@lttng-live.c:1802 [lttng-live] Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR 11-19 06:21:02.579 488497 488497 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:866 Component input port message iterator's "next" method failed: iter-addr=0x55b4cdecfe30, iter-upstream-comp-name="lttng-live", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR Cause ===== viewer_get_next_index() does not protect its use of the session ongoing_rotation state by any synchronization mechanism. It only takes the stream lock, which is not used to protect changes to the ongoing rotation state with respect to the chunk rename operation performed by chunk creation at the beginning of the clear command. Solution ======== Protect the use of the session ongoing rotation state and file open operations by the session lock in viewer_get_next_index(). Known drawbacks =============== I don't expect this to cause any real scalability concern considering the fact that the relay daemon has only two threads, one to handle session daemon commands, and the other to handle viewer commands. Signed-off-by: Mathieu Desnoyers Signed-off-by: Jérémie Galarneau Change-Id: I242942ac6bc7a9199a17cd9800fc461a397b8955 --- src/bin/lttng-relayd/live.cpp | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/src/bin/lttng-relayd/live.cpp b/src/bin/lttng-relayd/live.cpp index 96ec11efc..46faea031 100644 --- a/src/bin/lttng-relayd/live.cpp +++ b/src/bin/lttng-relayd/live.cpp @@ -1636,6 +1636,13 @@ int viewer_get_next_index(struct relay_connection *conn) metadata_viewer_stream = ctf_trace_get_viewer_metadata_stream(ctf_trace); + /* + * Hold the session lock to protect against concurrent changes + * to the chunk files (e.g. rename done by clear), which are + * protected by the session ongoing rotation state. Those are + * synchronized with the session lock. + */ + pthread_mutex_lock(&rstream->trace->session->lock); pthread_mutex_lock(&rstream->lock); /* @@ -1839,6 +1846,7 @@ int viewer_get_next_index(struct relay_connection *conn) send_reply: if (rstream) { pthread_mutex_unlock(&rstream->lock); + pthread_mutex_unlock(&rstream->trace->session->lock); } if (metadata_viewer_stream) { @@ -1880,6 +1888,7 @@ end: error_put: pthread_mutex_unlock(&rstream->lock); + pthread_mutex_unlock(&rstream->trace->session->lock); if (metadata_viewer_stream) { viewer_stream_put(metadata_viewer_stream); } -- 2.34.1