From: Jérémie Galarneau Date: Thu, 14 May 2020 18:24:17 +0000 (-0400) Subject: Fix: consumerd: live client receives incomplete metadata X-Git-Tag: v2.13.0-rc1~631 X-Git-Url: https://git.lttng.org/?p=lttng-tools.git;a=commitdiff_plain;h=f5ba75b4f0c0b44092c76bc931b25b24a2e62718;hp=f5ba75b4f0c0b44092c76bc931b25b24a2e62718 Fix: consumerd: live client receives incomplete metadata Observed issue ============== Babeltrace 1.5.x and Babeltrace 2.x can both report errors (albeit differently) when using the "lttng-live" protocol that imply that the metadata they received is incomplete. For instance, babeltrace 1.5.3 reports the following error: ``` [error] Error creating AST [error] [Context] Cannot open_mmap_trace of format ctf. [error] Error adding trace [warning] [Context] Cannot open_trace of format lttng-live at path net://localhost:xxxx/host/session/live_session. [warning] [Context] cannot open trace "net://localhost:xxxx/host/session/live_session" for reading. [error] opening trace "net://localhost:xxxx/host/session/live_session" for reading. [error] none of the specified trace paths could be opened. ``` While debugging both viewers, I noticed that both were attempting to receive the available metadata before consuming the "data" streams' content. Typically, the following exchange between the relay daemon and the lttng-live client occurs when the problem is observed: bt lttng-live: emits LTTNG_VIEWER_GET_METADATA command relayd: returns LTTNG_VIEWER_METADATA_OK, len = 4096 (default packet size) bt lttng-live: consume 4096 bytes of metadata emits LTTNG_VIEWER_GET_METADATA command relayd: returns LTTNG_VIEWER_NO_NEW_METADATA When the lttng-live client receives the LTTNG_VIEWER_NO_NEW_METADATA status code, it attempts to parse all the metadata it has received since the last LTTNG_VIEWER_NO_NEW_METADATA reply. In effect, it is expected that this forms a logical unit of metadata that is parseable on its own. If this is the first time metadata is received for that trace, the metadata is expected to contain a trace declaration, packet header declaration, etc. If metadata was already received, it is expected that the newly parsed declarations can be "appended" to the existing trace schema. It appears that the relay daemon sends the LTTNG_VIEWER_NO_NEW_METADATA while the metadata it has sent up to that point is not parseable on its own. The live protocol description does not require or imply that a viewer should attempt to parse metadata packets until it hopefully succeeds at some point. Anyhow: 1) This would make it impossible for a live viewer to correctly handle a corrupted metadata stream beyond retrying forever, 2) This behaviour is not implemented by the two reference implementations of the protocol. Cause ===== The relay daemon provides a guarantee that it will send any available metadata before allowing a data stream packet to be served to the client. In other words, a client requesting a data packet will receive the LTTNG_VIEWER_FLAG_NEW_METADATA status code (and no data) if it attempts to get a data stream packet while the relay daemon has metadata already available. This guarantee is properly enforced as far as I can tell. However, looking at the consumer daemon implementation, it appears that metadata packets are sent as soon as they are available. A metadata packet is not guaranteed to be parseable on its own. For instance, it can end in the middle the an event declaration. Hence, this hints at a race involving the tracer, the consumer daemon, the relay daemon, and the lttng-live client. Consider the following scenario: - Metadata packets (sub-buffers) are configured to be 4kB in size, - a large number of kernel events are enabled (e.g. --kernel --all), - the network connection between the consumer and relay daemons is slow 1) The kernel tracer will produce enough TSDL metadata to fill the first sub-buffer of the "metadata" ring-buffer and signal the consumer daemon that a buffer is ready. The tracer then starts writing the remaining data in the following available sub-buffers. 2) The consumer daemon metadata thread is woken up and consumes the first metadata sub-buffer and sends it to the relay daemon. 3) A live client establishes an lttng-live connection to the relay daemon and attempts to consume the available metadata. It receives the first packet and, since the relay daemon doesn't know about any follow-up metadata, receives LTTNG_VIEWER_NO_NEW_METADATA on the next attempt. 4) Having received LTTNG_VIEWER_NO_NEW_METADATA, the lttng-live client attempts to parse the metadata it has received and fails. This scenario is easy to reproduce by inserting a "sleep(1)" at src/bin/lttng-relayd/main.c:1978 (as of this revision). This simulates a relay daemon that would be slow to receive/process metadata packets from the consumer daemon. This problem similarly applies to the user space tracer. Solution ======== Having no knowledge of TSDL, the relay daemon can't "bundle" packets of metadata until they form a "parseable unit" to send to the consumer daemon. To provide the parseability guarantee expected by the viewers, and by the relay daemon implicitly, we need to ensure that the consumer daemons only send "parseable units" of metadata to the relay daemon. Unfortunately, the consumer daemons do not know how to parse TSDL either. In fact, only the metadata producers are able to provide the boundaries of the "parseable units" of metadata. The general idea of the fix is to accumulate metadata up to a point where a "parseable unit" boundary has been identified and send that content in one request to the relay daemon. Note that the solution described here only concerns the live mode. In other cases, the mechanisms described are simply bypassed. A "metadata bucket" is added to lttng_consumer_stream when it is created from a live channel. This bucket is filled until the consumption position reaches the "parseable unit" end position. A refresher about the handling of metadata in live mode ------------------------------------------------------- Three "events" are of interest here and can cause metadata to be consumed more or less indirectly: 1) A metadata packet is closed, causing the metadata thread to wake up 2) The live timer expires 3) A data sub-buffer is closed, causing the data thread to wake-up 1) The first case is simple and happens regardless of whether or not the tracing session is in live mode or not. Metadata is always consumed by the metadata thread in the same way. However, this scenario can be "caused" by (2) and (3). See [1]. A sub-buffer is "acquired" from the metadata ring-buffer and sent to the relayd daemon as the payload of a "RELAYD_SEND_METADATA" command. 2) When the live timer expires [2], the 'check_stream' function is called on all data streams of the session. As its name clearly implies, this function is responsible for flushing all streams or sending a "live beacon" (called an "empty index" in the code) if there is no data to flush. Any flushed data will result in (3). 3) When a data sub-buffer is ready to be consumed, [1] is invoked by the data thread. This function acquires a sub-buffer and sends it to the relay daemon through the data connection. Then, an important synchronization step takes place. The index of the newly-sent packet will be sent through the control connection. The relay daemon waits for both the data packet and its matching index before making the new packet visible to live viewers. Since a data packet could contain data that requires "newer" metadata to be decoded, the data thread flushes the metadata stream and enters a "waiting" phase to pause until all metadata present in the metadata ring buffer has been consumed [3]. At the end of this waiting phase, the data thread sends the data packet's index to the relay daemon, allowing the relayd to make it visible to its live clients. How to identify a "parseable unit" boundary? -------------------------------------------- In the case of the kernel domain, the kernel tracer produces the actual TSDL descriptions directly. The TSDL metadata is serialized to a metadata cache and is flushed "just in time" to the metadata ring-buffer when a "get next" operation is performed. There is no way, from user space, to query whether or not the metadata cache of the kernel tracer is empty. Hence, a new RING_RING_BUFFER_GET_NEXT_SUBBUF_METADATA_CHECK command was added to query whether or not the kernel tracer's metadata cache is empty when acquiring a sub-buffer. This allows the consumer daemon to identify a "coherent" position in the metadata stream that is safe to use as a "parseable unit" boundary. As for the user space domain, since the session daemon is responsible for generating the TSDL representation of the metadata, there is no need to change LTTng-ust APIs. The session daemon generates coherent units of metadata and adds them to its "registry" at once (protected by the registry's lock). It then flushes the contents to the consumer daemon and waits for that data to be consumed before proceeding further. On the consumer daemon side, the metadata cache is filled with the newly-produced contents. This is done atomically with respect to accesses to the metadata cache as all accesses happen through a dedicated metadata cache lock. When the consumer's metadata polling thread is woken-up, it will attempt to acquire (`get_next`) a sub-buffer from the metadata stream ring-buffer. If it fails, it will flush a sub-buffer's worth of metadata to the ring-buffer and attempt to acquire a sub-buffer again. At this point, it is possible to determine if that sub-buffer is the last one of a parseable metadata unit: the cache must be empty and the ring-buffer must be empty following the consumption of this sub-buffer. When those conditions are met, the resulting metadata `stream_subbuffer` is tagged as being `coherent`. Metadata bucket --------------- A helper interface, metadata_bucket, is introduced as part of this fix. A metadata_bucket is `fill`ed with `stream_subbuffer`s, and is eventually `flushed` when it is filled by a `coherent` sub-buffer. As older versions of LTTng-modules must remain supported, this new helper is not used when the RING_RING_BUFFER_GET_NEXT_SUBBUF_METADATA_CHECK operation is not available. When the operation is available, the metadata stream's bucketization is enabled, causing a bucket to be created and the `consume` callback to be swapped. The `consume` callback of the metadata streams is replaced by a new implementation when the metadata bucketization is activated on the stream. This implementation returns the padded size of the consumed sub-buffer when they could be added to the bucket. When the bucket is flushed, the regular `mmap`-based consumption function is called with the bucket's contents. Known drawbacks =============== This implementation causes the consumer daemon to buffer the whole initial unit of metadata before sending it. In practice, this is not expected to be a problem since the largest metadata files we have seen in real use are a couple of megabytes wide. Beyond the (temporary) memory use, this causes the metadata thread to block while this potentially large chunk of metadata is sent (rather than blocking while sending 4kb at a time). The second point is just a consequence of existing shortcomings of the consumerd; slow IO should not affect other unrelated streams. The fundamental problem is that blocking IO is used and we should switch to non-blocking communication if this is a problem (as is done in the relay daemon). The first point is more problematic given the existing tracer APIs. If the tracer could provide the boundary of a "parseable unit" of metadata, we could send the header of the RELAYD_SEND_METADATA command with that size and send the various metadata packets as they are made available. This would make no difference to the relay daemon as it is not blocking on that socket and will not make the metadata size change visible to the "live server" until it has all been received. This size can't be determined right now since it could exceed the total size of the "metadata" ring buffer. In other words, we can't wait for the production of metadata to complete before starting to consume. Finally, while implementing this fix, I also realized that the computation of the rotation position of the metadata streams is erroneous. The rotation code makes use of the ring-buffer's positions to determine the rotation position. However, since both user space and kernel domains make use of a "cache" behind the ring-buffer, that cached content must be taken into account when computing the metadata stream's rotation position. References ========== [1] https://github.com/lttng/lttng-tools/blob/d5ccf8fe0/src/common/consumer/consumer.c#L3433 [2] https://github.com/lttng/lttng-tools/blob/d5ccf8fe0/src/common/consumer/consumer-timer.c#L312 [3] https://github.com/lttng/lttng-tools/blob/d5ccf8fe0/src/common/consumer/consumer-stream.c#L492 Signed-off-by: Jérémie Galarneau Change-Id: I40ee07e5c344c72d9aae2b9b15dc36c00b21e5fa ---