From: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Date: Thu, 14 May 2020 18:24:17 +0000 (-0400)
Subject: Fix: consumerd: live client receives incomplete metadata
X-Git-Tag: v2.13.0-rc1~631
X-Git-Url: https://git.lttng.org/?p=lttng-tools.git;a=commitdiff_plain;h=f5ba75b4f0c0b44092c76bc931b25b24a2e62718;hp=f5ba75b4f0c0b44092c76bc931b25b24a2e62718

Fix: consumerd: live client receives incomplete metadata

Observed issue
==============

Babeltrace 1.5.x and Babeltrace 2.x can both report errors (albeit
differently) when using the "lttng-live" protocol that imply that the
metadata they received is incomplete.

For instance, babeltrace 1.5.3 reports the following error:

```
[error] Error creating AST
[error] [Context] Cannot open_mmap_trace of format ctf.
[error] Error adding trace
[warning] [Context] Cannot open_trace of format lttng-live at path net://localhost:xxxx/host/session/live_session.
[warning] [Context] cannot open trace "net://localhost:xxxx/host/session/live_session" for reading.
[error] opening trace "net://localhost:xxxx/host/session/live_session" for reading.
[error] none of the specified trace paths could be opened.
```

While debugging both viewers, I noticed that both were attempting to
receive the available metadata before consuming the "data" streams'
content.

Typically, the following exchange between the relay daemon and the
lttng-live client occurs when the problem is observed:

bt lttng-live:
    emits LTTNG_VIEWER_GET_METADATA command
relayd:
    returns LTTNG_VIEWER_METADATA_OK, len = 4096 (default packet size)
bt lttng-live:
    consume 4096 bytes of metadata
    emits LTTNG_VIEWER_GET_METADATA command
relayd:
    returns LTTNG_VIEWER_NO_NEW_METADATA

When the lttng-live client receives the LTTNG_VIEWER_NO_NEW_METADATA
status code, it attempts to parse all the metadata it has received
since the last LTTNG_VIEWER_NO_NEW_METADATA reply. In effect, it is
expected that this forms a logical unit of metadata that is parseable
on its own.

If this is the first time metadata is received for that trace, the
metadata is expected to contain a trace declaration, packet header
declaration, etc.

If metadata was already received, it is expected that the newly parsed
declarations can be "appended" to the existing trace schema.

It appears that the relay daemon sends the
LTTNG_VIEWER_NO_NEW_METADATA while the metadata it has sent up to that
point is not parseable on its own.

The live protocol description does not require or imply that a viewer
should attempt to parse metadata packets until it hopefully succeeds
at some point. Anyhow:
  1) This would make it impossible for a live viewer to correctly
  handle a corrupted metadata stream beyond retrying forever,
  2) This behaviour is not implemented by the two reference
  implementations of the protocol.

Cause
=====

The relay daemon provides a guarantee that it will send any available
metadata before allowing a data stream packet to be served to the
client.

In other words, a client requesting a data packet will receive the
LTTNG_VIEWER_FLAG_NEW_METADATA status code (and no data) if it
attempts to get a data stream packet while the relay daemon has
metadata already available.

This guarantee is properly enforced as far as I can tell. However,
looking at the consumer daemon implementation, it appears that
metadata packets are sent as soon as they are available.

A metadata packet is not guaranteed to be parseable on its own. For
instance, it can end in the middle the an event declaration.

Hence, this hints at a race involving the tracer, the consumer daemon,
the relay daemon, and the lttng-live client.

Consider the following scenario:
  - Metadata packets (sub-buffers) are configured to be 4kB in size,
  - a large number of kernel events are enabled (e.g. --kernel --all),
  - the network connection between the consumer and relay daemons is
    slow

1) The kernel tracer will produce enough TSDL metadata to fill the
   first sub-buffer of the "metadata" ring-buffer and signal the
   consumer daemon that a buffer is ready. The tracer then starts
   writing the remaining data in the following available sub-buffers.

2) The consumer daemon metadata thread is woken up and consumes the
   first metadata sub-buffer and sends it to the relay daemon.

3) A live client establishes an lttng-live connection to the relay
   daemon and attempts to consume the available metadata. It receives
   the first packet and, since the relay daemon doesn't know about any
   follow-up metadata, receives LTTNG_VIEWER_NO_NEW_METADATA on the
   next attempt.

4) Having received LTTNG_VIEWER_NO_NEW_METADATA, the lttng-live client
   attempts to parse the metadata it has received and fails.

This scenario is easy to reproduce by inserting a "sleep(1)" at
src/bin/lttng-relayd/main.c:1978 (as of this revision). This simulates
a relay daemon that would be slow to receive/process metadata packets
from the consumer daemon.

This problem similarly applies to the user space tracer.

Solution
========

Having no knowledge of TSDL, the relay daemon can't "bundle" packets
of metadata until they form a "parseable unit" to send to the consumer
daemon.

To provide the parseability guarantee expected by the viewers, and by
the relay daemon implicitly, we need to ensure that the consumer
daemons only send "parseable units" of metadata to the relay daemon.

Unfortunately, the consumer daemons do not know how to parse TSDL
either. In fact, only the metadata producers are able to provide the
boundaries of the "parseable units" of metadata.

The general idea of the fix is to accumulate metadata up to a point
where a "parseable unit" boundary has been identified and send that
content in one request to the relay daemon. Note that the solution
described here only concerns the live mode. In other cases, the
mechanisms described are simply bypassed.

A "metadata bucket" is added to lttng_consumer_stream when it is
created from a live channel. This bucket is filled until the
consumption position reaches the "parseable unit" end position.

A refresher about the handling of metadata in live mode
-------------------------------------------------------

Three "events" are of interest here and can cause metadata to be
consumed more or less indirectly:
  1) A metadata packet is closed, causing the metadata thread to wake
     up
  2) The live timer expires
  3) A data sub-buffer is closed, causing the data thread to wake-up

1) The first case is simple and happens regardless of whether or not
   the tracing session is in live mode or not. Metadata is always
   consumed by the metadata thread in the same way. However, this
   scenario can be "caused" by (2) and (3). See [1]. A sub-buffer is
   "acquired" from the metadata ring-buffer and sent to the relayd
   daemon as the payload of a "RELAYD_SEND_METADATA" command.

2) When the live timer expires [2], the 'check_stream' function is
   called on all data streams of the session. As its name clearly
   implies, this function is responsible for flushing all streams or
   sending a "live beacon" (called an "empty index" in the code) if
   there is no data to flush. Any flushed data will result in (3).

3) When a data sub-buffer is ready to be consumed, [1] is invoked
   by the data thread. This function acquires a sub-buffer and sends
   it to the relay daemon through the data connection.

   Then, an important synchronization step takes place. The index of
   the newly-sent packet will be sent through the control
   connection. The relay daemon waits for both the data packet and its
   matching index before making the new packet visible to live
   viewers.

   Since a data packet could contain data that requires "newer"
   metadata to be decoded, the data thread flushes the metadata stream
   and enters a "waiting" phase to pause until all metadata present in
   the metadata ring buffer has been consumed [3].

   At the end of this waiting phase, the data thread sends the data
   packet's index to the relay daemon, allowing the relayd to make it
   visible to its live clients.

How to identify a "parseable unit" boundary?
--------------------------------------------

In the case of the kernel domain, the kernel tracer produces the
actual TSDL descriptions directly. The TSDL metadata is serialized to
a metadata cache and is flushed "just in time" to the metadata
ring-buffer when a "get next" operation is performed.

There is no way, from user space, to query whether or not the metadata
cache of the kernel tracer is empty. Hence, a new
RING_RING_BUFFER_GET_NEXT_SUBBUF_METADATA_CHECK command was added to
query whether or not the kernel tracer's metadata cache is empty when
acquiring a sub-buffer.

This allows the consumer daemon to identify a "coherent" position in
the metadata stream that is safe to use as a "parseable unit"
boundary.

As for the user space domain, since the session daemon is responsible
for generating the TSDL representation of the metadata, there is no
need to change LTTng-ust APIs.

The session daemon generates coherent units of metadata and adds them
to its "registry" at once (protected by the registry's lock). It then
flushes the contents to the consumer daemon and waits for that data to
be consumed before proceeding further.

On the consumer daemon side, the metadata cache is filled with the
newly-produced contents. This is done atomically with respect to
accesses to the metadata cache as all accesses happen through a
dedicated metadata cache lock.

When the consumer's metadata polling thread is woken-up, it will
attempt to acquire (`get_next`) a sub-buffer from the metadata stream
ring-buffer. If it fails, it will flush a sub-buffer's worth of
metadata to the ring-buffer and attempt to acquire a sub-buffer again.

At this point, it is possible to determine if that sub-buffer is the
last one of a parseable metadata unit: the cache must be empty and the
ring-buffer must be empty following the consumption of this
sub-buffer. When those conditions are met, the resulting metadata
`stream_subbuffer` is tagged as being `coherent`.

Metadata bucket
---------------

A helper interface, metadata_bucket, is introduced as part of this
fix. A metadata_bucket is `fill`ed with `stream_subbuffer`s, and is
eventually `flushed` when it is filled by a `coherent` sub-buffer.

As older versions of LTTng-modules must remain supported, this new
helper is not used when the
RING_RING_BUFFER_GET_NEXT_SUBBUF_METADATA_CHECK operation is not
available. When the operation is available, the metadata stream's
bucketization is enabled, causing a bucket to be created and the
`consume` callback to be swapped.

The `consume` callback of the metadata streams is replaced by a new
implementation when the metadata bucketization is activated on the
stream. This implementation returns the padded size of the consumed
sub-buffer when they could be added to the bucket. When the bucket is
flushed, the regular `mmap`-based consumption function is called with
the bucket's contents.

Known drawbacks
===============

This implementation causes the consumer daemon to buffer the whole
initial unit of metadata before sending it. In practice, this is not
expected to be a problem since the largest metadata files we have seen
in real use are a couple of megabytes wide.

Beyond the (temporary) memory use, this causes the metadata thread to
block while this potentially large chunk of metadata is sent (rather
than blocking while sending 4kb at a time).

The second point is just a consequence of existing shortcomings of the
consumerd; slow IO should not affect other unrelated streams. The
fundamental problem is that blocking IO is used and we should switch
to non-blocking communication if this is a problem (as is done in the
relay daemon).

The first point is more problematic given the existing tracer APIs.
If the tracer could provide the boundary of a "parseable unit" of
metadata, we could send the header of the RELAYD_SEND_METADATA command
with that size and send the various metadata packets as they are made
available. This would make no difference to the relay daemon as it is
not blocking on that socket and will not make the metadata size change
visible to the "live server" until it has all been received.

This size can't be determined right now since it could exceed the
total size of the "metadata" ring buffer. In other words, we can't wait
for the production of metadata to complete before starting to consume.

Finally, while implementing this fix, I also realized that the
computation of the rotation position of the metadata streams is
erroneous. The rotation code makes use of the ring-buffer's positions
to determine the rotation position. However, since both user space and
kernel domains make use of a "cache" behind the ring-buffer, that
cached content must be taken into account when computing the metadata
stream's rotation position.

References
==========

[1] https://github.com/lttng/lttng-tools/blob/d5ccf8fe0/src/common/consumer/consumer.c#L3433
[2] https://github.com/lttng/lttng-tools/blob/d5ccf8fe0/src/common/consumer/consumer-timer.c#L312
[3] https://github.com/lttng/lttng-tools/blob/d5ccf8fe0/src/common/consumer/consumer-stream.c#L492

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I40ee07e5c344c72d9aae2b9b15dc36c00b21e5fa
---