Fix: consumerd: fd still open after `lttng snapshot record` returns
authorJonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Wed, 9 Feb 2022 19:23:18 +0000 (14:23 -0500)
committerJérémie Galarneau <jeremie.galarneau@efficios.com>
Thu, 3 Mar 2022 16:47:18 +0000 (11:47 -0500)
Observed issue
=====

Using a snapshot output located on a pramfs mount:

  lttng snapshot record
  rm -rf /my_mount/my_trace_output

`rm` fails with ENOTEMPTY on rmdir for /my_mount/my_trace_output.

At that point, the lttng-consumerd daemon have an open fd on:
  /my_mount/my_trace_output/ust

Note that a sleep between both command "fixes" the issue.

Cause
=====

The reclaim for the in-registry trace chunks can happen after the LTTng
CLI returns since we use `call_rcu`.

```
static
void lttng_trace_chunk_release(struct urcu_ref *ref)

....

  if (chunk->in_registry_element) {
    struct lttng_trace_chunk_registry_element *element;

    element = container_of(chunk, typeof(*element), chunk);
    if (element->registry) {
      rcu_read_lock();
      cds_lfht_del(element->registry->ht, &element->trace_chunk_registry_ht_node);
      rcu_read_unlock();
->    call_rcu(&element->rcu_node, free_lttng_trace_chunk_registry_element);
   } else {

```

The delayed reclaim of the `lttng_trace_chunk_registry_element` can
result in lttng-consumerd holding an open fd for the "chunk directory"
of the chunk since the close() is only done during the "*fini" phase of
the chunk (`lttng_trace_chunk_fini`).

Solution
========

Considering that the rcu lookup+refcount access scheme is used for the
trace chunk object and that at that point the refcount for the trace
chunk object is effectively zero, we can move the
`lttng_trace_chunk_fini` safely outside of the
`free_lttng_trace_chunk_registry_element` call_rcu call.

Known drawbacks
=========

Even if this solves the current situation, it is important to note that
the actual object holding the reference is itself refcounted and only
close the fd on release. This means that we are still exposed to this
problem if at some point the directory handle is shared and outlives the
trace chunk for some reason in the future.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I6da3948824bf8b092fc8248b1bb0263fdd5887be

src/common/trace-chunk.cpp

index ad7a983fbc78e50ca78db0c038aeae2f4244627e..9e804ce0197b25c74a66a0b211cce3fa8f77768a 100644 (file)
@@ -1857,7 +1857,6 @@ void free_lttng_trace_chunk_registry_element(struct rcu_head *node)
        struct lttng_trace_chunk_registry_element *element =
                        container_of(node, typeof(*element), rcu_node);
 
-       lttng_trace_chunk_fini(&element->chunk);
        free(element);
 }
 
@@ -1879,6 +1878,24 @@ void lttng_trace_chunk_release(struct urcu_ref *ref)
        if (chunk->in_registry_element) {
                struct lttng_trace_chunk_registry_element *element;
 
+               /*
+                * Release internal chunk attributes immediately and
+                * only use the deferred `call_rcu` work to reclaim the
+                * storage.
+                *
+                * This ensures that file handles are released as soon as
+                * possible which works around a problem we encounter with PRAM fs
+                * mounts (and possibly other non-POSIX compliant file systems):
+                * directories that contain files which are open can't be
+                * rmdir().
+                *
+                * This means that the recording of a snapshot could be
+                * completed, but that it would be impossible for the user to
+                * delete it until the deferred clean-up released the file
+                * handles to its contents.
+                */
+               lttng_trace_chunk_fini(chunk);
+
                element = container_of(chunk, typeof(*element), chunk);
                if (element->registry) {
                        rcu_read_lock();
This page took 0.027506 seconds and 4 git commands to generate.