From: Jonathan Rajotte Date: Mon, 6 Dec 2021 14:20:58 +0000 (-0500) Subject: Fix: abort on decrement_sem_count during concurrent tracing start and teardown X-Git-Url: http://git.lttng.org/?p=lttng-ust.git;a=commitdiff_plain;h=4d03f5699ca5129319b8ceb7baef85bbf05f3f3e Fix: abort on decrement_sem_count during concurrent tracing start and teardown Observed issue ============== The following backtrace has been reported: #0 __GI_raise (sig=sig@entry=6) at /usr/src/debug/glibc/2.31/git/sysdeps/unix/sysv/linux/raise.c:50 #1 0x0000007f90b3fdd4 in __GI_abort () at /usr/src/debug/glibc/2.31/git/stdlib/abort.c:79 #2 0x0000007f90b4bf50 in __assert_fail_base (fmt=0x7f90c3da98 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7f9112cb90 "uatomic_read(&sem_count) >= count", file=file@entry=0x7f9112cb30 "/usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c", line=line@entry=664, function=function@entry=0x7f911317e8 <__PRETTY_FUNCTION__.10404> "decrement_sem_count") at /usr/src/debug/glibc/2.31/git/assert/assert.c:92 #3 0x0000007f90b4bfb4 in __GI___assert_fail (assertion=assertion@entry=0x7f9112cb90 "uatomic_read(&sem_count) >= count", file=file@entry=0x7f9112cb30 "/usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c", line=line@entry=664, function=function@entry=0x7f911317e8 <__PRETTY_FUNCTION__.10404> "decrement_sem_count") at /usr/src/debug/glibc/2.31/git/assert/assert.c:101 #4 0x0000007f910e3830 in decrement_sem_count (count=) at /usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c:664 #5 0x0000007f910e5d28 in handle_pending_statedump (sock_info=0x7f9115c608 ) at /usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c:737 #6 handle_message (lum=0x7f8dde46d8, sock=3, sock_info=0x7f9115c608 ) at /usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c:1410 #7 ust_listener_thread (arg=0x7f9115c608 ) at /usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c:2055 #8 0x0000007f90af73e0 in start_thread (arg=0x7fc27a82f6) at /usr/src/debug/glibc/2.31/git/nptl/pthread_create.c:477 #9 0x0000007f90bead5c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78 It turns out that the main thread is at that point iterating over the libraries destructors: Thread 3 (LWP 1983): #0 0x0000007f92a68a0c in _dl_fixup (l=0x7f9054e510, reloc_arg=432) at /usr/src/debug/glibc/2.31/git/elf/dl-runtime.c:69 #1 0x0000007f92a6ea3c in _dl_runtime_resolve () at ../sysdeps/aarch64/dl-trampoline.S:100 #2 0x0000007f905170f8 in __do_global_dtors_aux () from <....>/crash/work/rootfs/usr/lib/libbsd.so.0 #3 0x0000007f92a697f8 in _dl_fini () at /usr/src/debug/glibc/2.31/git/elf/dl-fini.c:138 #4 0x0000007f90b54864 in __run_exit_handlers (status=0, listp=0x7f90c65648 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at /usr/src/debug/glibc/2.31/git/stdlib/exit.c:108 #5 0x0000007f90b549f4 in __GI_exit (status=) at /usr/src/debug/glibc/2.31/git/stdlib/exit.c:139 #6 0x0000000000404c98 in a_function_name (....) at main.c:152 #7 0x0000000000404a98 in main (argc=3, argv=0x7fc27a8858, env=0x7fc27a8878) at main.c:97 Cause ===== An enable command is processed at the same time that the lttng-ust destructor is run. At the end of the command handling, `handle_pending_statedump` is called. Multiple variables from the `sock_info` struct are checked outside the UST lock at that point. lttng-ust-comm.c +1406: /* * Performed delayed statedump operations outside of the UST * lock. We need to take the dynamic loader lock before we take * the UST lock internally within handle_pending_statedump(). */ handle_pending_statedump(sock_info); Namely: registration_done statedump_pending initial_statedump_done `statedump_pending` is set during the enable command (`lttng_session_statedump`, lttng-events.c +631) in the same thread. As for `registration_done` and `initial_statedump_done` they are invariant from the registration of the app until `lttng_ust_cleanup` is called. `cleanup_sock_info` called by `lttng_ust_cleanup`, itself called by `lttng_ust_exit` resets the `registration_done` and `initial_statedump_done` fields. Note that these operations are done outside of the listener thread. Note that by that point `lttng_ust_exit` expects all "getters" on `sock_info` to fail while trying to acquire the UST lock due to `lttng_ust_comm_should_quit` set to 1. Note that the listener threads can still exist because we do not join them, we only execute pthread_cancel which is async. Clearly we are missing mutual exclusion provided by locking when accessing `registration_done` and `initial_statedump_done`. Solution ======== Here we can do better and simply not require any mutual exclusion based on locking. `registration_done` and `initial_statedump_done` only need to be reset to zero when we are not actually exiting (`lttng_ust_after_fork_child`). In this case, no concurrent listener thread exists at that point that could access those fields during the reset. Hence we can move the reset to only the non-exiting code path and alleviate the current situation. Known drawbacks =============== None. Signed-off-by: Jonathan Rajotte Signed-off-by: Mathieu Desnoyers Change-Id: I45ba3eaee20c49a3988837a87fa680ce0a6ed953 --- diff --git a/src/lib/lttng-ust/lttng-ust-comm.c b/src/lib/lttng-ust/lttng-ust-comm.c index b1c35480..25d79d9f 100644 --- a/src/lib/lttng-ust/lttng-ust-comm.c +++ b/src/lib/lttng-ust/lttng-ust-comm.c @@ -1459,8 +1459,7 @@ void cleanup_sock_info(struct sock_info *sock_info, int exiting) } sock_info->root_handle = -1; } - sock_info->registration_done = 0; - sock_info->initial_statedump_done = 0; + /* * wait_shm_mmap, socket and notify socket are used by listener @@ -1472,6 +1471,9 @@ void cleanup_sock_info(struct sock_info *sock_info, int exiting) if (exiting) return; + sock_info->registration_done = 0; + sock_info->initial_statedump_done = 0; + if (sock_info->socket != -1) { ret = ustcomm_close_unix_sock(sock_info->socket); if (ret) {