From: Jonathan Rajotte Date: Thu, 8 Jul 2021 18:17:51 +0000 (-0400) Subject: Fix: ust: app stuck on recv message during UST comm timeout scenario X-Git-Url: https://git.lttng.org/?p=lttng-tools.git;a=commitdiff_plain;h=48c77bf7ccc9a88c229432a6db4871940914bedb Fix: ust: app stuck on recv message during UST comm timeout scenario Observed issue ============== The following scenario lead to the UST thread to be "stuck" on recvmsg on the notify socket. The problem manifest itself when an application is unresponsive during the ustctl_start_session call. Note that the default timeout for ust communication is 5 seconds. # Start an instrumented app ./app gdb lttng-sessiond # put a breakpoint on ustctl_start_session lttng create my_session lttng enable-event -u -a lttng start # The tracepoint should hit. Do not continue. kill -s SIGSTOP $(pgrep app) # Continue lttng-sessiond. sleep 5 # This make sure lttng-sessiond unregister the app from its point of view kill -s SIGCONT $(pgrep app) gdb -p $(pgrep app) thread apply all bt App stack trace: Thread 3 (Thread 0x7fe2c6f58700 (LWP 48172)): #0 __libc_recvmsg (flags=0, msg=0x7fe2c6f56ac0, fd=4) at ../sysdeps/unix/sysv/linux/recvmsg.c:28 #1 __libc_recvmsg (fd=fd@entry=4, msg=msg@entry=0x7fe2c6f56ac0, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:25 #2 0x00007fe2c7a010ba in ustcomm_recv_unix_sock (sock=sock@entry=4, buf=buf@entry=0x7fe2c6f56ea0, len=len@entry=48) at lttng-ust-comm.c:308 #3 0x00007fe2c7a037c3 in ustcomm_register_channel (sock=4, session=session@entry=0x7fe2c0000ba0, session_objd=, channel_objd=, nr_ctx_fields=nr_ctx_fields@entry=0, ctx_fields=, chan_id=0x7fe2 c6f5716c, header_type=0x7fe2c0012b18) at lttng-ust-comm.c:1544 #4 0x00007fe2c7a10787 in lttng_session_enable (session=0x7fe2c0000ba0) at lttng-events.c:444 #5 0x00007fe2c7a0b785 in lttng_session_cmd (objd=1, cmd=128, arg=140611977311672, uargs=0x7fe2c6f57800, owner=0x7fe2c7a5da00 ) at lttng-ust-abi.c:576 #6 0x00007fe2c7a07d6d in handle_message (lum=0x7fe2c6f57590, sock=3, sock_info=0x7fe2c7a5da00 ) at lttng-ust-comm.c:1003 #7 ust_listener_thread (arg=0x7fe2c7a5da00 ) at lttng-ust-comm.c:1712 #8 0x00007fe2c7993609 in start_thread (arg=) at pthread_create.c:477 #9 0x00007fe2c78ba293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 ... Cause ===== When the app continues after the timeout from lttng-sessiond side, the actual start_session message is received on the application side then UST, app side, send commands on the notify socket. On lttng-sessiond side, the command is received but no reply is sent. This is due to the fact that the lookup against the ust_app_ht_by_notify_sock hash table (find_app_by_notify_sock) return nothing since the app is unregistered at this point and the hash table node was removed on unregistration. Solution ======== When the app lookup fails, return an error that will trigger the cleanup of the notify socket. Known drawbacks ========= None Note ========= Subsequent error path in reply_ust_register_channel, add_event_ust_registry, and add_enum_ust_registry might lead to the same type of problem since no reply is sent to the app. Still, for those cases the complete application/notify socket should not be destroyed since the error path relate to either a session or a sub object of a session. Signed-off-by: Jonathan Rajotte Change-Id: Iea0dc027ca1ee772e84c7e545114f1be69fd1f63 Signed-off-by: Jérémie Galarneau --- diff --git a/src/bin/lttng-sessiond/ust-app.c b/src/bin/lttng-sessiond/ust-app.c index 2dc13da09..b18988560 100644 --- a/src/bin/lttng-sessiond/ust-app.c +++ b/src/bin/lttng-sessiond/ust-app.c @@ -6349,7 +6349,7 @@ static int reply_ust_register_channel(int sock, int cobjd, if (!app) { DBG("Application socket %d is being torn down. Abort event notify", sock); - ret = 0; + ret = -1; goto error_rcu_unlock; } @@ -6470,7 +6470,7 @@ static int add_event_ust_registry(int sock, int sobjd, int cobjd, char *name, if (!app) { DBG("Application socket %d is being torn down. Abort event notify", sock); - ret = 0; + ret = -1; goto error_rcu_unlock; } @@ -6576,6 +6576,7 @@ static int add_enum_ust_registry(int sock, int sobjd, char *name, DBG("Application socket %d is being torn down. Aborting enum registration", sock); free(entries); + ret = -1; goto error_rcu_unlock; }