From: Jérémie Galarneau Date: Fri, 28 Feb 2020 23:02:17 +0000 (-0500) Subject: Fix: remove broken health monitoring test `test_thread_exit` X-Git-Tag: v2.13.0-rc1~748 X-Git-Url: https://git.lttng.org/?p=lttng-tools.git;a=commitdiff_plain;h=89c453960c909e2e793e246715026c5e81a83c3a Fix: remove broken health monitoring test `test_thread_exit` The `test_thread_exit` test uses the testpoint() infrastructure to simulate daemon threads dying at "random" times by calling pthread_exit() and checks that dead threads are properly reported by the health check API. The health check system is implemented as a list of structures that live in the TLS of the various monitored threads. When the health thread receives a health-check request, it iterates on this list and reports the current health status of the threads to the client. When a thread is stalled, this works fine; the health TLS has not been updated in some time and that can be observed by the client. However, for a 'spurious' thread exit, this is a bit more fragile. Essentially, the test assumes that the thread will not have unregistered its health TLS, but that it will remain valid until the thread is joined. Unfortunately, this behaviour relies on the fact that threads were not joined until late in the tear down of the session daemon in the past. This is no longer the case for all threads. To provide an example of a test sequence that results in a crash: - the test kills the client thread, - the session daemon received SIGTERM, - the client thread is joined immediately, - the next thread to shutdown (gracefully) unregisters itself from the health monitoring sybsystem and, in doing so, accesses an invalid element while removing itself from the health_app list, causing a crash. As a side note, I could not find a definitive answer on the lifetime of TLS variables. Are they guaranteed to be accessible until a pthread_join() or is it undefined to access them after a thread has returned? I'm guessing this is a very internal implementation detail of the pthread implementation being used and that we should not rely on that behaviour. There are multiple ways we could fix this problem, such as using heap-allocated structures and ref-counting them to share ownership with the health thread, using pthread_atexit() to clean-up, etc. However, the LTTng daemons never pthread_exit() their threads to handle errors (or even call it directly); handling this behaviour is of dubious interest at the moment. Signed-off-by: Jérémie Galarneau Change-Id: I16ec70eb683832fe5f2810885c995988b6194690 --- diff --git a/tests/long_regression b/tests/long_regression index 105dc510b..162117640 100644 --- a/tests/long_regression +++ b/tests/long_regression @@ -1,7 +1,6 @@ regression/tools/filtering/test_invalid_filter regression/tools/filtering/test_unsupported_op regression/tools/filtering/test_valid_filter -regression/tools/health/test_thread_exit regression/tools/health/test_thread_stall regression/tools/health/test_tp_fail regression/tools/streaming/test_ust diff --git a/tests/regression/tools/health/Makefile.am b/tests/regression/tools/health/Makefile.am index 55fea4331..e023f3a54 100644 --- a/tests/regression/tools/health/Makefile.am +++ b/tests/regression/tools/health/Makefile.am @@ -2,7 +2,7 @@ AM_CPPFLAGS += -I. -I$(top_srcdir)/include -COPYSCRIPTS = test_thread_exit test_thread_stall test_tp_fail \ +COPYSCRIPTS = test_thread_stall test_tp_fail \ test_health.sh test_thread_ok dist_noinst_SCRIPTS = $(COPYSCRIPTS) diff --git a/tests/regression/tools/health/test_thread_exit b/tests/regression/tools/health/test_thread_exit deleted file mode 100755 index 774d3a0db..000000000 --- a/tests/regression/tools/health/test_thread_exit +++ /dev/null @@ -1,25 +0,0 @@ -#!/bin/bash -# -# Copyright (C) 2012 Christian Babeux -# Copyright (C) 2014 Mathieu Desnoyers -# -# SPDX-License-Identifier: GPL-2.0-only - -TEST_DESC="Health check - Thread exit" - -SESSION_NAME="health_thread_exit" -SESSIOND_PRELOAD=".libs/libhealthexit.so" -TEST_SUFFIX="EXIT" -CURDIR=$(dirname $(readlink -f $0)) - -# A current design limitation of the lttng-consumerd will cause it to -# hang on shutdown if the timer management thread exits as the teardown -# of channels switches off the channel's timers. The timer thread is -# then expected to purge timer signals and signal when it is done. -# Obviously this state will never be reached as signals are no longer -# being processed. This is not dramatic as this is not what this test -# is meant to test; we only want to make sure the health check signals that -# something went wrong. -KILL_SIGNAL="SIGKILL" - -source ${CURDIR}/test_health.sh