tests: Replace babelstats.pl with bt2 plugins Observed Issue ============== `tests/regression/tools/filtering/test_valid_filters` is a long running test, especially when running as root and exercising the tests across the kernel domain. I observed that a sizable amount of time was being spent in the analysis of the results using `babelstats.pl`. Solution ======== Instead of using a script to parse the pretty output of babeltrace2, I decided to write two C++ plugins to replicate the behaviour of the `babelstats.pl` script. I measured the time using `sudo -E time ./path/to/test` | Test | Time with `babelstats.pl` | Time with bt2 plugins | | test_tracefile_count | 13.04s | 11.73s | | test_exclusion | 22.75s | 22.07s | | test_valid_filter | 301.04s | 144.41s | The switch to using babeltrace2 plugins reduces the runtime of the `test_valid_filter` test (when running with kernel tests) by half. The runtime changes to the other tests that were modified are not significant. Known drawbacks =============== The field_stats plugin behaviour differs from `babelstats.pl` with regards to enumeration fields ("container" in `babelstats.pl`). However, no tests depend on that behaviour to pass. The field_stats sink plugin doesn't perform a lot of run-time error-checking of functions it invokes, and doesn't fully clean up all the references it allocates though the babeltrace2 API. As the intended usage is for short lived invocations with relatively small traces, the principal drawback of this approach is that errors in the plugin may be harder to debug. Building tests of lttng-tools will now depend on having the babeltrace2 development headers and libraries available. Change-Id: Ie8ebdd255b6901a7d0d7c4cd584a02096cccd4fb Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
configure.ac: introduce --{disable,enable}-lib-lttng-ctl The goal is to be able to only build liblttng-ctl, for example without needing to build bin/lttng. Since liblttng-ctl is required when building some of the binaries, ./configure will fail if --disabled (explicitly) unless those binaries are --disabled too. Previously, the following would result in liblttng-ctl not getting built, but it now gets built by default: ./configure \ --disable-bin-lttng \ --disable-bin-lttng-relayd \ --disable-bin-lttng-sessiond Change-Id: I9338c46e64c031360aa762a3ce891511a3dbba39 Signed-off-by: Christophe Bedard <christophe.bedard@apex.ai> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Fix: sessiond: freeze on channel creation on restart Issue observed -------------- When using lttng via a script, the session and consumer daemons appear to completely lock up when we request that a channel be created. The conditions for this lockup seem to be created by destroying a sessiond and then creating a sessiond in quick sequence. This can be reproduced, on some systems, by launching a session daemon and running the following commands: $ sudo killall lttng-sessiond $ sudo lttng-sessiond --daemonize $ lttng create my_session --snapshot --output /tmp/demo-output $ lttng enable-channel --kernel my_channel Note that 'killall' above is racy as it does not wait for the session daemon to be killed. Hence, it is not unexpected for the incoming session daemon to see the smoldering ashes of the "outgoing" session daemon. However, it would be helpful if the second session daemon instance warned the user of the existing session daemon instance. From the logs captured from both instances of the lttng-sessiond (the outgoing and incoming instances), there appears to be a time period during which both session daemons are active at once. This behaviour is unexpected as the session daemon guards itself (in theory) from running multiple conflicting instances. The guarding mechanism works in two steps (see the implementation of `check_existing_daemon` @ src/bin/lttng-sessiond/main.cpp:926) When a session daemon is launched, it attempts to connect to any active session daemon's 'client' endpoint (a UNIX socket, the same used by liblttng-ctl to communicate with the session daemon). If the daemon receives a reply, it can assume that another session daemon instance is already active and abort its launch. Conversely, when no reply is received, it uses a "lock file" mechanism to check for other running instances. The lock file-based check creates a file (typically /var/run/lttng/lttng-sessiond.lck in the case of a root session daemon) and acquires an exclusive (write) POSIX lock on it [1]. The assumption is that any other instance would own the lock and cause the operation to fail. On a reproducer system, we could notice that the client thread of the outgoing sessiond daemon was torn down before the launch of the initialization of the incoming session daemon. This caused the incoming session daemon to not receive a reply to its connection attempt and fall-back to the lock file-based mechanism. Surprinsingly, it appears that the lock file checks succeeds even though the outgoing session daemon still holds the lock file according to its log. See the original bug report for more information about the investigation and how to reproduce the problem. Cause ----- The POSIX file locking API has a number of surprising behaviours[2] that have seen it being superseded by platform-specific APIs. In our case, the one that bit us is that any file lock held by a process is automatically released when any of the file descriptors that reference the file's description is released. In practical terms, if a process forks and its child dies, it loses its file lock since the child's file descriptors are closed on exit. The LWN article linked below describes a similar scenario: It's common to have a library routine that opens a file, reads or writes to it, and then closes it again, without the calling application ever being aware that has occurred. If the application happens to be holding a lock on the file when that occurs, it can lose that lock without ever being aware of it. The problem affects any use of the --background/--daemonize options since, as part of the daemonization process (which occurs after the lock file acquisition), the session daemon forks and its parent process exits. This causes one of the descriptors pointing to the lock file to be closed and the lock to be released. After that point, any other instance of the session daemon process would succeed in acquiring the lock file and assume it is the sole instance on the system. Solution -------- The lock file code is modified to use the non-POSIX `flock`[3] interface which is available on Linux and some BSDs[4]. `flock` provides us with the guarantee we thought we had: that the file lock is only released when _all_ file descriptors pointing to a given file description are closed. Drawbacks --------- As a fallback, platforms that don't support `flock` will use the original locking mechanism. Since this is a "hint" to warn users when erroneously launch a second privileged session daemon, it seems acceptable for it to not be completely reliable on secondary platforms. References ---------- [1] https://man7.org/linux/man-pages/man2/fcntl.2.html (see F_SETLK) [2] https://lwn.net/Articles/586904/ [3] https://linux.die.net/man/2/flock [4] https://man.freebsd.org/cgi/man.cgi?query=flock&sektion=2 Fixes #1405 Reported-by: Erica Bugden <ebugden@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com> Change-Id: Ic505ff0671c321f808050831ef2b7152cdbf4b8a
Fix: sessiond: crash when sending data_pending to an active session Observed Issue ============== When a data_pending command is sent to an active session, the sessiond crashes with the following assert ``` lttng-sessiond: client.cpp:2647: void* thread_manage_clients(void*): Assertion `cmd_ctx.reply_payload.buffer.size >= sizeof(*llm)' failed. Error: 1 trace chunks are leaked by lttng-consumerd. This can be caused by an internal error of the session daemon. ``` Cause ===== When a session is active, cmd.cpp:cmd_data_pending() returns LTTNG_ERR_SESSION_STARTED. In client.cpp:process_client_msg(), this return value causes the execution to go the the setup_error label. In the setup_error label, no default LLM header is added to the reply, meaning the reply has a zero size and triggering the assert above. Solution ======== When cmd_data_pending() returns a value that is neither 0 nor 1, the return code is set appropriately as follows: * when the return value is outside the range of lttng error codes, LTTNG_ERR_UNK is used * otherwise, the return value is used The execution then jumps to the error label so that the default LLM message header can be added. Known Drawbacks =============== None. Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com> Change-Id: Iff46f87c7725d25c131a86ac3dbaed5c99b4d16b
Tests: Introduce test_ust_constructor Test instrumentation coverage of C/C++ constructors and destructors by LTTng-UST tracepoints. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com> Change-Id: Ia9e5a5a57bfa7fd4316f8a914ef97effd020262e
Tests: Introduce gen-ust-events-constructor test application This test application tests C/C++ constructor/destructor instrumentation coverage. * How to use: lttng create lttng enable-event -u 'tp*' lttng start ./gen-ust-events-constructor lttng stop lttng view * Before UST fixes: [11:57:09.949917277] (+?.?????????) compudjdev tp_so:constructor_c_provider_shared_library: { cpu_id = 6 }, { } [11:57:09.949962573] (+0.000045296) compudjdev tp_so:constructor_cplusplus_provider_shared_library: { cpu_id = 6 }, { msg = "global - shared library define and provider" } [11:57:09.952145202] (+0.002182629) compudjdev tp:constructor_cplusplus: { cpu_id = 6 }, { msg = "global - same unit after provider" } [11:57:09.952146517] (+0.000001315) compudjdev tp:constructor_c_across_units_after_provider: { cpu_id = 6 }, { } [11:57:09.952146887] (+0.000000370) compudjdev tp:constructor_cplusplus: { cpu_id = 6 }, { msg = "global - across units after provider" } [11:57:09.952634622] (+0.000487735) compudjdev tp:constructor_cplusplus: { cpu_id = 6 }, { msg = "main() local" } [11:57:09.952635522] (+0.000000900) compudjdev tp_so:constructor_cplusplus_provider_shared_library: { cpu_id = 6 }, { msg = "main() local - shared library define and provider" } [11:57:09.952636176] (+0.000000654) compudjdev tp_a:constructor_cplusplus_provider_static_archive: { cpu_id = 6 }, { msg = "main() local - static archive define and provider" } [11:57:09.952636906] (+0.000000730) compudjdev tp:main: { cpu_id = 6 }, { } [11:57:09.952637469] (+0.000000563) compudjdev tp_a:destructor_cplusplus_provider_static_archive: { cpu_id = 6 }, { msg = "main() local - static archive define and provider" } [11:57:09.952638106] (+0.000000637) compudjdev tp_so:destructor_cplusplus_provider_shared_library: { cpu_id = 6 }, { msg = "main() local - shared library define and provider" } [11:57:09.952638516] (+0.000000410) compudjdev tp:destructor_cplusplus: { cpu_id = 6 }, { msg = "main() local" } [11:57:09.952681576] (+0.000043060) compudjdev tp:destructor_cplusplus: { cpu_id = 6 }, { msg = "global - across units after provider" } [11:57:09.952682066] (+0.000000490) compudjdev tp:destructor_cplusplus: { cpu_id = 6 }, { msg = "global - same unit after provider" } [11:57:09.952729603] (+0.000047537) compudjdev tp_so:destructor_cplusplus_provider_shared_library: { cpu_id = 6 }, { msg = "global - shared library define and provider" } * After UST fixes: [11:49:37.921028048] (+?.?????????) compudjdev tp_so:constructor_c_provider_shared_library: { cpu_id = 22 }, { } [11:49:37.921033701] (+0.000005653) compudjdev tp_a:constructor_c_provider_static_archive: { cpu_id = 22 }, { } [11:49:37.921036278] (+0.000002577) compudjdev tp_so:constructor_cplusplus_provider_shared_library: { cpu_id = 22 }, { msg = "global - shared library define and provider" } [11:49:37.921037961] (+0.000001683) compudjdev tp_a:constructor_cplusplus_provider_static_archive: { cpu_id = 22 }, { msg = "global - static archive define and provider" } [11:49:37.921039431] (+0.000001470) compudjdev tp:constructor_c_across_units_before_define: { cpu_id = 22 }, { } [11:49:37.921040288] (+0.000000857) compudjdev tp:constructor_cplusplus: { cpu_id = 22 }, { msg = "global - across units before define" } [11:49:37.921041208] (+0.000000920) compudjdev tp:constructor_c_same_unit_before_define: { cpu_id = 22 }, { } [11:49:37.921042021] (+0.000000813) compudjdev tp:constructor_c_same_unit_after_define: { cpu_id = 22 }, { } [11:49:37.921042568] (+0.000000547) compudjdev tp:constructor_cplusplus: { cpu_id = 22 }, { msg = "global - same unit before define" } [11:49:37.921043161] (+0.000000593) compudjdev tp:constructor_cplusplus: { cpu_id = 22 }, { msg = "global - same unit after define" } [11:49:37.921044058] (+0.000000897) compudjdev tp:constructor_c_across_units_after_define: { cpu_id = 22 }, { } [11:49:37.921044585] (+0.000000527) compudjdev tp:constructor_cplusplus: { cpu_id = 22 }, { msg = "global - across units after define" } [11:49:37.921045585] (+0.000001000) compudjdev tp:constructor_c_same_unit_before_provider: { cpu_id = 22 }, { } [11:49:37.921046385] (+0.000000800) compudjdev tp:constructor_c_same_unit_after_provider: { cpu_id = 22 }, { } [11:49:37.921046938] (+0.000000553) compudjdev tp:constructor_cplusplus: { cpu_id = 22 }, { msg = "global - same unit before provider" } [11:49:37.921047548] (+0.000000610) compudjdev tp:constructor_cplusplus: { cpu_id = 22 }, { msg = "global - same unit after provider" } [11:49:37.921048428] (+0.000000880) compudjdev tp:constructor_c_across_units_after_provider: { cpu_id = 22 }, { } [11:49:37.921048918] (+0.000000490) compudjdev tp:constructor_cplusplus: { cpu_id = 22 }, { msg = "global - across units after provider" } [11:49:37.921050001] (+0.000001083) compudjdev tp:constructor_cplusplus: { cpu_id = 22 }, { msg = "main() local" } [11:49:37.921050628] (+0.000000627) compudjdev tp_so:constructor_cplusplus_provider_shared_library: { cpu_id = 22 }, { msg = "main() local - shared library define and provider" } [11:49:37.921051368] (+0.000000740) compudjdev tp_a:constructor_cplusplus_provider_static_archive: { cpu_id = 22 }, { msg = "main() local - static archive define and provider" } [11:49:37.921052098] (+0.000000730) compudjdev tp:main: { cpu_id = 22 }, { } [11:49:37.921052758] (+0.000000660) compudjdev tp_a:destructor_cplusplus_provider_static_archive: { cpu_id = 22 }, { msg = "main() local - static archive define and provider" } [11:49:37.921053758] (+0.000001000) compudjdev tp_so:destructor_cplusplus_provider_shared_library: { cpu_id = 22 }, { msg = "main() local - shared library define and provider" } [11:49:37.921054595] (+0.000000837) compudjdev tp:destructor_cplusplus: { cpu_id = 22 }, { msg = "main() local" } [11:49:37.921055698] (+0.000001103) compudjdev tp:destructor_cplusplus: { cpu_id = 22 }, { msg = "global - across units after provider" } [11:49:37.921056455] (+0.000000757) compudjdev tp:destructor_cplusplus: { cpu_id = 22 }, { msg = "global - same unit after provider" } [11:49:37.921057011] (+0.000000556) compudjdev tp:destructor_cplusplus: { cpu_id = 22 }, { msg = "global - same unit before provider" } [11:49:37.921057558] (+0.000000547) compudjdev tp:destructor_cplusplus: { cpu_id = 22 }, { msg = "global - across units after define" } [11:49:37.921058188] (+0.000000630) compudjdev tp:destructor_cplusplus: { cpu_id = 22 }, { msg = "global - same unit after define" } [11:49:37.921058658] (+0.000000470) compudjdev tp:destructor_cplusplus: { cpu_id = 22 }, { msg = "global - same unit before define" } [11:49:37.921059168] (+0.000000510) compudjdev tp:destructor_cplusplus: { cpu_id = 22 }, { msg = "global - across units before define" } [11:49:37.921059768] (+0.000000600) compudjdev tp_a:destructor_cplusplus_provider_static_archive: { cpu_id = 22 }, { msg = "global - static archive define and provider" } [11:49:37.921060445] (+0.000000677) compudjdev tp_so:destructor_cplusplus_provider_shared_library: { cpu_id = 22 }, { msg = "global - shared library define and provider" } [11:49:37.921067265] (+0.000006820) compudjdev tp:destructor_c_across_units_after_provider: { cpu_id = 22 }, { } [11:49:37.921067901] (+0.000000636) compudjdev tp:destructor_c_same_unit_after_provider: { cpu_id = 22 }, { } [11:49:37.921068515] (+0.000000614) compudjdev tp:destructor_c_same_unit_before_provider: { cpu_id = 22 }, { } [11:49:37.921069128] (+0.000000613) compudjdev tp:destructor_c_across_units_after_define: { cpu_id = 22 }, { } [11:49:37.921069831] (+0.000000703) compudjdev tp:destructor_c_same_unit_after_define: { cpu_id = 22 }, { } [11:49:37.921070445] (+0.000000614) compudjdev tp:destructor_c_same_unit_before_define: { cpu_id = 22 }, { } [11:49:37.921071075] (+0.000000630) compudjdev tp:destructor_c_across_units_before_define: { cpu_id = 22 }, { } [11:49:37.921071721] (+0.000000646) compudjdev tp_a:destructor_c_provider_static_archive: { cpu_id = 22 }, { } [11:49:37.921072605] (+0.000000884) compudjdev tp_so:destructor_c_provider_shared_library: { cpu_id = 22 }, { } Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com> Change-Id: I4572c2548acf5e295f70e88137ab12b3b86d17c9
Test: client: start, stop, destroy: add tests for --glob/--all Test the CLI client's start, stop, and destroy commands along with their --all and --glob options. The tests validate that only the targeted sessions are affected by the various commands and that the commands don't error-out when multiple sessions are targetted. Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com> Change-Id: Ie39e999608a063cc4573d790120fbe0896917d6f
Add getrandom compat for MacOS, FreeBSD and Cygwin Use the BSD arc4random_buf() function which should be non-blocking on all of these platforms. Change-Id: Ib43373cad82373dc83995fdb3d01c2a2d43ab683 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Build fix: add a noop compat wrapper for posix_fadvise This allows building on platforms without posix_fadvise. Change-Id: I13fd6404cd02eb8038a97693db27d9619246401d Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
tests: make test_shellcheck depend on shellcheck Since this test is only useful for development, don't fail the build on platforms that lack shellcheck. Another approach could be to add a dedicated make target to run the meta tests but this was easier as a quick fix. Change-Id: If9d30c7c4e3ee526f028ad82545e631ace272354 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Tests: Add meta shellcheck test Pass shellcheck on a list of known correct shell scripts. The idea is to add scripts to the list incrementally and avoid regressions. Change-Id: If9b15879cfd4d3c3bdbc6b5fc3efb5638593ca74 Signed-off-by: Olivier Dion <odion@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Build feature: Add pre-inst-env script This script setup environment variables for running compiled binaries from the project without having to install them. Usage example: $ ./pre-inst-env lttng create Change-Id: Ie7839f32cfdc32103db035e2189b6f8a4c5cabe4 Signed-off-by: Olivier Dion <odion@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Fix: sessiond: instance uuid is not sufficiently unique Observed issue ============== Tracing a cluster of machines -- all launched simultaneously -- to the same relay daemon occasionally produces corrupted traces. The size of packets received (as seen from the relay daemon logs) and that of those present in the on-disk stream occasionally didn't match. The traces were observed to all relate to the same trace UUID, but present packet begin timestamps that were not monotonic for a given stream. This causes both Babeltrace 1.x and 2.x to fail to open the traces with different error messages related to clocks. Cause ===== On start, the session daemon generates a UUID to uniquely identify the sessiond instance. Since the UUID generation utils use time() to seed the random number generator, two session daemons launched within the same second can end up with the same instance UUID. Since the relay daemon relies on this UUID to uniquely identify a session daemon accross its various connections, identifier clashes can cause streams from the same `uid` or `pid` to become scrambled resulting in corrupted traces. Solution ======== The UUID utils now initializes its random seed using the getrandom() API in non-blocking mode. If that fails -- most likely because the random pool is depleted or the syscall is not available on the platform -- it falls back to using a hash of two time readings (with nanosecond precision), of the hostname, and the PID. Known drawbacks =============== This fix implements many fallbacks, each with their own caveats and we don't have full test coverage for all of those for the moment. This article presents the different drawbacks of using /dev/urandom vs getrandom(). https://lwn.net/Articles/884875/ As for the pseudo-random time and configuration based fallback, it is meant as a last resort for platforms or configurations where both getrandom() (old kernels or non-Linux platforms) and /dev/urandom (e.g. locked-down container) are not be available. I haven't done a formal analysis of the entropy of this home-grown method. The practical use-case we want to enable is launching multiple virtual machines (or containers) at roughly the same time and ensure that they don't end up using the same sessiond UUID. In that respect, having a different host name and minute timing changes seem enough to prevent a UUID clash. Using the PID as part of the hash also helps when launching multiple session daemons simultaneously for different users. Change-Id: I064753b9ff0f5bf2279be0bd0cfbfd2b0dd19bfc Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Tests: add basic ust context tests for $app, vpid, vuid, vgid Context tracing has very little testing coverage beyond the namespace tests. This test is initially introduced to help troubleshoot an issue with application contexts. However, with the scaffolding now in place, it's trivial to exercise some other context types. This also adds a basic framework to write tests in Python. Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com> Change-Id: I85be842fab252d8b853392d3f742b1461a69f0fe
Fix: never use 'no' in a command variable Command variable may be used in the Makefiles and tests, when not found, set them to an empty string instead of trying to execute the command 'no'. Change-Id: I5429bb96599e3b7f166e52545269cd99eed758ae Reported-by: Heng Guo <heng.guo@windriver.com> Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Bump URCU dependency to 0.14 Complete C++ support was introduced in Userspace-RCU 0.14, using earlier versions results in a build failure, this should be reflected in the configure check. Change-Id: I1b708bd9b04784deb9f2c8768a331911c3ebb891 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>