git.lttng.org Git - lttng-tools.git/log

Fix: consumerd: wrong timer mentioned in error logging

As its name indicates, consumer_timer_monitor_stop() stops the _monitor_
timer; not the live timer. This is most likely a copy-paste error.

The error logging is fixed to mention the appropriate timer.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I9768408581fc6a06f47892850a3a91669df35188

Fix: consumerd: type confusion in lttng_consumer_send_error

lttng_consumer_send_error sends an lttcomm_return_code to the session
daemon. However, the size of lttcomm_sessiond_command was used.

This was probably missed since the function accepts an integer instead
of a proper enum type.

The size accepted by the function is changed to use lttcomm_return_code
and the size of a fixed-size type is used to send the error code to the
session daemon.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib6a04969dd82857e3b8ac2ca2545cfb098b2d04f

Update version to v2.12.15

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: baddr-statedump: use $(LIBTOOL) --mode=execute

GNU libtool inconsistently places the compiled executable in the source
directory or in the .libs directory where a libtool wrapper script is
placed in the source directory.

While slibtool will always place the compiled executable in the .libs
directory and a wrapper script in the source directory.

This will result with a build error when using slibtool since objcopy
needs the executable and not the shell wrapper script, but this can be
solved for both implementations by using $(LIBTOOL) --mode=execute on all
commands that operate on the libtool compiled executables.

Gentoo issue: https://bugs.gentoo.org/858095

The GNU libtool --mode=excute is documented upstream.

https://www.gnu.org/software/libtool/manual/html_node/Execute-mode.html
https://www.gnu.org/software/libtool/manual/html_node/Debugging-executables.html

And the GNU libtool behavior of when to create a wrapper script is
documented in the 'Linking Executables' section.

  "Notice that the executable, hell, was actually created in the .libs
  subdirectory. Then, a wrapper script (or, on certain platforms, a
  wrapper executable see Wrapper executables) was created in the current
  directory.

  Since libtool created a wrapper script, you should use libtool to
  install it and debug it too. However, since the program does not depend
  on any uninstalled libtool library, it is probably usable even without
  the wrapper script."

https://www.gnu.org/software/libtool/manual/html_node/Linking-executables.html

And the inconsistency between GNU libtool and slibtool is documented at
the Gentoo wiki.

  "One difference between GNU libtool and slibtool is that the former will
  conditionally place the compiled executable or a shell wrapper script in
  the build directory, depending on whether or not the executable depends
  on a build-local libtool library (e.g. libfoo.la). Where slibtool will
  always place a compatible wrapper script in the build directory where
  GNU libtool would have conditionally placed the executable. When the
  wrapper script is created both GNU libtool and slibtool will place the
  executable in the .libs directory within the build directory.
  Consequently build systems, ebuilds, and other users should take care to
  avoid scenarios like installing the wrapper script to the system instead
  of the executable. In these cases ideally the executable would be
  installed by the same libtool implementation that compiled it."

https: //wiki.gentoo.org/wiki/Slibtool#Installing_or_using_binaries_created_by_libtool_manually
Signed-off-by: orbea <orbea@riseup.net>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I03102ed78af835daa9b9a5836c2979a5f5d4bd8c

Fix: relayd: live client not notified of inactive streams

Observed issue
--------------

Some LTTng-tools live tests failures appear to show babeltrace2
hanging (failing to print expected events). The problem is intermittent,
but Kienan was able to develop a test case that's reproducible for him.

The test case performs the following steps:
  - Start a ust application and leave it running
  - Configure and then start an lttng live session
  - Connect a live viewer (babeltrace)
  - Run a second ust application
  - Wait for the expected number of events
    - In the failing case, no events are seen by babeltrace

Using per-uid buffers, the test typically completes normally. With
per-pid buffers the test fails, hanging indefinitely if waiting for the
specified number of events. While "hanging", babeltrace2 is polling the
relayd.

This affects for babeltrace2 stable-2.0 and master while using
lttng-tools master.

For more information, see the description of bug #1406[1]

Cause
-----

When consuming a live trace captured in per-PID mode, Babeltrace
periodically requests the index of the next packet it should consume.

As part of the reply, it gets a 'flags' field which is used to announce
that new streams, or new metadata, are available to the viewer.
Unfortunately, these 'flags' are only set when the relay daemon has new
tracing data to deliver. It is not set when the relay daemon indicates
that the stream is inactive (see LTTNG_VIEWER_INDEX_INACTIVE).

In the average case where an application is spawned while others are
actively emiting events, a request for new data will result in a reply
that returns an index entry (code LTTNG_VIEWER_INDEX_OK) for an
available packet accompanied by the LTTNG_VIEWER_FLAG_NEW_STREAM flag.

This flag indicates to the viewer that it should request new
streams (using the LTTNG_VIEWER_GET_NEW_STREAMS live protocol command)
before consuming the new data.

In the cases where we observe a hang, an application is running but not
emiting new events. As such, the relay daemon periodically emits "live
beacons" to indicate that the session's streams are inactive up to a
given time 'T'.

Since the existing application remains inactive and the viewer is never
notified that new streams are available, the viewer effectively remains
"stuck" and never notices the new application being traced.

The LTTNG_VIEWER_FLAG_NEW_METADATA communicates a similar semantic with
regards to the metadata. However, ignoring it for inactive streams isn't
as deleterious: the same information is made available to the viewer the
next time it will successfully request a new index to the relay daemon.

This would only become a problem if the tracers start to express
non-layout data (like supplemental environment information, but I don't
see a real use-case) as part of the metadata stream that should be made
available downstream even during periods of inactivity.

Note that the same problem most likely affects the per-UID buffer
allocation mode when multiple users are being traced.

Solution
--------

On the producer end, LTTNG_VIEWER_FLAG_NEW_STREAM is set even when
returning an inactivity index.

Note that to preserve compatibility with older live consumers that don't
expect this flag in non-OK response, the LTTNG_VIEWER_FLAG_NEW_STREAM
notification is repeated until the next LTTNG_VIEWER_GET_NEW_STREAMS
command that returns LTTNG_VIEWER_INDEX_OK.

The 'new_streams' state is no longer cleared from relay sessions during
the processing of the LTTNG_VIEWER_GET_NEXT_INDEX commands. Instead, it
is cleared when the viewer requests new streams.

On Babeltrace's end, the handler of the LTTNG_VIEWER_GET_NEXT_INDEX
command (lttng_live_get_next_index) is modified to expect
LTTNG_VIEWER_FLAG_NEW_STREAM in the cases where the command returns:
  - LTTNG_VIEWER_INDEX_OK (as done previously),
  - LTTNG_VIEWER_INDEX_HUP (new),
  - LTTNG_VIEWER_INDEX_INACTIVE (new).

Drawbacks
---------

This is arguably a protocol change as none of the producers ever set the
NEW_METADATA/NEW_STREAM flags when indicating an inactive stream.

References
----------

[1] https://bugs.lttng.org/issues/1406

Fixes #1406

Change-Id: I84f53f089597ac7b22ce8bd0962d4b28112b7ab6
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: relayd: live: dispose of zombie viewer metadata stream

Issue observed
==============

In the CI, builds on SLES15SP5 frequently experience timeouts. From
prior inspections, there are hangs during
tests/regression/tools/clear/test_ust while waiting for babeltrace to
exit.

It is possible to reproduce the problem fairly easily:

  $ lttng create --live
  $ lttng enable-event --userspace --all
  $ lttng start

  # Launch an application that emits a couple of events
  $ ./my_app
  $ lttng stop

  # Clear the data, this eventually results in the deletion of all
  # trace files on the relay daemon's end.
  $ lttng clear

  # Attach to the live session from another terminal
  $ babeltrace -i lttng-live net://...

  # The 'destroy' command completes, but the viewer never exits.
  $ lttng destroy

Cause
=====

After the clear command completes, the relay daemon no longer has any
data to serve. We notice that the live client loops endlessly repeatably
sending GET_METADATA requests. In response, the relay daemon replies
with the NO_NEW_METADATA status.

In concrete terms, the viewer_get_metadata() function short-circuits to
send that reply when it sees that the metadata stream has no active
trace chunk (i.e., there are no backing files from which to read the
data at the moment).

This situation is not abnormal in itself: it is legitimate for a client
to wait for the metadata to become available again. For example, in the
reproducer above, it would be possible for the user to restart the
tracing (lttng start), which would create a new trace chunk and make the
metadata stream available. New events could also be emitted following
this restart.

However, when a session's connection is closed, there is no hope that
the metadata stream will ever transition back to an active trace chunk.

Solution
========

When the metadata stream has no active chunk and the corresponding
consumerd-side connection has been closed, there is no way the relay
daemon will be able to serve the metadata contents to the client.

As such, the viewer stream can be disposed-of since it will no longer be
of any use to the client. Since some client implementations expect at
least one GET_METADATA command to result in NO_NEW_METADATA, that status
code is initially returned.

Later, when the client emits a follow-up GET_METADATA request for that
same stream, it will receive an "error" status indicating that the
stream no longer exists. This situation is not treated as an error by
the clients. For instance, babeltrace2 will simply close the
corresponding trace and indicate it ended.

The 'no_new_metadata_notified' flag doesn't appear to be necessary to
implement the behaviour expected by the clients (seeing at least one
NO_NEW_METADATA status reply for every metadata stream). The
viewer_get_metadata() function is refactored a bit to drop the global
reference to the viewer metadata stream as it exits, while still
returning the NO_NEW_METADATA status code.

Known drawbacks
===============

None.

Note
====

The commit message of e8b269fa provides more details behind the
intention of the 'no_new_metadata_notified' flag.

Change-Id: Ib1b80148d7f214f7aed221d3559e479b69aedd82
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: sessiond: freeze on channel creation on restart

Issue observed
--------------

When using lttng via a script, the session and consumer daemons appear
to completely lock up when we request that a channel be created. The
conditions for this lockup seem to be created by destroying a sessiond
and then creating a sessiond in quick sequence.

This can be reproduced, on some systems, by launching a session daemon
and running the following commands:
  $ sudo killall lttng-sessiond
  $ sudo lttng-sessiond --daemonize
  $ lttng create my_session --snapshot --output /tmp/demo-output
  $ lttng enable-channel --kernel my_channel

Note that 'killall' above is racy as it does not wait for the session
daemon to be killed. Hence, it is not unexpected for the incoming
session daemon to see the smoldering ashes of the "outgoing" session
daemon. However, it would be helpful if the second session daemon
instance warned the user of the existing session daemon instance.

From the logs captured from both instances of the lttng-sessiond (the
outgoing and incoming instances), there appears to be a time period
during which both session daemons are active at once.

This behaviour is unexpected as the session daemon guards itself (in
theory) from running multiple conflicting instances.

The guarding mechanism works in two steps (see the implementation of
`check_existing_daemon` @ src/bin/lttng-sessiond/main.cpp:926)

When a session daemon is launched, it attempts to connect to any active
session daemon's 'client' endpoint (a UNIX socket, the same used by
liblttng-ctl to communicate with the session daemon).

If the daemon receives a reply, it can assume that another session
daemon instance is already active and abort its launch. Conversely, when
no reply is received, it uses a "lock file" mechanism to check for other
running instances.

The lock file-based check creates a file (typically
/var/run/lttng/lttng-sessiond.lck in the case of a root session daemon)
and acquires an exclusive (write) POSIX lock on it [1]. The assumption
is that any other instance would own the lock and cause the operation to
fail.

On a reproducer system, we could notice that the client thread of the
outgoing sessiond daemon was torn down before the launch of the
initialization of the incoming session daemon. This caused the incoming
session daemon to not receive a reply to its connection attempt and
fall-back to the lock file-based mechanism.

Surprinsingly, it appears that the lock file checks succeeds even though
the outgoing session daemon still holds the lock file according to its
log.

See the original bug report for more information about the investigation
and how to reproduce the problem.

Cause
-----

The POSIX file locking API has a number of surprising behaviours[2] that
have seen it being superseded by platform-specific APIs. In our case,
the one that bit us is that any file lock held by a process is
automatically released when any of the file descriptors that reference
the file's description is released.

In practical terms, if a process forks and its child dies, it loses its
file lock since the child's file descriptors are closed on exit.

The LWN article linked below describes a similar scenario:

  It's common to have a library routine that opens a file, reads or
  writes to it, and then closes it again, without the calling
  application ever being aware that has occurred. If the application
  happens to be holding a lock on the file when that occurs, it can lose
  that lock without ever being aware of it.

The problem affects any use of the --background/--daemonize options
since, as part of the daemonization process (which occurs after the lock
file acquisition), the session daemon forks and its parent process
exits. This causes one of the descriptors pointing to the lock file to
be closed and the lock to be released.

After that point, any other instance of the session daemon process would
succeed in acquiring the lock file and assume it is the sole instance on
the system.

Solution
--------

The lock file code is modified to use the non-POSIX `flock`[3]
interface which is available on Linux and some BSDs[4]. `flock` provides
us with the guarantee we thought we had: that the file lock is only
released when _all_ file descriptors pointing to a given file
description are closed.

Drawbacks
---------

As a fallback, platforms that don't support `flock` will use the original
locking mechanism. Since this is a "hint" to warn users when erroneously
launch a second privileged session daemon, it seems acceptable for it
to not be completely reliable on secondary platforms.

References
----------

[1] https://man7.org/linux/man-pages/man2/fcntl.2.html (see F_SETLK)
[2] https://lwn.net/Articles/586904/
[3] https://linux.die.net/man/2/flock
[4] https://man.freebsd.org/cgi/man.cgi?query=flock&sektion=2

Fixes #1405

Reported-by: Erica Bugden <ebugden@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic505ff0671c321f808050831ef2b7152cdbf4b8a

common: move utils_create_lock_file to its own file

A follow-up change introduces platform-specific implementations of this
functions. Moving the function to a separate file makes it possible to
add other implementations without polluting utils.c with more
platform-specific code.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibd566d8710380fe378a8f3df9454e21e83655b62

tests: tools/clear/test_ust wait for specific test app pid

Observed issue
==============

When debugging failing tests manually, one step that is sometimes done
is to quickly swap the commands that start the relay or sessiond in
`tests/utils/utils.sh` (eg. in `start_lttng_relayd_opt`) for the version
which uses a verbose output to a logfile.

When doing this, the `relayd` wasn't using the background
`process_mode`, and was a child of the running test.

This caused `test_ust_local_snapshot_per_pid` in
`tests/regression/tools/clear/test_ust` to hang as it waited for all
child processes to terminate.

Solution
========

The test has been updated to wait for only the specific test application
pid.

Known drawbacks
===============

None.

Change-Id: I8761649a52fceda92a5545c71818dc2eb027bfcf
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Tests: use CPU ids from online ranges

test_tracefile_count could fail randomly on systems where there are CPUs
present but not online. For example:

  $ cat /sys/devices/system/cpu/online
  0-7
  $ cat /sys/devices/system/cpu/present
  0-39

When a CPU is present, it will have an entry in
/sys/devices/system/cpu/cpuX for it's ID, and thus the test may pick
that CPU's ID. However, a present CPU which is not online is not a valid
target for taskset.

In cases where `get_any_available_cpu` is used with task set, the tests
could fail for a similar reason. This case can be somewhat less common,
because it would return the numerically lowest CPU first; however, with
online as follows cpu 0 isn't available and taskset fails.

  $ cat /sys/devices/system/cpu/online
  18-19,135,142
  $ cat /sys/devices/system/cpu/present
  0-167

Change-Id: I06ac2e67495552f54765794b154acc9e8e9990ec
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Tests: Preemptively fail infinite blocking tests when low on disk space

In the system tests run by LAVA, the infinite blocking tests were
hanging when the system under test ran out of disk space. This is the
expected behaviour of the failing test, but the condition can be
detected and the tests preemptively failed with a clear error of what
needs to be addressed in the system being tested.

Change-Id: I9e6126408b57c2cd5aa64c2e360e0672f9eb2314
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: lttng-add-context: leak of application context parameters

This commit backports 37d11aa60 and ca400a290, see their respective
messages.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Iaf1c09acda09f592d5b5189ade4b3181e931cd7d

Update version to v2.12.14

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Tests fix: test_callstack: output of addr2line incorrectly parsed

Issue observed
--------------

The test_callstack test fails with GCC 13.1 with the following output:

  Traceback (most recent call last):
  File "/usr/lib/lttng-tools/ptest/tests/regression/././kernel//../../utils/parse-callstack.py", line 160, in <module>
  main()
  File "/usr/lib/lttng-tools/ptest/tests/regression/././kernel//../../utils/parse-callstack.py", line 155, in main
  raise Exception('Expected function name not found in recorded callstack')
  Exception: Expected function name not found in recorded callstack
  ok 10 - Destroy session callstack
  PASS: kernel/test_callstack 10 - Destroy session callstack
  not ok 11 - Validate userspace callstack
  FAIL: kernel/test_callstack 11 - Validate userspace callstack

Cause
-----

parse-callstack.py uses 'split()' to split the lines of addr2line's
output. By default, 'split()' splits a string on any whitespace.
Typically this was fine as addr2line's output doesn't contain spaces and
the function then splits on new lines.

Typical output of addr2line:

  $ addr2line -e ./tests/regression/kernel//../../utils/testapp/gen-syscall-events-callstack/gen-syscall-events-callstack --functions --addresses 0x40124B
  0x000000000040124b
  my_gettid
  /tmp/test-callstack-master/src/lttng-tools/tests/utils/testapp/gen-syscall-events-callstack/gen-syscall-events-callstack.c:40

However, with the test app compiled using gcc 13.1, a "discriminator"
annotation is present:

  0x0000000000401279
  fct_b
  /tmp/test-callstack-master/src/lttng-tools/tests/utils/testapp/gen-syscall-events-callstack/gen-syscall-events-callstack.c:58 (discriminator 1)

Hence, by selecting the second to last element (-2, with negative
indexing), the addr2line function returns '(discriminator' as the
function name.

Solution
--------

The parsing code is changed to simply iterate on groups of 3 lines,
following addr2line's output format.

Fixes #1377

Change-Id: I8c1eab97e84ca7cad171904bed6660540061cf08
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: adding a user space probe fails on thumb functions

On some architectures, calling convention details are embedded in the
symbol addresses. Uprobe requires a "clean" symbol offset (or at least,
an address where an instruction boundary would be legal) to add
instrumentation. sanitize_uprobe_offset implements that sanitization
logic on a per-architecture basis.

The least significant bit is used when branching to switch to thumb ISA.
However, it's an invalid address for us; mask the least significant bit.

We were not masking the thumb bit, thus using the wrong address offset
by one.

Change-Id: Iaff8ccea3a319f9d9ad80501f1beccd74d1ef56d
Signed-off-by: Olivier Dion <odion@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: Tests that assume CPU 0 is present

Add util_event_generator.sh to shellcheck test while at it.

Change-Id: I261452496827a5b9fc08b39a1132b13a39d0a7f5
Signed-off-by: Olivier Dion <odion@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: Wrong assumption about possible CPUs

cpuset is not necessary contiguous. Thus, we need to parse
/sys/devices/system/cpu/posssible correctly.

Also, the `get_exposed_cpus_list' utility functions is required for
taskset-ting on the correct CPUs available to the test environment.

Change-Id: I062ce8d311ff0e8c4b757fe6f36387e3007cfa27
Signed-off-by: Olivier Dion <odion@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Tests: fix: parse-callback reports missing addr2line

addr2line from binutils is required for this script to work correctly.
However, it silently fails. Fix this by using `subprocess.run' with
`check=True' instead of `subprocess.getoutput'. That way, an exception
is raised if an error occurs.

Fix the shebang by not assuming where python is installed while at it.

Change-Id: I5157b3dbccf6bfbe08a6b6840b38f5db9010fe96
Signed-off-by: Olivier Dion <odion@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Tests: snapshot tests complain that nothing is output

The user space snapshot tests invoke a common script that runs the
actual tests. In doing so, the parent script's tap generator isn't
involved in the production of the tests' tap output and outputs an error
message as diagnostic.

ok 134 - Wait after kill session daemon
# Looks like your test died before it could output anything.

`exec`-ing the common script sidesteps the problem by replacing the
shell entirely.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I53783fa783eb5581bf41fc1a5504dbd4e14fa595

Tests: Test snapshot maximum size correctly

The minimum size for a channel is determined by:

sub-buffer-size * sub-buffer-count * possible-cpus

where sub-buffer-size is the system page size and sub-buffer-count is 2.

We set a snapshot with a maximum size of the minimum size. From there,
we need to spam that amount of events, assuming each event to be one
byte, on every online CPUs. We can then ensure that the total snapshot's
size is equal to the minimum size for a channel. However, there's a
little bias if the number of possible cores is greater than the number
of online cores. In that case, the bias is one sub-buffer for each extra
ring buffer.

Change-Id: I4718e134684463789b4f7be9b12c9bf3d6cfec20
Signed-off-by: Olivier Dion <odion@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Tests: Add get_possible_cpus_count utility

lttng-ust uses the possible number of CPUs to allocate its ring buffers.
Certain tests have to take that into consideration in their calculation
instead of relying on online processors.

Thus, add the same logic for calculating the possible CPUs on the
system.

Change-Id: I9f14afba3e4adad9547cbd9386f8e1b1b55a3253
Signed-off-by: Olivier Dion <odion@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: ini parser: truncation of value name

clang 14 reports the following:

  ini-config/ini.cpp:88:16: warning: 'char* strncpy(char*, const char*, size_t)' output may be truncated copying 49 bytes from a string of length 199 [-Wstringop-truncation]
     88 |         strncpy(dest, src, size - 1);
        |         ~~~~~~~^~~~~~~~~~~~~~~~~~~~~

Indeed, a silent truncation of `name` occurs whenever it is longer than
prev_name (49 characters, excluding the terminator).

Report an error when this condition occurs.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I973bd27185e0130d8e4a452525d9277de45ba200

Update version to v2.12.13

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: lttng: poptGetArg doesn't provide string ownership

The string returned by poptGetArg() is 'const' because it is owned by
the popt librairy and is free'd by it when poptFreeContext() is called.
Copy those strings when we need to alter them to maintain proper
ownership.

The latest release of popt (v1.19) introduced a breaking
change (changing the ownership of left-over command line arguments) that
can cause double free()-s.

This is ultimately due to this upstream commit in popt 1.19:
https://github.com/rpm-software-management/popt/commit/7182e4618ad5a0186145fc2aa4a98c2229afdfa8

which is derived from a package patch:
https://src.fedoraproject.org/rpms/babeltrace/c/d48452beff87b145c038f070e7182358db04336c?branch=rawhide

Change-Id: Id2535d1534c0e47cc0747968d6dd60a587f0b810
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: relayd: missing space in trace creation logging statement

A missing space results in hard to read logging statements when a
ctf_trace is created:

[...] Created ctf_trace 15of session "hello" [...]

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I3881ddfc17556f2300f2140939c45d49e3b18d2b

Fix: never use 'no' in a command variable

Command variable may be used in the Makefiles and tests, when not found,
set them to an empty string instead of trying to execute the command
'no'.

Change-Id: I5429bb96599e3b7f166e52545269cd99eed758ae
Reported-by: Heng Guo <heng.guo@windriver.com>
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: consumer: snapshot: assertion on subsequent snapshot

Observed issue
==============

While a snapshot is being taken, the containing folder can disappear
unexpectedly. This can lead to the following errors, which are expected
and mostly handled fine:

PERROR - 14:47:32.002564464 [2922498/2922507]: Failed to open file relative to trace chunk file_path = "channel0_0", flags = 577, mode = 432: No such file or directory (in _lttng_trace_chunk_open_fs_handle_locked() at trace-chunk.cpp:1411)
Error: Failed to open stream file "channel0_0"
Error: Snapshot channel failed

The problem happens on the subsequent snapshot for the session:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fbbdadb3859 in __GI_abort () at abort.c:79
#2  0x00007fbbdadb3729 in __assert_fail_base (fmt=0x7fbbdaf49588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55c4212cfbb5 "!stream->trace_chunk", file=0x55c4212cf820 "kernel-co
#3  0x00007fbbdadc5006 in __GI___assert_fail (assertion=0x55c4212cfbb5 "!stream->trace_chunk", file=0x55c4212cf820 "kernel-consumer/kernel-consumer.cpp", line=188, function=0x55c4212cfb00 "
#4  0x000055c421268cc6 in lttng_kconsumer_snapshot_channel (channel=0x7fbbc4000b60, key=1, path=0x7fbbd37f8fd4 "", relayd_id=18446744073709551615, nb_packets_per_stream=0) at kernel-consume
#5  0x000055c42126b39d in lttng_kconsumer_recv_cmd (ctx=0x55c421b80a90, sock=31, consumer_sockpoll=0x7fbbd37fd280) at kernel-consumer/kernel-consumer.cpp:986
#6  0x000055c4212546d1 in lttng_consumer_recv_cmd (ctx=0x55c421b80a90, sock=31, consumer_sockpoll=0x7fbbd37fd280) at consumer/consumer.cpp:2090
#7  0x000055c421259963 in consumer_thread_sessiond_poll (data=0x55c421b80a90) at consumer/consumer.cpp:3281
#8  0x00007fbbdaf8b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#9  0x00007fbbdaeb0163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

How to reproduce:

1. Setting a breakpoint on snapshot_channel() inside
    src/common/ust-consumer/ust-consumer.cpp
2. When the breakpoint hits, remove the the complete lttng directory
    containing the session data.
3. Continue the lttng_consumerd process from gdb.
4. In that case you see a negative return value -1 from
    consumer_stream_create_output_files() inside snapshot_channel().
5. Take another snapshot and lttng_consumerd crashes because
    of the `assert(!stream->trace_chunk)` in snapshot_channel().

    This last action does not require any breakpoint intervention.

Cause
=====

During the snapshot, the stream is assigned the channel current chunk.
It is expected that the stream does not have a chunk at this point.

The error handling is faulty here, the stream chunk must be
invalidated/reset on error to allow its reuse later on.

The problem exists for both consumer domains (user/kernel).

Solution
========

For the ust consumer, we can directly use the `error_close_stream`
label.

For the kernel consumer, the code path is slightly different since it
does not uses `consumer_stream_close`. Note that `consumer_stream_close`
cannot be used as is for the kernel consumer. The current implementation
partially resembles `consumer_stream_close` at the end of the iteration.
It is extracted to its own function for easier reuse from the new
`error_finalize_stream` label.

Known drawbacks
=========

None.

Fixes: #1352
Signed-off-by: Marcel Hamer <marcel.hamer@windriver.com>
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I9fc81917b19aa436ed8e8679672648f2d5baf41a

Fix: sessiond: instance uuid is not sufficiently unique

Observed issue
==============

Tracing a cluster of machines -- all launched simultaneously -- to the
same relay daemon occasionally produces corrupted traces.

The size of packets received (as seen from the relay daemon logs) and
that of those present in the on-disk stream occasionally didn't match.

The traces were observed to all relate to the same trace UUID, but
present packet begin timestamps that were not monotonic for a given
stream.

This causes both Babeltrace 1.x and 2.x to fail to open the traces with
different error messages related to clocks.

Cause
=====

On start, the session daemon generates a UUID to uniquely identify the
sessiond instance. Since the UUID generation utils use time() to seed
the random number generator, two session daemons launched within the
same second can end up with the same instance UUID.

Since the relay daemon relies on this UUID to uniquely identify a
session daemon accross its various connections, identifier clashes can
cause streams from the same `uid` or `pid` to become scrambled resulting
in corrupted traces.

Solution
========

The UUID utils now initializes its random seed using the getrandom() API
in non-blocking mode. If that fails -- most likely because the random
pool is depleted or the syscall is not available on the platform -- it
falls back to using a hash of two time readings (with nanosecond
precision), of the hostname, and the PID.

Known drawbacks
===============

This fix implements many fallbacks, each with their own caveats and we
don't have full test coverage for all of those for the moment.

This article presents the different drawbacks of using /dev/urandom vs
getrandom().

https://lwn.net/Articles/884875/

As for the pseudo-random time and configuration based fallback, it is
meant as a last resort for platforms or configurations where both
getrandom() (old kernels or non-Linux platforms) and /dev/urandom (e.g.
locked-down container) are not be available. I haven't done a formal
analysis of the entropy of this home-grown method. The practical
use-case we want to enable is launching multiple virtual machines (or
containers) at roughly the same time and ensure that they don't end up
using the same sessiond UUID.

In that respect, having a different host name and minute timing changes
seem enough to prevent a UUID clash.

Using the PID as part of the hash also helps when launching multiple
session daemons simultaneously for different users.

Change-Id: I320fff7bc52752ff504643569e49fa3c02472ec2
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Update version to v2.12.12

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: waiter: futex wait: handle spurious futex wakeups

Observed issue
==============

The waiter lttng_waiter_wait() implements a futex wait/wakeup
scheme similar to the liburcu workqueue code, which has an issue with
spurious wakeups.

A spurious wakeup on lttng_waiter_wait can cause
lttng_waiter_wait to reach label skip_futex_wait with a
waiter->state state of WAITER_WAITING, which is unexpected. It would
cause busy-waiting on WAITER_TEARDOWN state to start early. The
wait-teardown stage is done with WAIT_ATTEMPTS active attempts,
following by attempts spaced by 10ms sleeps. I do not expect that these
spurious wakeups will cause user-observable effects other than being
slightly less efficient that it should be.

This issue will cause spurious unexpected high CPU use, but will not
lead to data corruption.

Cause
=====

From futex(5):

       FUTEX_WAIT
              Returns 0 if the caller was woken up.  Note that a  wake-up  can
              also  be caused by common futex usage patterns in unrelated code
              that happened to have previously used the  futex  word's  memory
              location  (e.g., typical futex-based implementations of Pthreads
              mutexes can cause this under some conditions).  Therefore, call‐
              ers should always conservatively assume that a return value of 0
              can mean a spurious wake-up, and  use  the  futex  word's  value
              (i.e.,  the user-space synchronization scheme) to decide whether
              to continue to block or not.

Solution
========

We therefore need to validate whether the value differs from
WAITER_WAITING in user-space after the call to FUTEX_WAIT returns 0.

Known drawbacks
===============

None.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ida9905d1f0b5d9543c8b85ecbd7d748a6f7c1c97

Fix: futex wait: handle spurious futex wakeups

Observed issue
==============

The futex futex_nto1_wait() implements a futex wait/wakeup scheme
identical to the liburcu workqueue code, which has an issue with
spurious wakeups.

A spurious wakeup on futex_nto1_wait can cause futex_nto1_wait to return
with a futex state of -1, which is unexpected.

futex_nto1_wait is used by the relayd live dispatcher thread, by the
relayd main dispatcher thread, as well as by the sessiond dispatcher
thread.

Given that following a futex_nto1_wait returning due to a spurious
wakeup futex_nto1_prepare will set the futex value to -1, things go
back to normal for the following futex_nto1_wait calls.

Therefore, the only impact of this issue is to spuriously use slightly
more CPU time than strictly required.

The effect is even shorter-lasting that in the liburcu counterparts
because futex_nto1_prepare explicitly sets the futex state to -1 rather
than use an atomic decrement, which immediately sets to state back to
a consistent state.

Cause
=====

From futex(5):

       FUTEX_WAIT
              Returns 0 if the caller was woken up.  Note that a  wake-up  can
              also  be caused by common futex usage patterns in unrelated code
              that happened to have previously used the  futex  word's  memory
              location  (e.g., typical futex-based implementations of Pthreads
              mutexes can cause this under some conditions).  Therefore, call‐
              ers should always conservatively assume that a return value of 0
              can mean a spurious wake-up, and  use  the  futex  word's  value
              (i.e.,  the user-space synchronization scheme) to decide whether
              to continue to block or not.

Solution
========

We therefore need to validate whether the value differs from -1 in
user-space after the call to FUTEX_WAIT returns 0.

Known drawbacks
===============

None.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I8a1b6aaf77b6a2f941fd4f89b61bed71cf17906b

Fix: utils: unhandled close return value

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I3f932ac53e6b39c86babe726d5ec8e9ba999e561

Fix: agent port file is o+w when launching as root

Observed issue
==============

When starting as root, the following permissions are observed:

[-rw-rw-rw-]  agent.port
[-rw-r--r--]  lttng-sessiond.pid

When starting as user:

[-rw-rw----]  agent.port
[-rw-rw-r--]  lttng-sessiond.pid

Note that despite being created by the same function,
`utils_create_pid_file`, the permissions are not the same.

Cause
=====

`get_wait_shm` manipulates the umask and does not restore it, thus
influencing the outcome of following file creations that don't enforce
specific permissions (using chmod).

Also `fopen` defaults to mode `0666 & ~umask`, thus resulting in
unnecessarily lax permissions when the session daemon is started as a
non-privileged user (umask = 0002, most of the time).

Solution
========

Mimic other call sites of umask(), modify then revert the umask.

Open the pid and agent port files as 0644 letting the umask to do its
job as necessary for those files.

Remove unnecessary umask() usage when chmod is directly used.

Known drawbacks
===============

Use of umask in a multi-threaded process is not recommended. Still our
current usage is limited and mostly happens during the initialization
phase. The usage of umask() is required for the `wait_shm` since on
FreeBSD it is not possible to chmod an shm file descriptor. The default
umask would interfere here.

Discussion
==========

The usage in run-as is valid even when in no-clone mode (valgrind) since
it is the sole user of umask() following the initialization phase. When
spawned as a separate process the clearing of umask is totally valid
even if it is not ideal since we are ignoring any umask set by the user.

It seems like the current usage is the lesser evil here.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ie224d254714fff05f4bced471ebfa8f19eede26a

Fix: tests: don't assume sequential cpuids

On Linux CPU ids aren't sequential if a CPU is offlined or unplugged.
Get the list of currently available CPU ids from sysfs and pick a random
one, if sysfs is not available use the previous behavior.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibdb63c7d036389104ac2f629827a6dce59e06983

Fix: ust-consumerd: set `hangup_flush_done` in a locked context

hangup_flush_done is updated after releasing the stream lock. This
doesn't appear to be a problem right now since this attribute is
apparently always accessed by the same thread, but it is conceptually
sus.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I191cb01b02c3d96e19034e0d0e80cb7f8dff2140

Revert "Fix: sessiond: rotation trigger leak"

This reverts commit 69d5d2ae78d1835f4547a2946d226c2a7a2e513c.

The original leak does not exist in this version.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I73f4dff67e9d98f898c2143a4562d8e38cc1e042

Fix: lttng-snapshot: use after free of max size argument

gcc 12.1.0 reports:

commands/snapshot.cpp: In function ‘int cmd_snapshot(int, const char**)’:
../../../src/common/error.hpp:139:32: error: pointer ‘max_size_arg’ may be used after ‘void free(void*)’ [-Werror=use-after-free]

free max_size_arg on both paths.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I3775e835e10b364f32f4797afb9c090ac4dc133c

Fix: lttng: snapshot: add-output: leak of max size parameter

==1920281==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 6 byte(s) in 1 object(s) allocated from:
#0 0x7fa95633add9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7fa955e90c09 (/usr/lib/libpopt.so.0+0x3c09)

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I87ce90a77d9624add0cab5d3090a7e83734da7f4

Tests: fix: lttng-create: leaked command parameter

==1853705==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 3 byte(s) in 1 object(s) allocated from:
#0 0x7fb67ee0edd9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7fb67e964c09 (/usr/lib/libpopt.so.0+0x3c09)

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I2f3346157cb26de6712c6e6ebd5fafa6b51fac08

Fix: sessiond: rotation trigger leak

==1801304==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 224 byte(s) in 2 object(s) allocated from:
    #0 0x7fe0f4e73fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x559fbeb64175 in zmalloc_internal ../../src/common/macros.hpp:60
    #2 0x559fbeb6a291 in lttng_trigger* zmalloc<lttng_trigger>() ../../src/common/macros.hpp:89
    #3 0x559fbeb64aa6 in lttng_trigger_create /home/jgalar/EfficiOS/src/lttng-tools/src/common/trigger.cpp:58
    #4 0x559fbe9dc417 in subscribe_session_consumed_size_rotation(ltt_session*, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/rotate.cpp:87
    #5 0x559fbe995d6f in cmd_rotation_set_schedule(ltt_session*, bool, lttng_rotation_schedule_type, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:5993
    #6 0x559fbe9fe559 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2246
    #7 0x559fbea01378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
    #8 0x559fbe9ea642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
    #9 0x7fe0f44935c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)

Indirect leak of 208 byte(s) in 2 object(s) allocated from:
    #0 0x7fe0f4e73fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x559fbeb16e21 in zmalloc_internal ../../src/common/macros.hpp:60
    #2 0x559fbeb16e31 in lttng_action_notify* zmalloc<lttng_action_notify>() ../../src/common/macros.hpp:89
    #3 0x559fbeb168a0 in lttng_action_notify_create actions/notify.cpp:135
    #4 0x559fbe9dc34b in subscribe_session_consumed_size_rotation(ltt_session*, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/rotate.cpp:80
    #5 0x559fbe995d6f in cmd_rotation_set_schedule(ltt_session*, bool, lttng_rotation_schedule_type, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:5993
    #6 0x559fbe9fe559 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2246
    #7 0x559fbea01378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
    #8 0x559fbe9ea642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
    #9 0x7fe0f44935c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)

Indirect leak of 160 byte(s) in 2 object(s) allocated from:
    #0 0x7fe0f4e73fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x559fbeb3d7a1 in zmalloc_internal ../../src/common/macros.hpp:60
    #2 0x559fbeb3fa35 in lttng_condition_session_consumed_size* zmalloc<lttng_condition_session_consumed_size>() ../../src/common/macros.hpp:89
    #3 0x559fbeb3e6fd in lttng_condition_session_consumed_size_create conditions/session-consumed-size.cpp:206
    #4 0x559fbe9dc0f1 in subscribe_session_consumed_size_rotation(ltt_session*, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/rotate.cpp:54
    #5 0x559fbe995d6f in cmd_rotation_set_schedule(ltt_session*, bool, lttng_rotation_schedule_type, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:5993
    #6 0x559fbe9fe559 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2246
    #7 0x559fbea01378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
    #8 0x559fbe9ea642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
    #9 0x7fe0f44935c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)

Indirect leak of 112 byte(s) in 2 object(s) allocated from:
    #0 0x7fe0f4e73fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x559fbeb242ad in zmalloc_internal ../../src/common/macros.hpp:60
    #2 0x559fbeb27062 in zmalloc<(anonymous namespace)::lttng_rate_policy_every_n> ../../src/common/macros.hpp:89
    #3 0x559fbeb25e9f in lttng_rate_policy_every_n_create actions/rate-policy.cpp:492
    #4 0x559fbeb168b9 in lttng_action_notify_create actions/notify.cpp:141
    #5 0x559fbe9dc34b in subscribe_session_consumed_size_rotation(ltt_session*, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/rotate.cpp:80
    #6 0x559fbe995d6f in cmd_rotation_set_schedule(ltt_session*, bool, lttng_rotation_schedule_type, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:5993
    #7 0x559fbe9fe559 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2246
    #8 0x559fbea01378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
    #9 0x559fbe9ea642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
    #10 0x7fe0f44935c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)

Indirect leak of 34 byte(s) in 2 object(s) allocated from:
    #0 0x7fe0f4e19319 in __interceptor_strdup /usr/src/debug/gcc/libsanitizer/asan/asan_interceptors.cpp:454
    #1 0x559fbeb3f603 in lttng_condition_session_consumed_size_set_session_name conditions/session-consumed-size.cpp:442
    #2 0x559fbe9dc2c4 in subscribe_session_consumed_size_rotation(ltt_session*, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/rotate.cpp:71
    #3 0x559fbe995d6f in cmd_rotation_set_schedule(ltt_session*, bool, lttng_rotation_schedule_type, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:5993
    #4 0x559fbe9fe559 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2246
    #5 0x559fbea01378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
    #6 0x559fbe9ea642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
    #7 0x7fe0f44935c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)

The rotation trigger of a session (used for size-based rotations) is
never cleaned-up. It is now cleaned up every time its condition is
hit and whenever the session is destroyed.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I5a89341535f87b7851b548ded9838c18bd1ccb95

Tests: fix: schedule api: leak of rotation schedule list

==1769573==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7fef37a29fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x7fef37792f2f in zmalloc_internal ../../../src/common/macros.hpp:60
    #2 0x7fef3779573a in lttng_rotation_schedules* zmalloc<lttng_rotation_schedules>() ../../../src/common/macros.hpp:89
    #3 0x7fef377947cc in lttng_rotation_schedules_create /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/rotate.cpp:353
    #4 0x7fef37794aa0 in get_schedules /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/rotate.cpp:392
    #5 0x7fef377956dc in lttng_session_list_rotation_schedules /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/rotate.cpp:665
    #6 0x5646131713f2 in test_add_list_remove_schedule /home/jgalar/EfficiOS/src/lttng-tools/tests/regression/tools/rotation/schedule_api.c:252
    #7 0x56461317157b in test_add_list_remove_size_schedule /home/jgalar/EfficiOS/src/lttng-tools/tests/regression/tools/rotation/schedule_api.c:270
    #8 0x564613171680 in main /home/jgalar/EfficiOS/src/lttng-tools/tests/regression/tools/rotation/schedule_api.c:307
    #9 0x7fef373ae30f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I9b7eb537d158791db76f9a7676ffeb5d4a1f2203

Fix: lttng: enable-rotation: leak of command parameter

==1759491==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 6 byte(s) in 1 object(s) allocated from:
#0 0x7fdbdc94add9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7fdbdc4a0c09 (/usr/lib/libpopt.so.0+0x3c09)

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I29cc6ec4390e71829107f309f162247b9be2868c

Fix: lttng: track: leaked command parameter

==1676099==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 5 byte(s) in 1 object(s) allocated from:
#0 0x7f19429d9dd9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7f19425342ad in poptGetNextOpt (/usr/lib/libpopt.so.0+0x82ad)

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibafcaf42ad4f842b3fa74cf91dc5ecc8acb3487d

Fix: lttng: enable-channel: leak of popt arguments

==1245463==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 5 byte(s) in 1 object(s) allocated from:
#0 0x7fe7c494fdd9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7fe7c44a5c09 (/usr/lib/libpopt.so.0+0x3c09)

Arguments obtained with poptGetOptArg() must be free'd.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I5a65ca6fbaa18f7717ea918a5bc7f42daeb1009a

Tests: fix: leak of some attributes of ltt_ust_session

==1175545==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 8696 byte(s) in 1 object(s) allocated from:
    #0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x55707ddc6004 in zmalloc_internal ../../../src/common/macros.hpp:60
    #2 0x55707ddceb17 in ltt_ust_session* zmalloc<ltt_ust_session>() ../../../src/common/macros.hpp:89
    #3 0x55707ddc81e7 in trace_ust_create_session(unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/trace-ust.cpp:274
    #4 0x55707ddc2bea in test_create_one_ust_session /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:63
    #5 0x55707ddc4941 in main /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:283
    #6 0x7efed04f930f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)

Indirect leak of 24672 byte(s) in 1 object(s) allocated from:
    #0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x55707dee4ec1 in zmalloc_internal ../../../src/common/macros.hpp:60
    #2 0x55707def774e in consumer_output* zmalloc<consumer_output>() ../../../src/common/macros.hpp:89
    #3 0x55707dee90df in consumer_create_output(consumer_dst_type) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/consumer.cpp:523
    #4 0x55707ddc8821 in trace_ust_create_session(unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/trace-ust.cpp:321
    #5 0x55707ddc2bea in test_create_one_ust_session /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:63
    #6 0x55707ddc4941 in main /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:283
    #7 0x7efed04f930f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)

Indirect leak of 1024 byte(s) in 1 object(s) allocated from:
    #0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x7efed0bf985f in alloc_split_items_count /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash.c:688
    #2 0x7efed0bf985f in _cds_lfht_new /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash.c:1642

Indirect leak of 656 byte(s) in 1 object(s) allocated from:
    #0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x7efed0bfac68 in __default_alloc_cds_lfht ../src/rculfhash-internal.h:172
    #2 0x7efed0bfac68 in alloc_cds_lfht /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash-mm-order.c:81

Indirect leak of 48 byte(s) in 2 object(s) allocated from:
    #0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x7efed0bfabd4 in cds_lfht_alloc_bucket_table /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash-mm-order.c:35
    #2 0x7efed0bfabd4 in cds_lfht_alloc_bucket_table /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash-mm-order.c:28

Indirect leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x55707de3a9af in zmalloc_internal ../../src/common/macros.hpp:60
    #2 0x55707de3a9bf in lttng_ht* zmalloc<lttng_ht>() ../../src/common/macros.hpp:89
    #3 0x55707de38461 in lttng_ht_new(unsigned long, lttng_ht_type) hashtable/hashtable.cpp:113
    #4 0x55707dee9340 in consumer_create_output(consumer_dst_type) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/consumer.cpp:535
    #5 0x55707ddc8821 in trace_ust_create_session(unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/trace-ust.cpp:321
    #6 0x55707ddc2bea in test_create_one_ust_session /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:63
    #7 0x55707ddc4941 in main /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:283
    #8 0x7efed04f930f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)

Indirect leak of 16 byte(s) in 1 object(s) allocated from:
    #0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x7efed0bfac15 in cds_lfht_alloc_bucket_table /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash-mm-order.c:31

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib2ad82a197f2a4ccb86ae5799c1d93ff059888e3

Fix: liblttng-ctl: non-packed structure used for tracker serialization

Using unpacked structures in liblttng-ctl's protocol can cause issues
when mixing sessiond and client of different bitness. In this specific
case I doubt it causes a problem, but it could rightfully do on some
architectures.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ie17096a55a4b7508c604e184cae877b83df6e451

Fix: test: lttng kernel modules still loaded after running test_clock_override

Observed issue
==============

After running test_clock_override, some lttng modules are still loaded.

$ lsmod | ag lttng
  lttng_test             32768  0
  lttng_tracer         2326528  1 lttng_test
  lttng_statedump       749568  1 lttng_tracer
  lttng_wrapper          16384  2 lttng_statedump,lttng_tracer
  lttng_uprobes          16384  1 lttng_tracer
  lttng_kprobes          16384  1 lttng_tracer
  lttng_lib_ring_buffer    61440  1 lttng_tracer
  lttng_kretprobes       16384  1 lttng_tracer
  lttng_clock_plugin_test    16384  1
  lttng_clock            16384  2 lttng_tracer,lttng_clock_plugin_test

Cause
=====

The order in which the modules are removed is important.

In `test_clock_override_timestamp` the last `modprobe --remove order` is

  modprobe --remove lttng-clock-plugin-test lttng-clock lttng-test

While other callsites order is:

  modprobe --remove lttng-test lttng-clock-plugin-test lttng-clock

Solution
========

Use

  modprobe --remove lttng-test lttng-clock-plugin-test lttng-clock

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I528df2e3e90664433337a547a74cdbe476d4ee62

Update version to v2.12.11

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: notification: assert on len > 0 for dropped notification message

Observed issue
==============

Using the notification client from
doc/examples/trigger-condition-event-matches/notification-client.cpp, an
assert is hit when the notification subsystem is under load.

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f69eab58859 in __GI_abort () at abort.c:79
#2  0x00007f69eab58729 in __assert_fail_base (fmt=0x7f69eacee588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7f69eae1d5dd "len > 0", file=0x7f69eae1d5cb "unix.cpp", line=179, function=<optimized out>) at assert.c:92
#3  0x00007f69eab6a006 in __GI___assert_fail (assertion=0x7f69eae1d5dd "len > 0", file=0x7f69eae1d5cb "unix.cpp", line=179, function=0x7f69eae1d598 "ssize_t lttcomm_recv_unix_sock(int, void*, size_t)") at assert.c:101
#4  0x00007f69eadd5fe6 in lttcomm_recv_unix_sock (sock=3, buf=0x55da9ecd5f89, len=0) at unix.cpp:179
#5  0x00007f69ead7df3f in receive_message (channel=0x55da9ecd6ee0) at channel.cpp:64
#6  0x00007f69ead7e478 in lttng_notification_channel_get_next_notification (channel=0x55da9ecd6ee0, _notification=0x7ffdefed2570) at channel.cpp:279
#7  0x000055da9e0e742f in main (argc=2, argv=0x7ffdefed2698) at notification-client.cpp:506

(gdb) frame
#5  0x00007f69ead7df3f in receive_message (channel=0x55da9ecd6ee0) at channel.cpp:64
64  ret = lttcomm_recv_unix_sock(channel->socket,

(gdb) print msg
$2 = {type = 5 '\005', size = 0, fds = 0, payload = 0x7ffdefed24a8 ""}

The msg type 5 is
`LTTNG_NOTIFICATION_CHANNEL_MESSAGE_TYPE_NOTIFICATION_DROPPED`

Cause
=====

The msg portion of a
`LTTNG_NOTIFICATION_CHANNEL_MESSAGE_TYPE_NOTIFICATION_DROPPED` is indeed
zero. There is no extra payload.

Solution
========

When the msg size is zero, skip the 'payload' reception phase.

Known drawbacks
=========

None.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibabb922d0e410c9902414a5eabbe04738861d772

Fix: sessiond: cmd_enable_channel: negative error code used

A negative `lttng_error_code` value is returned (as an integer)
when a channel copy fails. Return a positive error code.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I340f739cc33858a06832bb75a7a6d5e18459551f

Update version to v2.12.10

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: Remove liblttng-ctl.so dependency on liburcu-cds.so and liburcu-common.so

Observed Issue
==============

After splitting libcommon into lgpl/gpl convenience libraries,
liblttng-ctl.so still depends on liburcu-cds.so and liburcu-common.so.

Cause
=====

The default behavior, for AC_CHECK_LIB when the `action-if-found` is NOT
defined, is to prepend the library to LIBS. [1]

"
If action-if-found is not specified, the default action prepends
-llibrary to LIBS and defines ‘HAVE_LIBlibrary’ (in all capitals).
"

It is important to note that the LIBS variable is used for ALL linking.

This is normally not a problem for most distribution since they force
the use of `--as-needed` at the toolchain level (gcc specs) (for example
debian [2]). One could also pass the `--as-needed` flag manually but
libtool reorganize flags in the case of shared object creation [3].

In our case, we always explicitly state the dependencies via the *_LIBADD
automake clause. We do not rely on the LIBS variable.

The current configure.ac does define what seems to be `empty but
defined` clause for the `action-if-found` as "[]". This is not a valid
"empty but defined" `action-if-find` clause and end up generating the
default behavior as defined in [1].

This leads to unnecessary dependencies for most of the shared object, at
link time, generated when using a distro that do not enforce the
`--as-needed` flag on linking.

Solution
========

Define an actual no-op shell operation `:` for the `action-if-found`
parameter.

Known drawbacks
=========

None.

References
==========
[1] https://www.gnu.org/software/autoconf/manual/autoconf-2.69/html_node/Libraries.html
[2] https://salsa.debian.org/toolchain-team/gcc/-/blob/master/debian/patches/gcc-as-needed.diff
[3] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=347650

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I4ce4289f4022bd6f27f656639a451cb39b7c9a9e

Update version to v2.12.9

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: lttng-elf: wrong error label used by error path

1486805 Resource leak
The system resource will not be reclaimed and reused, reducing the future availability of the resource.

In lttng_elf_get_symbol_offset: Leak of memory or pointers to system resources (CWE-404)

Reported-by: Coverity Scan
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I78b868773b389c2eaf3f1d45151fe9416b1fe447

Fix: use the correct endian compat macros

Document which variant of the endian macros our compat header guarantees
across all platforms and fix incorrect uses.

This was discovered with -Wundef on macOS.

Change-Id: Iaf442fe5887063661273ac2a00c9fa4015e83d5c
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: lttng-elf: untrusted entry size divisor

1405557 Untrusted divisor
The divisor could be controlled by an attacker, who could cause a division by zero.

In lttng_elf_get_symbol_offset: An unscrutinized value from an untrusted source used as a divisor (CWE-369)

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I029708a0df4f62fe0031e374d50839c26f4f3f4b

Fix: event: erroneous bound check on perf counter name size

The wrong size if used when initializing a perf counter name from a
payload. The destination size must be used to prevent out of bound
writes.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I8ea41d30815bd2c02bb2ad8b01e8cecd2d6549a8

Fix: relayd: missing session unlock on error path

1475890 Missing unlock May result in deadlock if there is another
attempt to acquire the lock.

In viewer_get_new_streams(relay_connection *): Missing a release of a
lock on a path (CWE-667)

Reported-by: Coverity Scan
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I66de344f5f39ac85bf8db93cf39a07d0c6cf7694

fix: warning '-Wstringop-truncation' with GCC 11.2

Building with GCC 11.2 results in the following warning :

  In file included from ../../src/common/tracker.h:18,
                 from ../../src/bin/lttng-sessiond/trace-ust.h:17,
                 from test_ust_data.cpp:19:
../../src/common/sessiond-comm/sessiond-comm.h:569:14: note: while referencing ‘lttng_event_exclusion::names’
  569 |         char names[0][LTTNG_SYMBOL_NAME_LEN];
      |              ^~~~~
test_ust_data.cpp:209:16: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 256 equals destination size [-Wstringop-truncation]
  209 |         strncpy(LTTNG_EVENT_EXCLUSION_NAME_AT(exclusion, 0),
      |         ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  210 |                 get_random_string(), LTTNG_SYMBOL_NAME_LEN);
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test_ust_data.cpp:211:16: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 256 equals destination size [-Wstringop-truncation]
  211 |         strncpy(LTTNG_EVENT_EXCLUSION_NAME_AT(exclusion, 1),
      |         ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  212 |                 get_random_string(), LTTNG_SYMBOL_NAME_LEN);
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Change-Id: I78eea760b4684227ee457c3368c6397d0a767af5
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Add Log4j 2.x agent tests for the 'log4j' domain

Add integration tests for the new Log4j 2.x agent in Log4j 1.x compat
mode using the current 'log4j' domain, use the new configure switch
'--enable-test-java-agent-log4j2' to enable it.

To run only this new test, use this command :

cd tests/regression && make check TESTS="ust/java-log4j2/test_agent_log4j2_domain_log4j"

Change-Id: Id780c9ee13913c91c18548f58b14cc600e77e9fa
Depends-on: lttng-ust: I7fd5f79ad58c77175714bd4198d8ff5db2e6b846
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: relayd: session id is ignored by 2.11+ create session command

The id of the session used by the sessiond is not returned by
cmd_create_session_2_11 and its caller sets the value in the
relay_session to an uninitialized value.

Up until recently this didn't have much effect as this uninitialized
value was stored and used to perform look-ups in the trace chunk
registry, which would work.

However, the recent multi-consumer rotation fixes make this problem more
significant as this 'id' is used as a key to join relay sessions
originating from the same session daemon.

This was discovered by enabling the '-Wunused-parameter' warning.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I7e33f5f93dc46bb630c431408c9472be3a75c030

Fix: consumerd: use-after-free of metadata bucket

Observed issue
==============

When consumer_stream_destroy() is called from, for example, the error
path in setup_metadata(), consumer_stream_free() can end up being called
twice on the same stream.  Since the stream->metadata_bucket is not set
to NULL after being destroyed, it leads to a use-after-free:

ERROR: AddressSanitizer: heap-use-after-free on address 0x604000000318
READ of size 8 at 0x604000000318 thread T7
     #0 in metadata_bucket_destroy
     #1 in consumer_stream_free
     #2 in consumer_stream_destroy
     #3 in setup_metadata
     #4 in lttng_ustconsumer_recv_cmd
     #5 in lttng_consumer_recv_cmd
     #6 in consumer_thread_sessiond_poll
     #7 in start_thread nptl/pthread_create.c:481
     #8 in clone (/lib/x86_64-linux-gnu/libc.so.6+0xfcbde)

0x604000000318 is located 8 bytes inside of 48-byte region [0x604000000310,0x604000000340)
freed by thread T7 here:
     #0 in __interceptor_free
     #1 in metadata_bucket_destroy
     #2 in consumer_stream_free
     #3 in consumer_stream_destroy
     #4 in clean_channel_stream_list
     #5 in consumer_del_channel
     #6 in consumer_stream_destroy
     #7 in setup_metadata
     #8 in lttng_ustconsumer_recv_cmd
     #9 in lttng_consumer_recv_cmd
     #10 in consumer_thread_sessiond_poll
     #11 in start_thread nptl/pthread_create.c:481

previously allocated by thread T7 here:
     #0 in __interceptor_calloc
     #1 in zmalloc
     #2 in metadata_bucket_create
     #3 in consumer_stream_enable_metadata_bucketization
     #4 in lttng_ustconsumer_set_stream_ops
     #5 in lttng_ustconsumer_on_recv_stream
     #6 in lttng_consumer_on_recv_stream
     #7 in create_ust_streams
     #8 in ask_channel
     #9 in lttng_ustconsumer_recv_cmd
     #10 in lttng_consumer_recv_cmd
     #11 in consumer_thread_sessiond_poll
     #12 in start_thread nptl/pthread_create.c:481

Thread T7 created by T0 here:
     #0 in __interceptor_pthread_create
     #1 in main
     #2 in __libc_start_main ../csu/libc-start.c:332

SUMMARY: AddressSanitizer: heap-use-after-free in metadata_bucket_destroy

This can be easily reproduced by forcing a failure during the setup
of the metadata reproducible using the following change:

  diff --git a/src/common/ust-consumer/ust-consumer.c b/src/common/ust-consumer/ust-consumer.c
  index fa1c71299..97ed59632 100644

  --- a/src/common/ust-consumer/ust-consumer.c
  +++ b/src/common/ust-consumer/ust-consumer.c
  @@ -908,8 +908,7 @@ static int setup_metadata(struct lttng_consumer_local_data *ctx, uint64_t key)

           /* Send metadata stream to relayd if needed. */
           if (metadata->metadata_stream->net_seq_idx != (uint64_t) -1ULL) {
  -                ret = consumer_send_relayd_stream(metadata->metadata_stream,
  -                                metadata->pathname);
  +                ret = -1;
                   if (ret < 0) {
                           ret = LTTCOMM_CONSUMERD_ERROR_METADATA;
                           goto error;

Cause
=====

Channels have a list of streams that are being "setup" and are not
yet monitored for consumption. During this setup phase, the streams are
owned by the channel. On destruction of the channel, any stream in that
list will thus be cleaned-up.

When destroying a consumer stream, a reference to its channel is 'put'.
This can result in the destruction of the channel.

In the situation described above, the release of the channel's reference
is done before the stream is removed from the channel's stream list.
This causes the channel's clean-up to invoke (again) the current
stream's clean-up, resulting in the double-free of the metadata bucket.

This problem is present in a number of error paths.

Solution
========

Some error paths already manually removed the consumer stream from it's
channel's stream list before invoking consumer_stream_destroy(). The
various error paths that have to deal with this possible situation are
changed to simply invoke consumer_stream_destroy().

consumer_stream_destroy() is modified to always remove the stream from
its channel's list before performing the rest of the clean-up. This
ensures that those double clean-ups can't occur.

Drawbacks
=========

None.

Reported-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Tested-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibeca9b675b86fc46be3f57826f7158de4da43df8

Fix: ust-consumerd: leak of stream control structure

The following leak is reported by LeakSanitizer when
setup_metadata() fails to send the metadata stream to the relay
daemon:

  ==3050181==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 240 byte(s) in 5 object(s) allocated from:
      #0 0x7f5fce02cfb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
      #1 0x7f5fcdd95a7a in zmalloc ../../../src/common/macros.h:23
      #2 0x7f5fcdd95a7a in lttng_ust_ctl_create_stream /home/jgalar/EfficiOS/src/lttng-ust/src/lib/lttng-ust-ctl/ustctl.c:1649

A consumer stream can have an allocated
`struct lttng_ust_ctl_consumer_stream *` (ustream) even if it is
not globally visible at the time of its teardown.

In the case of the user space consumer, the only site that creates
consumer stream instances ensures that the allocation of the
lttng_ust_ctl_consumer_stream succeeded, ensuring that the
consumer stream's 'ustream' is always set.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ia4be7969e85bd8812ae13b042e1e100812a63c1d

Fix: liblttng-ctl: erroneous flat size computation

compute_flattened_size() erroneously computes (over-estimates) the size
of the allocation required to hold the flat array of struct lttng_event
returned to the user by lttng_list_{events, syscalls, tracepoints}.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I0a80ef0fa66428b7df34303804a024e80b635c69

Build fix: relayd: mismatching function signatures

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I4ba2e7cacbe4357ad0cdbe1dc2c85894f8bedc27

Build fix: use of mutable keyword in C file

4045de280 is a backport of a fix against a C++ file in which `mutable`
is used to allow an ASSERT_LOCKED check. Remove the use of mutable and
make session_has_ongoing_rotation() non-const.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Id6317ad5c271f08701c9545827f75010f5e2e55d

Fix: relayd: connection abruptly closed on viewer stream creation failure

Commit fe88e5175 explains (and fixes) an issue that could cause the
creation of viewer streams to fail. Currently, the error path causes the
relay daemon to abruptly close the connection to its live viewer peer.
This behaviour makes it impossible for the viewer to determine if an
error occurred or if the network connection simply failed.

Returning an `LTTNG_VIEWER_NEW_STREAMS_ERR` status code allows the
viewer to report a precise error. The viewer connection is closed since
the internal error is unlikely to be recoverable.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I557a8b66c1bd1c0bf361cfbabe962d8a6808f4f4

Fix: relayd: live client fails on clear of multi-domain session

Observed issue
==============

Two test cases of the clear/test_ust test suite occasionally fail in the
integration jobs testing cross-bitness (32/64) LTTng deployments.

Babeltrace fails with the following error when a clear occurs while a
live client consumes a trace:

  02-28 16:55:03.262 32362 32362 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_recv@viewer-connection.c:198 [lttng-live] Remote side has closed connection
  02-28 16:55:03.262 32362 32362 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_session_get_new_streams@viewer-connection.c:1706 [lttng-live] Error receiving get new streams reply
  02-28 16:55:03.262 32362 32362 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_msg_iter_next@lttng-live.c:1665 [lttng-live] Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
  02-28 16:55:03.262 32362 32362 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:864 Component input port message iterator's "next" method failed: iter-addr=0x55eab7eb1170, iter-upstream-comp-name="lttng-live", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
  02-28 16:55:03.262 32362 32362 E PLUGIN/FLT.UTILS.MUXER muxer_upstream_msg_iter_next@muxer.c:454 [muxer] Upstream iterator's next method returned an error: status=ERROR
  02-28 16:55:03.262 32362 32362 E PLUGIN/FLT.UTILS.MUXER validate_muxer_upstream_msg_iters@muxer.c:991 [muxer] Cannot validate muxer's upstream message iterator wrapper: muxer-msg-iter-addr=0x55eab7eb1120, muxer-upstream-msg-iter-wrap-addr=0x55eab7eb3a70
  02-28 16:55:03.262 32362 32362 E PLUGIN/FLT.UTILS.MUXER muxer_msg_iter_next@muxer.c:1415 [muxer] Cannot get next message: comp-addr=0x55eab7eb0470, muxer-comp-addr=0x55eab7eb0510, muxer-msg-iter-addr=0x55eab7eb1120, msg-iter-addr=0x55eab7eb0fb0, status=ERROR
  02-28 16:55:03.262 32362 32362 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:864 Component input port message iterator's "next" method failed: iter-addr=0x55eab7eb0fb0, iter-upstream-comp-name="muxer", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
  02-28 16:55:03.262 32362 32362 W LIB/GRAPH consume_graph_sink@graph.c:473 Component's "consume" method failed: status=ERROR, comp-addr=0x55eab7eb0760, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=1, comp-class-so-handle-addr=0x55eab7ebd910, comp-class-so-handle-path="/root/workspace/joraj_integration_base_job/deps-64/build/lib/babeltrace2/plugins/babeltrace-plugin-text.so", comp-input-port-count=1, comp-output-port-count=0
  02-28 16:55:03.262 32362 32362 E CLI cmd_run@babeltrace2.c:2548 Graph failed to complete successfully

  ERROR:    [Babeltrace CLI] (babeltrace2.c:2548)
    Graph failed to complete successfully
  CAUSED BY [libbabeltrace2] (graph.c:473)
    Component's "consume" method failed: status=ERROR, comp-addr=0x55eab7eb0760,
    comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK,
    comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages
    (`text` fo", comp-class-is-frozen=1, comp-class-so-handle-addr=0x55eab7ebd910,
    comp-class-so-handle-path="/root/workspace/joraj_integration_base_job/deps-64/build/lib/babeltrace2/plugins/babeltrace-plugin-text.so",
    comp-input-port-count=1, comp-output-port-count=0
  CAUSED BY [libbabeltrace2] (iterator.c:864)
    Component input port message iterator's "next" method failed:
    iter-addr=0x55eab7eb0fb0, iter-upstream-comp-name="muxer",
    iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER,
    iter-upstream-comp-class-name="muxer",
    iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu",
    iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
  CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:991)
    Cannot validate muxer's upstream message iterator wrapper:
    muxer-msg-iter-addr=0x55eab7eb1120,
    muxer-upstream-msg-iter-wrap-addr=0x55eab7eb3a70
  CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:454)
    Upstream iterator's next method returned an error: status=ERROR
  CAUSED BY [libbabeltrace2] (iterator.c:864)
    Component input port message iterator's "next" method failed:
    iter-addr=0x55eab7eb1170, iter-upstream-comp-name="lttng-live",
    iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE,
    iter-upstream-comp-class-name="lttng-live",
    iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon",
    iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
  CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (lttng-live.c:1665)
    Error preparing the next batch of messages:
    live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
  CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (viewer-connection.c:1706)
    Error receiving get new streams reply
  CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (viewer-connection.c:198)
    Remote side has closed connection

Looking at the relay daemon logs, we see the following error:
  DBG1 - 16:55:03.262106718 [32139/32146]: Adding new file "ust/pid/gen-ust-events-32373-20220228-165503/chan_0" to trace chunk "(unnamed)" (in lttng_trace_chunk_add_file() at trace-chunk.cpp:1310)
  PERROR - 16:55:03.262133333 [32139/32146]: Failed to open fs handle to ust/pid/gen-ust-events-32373-20220228-165503/chan_0, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker/fd-tracker.cpp:548)

Cause
=====

Adding more debugging logging allows us to see the following situation
takes place:

- relay thread: Create trace chunk on session 1.
- live thread: get new streams against session 1, returns NO_NEW_STREAMS
  since the session has an 'ongoing_rotation'.
- live thread: get new streams against session 2, sees no rotation
  ongoing and attempts to open `chan_0` when creating a viewer stream

The "ongoing rotation" check was introduced in a7ceb342d and, in a
nutshell, prevents live viewers from creating new viewer streams during
a rotation.

The "ongoing rotation" state is entered when a CREATE_NEW_TRACE_CHUNK
command is issued against a session.

However, this presumes that a relay_session maps 1:1 to a session on the
session daemon's end. This isn't the case as, in multi-domain
scenarios (tracing 32-bit, 64-bit, and kernel events), a single session
daemon session can map to multiple relay_session objects. This is
because the consumer daemons maintain independant connections to the
relay daemon.

To synchronize rotations accross related relay_session instances, the
relay daemon uses the same trace chunk instances accross relay_session
instances. This means that while a trace chunk is created against a
specific relay session, it can be used by other relay_session instances.

To manage shared trace chunks between relay_sessions, the relay daemon
makes use of the trace_chunk_registry. This registry allows
relay_sessions to share trace chunk instances using a unique key tuple:
  - session daemon instance uuid,
  - session daemon session id,
  - trace chunk id.

There is no equivalent mechanism to track the "ongoing_rotation" state
accross relay_sessions originating from the same sessiond session.

In the current scenario, this causes the live client to correctly see
that no new streams are available for session 1 (say, the 32-bit user
space session). Unfortunately, this state is not entered for other
sessions (64-bit and kernel relay sessions). Hence, the viewer succeds
in acquiring new streams from session 2, exposing the race the 'ongoing
rotation' state aims to protect against.

Solution
========

Like the trace chunk instances, the "ongoing rotation" state must be
shared accross relay sessions that originate from the same session
daemon session.

To "emulate" this shared state, session_has_ongoing_rotation() checks
if any relay session originating from the same sessiond session
have an ongoing rotation. If it is the case, we temporarily prevent
live viewers from acquiring new streams.

Known drawbacks
===============

session_has_ongoing_rotation() iterates over all sessions, acquiring
their lock in the process, which is certainly undesirable from a
performance standpoint.

Optimizing this is not a great challenge, but is beyond the scope
of this immediate fix.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I457a32fa497d151ca933c25433c80665268a7c1c

Tests: fix: select_poll_epoll: test assumes epoll fd value

The test currently assumes that epoll fds are always == 3, which
is not always the case depending on the execution environment.

This change causes `select_poll_epoll` to produce a JSON file
containing the application's pid and epoll fd values that is
then used by the validation script.

Note that the test is converted to C++ to allow the use of
internal utils (common/error.h/cpp) without changing their linkage.

However, the code is still regular C to ease the backport of this
fix.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ie373c63f6e6b9267ae2d785c9f0a532a5de37905

Fix: rotation: hang on destroy when using scheduled rotation based on timer

Observed issue
==============

The following scenario results in a hang for `lttng destroy`:

lttng create test
lttng enable-event -u -a
lttng enable-rotation --timer 100000
lttng start
lttng stop
lttng start
lttng destroy

Cause
=====

There is an imbalance in how many times we start the rotation timer.

The rotation timer is only removed on `lttng destroy` or when disabling
a time-based-rotation. On the other hand, the timer is "started"
on `lttng start` and when enabling a time based rotation.

The imbalance emerging from a start/stop/start sequence would prevent the
teardown of the session object since each time the timer is started a
reference to the session is held.

Solution
========

Do not start the rotation schedule timer if it was already launched.

Known drawbacks
=========

None.

Change-Id: Ic5b8938166358fe7629187bebdf02a09e90846c0
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: consumerd: fd still open after `lttng snapshot record` returns

Observed issue
=====

Using a snapshot output located on a pramfs mount:

  lttng snapshot record
  rm -rf /my_mount/my_trace_output

`rm` fails with ENOTEMPTY on rmdir for /my_mount/my_trace_output.

At that point, the lttng-consumerd daemon have an open fd on:
  /my_mount/my_trace_output/ust

Note that a sleep between both command "fixes" the issue.

Cause
=====

The reclaim for the in-registry trace chunks can happen after the LTTng
CLI returns since we use `call_rcu`.

```
static
void lttng_trace_chunk_release(struct urcu_ref *ref)

....

  if (chunk->in_registry_element) {
    struct lttng_trace_chunk_registry_element *element;

    element = container_of(chunk, typeof(*element), chunk);
    if (element->registry) {
      rcu_read_lock();
      cds_lfht_del(element->registry->ht, &element->trace_chunk_registry_ht_node);
      rcu_read_unlock();
->    call_rcu(&element->rcu_node, free_lttng_trace_chunk_registry_element);
   } else {

```

The delayed reclaim of the `lttng_trace_chunk_registry_element` can
result in lttng-consumerd holding an open fd for the "chunk directory"
of the chunk since the close() is only done during the "*fini" phase of
the chunk (`lttng_trace_chunk_fini`).

Solution
========

Considering that the rcu lookup+refcount access scheme is used for the
trace chunk object and that at that point the refcount for the trace
chunk object is effectively zero, we can move the
`lttng_trace_chunk_fini` safely outside of the
`free_lttng_trace_chunk_registry_element` call_rcu call.

Known drawbacks
=========

Even if this solves the current situation, it is important to note that
the actual object holding the reference is itself refcounted and only
close the fd on release. This means that we are still exposed to this
problem if at some point the directory handle is shared and outlives the
trace chunk for some reason in the future.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I6da3948824bf8b092fc8248b1bb0263fdd5887be

Fix: event: unchecked return value for lttng_strncpy

Reported by Coverity:

CID 1475806: Unchecked return value (CHECKED_RETURN)
Calling lttng_strncpy without checking return value.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib2b16bef61bcc24fea381c18fd8f914e9a84a6e7

Fix: event: format specifier for ssize_t

Observed issue
==============

  In file included from event.cpp:15:
  event.cpp: In function ‘ssize_t lttng_event_create_from_payload(lttng_payload_view*, lttng_event**, lttng_event_exclusion**, char**, lttng_bytecode**)’:
  ../../src/common/error.h:191:28: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 4 has type ‘ssize_t’ {aka ‘int’} [-Wformat=]
    191 |  __lttng_print(PRINT_WARN, "Warning: " fmt "\n", ## args)
        |                            ^~~~~~~~~~~~~~~~~~~~
  ../../src/common/error.h:139:51: note: in definition of macro ‘__lttng_print’
    139 |    fprintf((type) == PRINT_MSG ? stdout : stderr, fmt, ## args); \
        |                                                   ^~~
  event.cpp:624:4: note: in expansion of macro ‘WARN’
    624 |    WARN("Userspace probe location from the received buffer is not the advertised length: header length = %" PRIu32 ", payload length = %lu", event_comm->userspace_probe_location_len, ret);
        |    ^~~~

Solution
========

Albeit there is no "canonical" way of printing ssize_t, use '%zd' since
we already make use of it elsewhere.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Id41e6ccf07bd580813f169b65d281a4fa305fb48

Fix: sessiond: lttng_channel object is not reclaimed

The reclaim is also missing on the error path.

Reported by Coverity:

CID 1475807: Resource leak (RESOURCE_LEAK)
Variable channel going out of scope leaks the storage it points to.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I9ae59900f6bc33282b8b8a65b834a718428a7f50

Fix: common: local_attr might leak

Reported by Coverity:

    CID 1475808: Resource leak (RESOURCE_LEAK)
    Variable local_attr going out of scope leaks the storage it points to.

    CID 1475810: Resource leak (RESOURCE_LEAK)
    Variable local_attr going out of scope leaks the storage it points to.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibe2ee6516618827b6f52676d9a9f7d510cadf1bc

Fix: session: end goto label used for error path

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I1e5137c2de6c7b821485c9571ba24ee1036442e9

Fix: ser/des: missing null terminator on payload append

Reported by the lttng-ust-java-tests_master_build CI job.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I47b048c65a9855347c9b02fe44d9fdbae7e06dbf

Fix: test: test plan/skip must come after TAP initialization

TAP is initialized during the sourcing of `utils.sh`. Calling it before
yield nothing.

Move the platform validation after the `PLAN_TEST` call.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib7b4434440ed7590b39b0580ea294045ace40495

Fix: test: missing test plan for non-x86/arm platforms (i.e powerpc)

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I9a34a0b45fd011943436ce6e45c8f0d1666df26f

Fix: liblttng-ctl comm: lttng_event_field is not packed

Observed issue
==============

For MI testing where the lttng-sessiond is 64 bit and the lttng CLI is
32 bit, the tracepoint field listing fails with partial garbage output.

The size of the struct differs between bitness for x86-64 and x86
leading to serialization/deserialization problem across client
(liblttng-ctl) and lttng-sessiond.

sizeof(struct lttng_event_field):

x86: 1136
x86-64: 1144

The struct cannot be marked as LTTNG_PACKED since it is part of the API.

Solution
========

Adopt a similar pattern to the new APIs with a "serialize" &
"create_from_buffer" approach. The only particularity is that we need to
flatten the event_field on listing.

Most of the complexity is moved to `src/common/event.c`

Known drawbacks
=========

None.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I280d9809d110237574e2606ee93a7aeba41e704e

Fix: liblttng-ctl comm: lttng_event_context is not packed

Observed issue
==============

The size of the struct differs between bitness for x86-64 and x86
leading to serialization/deserialization problem across client
(liblttng-ctl) and lttng-sessiond.

sizeof(struct lttng_event_context):

x86: 308
x86-64: 312

The struct cannot be marked as LTTNG_PACKED since it is part of the API.

Solution
========

Adopt a similar pattern to the new API with a "serialize" &
"create_from_buffer" approach.

Most of the complexity is moved to `src/common/event.c`

Known drawbacks
=========

None.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ieb400eab2a2df4070ff51cb2b44929d3ea945ce4

Fix: liblttng-ctl comm: lttng_event is not packed

Observed issue
==============

In `lttcomm_session_msg` the lttng_event struct is marked as
LTTNG_PACKED. This statement have no effect as explained in commit [2].

Solution
========

Adopt a similar pattern to the new API with a "serialize" &
"create_from_buffer" approach.

Most of the complexity is moved to `src/common/event.c`

Known drawbacks
=========

None.

Reference
========
[1] https://review.lttng.org/gitweb?p=lttng-tools.git;a=commit;h=7bd95aee4660c6419a4a65429fc27754481e7e90

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I35d848519dacb2b119324e88f262aa95951e4ac6

libcommon: move event.c to libcommon-lgpl

The `event.c` license is already LGPL. There is no technical reason why
it was not part of the lgpl side of libcommon, simply that nothing that
is LGPL needed it. This will change in upcoming commits with the
addition of ser/des functions of `struct lttng_event` and other structs
related to `lttng_event` for liblttng-ctl.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I574e9053645b768c37505f1f27e18e3da69c772a

Fix: liblttng-ctl comm: lttng_channel is not packed

Observed issue
==============

The size of the struct differs between bitness for x86-64 and x86
leading to serialization/deserialization problem across client
(liblttng-ctl) and lttng-sessiond.

sizeof(struct lttng_channel):

x86: 608
x86-64: 624

The struct cannot be marked as LTTNG_PACKED since it is part of the API.

Solution
========

Adopt a similar pattern to the new API with a "serialize" &
"create_from_buffer" approach. The only particularity is that we need to
flatten the channels on listing.

Most of the complexity is moved to `src/common/channel.c`

Known drawbacks
=========

None.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I59b0e92286e36c4f183d2950417cb180d4efc200

Fix: lttng-ctl: erroneous uses of LTTNG_PACKED

The LTTNG_PACKED macro uses gcc attributes to indicate that a structure
should be packed. Hence, this macro obeys the same rules as the gcc
attribute.

Various mis-uses of the LTTNG_PACKED macros may result in structure not
being packed:
  - The LTTNG_PACKED macro should always be placed _before_ an identifier
    when a structure is declared in-place.
  - Adding LTTNG_PACKED at the definition site has no effect if the
    structure was declared elsewhere.

Those mis-uses cause issues when mixing the bitness (32/64) of the
session daemon and liblttng-ctl.

Outstanding issues include the following structures that are not
tagged as LTTNG_PACKED:
  - struct lttng_event
  - struct lttng_channel
  - struct lttng_event_context

Unfortunately, those structures are exposed by the public API and
can't be tagged as being "packed". Doing so would break the ABI
of liblttng-ctl.

These structures should be packed/unpacked explicitly.

JORAJ CHERRY PICK: Change-Id: I095dc0dffc6bf9e15dc7a7ec797958a5780ef150

Change-Id: I33f026a613813af312de65adee23ac829a071cf8
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Backport: lttng_buffer_view_is_valid, lttng_buffer_view_contains_string

Change-Id: I23892e6f20e2847037ba16ad1bba8b2d5a9e0a7c
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Fix: conversion from KB to bytes overflow on arm32

Observed issue
==============

On enable channel the memory available check fails on arm32 when
available memory, in bytes, is larger than 2^32.

Cause
=====

`read_proc_meminfo_field` converts the read value (in KB) to bytes and
stores it into a size_t variable.

On the system running the reproducer the value of the `value_kb` variable
is 4839692, yielding an overflow when multiplied with 1024 since
`size_t` is 32 bit long. `size_t` can be larger in certain situation
(i.e LARGEFILE) but this is irrelevant to the problem at hand.

Solution
========

Convert all the checks to use uint64_t.

Known drawbacks
=========

None.

References
==========

The multiplication overflow check scheme is borrowed from
`src/common/time.c`

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I76fe7e57da149c1b4c230a411e0346ba6f9ef7b1

Fix: Unexpected payload size in cmd_recv_stream_2_11

Observed issue
==============

For the following scenario:

lttng-relayd: 64 bit
lttng-sessiond: 64 bit
lttng-consumerd: 32 bit
application: 32 bit

Commands
   lttng create --set-url=net://127.0.0.1
   lttng enable-event -u -a
   lttng start
   ./application

On application start the lttng-relayd reports this error:

  DEBUG1 - 14:16:38.216442600 [2004731/2004735]: Done receiving control command payload: fd = 19, payload size = 4376 bytes (in relay_process_control_receive_payload() at main.c:3456)
  DEBUG3 - 14:16:38.216469462 [2004731/2004735]: Processing "RELAYD_ADD_STREAM" command for socket 19 (in relay_process_control_command() at main.c:3327)
  Error: Unexpected payload size in "cmd_recv_stream_2_11": expected >= 3519925694 bytes, got 4376 bytes

Cause
=====

In `relayd_add_stream`, instead of taking the > 2.11 protocol path, the
`relayd_add_stream_2_2` function is called.

The value of the rsock version number are:

  major: 21845
  minor: 2

Which is simply invalid since we know that the version should be 2.12.

The relayd sock version numbers are set during the
LTTNG_CONSUMER_ADD_RELAYD_SOCKET command between the lttng-sessiond and
the lttng-consumerd process. It is important to note here that both
processes do NOT have the same bitness.

The serialization and deserialization of `struct lttcomm_relayd_sock` is
the culprit.

`struct lttcomm_relayd_sock` contains a `struct lttcomm_sock`:

struct lttcomm_sock {
    int32_t fd;
    enum lttcomm_sock_proto proto;
    struct lttcomm_sockaddr sockaddr;
    const struct lttcomm_proto_ops *ops;
} LTTNG_PACKED;

Note that `ops` is a pointer and its size varies based on the bitness of
the application. Hence the size of the `struct lttcomm_sock` differs
across bitness. Since it is the first member of `struct
lttcomm_relayd_sock`, the memory layout is simply invalid across
bitness (amd64/x86).

This results in invalid parsing for the overall "struct
lttcomm_relayd_sock" when dealing with a lttng-consumerd with a
different bitness than the lttng-sessiond. As far as I know local
tracing scenarios are not affected since this is only relevant when
dealing with a lttng-relayd.

Solution
========

Pass the socket protocol type, relayd major, relayd minor in
`lttcomm_consumer_msg`. On the receiver side, query the network stack to
get the peer information to populate a basic `lttcomm_sock`. Leaving
this work to the OS saves us from having to serialize the `sockaddr_in*`
structs.

Known drawbacks
=========

We rely on `getpeername` for the first time. Compatibility might be a
problem.

This code path assumes a lot of thing that cannot be asserted against
such as the fact that the socket from which we fetch the info must be
`connected`. Still at this point, the socket is completely setup and the
rest of the code depends on it already.

From GETPEERNAME(2):

```
       For stream sockets, once a connect(2) has been performed, either
       socket can call getpeername() to obtain the address of the peer
       socket.  On the other hand, datagram sockets  are connectionless.
       Calling connect(2) on a datagram socket merely sets the peer
       address for outgoing datagrams sent with write(2) or recv(2).
       The caller of connect(2) can use getpeername() to obtain the
       peer address that it earlier set for the socket.  However, the
       peer socket is unaware of this information, and calling
       getpeername() on the  peer  socket will  return  no useful
       information (unless a connect(2) call was also executed on the
       peer).  Note also that the receiver of a datagram can obtain the
       address of the sender when using recvfrom(2).
```

But here we are always "the caller of connect".

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I930ef9bbbf18fa881222850ba0fbbba026dc0220

Tests: add kernel test into the `make check` test suite.

The Jenkins CI mostly run the `make check` suite. Only the Lava base CI
run the root_regression test suite. Most of those test can be run on
`make check` without incurring any major extra time.

Only `regression/tools/streaming/test_high_throughput_limits` is left in
root_regression since it is currently "unreasonable" in term of the time
it takes to run. This could be tackled another time.

Change-Id: I29d40fa8bec872bf2e22a8bd933f58fa6376ee22
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Update version to v2.12.8

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>

Relicence all source and header files included in LGPL code

All code included in libcommon-lgpl.a should be LGPL. Some were licensed
as GPLv2 by mistake. We need to relicense those.

EfficiOS owns the copyright to all of the affected source files and
agrees to this relicensing from GPLv2 to LGPLv2.1.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib4aa1a7cc8a6f8e2a9891f1bd05c7ea2f8530e9b

Fix: sessiond: use of null session when session is not found

A bad backport introduced this use of `session` even when it is
NULL.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: If7d540229ef3a2a50e1e58d4f408cb61252ed45d

Move utils_expand_path and utils_expand_path_keep_symlink to libpath.la

Move the GPLv2 helper functions utils_expand_path and
utils_expand_path_keep_symlink to libpath.la. This will allow utils.cpp
to be relicensed to LGPLv2.1 by making sure EfficiOS owns the copyright
for the entire source file.

Statically include libpath.la into libcommon-gpl.la.

The "lttng" executable is GPLv2 and only depends on libcommon-lgpl.la,
so it needs to explicitly list libpath.la as its dependency.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I2108e0bd35ce75797d4a20e65891c820a1dd79f8

Link lttng executable on libcommon-lgpl.a

Allows testing whether liblttng-ctl.so has any dependency on
libcommon-gpl.a. Will trigger a link-time failure if it is the case.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ie6f864dccdd61f896a7e238e202dc6188ec4f9c5

Introduce libcommon-lgpl for liblttng-ctl

liblttng-ctl is a LGPLv2.1 library should should not use GPLv2 code.
Introduce libcommon-lgpl as a static archive containing only LGPLv2.1
compatible code.

This also removes the dependency from liblttng-ctl to liburcu.

Include some source files in libcommon-lgpl.a which are indirectly needed
by source files required in libcommon-lgpl.a:

- endpoint.cpp,
- lttng-elf.cpp,
- lttng-elf.h.

Include some source files in libcommon-lgpl.a which are only needed to
link the lttng executable:

- domain.cpp,
- spawn-viewer.cpp, spawn-viewer.h.

Introduce the new source file hashtable/seed.cpp to move the
lttng_ht_seed symbol in a source file which does not require
liburcu-cds, so it can be present in libcommon-lgpl. This allows
building compile units which are needed in the lgpl common library which
also contain functions which directly refer to lttng_ht_seed.

Programs and libraries which use libhashtable.la are changed to use
libcommon-gpl.la instead. libhashtable becomes internal to libcommon.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I27d2acb823a6d951692a5da88ce32bbe6bafb072

Rename libcommon.so to libcommon-gpl.so

libcommon is a static library is currently used by both liblttng-ctl
(LGPLv2.1) and all lttng-tools executables (GPLv2).

Given that some code in libcommon depends on liburcu, this introduces an
indirect dependency from liblttng-ctl to liburcu, which is unwanted.

This first step renames libcommon.so to libcommon-gpl.so. Following
steps will introduce a more lightweight libcommon-lgpl.so which only
contains LGPLv2.1 code, and removes the dependency on liburcu.

Backport Notes
--------------

ini_config has to link against liblttng-ctl since since the internal
configuration library is not split into ini-config and libconfig
(see 3299fd310).

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ia8f37ea229f68550200cbb1528216a505bbbd45f

Copyright ownership transfer

Apply copyright ownership transfer from Jonathan Rajotte-Julien and
Olivier Cotte to EfficiOS Inc.

Link: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030111.html
Link: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030113.html
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Olivier Moussavou Cotte <olivier.moussavoucotte@mail.mcgill.ca>
Signed-off-by: Jonathan Rajotte-Julien <jonathan.r.julien@gmail.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I0c2800ec17f5d435c8a18f2dae096809b8037b5d

Copyright ownership transfer

Apply copyright ownership transfer from David Goulet, Julien Desfossez,
and Simon Marchi to EfficiOS Inc.

Link: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030087.html
Link: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030092.html
Link: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030091.html
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Signed-off-by: David Goulet <dgoulet@ev0ke.net>
Signed-off-by: Julien Desfossez <ju@klipix.org>
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Change-Id: Id13575afd4a2a09bb91a8d2b7a12dc3db8dc329f

Fix: relayd: erroneous rundir permission logging message

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I2eb0901896feab684d397aa43f8c9036018d9c95

Fix: sessiond: rotation thread: fatal error when not finding a session

The rotation thread implements scheduled rotations (by size) by
registering a trigger that monitors the session's consumed size and
notifies when the next rotation's size threshold is exceeded.

The notification is delivered asynchronously which doesn't prevent
the session from being destroyed before the rotation thread has
had the time to process the notification (and perform a rotation).

Since it is possible for a session to be destroyed by the time the
notification is processed, the rotation thread shouldn't handle
this eventuality as a fatal error (shutting down the thread).

Note that nobody reported this issue nor did I attempt to reproduce it.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I588054bad3542854851f28d34f2c758bdf420a34