From: Jérémie Galarneau Date: Tue, 3 Jan 2023 23:41:23 +0000 (-0500) Subject: Fix: sessiond: instance uuid is not sufficiently unique X-Git-Url: https://git.lttng.org/?a=commitdiff_plain;h=57b90af7b1977684094706818e387433f50b7d48;hp=57b90af7b1977684094706818e387433f50b7d48;p=lttng-tools.git Fix: sessiond: instance uuid is not sufficiently unique Observed issue ============== Tracing a cluster of machines -- all launched simultaneously -- to the same relay daemon occasionally produces corrupted traces. The size of packets received (as seen from the relay daemon logs) and that of those present in the on-disk stream occasionally didn't match. The traces were observed to all relate to the same trace UUID, but present packet begin timestamps that were not monotonic for a given stream. This causes both Babeltrace 1.x and 2.x to fail to open the traces with different error messages related to clocks. Cause ===== On start, the session daemon generates a UUID to uniquely identify the sessiond instance. Since the UUID generation utils use time() to seed the random number generator, two session daemons launched within the same second can end up with the same instance UUID. Since the relay daemon relies on this UUID to uniquely identify a session daemon accross its various connections, identifier clashes can cause streams from the same `uid` or `pid` to become scrambled resulting in corrupted traces. Solution ======== The UUID utils now initializes its random seed using the getrandom() API in non-blocking mode. If that fails -- most likely because the random pool is depleted or the syscall is not available on the platform -- it falls back to using a hash of two time readings (with nanosecond precision), of the hostname, and the PID. Known drawbacks =============== This fix implements many fallbacks, each with their own caveats and we don't have full test coverage for all of those for the moment. This article presents the different drawbacks of using /dev/urandom vs getrandom(). https://lwn.net/Articles/884875/ As for the pseudo-random time and configuration based fallback, it is meant as a last resort for platforms or configurations where both getrandom() (old kernels or non-Linux platforms) and /dev/urandom (e.g. locked-down container) are not be available. I haven't done a formal analysis of the entropy of this home-grown method. The practical use-case we want to enable is launching multiple virtual machines (or containers) at roughly the same time and ensure that they don't end up using the same sessiond UUID. In that respect, having a different host name and minute timing changes seem enough to prevent a UUID clash. Using the PID as part of the hash also helps when launching multiple session daemons simultaneously for different users. Change-Id: I064753b9ff0f5bf2279be0bd0cfbfd2b0dd19bfc Signed-off-by: Jérémie Galarneau ---