in cds_list_add_rcu, use rcu_assign_pointer to update head->next
atomically and provide the memory barrier before publishing head->next.
Notice that we don't need the wmb() prior to store to prev, because RCU
traversals only go forward, and thus only use "next".
in cds_list_del_rcu, use CMM_STORE_SHARED() to store to elem->prev->next
atomically.
* Lai Jiangshan (laijs@cn.fujitsu.com) wrote:
> Hi, Mathieu,
>
> There is a big compatible problem in URCU which should be fix in next round.
>
> LB: liburcu built on the system which has sys_membarrier().
> LU: liburcu built on the system which does NOT have sys_membarrier().
>
> LBM: liburcu-mb ....
> LUM: liburcu-mb ...
>
> AB: application(-lliburcu) built on the system which has sys_membarrier().
> AU: application(-lliburcu) built on the system which does NOT have
> sys_membarrier().
>
> ABM application(-lliburcu-mb) ...
> AUM application(-lliburcu-mb) ...
>
> AB/AU + LB/LU: 4 combinations
> ABM/AUM + LBM/LUM: 4 combinations
>
> I remember some of the 8 combinations can't works due to symbols are
> miss match. only LU+AB and LB+AU ?
>
> could you check it?
>
> How to fix it: In LU and AU, keep all the symbol name/ABI as LA and
> AB, but only the behaviors falls back to URCU_MB.
Define membarrier() as -ENOSYS when SYS_membarrier is not found in the
system headers. Check dynamically for membarrier availability to ensure
ABI compatibility between applications and librairies.
Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fix: Use a filled signal mask to disable all signals
Changelog from David Pelton's original patch:
While using lttng-ust with an application that was calling fork()
with pending signals, I found that all signals were getting unmasked
shortly before the underlying call to fork(). After some
investigation, I found that the rcu_bp_before_fork() function was
unmasking all signals. Based on the comments for this function, it
should be masking all signals. Inspection of the rest of the code
in urcu-bp.c revealed the same pattern in two other functions.
This patch changes the code to use a filled signal mask to disable
all signals. The change to rcu_bp_before_fork() addressed the
problem I was seeing while using lttng-ust. The changes to the
other two functions appear to fix other instances of the same
problem.
Updates by Mathieu Desnoyers:
- Use SIG_BLOCK instead of SIG_SETMASK when setting a filled mask. This
has the same behavior in this case (since we're blocking all signals),
but is semantically neater: if we ever some signals from that mask,
we'd like to to a union with the signal mask already blocked by the
application.
- Also fix incorrect signal masking in compat_arch_x86.c.
Reported-by: David Pelton <dpelton@ciena.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Lai Jiangshan [Mon, 6 May 2013 12:42:27 +0000 (08:42 -0400)]
urcu: avoid false sharing for rcu_gp_ctr
@rcu_gp_ctr and @registry share the same cache line, it causes
false sharing and slowdown both of the read site and update site.
Fix: Use different cache line for them.
Although rcu_gp_futex is updated less than rcu_gp_ctr, but
they always be accessed at almost the same time, so we also move rcu_gp_futex
to the cacheline of rcu_gp_ctr to reduce the cacheline-usage or cache-missing
of read site.
Lai Jiangshan [Mon, 6 May 2013 12:32:02 +0000 (08:32 -0400)]
urcu: make the code of urcu-qsbr as normal urcu
urcu-qsbr's read site's quiescence is much longer than normal urcu ==>
synchronize_rcu() is much slower ==>
rcu_gp_ctr is updated much less ==>
the whole urcu-qsbr will not be slowed down by false sharing of rcu_gp_ctr.
But this patch makes sense to keep the code of urcu-qsbr like normal urcu,
better readability and maintenance.
- thread is online (QSBR),
- thread is nested within read-side critical section (other flavors),
This is useful for libraries that need to know if QSBR is online in
order to save the original state temporarily so it can be restored
before returning to the caller.
Eventually, this API can be called by a "debugging" implementation of
rcu_dereference() and other urcu-pointer.h API members to check that no
RCU pointer is read outside of RCU read-side critical sections.
Some sparc Debian setups advertise a "sparc" host cpu (rather than
sparc64).
In all cases, I think it should be safe to add a "sparc" entry to
userspace RCU configure.ac upstream, e.g.
[sparc], [ARCHTYPE="sparc64"],
in the event someone would launch the build on an environment not
supporting sparc v9, the build would fail because the 32-bit compiler
would not be able to generate sparc v9 instructions (unless
explicitely instructed to do so by the -m32 -Wa,-Av9a flags).
Fix hurd-i386: move cpuset tests outside of sched_setaffinity conditional
Comment about introduction of cpuset.h within urcu tests:
> Unfortunately it doesn't work, because sched_setaffinity is for now
> just a fail-stub on hurd-i386, and thus configure considers it as
> missing, and thus the CPU_SET test is disabled completely.
>
> I however guess you could just disable defining your own cpu_set_t
> when !HAVE_SCHED_SETAFFINITY, since it is probably used only for using
> sched_setaffinity.
Fix by moving cpu_set_t, CPU_SET and CPU_ZERO tests outside of the
sched_setaffinity conditional.
Reported-by: Samuel Thibault <sthibault@debian.org> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fix tests: finer-grained use of CPU_SET, CPU_ZERO and cpu_set_t
Noticed build failure at
https://buildd.debian.org/status/package.php?p=liburcu :
Tail of log for liburcu on hurd-i386:
test_urcu.c:110:0: warning: "CPU_SET" redefined [enabled by default]
In file included from /usr/include/pthread/pthread.h:50:0,
from /usr/include/pthread.h:2,
from test_urcu.c:26:
/usr/include/sched.h:80:0: note: this is the location of the previous definition
make[3]: *** [test_urcu.o] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all] Error 2
dh_auto_build: make -j1 returned exit code 2
make: *** [build-arch] Error 2
dpkg-buildpackage: error: debian/rules build-arch gave error exit status 2
make[3]: Entering directory `/build/buildd-liburcu_0.7.6-1-hurd-i386-wGBAtt/liburcu-0.7.6/tests'
CC test_urcu.o
make[3]: Leaving directory `/build/buildd-liburcu_0.7.6-1-hurd-i386-wGBAtt/liburcu-0.7.6/tests'
make[2]: Leaving directory `/build/buildd-liburcu_0.7.6-1-hurd-i386-wGBAtt/liburcu-0.7.6'
Fix build on architectures with HAVE_SCHED_GETCPU but without HAVE_SYSCONF
Noticed on: https://buildd.debian.org/status/package.php?p=liburcu
Tail of log for liburcu on kfreebsd-amd64:
CC urcu.lo
In file included from urcu.c:450:0:
urcu-call-rcu-impl.h:145:12: error: static declaration of 'sched_getcpu' follows non-static declaration
In file included from /usr/include/sched.h:43:0,
from /usr/include/pthread.h:20,
from urcu.c:30:
/usr/include/x86_64-kfreebsd-gnu/bits/sched.h:65:12: note: previous declaration of 'sched_getcpu' was here
make[3]: *** [urcu.lo] Error 1
make[3]: Leaving directory `/build/buildd-liburcu_0.7.6-1-kfreebsd-amd64-nnkICd/liburcu-0.7.6'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/build/buildd-liburcu_0.7.6-1-kfreebsd-amd64-nnkICd/liburcu-0.7.6'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/build/buildd-liburcu_0.7.6-1-kfreebsd-amd64-nnkICd/liburcu-0.7.6'
dh_auto_build: make -j1 returned exit code 2
make: *** [build-arch] Error 2
Tail of log for liburcu on kfreebsd-i386:
CC urcu.lo
In file included from urcu.c:450:0:
urcu-call-rcu-impl.h:145:12: error: static declaration of 'sched_getcpu' follows non-static declaration
In file included from /usr/include/sched.h:43:0,
from /usr/include/pthread.h:20,
from urcu.c:30:
/usr/include/i386-kfreebsd-gnu/bits/sched.h:65:12: note: previous declaration of 'sched_getcpu' was here
make[3]: *** [urcu.lo] Error 1
make[3]: Leaving directory `/build/buildd-liburcu_0.7.6-1-kfreebsd-i386-sWzNKU/liburcu-0.7.6'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/build/buildd-liburcu_0.7.6-1-kfreebsd-i386-sWzNKU/liburcu-0.7.6'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/build/buildd-liburcu_0.7.6-1-kfreebsd-i386-sWzNKU/liburcu-0.7.6'
dh_auto_build: make -j1 returned exit code 2
make: *** [build-arch] Error 2
CMM_STORE_SHARED(x, v) is a macro that really acts like an assignment
expression, e.g.:
x = v;
but internally also has "mc" barriers (useful for cache-incoherent
architectures).
The issue here is that (x = v) can evaluate to "v", but very often we're
not interested to use the assignment expression result. When we have an
explicit assignment, the compiler won't complain that the result of this
expression is unused, but given that the added barrier requires that we
make this macro evaluate explicitly to a value, clang complains.
Fix this by adding "_v = _v" at the last line of the macro, thus
performing what would appear like an effect-less assignment, but
actually tricks clang into thinking we are evaluating to an assignment
expression, thus suppressing the warning.
I've had a report of someone running into issues with the RCU lock-free
hash table by embedding the struct cds_lfht_node into a packed structure
by mistake, thus not respecting alignment requirements stated in
urcu/rculfhash.h. Assertions on "replace" and "add" operations should
catch this, but I notice that we should add assertions on the
REMOVAL_OWNER_FLAG to cover all possible misalignments.
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Simon Marchi [Tue, 12 Feb 2013 00:10:44 +0000 (19:10 -0500)]
Fix configure checks for Tile
The previous method of checking whether the architecture is TileGx or
not was buggy. urcu/arch/tile.h included urcu/arch/gcc.h, which was not
installed on the system, causing a configure error. I am not sure why it
worked when I tested commit 1000f1f4204e5fbb337f4ea911f1e29f67df79aa,
maybe some previous partial install or something.
The check is now done earlier, during the configure step and should not
cause any trouble.
Signed-off-by: Simon Marchi <simon.marchi@polymtl.ca> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Expand explanations, reorder items to have all wait-free descriptions
first, so that the rculfqueue API comes last, since it is less
featureful and is the only API of the queues/stacks to actually rely on
RCU.
Simon Marchi [Thu, 24 Jan 2013 20:40:54 +0000 (15:40 -0500)]
Add compilation support for the TileGX architecture
This patch adds compilation support for the TileGx architecture. Since
the tests were not ran on other architectures of the Tile family
(Tile64, TIlepro), errors are triggered during compilation if the
architecture is another Tile arch.
Signed-off-by: Simon Marchi <simon.marchi@polymtl.ca> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Discourage use of pthread_atfork() for call_rcu handlers
Discourage use of glibc pthread_atfork() for call_rcu handlers due to
its inappropriate assumptions about single-threadedness while pthread
atfork handlers are executing. This results in hangs within the glibc
memory allocator.
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fix call_rcu fork handling by putting all call_rcu threads in a
quiescent state before fork (paused state), and unpausing them when the
parent returns from fork.
On the child, everything will run fine as long as we don't issue fork()
from a call_rcu callback.
Side-note: pthread_atfork is not appropriate when using with multithread
and malloc/free. The glibc malloc implementation sadly expects that all
malloc/free are executed from the context of a single thread while
pthread atfork handlers are running, which leads to interesting hang in
glibc.
wfstack API: rename cds_wfs_first_blocking to cds_wfs_first
cds_wfs_first never needs to block. This operation can be used to check
if the stack returned by pop_all is empty or not, so it is quite
interesting to have a fully non-blocking semantic for all of
enqueue/pop_all/first operations. Only cds_wfs_next may block.
Here are benchmarks on batching of synchronize_rcu(), and it leads to
very interesting scalability improvement and speedups, e.g., on a
24-core AMD, with a write-heavy scenario (4 readers threads, 20 updater
threads, each updater using synchronize_rcu()):
Of course, we can see that readers have slowed down, probably due to
increased update traffic, given there is no change to the read-side code
whatsoever.
Now let's see the penality of managing the stack for single-updater.
With 4 readers, single updater:
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com> CC: Lai Jiangshan <laijs@cn.fujitsu.com> CC: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Here are benchmarks on batching of synchronize_rcu(), and it leads to
very interesting scalability improvement and speedups, e.g., on a
24-core AMD, with a write-heavy scenario (4 readers threads, 20 updater
threads, each updater using synchronize_rcu()):
Of course, we can see that readers have slowed down, probably due to
increased update traffic, given there is no change to the read-side code
whatsoever.
Now let's see the penality of managing the stack for single-updater.
With 4 readers, single updater:
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com> CC: Lai Jiangshan <laijs@cn.fujitsu.com> CC: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>