More

compudj · on Aug 13, 2018

IBM allowed their RCU contribution to be used in LGPLv2 or later code through this commit they kindly contributed to the Userspace RCU project:

  commit 54843abcc17c8e8b7600ed635e966c6970d8d20f
  Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
  Date:   Sat May 9 01:11:49 2009 -0400

    LGPL relicensing of IBM's contributions
    
    Add comments noting IBM's permission to relicense its contributions to the
    urcu.h and urcu.c files under the LGPLv2 license, or any later version.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Reviewed-by: Steven L. Bennett <steven.bennett@us.ibm.com>

nbsd4life · on Aug 13, 2018

That only refers to copyright, not the patent, right? I know FOSS projects avoiding a different implementation of RCU for patent reasons.

lostmsu · on Aug 13, 2018

Doesn't LGPLv2 give explicit patent grant?

compudj · on Aug 13, 2018

GPLv2 provides implicit patent grant in section 6 and section 7. LGPLv2.1 has basically the same content in section 10 and section 11.

IANAL, but AFAIU LGPLv2.1 provides implicit patent grant in the same way as GPLv2.

IBM has allowed use of RCU in GPL code through its contributions to the Linux kernel for many years now. Their contributions to Userspace RCU provide an implicit patent grant in a similar fashion.

Userspace RCU being LGPLv2.1, it allows proprietary applications to link to it, and therefore use RCU through the library APIs, as long as they satisfy LGPLv2.1 requirements.

compudj · on June 22, 2018

Indeed the restartable sequence critical section needs to be written in assembly. The idea is to keep this complexity within public headers implementing the common operations as inline assembly for all supported architectures. You can see such operations already implemented for x86 as part of the rseq selftests here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

compudj · on June 22, 2018

It all depends on how much control you have on the system you target. The strategy you refer to may well work for a dedicated deployment, but if you are developing a general-purpose memory allocator targeting a wide range of applications, you might not want to impose those constraints on your users.

signa11 · on June 22, 2018

ah, thanks for the information ! having used 'isolcpu' on dedicated-systems, i completely overlooked this aspect.

compudj · on June 22, 2018

(disclaimer: I am the patch author, Mathieu Desnoyers) Just as a clarification, the idea originates from Google (I give full credits to Paul Turner and Andrew Hunter for it). However, the extra 3 years of work required to get it upstream has been done by myself at EfficiOS.

senozhatsky · on June 22, 2018

Hi Mathieu, Yep, was going to say, that the patch set has evolved rather notably. Good job.

Did you manage to push the required changes to glibc or do you maintain your own user space rseq lib?

-ss

compudj · on June 22, 2018

I'm currently discussing with glibc maintainers on the best approach to integrate this into the Linux userspace ecosystem. So far, discussions aim into a direction where glibc would own the __rseq_abi TLS symbol, and register it for every thread. I can then maintain a rseq library which consists of helper header files that contain the common rseq operations for all supported architectures.

I am concerned about providing a librseq that handles rseq registration for early adopters though, because I don't want projects to eventually end up conflicting with future glibc versions. Once we settled how glibc will expose the symbol and register it, I will try to provide a helper library which exposes this symbol and allow performing explicit rseq registration in a way that won't conflict with future glibc versions.

senozhatsky · on June 23, 2018

> I am concerned about providing a librseq that handles rseq registration for early adopters though

Sounds very reasonable.

So at this point, as far as I understand it, FB and Google carry in-house rseq kernel and user space patches. Right? Are they on board with the mainline rseq? Will FB support rseq in jemalloc any time soon?

-ss

compudj · on June 23, 2018

I've been in touch with FB. They are interested in using rseq for jemalloc. They have provided prototypes of jemalloc based on rseq, along with benchmarks helping me make the case for rseq mainlining.

I don't know whether Google will ever want to swap from their in house rseq implementation to the upstream Linux rseq, use both ABIs for a transition period, or simply keep using their own in-house rseq.

pas · on June 22, 2018

Thank you for persevering! Could you elaborate a bit on what had to be done to get it upstream?

compudj · on June 22, 2018

Sure, before getting it upstream, I had to:

- Gather a list of desiderata, ensuring we take into account a complete list of use-cases targeted by everyone active in the rseq discussions. This is crucially important to ensure discussions don't spin in circles going back and forth between different requirements,

- Redesign the uapi/linux/rseq.h ABI, making sure a single TLS store is needed to enter a rseq critical section, without requiring any extra registers as ABI. I have introduced the "rseq_cs" structure as critical section descriptor to do this,

- Optimize arm32 and x86 rseq critical sections for speed, by creating my own benchmark programs,

- Rewrite the kernel rseq implementation a few times so it follows the kernel coding style and ensure it pleases everyone caring about it,

- Present 2 talks about rseq at Linux Plumbers Conference,

- Go through various rounds of in person, email, and IRC discussions with Paul Turner, Peter Zijlstra, Andy Lutomirski, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Ben Maurer, Linus Torvalds, and many others. Those were very constructive discussions bringing up everyone's concerns with respect to this new system call,

- Extend the rseq selftests, adding new testing strategies such as delay loops between "steps" of the critical section, thus increasing the likelihood of generating preemption races,

- Figure out nasty races only happening on NUMA systems after about a full day of stress-testing,

- Provide solutions for debugger single-stepping "lack of progress" problem if rseq is used when retrying on abort. It's basically the cpu_opv system call I plan to propose for 4.19. Meanwhile, without cpu_opv, rseq can still be used in ways to guarantee forward progress, but the abort code needs to use a partitioning strategy rather than a simple retry (e.g. going to a different memory pool in case of abort for a memory allocator),

- Harden the rseq mechanism for security, by adding a "signature" word before the abort label,

- Implement prototypes of lttng-ust and liburcu which use rseq, gathering benchmarks to validate the approach,

- Write rseq and cpu_opv man pages.

And this is just the items that were "forward progress" in the rseq adventure. I'm leaving out everything that were attempts at making things more generic that had to be thrown away.

pas · on June 24, 2018

Thanks for this detailed and a bit overwhelmingly long list. When I saw the first Paul Turner techtalk on rseq (it was called something else - LPC - if I remember correctly), it seemed so simple, so obvious "just read this memory address and if there was an interruption, we have to retry".

But then of course real life is a lot more complex than slideware.

cpu_opv is new for me (no time for LWN these days), but looks simple, elegant and sort of obvious (again). Which makes me wonder why no one thought about it yet. (But of course this is probably my ignorance speaking.)

Thanks for pushing the limits!

compudj · on June 22, 2018

Some use-cases likely to be enhanced by rseq: statistics counters, memory allocators (jemalloc, glibc malloc, and others), user-space tracing (LTTng), user-space Read-Copy Update (liburcu), reading performance monitoring unit counters from user-space on ARM64, and possibly user-space task scheduling.

Also, just reading the current CPU number can now be done faster by reading the __rseq_abi->cpu_id field rather than calling the sched_getcpu vDSO.

FrankBooth · on June 22, 2018

Thanks for seeing this work through, Mathieu.

> possibly user-space task scheduling

I'm very interested in this aspect. Do you have a sense of whether there's enough in the kernel now to build this, or are there still pieces missing?

compudj · on June 22, 2018

Yes, there are indeed pieces missing for this use-case. I intend to push another system call for the next merge window (4.19): "cpu_opv" [1]. It stands for "CPU operation vector", which is needed to take care of moving user-level tasks around between per-cpu work-queues touched by rseq fast-paths in a way that is safe against CPU hotplug. It's also needed to migrate free memory between per-cpu memory pools modified by rseq fast-paths safely against CPU hotplug. Some of it can be approximated by setting cpu affinity, but it's racy against CPU hotplug.

cpu_opv can be used as a slow-path fallback in pretty much all scenarios where the rseq fast-path aborts.

rseq user-level APIs are pretty much limited to only work on the current CPU, whereas cpu_opv allows creating operations on per-cpu data structures [2] which take the CPU number as argument. If it happens to be on the current CPU, rseq can be used and it is fast, but if the CPU number is not the current CPU, or is an offline CPU, then cpu_opv takes care of performing the operation safely with respect to rseq critical sections, and other cpu_opv operations.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/rseq/linux-r... [2] https://git.kernel.org/pub/scm/linux/kernel/git/rseq/linux-r...

compudj · on June 22, 2018

If you are looking for examples of per-cpu data structures using rseq, see the selftests I implemented here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

There are examples of per-cpu counters, per-cpu spinlocks, per-cpu linked-lists, and various forms of per-cpu buffers.

compudj · on June 22, 2018

Yes, Will Deacon (Linux ARM64 maintainer) is working on it right now.

compudj · on May 3, 2016

Hi Brendan,

(Full disclosure: I'm Mathieu Desnoyers, part of the LTTng maintainer team.)

I would like to introduce a slightly less extreme point of view when considering "on-the-fly" aggregation of traces vs tracing to buffer followed by post-processing. I see from the current discussion thread that it's very much either one or the other, but I think that combining the two approach helps creating much more powerful tools. On-the-fly aggregation based on trace instrumentation helps pinpointing latency outliers. Tracing to buffers, on the other hand, provides very detailed information about the system behavior that leads to those outliers. By using on-the-fly aggregation as "triggers" to collect tracer in-memory ring-buffers, one can achieve investigation of latency outliers with very small I/O overhead.

compudj · on May 3, 2016

Currently, LTTng-UST has slightly higher overhead than 100 cycles per event (roughly 250-300 ns/event on recent 2.4GHz Intel), which I expect is partly caused by use of per-CPU buffers rather than per-thread buffers. I have contributed the membarrier system call, and I am currently working on adding restartable sequences and cpu_id cache to the Linux kernel, so the speed of LTTng-UST can be brought closer to the performance of a tracer using per-thread buffers. Keeping per-CPU buffers ensures that the tracer efficiently use memory resources on workloads that have many more threads than CPU cores.

I also notice that filtering out all function entry/exit that take less than 5 microseconds is probably helping reaching those performance numbers. This kind of approach, although very specific to function tracing, seems worthwhile, and could eventually be introduced in LTTng-UST.

Another interesting aspect is that X-Ray seems to directly use the CPU cycle counter. It's all fine when the architectures has reliable TSC sources, but LTTng-UST uses the CLOCK_MONOTONIC vDSO to ensure that we properly fallback on other clock sources (e.g. HPET) whenever the system does not have a reliable TSC source. The extra function calls and seqlock may account for a few extra cycles difference between X-Ray and LTTng-UST.

compudj · on May 3, 2016

I don't see any mention of Intel's errata on cross-modifying code on SMP in the paper. I wonder how the authors handle this ? See "Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected Instruction Execution Results" Ref. http://www.intel.com.tr/content/dam/www/public/us/en/documen... AX72. This is one of the main challenges to cross-modifying code, and one key reason why LTTng-UST does not use a nop-slide today. One possible approach to this is to SIGSTOP the entire process while doing the code modification, which is unwanted in real-time systems. Another approach would be to integrate with uprobes and do a temporary breakpoint bypass, similarly to what is done in the Linux kernel today for jump labels.