> [...] I'm not even sure that it's possible to write a correct lock-free queue (of which this is a variant) in a non-garbage collected language without some pretty substantial trade-offs regarding memory management.
There's no dynamic memory allocation going on in this particular lock-free queue, so I don't see how your concern is relevant here?
After reading the full article, they solve a vastly more restricted (although perhaps still useful) problem than the title implies. Their design assumes a single producer and single consumer and is trivially broken if that's not the case. I don't have time to write a convincing set of benchmarks to prove it, but I very much doubt that this is faster than a locking implementation of the same structure.
I can attest to the basic lock-free spsc queues being far faster than a basic spinlocking version of the same queue, on xeon-class X86 at least.
LOCK'ed instructions (one might use lock xchg or lock cmpxchg) perform a full memory barrier, which has quite a few performance implications if there are any outbound loads/stores. Further, if the cache line containing the spinlock is in anything except an exclusively-held state on the writing core (it will usually be on a different core or in a shared state), simply acquiring the lock will stall for ~25-30ns+.
On the other hand, a simply spsc lock-free queue can hit <10ns per message with a latency about equal to the core-core communication latency.
There's no dynamic memory allocation going on in this particular lock-free queue, so I don't see how your concern is relevant here?