> ... but that wouldn't be lock-free would it? Why do people do lock free? 1. Lo...

jcelerier · on June 4, 2019

> 2. Lockfree can be faster -- This is the case most people hope for.

it's not about being faster, it's about never returning control to the OS if you want to achieve soft-real time in one of your threads (a common application being audio processing where you have to consistently fit as much as you can do every 1.33 milliseconds for the more demanding users).

dragontamer · on June 4, 2019

Spinlocks don't return to the OS.

Mutexes are spinlocks for roughly 1000ish cycles on both Linux and Windows. If you're in-and-out before the 1000 cycles are up, the typical mutex (be it a pthread_mutex or Windows CriticalSection) will never hit kernel code.

If you take more than 1000ish cycles, then yeah OS takes over as a heuristic. But most spinlock code won't take more than 50 cycles.

chrisseaton · on June 4, 2019

How many cycles do you think it will take if the other thread has locked the spinlock and then died or was scheduled out...

dragontamer · on June 4, 2019

Both Linux and Windows detects these cases with pthread_mutex (Linux spinlock+mutex) and CriticalSection objects (Windows spinlock+mutex).

Can you tell me an OS where what you say is a legitimate concern?

To answer your question: about 4000 cycles on Windows if I'm remembering off the top of my head. I think that's the default. Roughly 1-microsecond.

chrisseaton · on June 4, 2019

Detects these cases... and does what?

What can the OS do for you if your consuming thread has for example taken a lock and is now stalled on a page fault triggering IO from spinning disk that could take seconds to finish? Nothing that I'm aware of - you just have to wait until it runs again and releases the lock.

That's why people write lock-free - they don't want a stall or death of the one thread to stall other threads. Deadlock, lovelock, priority inversion.

Read the Wikipedia page for a starting point of motivations for lock-free.

dragontamer · on June 4, 2019

Then schedules out the other thread?

I mean, what happens if the writer thread stalls out in the lock free case? The reader thread runs out of things to do and is also going to be scheduled out (eventually).

Like, that's the OS's job. To remove tasks from the "runnable" state when they stall out.

> That's why people write lock-free - they don't want a stall or death of the one thread to stall other threads.

Erm, if you're saying that the reader thread will be highly utilized... that's not likely the case if the writer thread in this "lock free" implementation stalls out.

Lets say the writer-thread hits a blocking call and is waiting for the hard drive for ~10 milliseconds. What do you think will happen to the reader-thread? In the "lock-free" implementation talked in this blog.

chrisseaton · on June 4, 2019

> I mean, what happens if the writer thread stalls out in the lock free case? The reader thread runs out of things to do and is also going to be scheduled out (eventually).

Yeah it keeps running, exactly as you say! It can keep working on jobs already on the queue independently.

Are you arguing that you you personally don't think it's worth doing lock-free for various practical reasons? Or arguing that's never why anyone does it?

Because the latter is simply false - I've done it for the latter reasons professionally both in academia and industry so there's one data point for you, and the literature talks about this use-case extensively so I know other people do as well.

dragontamer · on June 4, 2019

> Are you arguing that you you personally don't think it's worth doing lock-free for various practical reasons? Or arguing that's never why anyone does it?

No. Lock-free is great. When it is done correctly.

The code in this blog post was NOT done correctly. It only works in the case of 1-reader + 1-writer, so the "advantages" you're talking about are extremely limited.

You have to keep in mind the overall big picture. Lock-free code is incredibly difficult to write and usually comes with great cost to complexity. However... sometimes Lock-free makes life easier (okay, use them in that case). Sometimes, lock-free can be faster (okay... use them in that case... if speed is important to you).

But in many, many cases, locked data-structures are just easier to write, test, and debug.

jcelerier · on June 4, 2019

> It only works in the case of 1-reader + 1-writer

this is what is needed 99.9% of the time. e.g. look at the number of users of this C++ SPSC queue : https://github.com/cameron314/readerwriterqueue

or of the Boost one : https://github.com/search?q=boost%3A%3Alockfree%3A%3Aspsc_qu...

dragontamer · on June 4, 2019

Yeah, but the "queue-of-queues" is a more traditional structure familiar to people with lock-free algorithms.

I don't have time to do a serious code review of everything... but immediately upon looking at that code, its already clearly written with better quality: with acquire/release fences being used in (seemingly) the correct spot (just as an example).

Like, the real issue I have here is the quality of the code that is being demonstrated in the blog post. A lock-free of queues-of-queues implementation makes sense in my brain. What's going on in the blog-post ultimately doesn't make sense.

jcelerier · on June 4, 2019

> Erm, if you're saying that the reader thread will be highly utilized... that's not likely the case if the writer thread in this "lock free" implementation stalls out.

That does not mean that the reader thread has nothing to do. e.g. the most common case is that your reader thread will produce an output, and your writer thread sends messages to change how this output is generated - but even if there's no message, the reader thread has to keep going at a steady rate.

zaarn · on June 5, 2019

There is a third reason to go lock free: You cannot take locks or wait.

I've had some cases in embedded devices and OS development where taking a lock is not possible.

One example might be read buffer in an embedded device that is on one end being fed from an interrupt routine and on the other being consumed by a normal processor routine.

The interrupt routine can happen at any time the normal processor routine happens and it cannot wait or yield to it since there is only one core and no task scheduling. It needs to write to the buffer now, regardless of what the consumer is doing.

Lock-free has some desirable properties, it can run concurrently on single-core machines in presence of interrupts and it's safely reentrant.

chrisseaton · on June 4, 2019

3. Lock-free cannot deadlock when one of the threads dies in the middle of an operation.

4. Lock-free will not stall all other threads if one of the threads stalls in the middle of an operation.

dragontamer · on June 4, 2019

3. If your writer thread dies randomly, then your reader thread has nothing to do.

3 part 2: If your reader thread dies randomly, then your writer thread eventually fills the buffer and has nothing to do.

Both cases turn into a deadlock with this "lock-free" buffer. You can't afford to have either thread die, regardless of the "lock free" or "locked" implementations.

> 4. Lock-free will not stall all other threads if one of the threads stalls in the middle of an operation.

Same thing here. The queue will eventually fill up if there's a stall, causing the reader (and / or the writer) threads to stall as well.

So in practice, neither #3 nor #4 actually help you out.

cesarb · on June 4, 2019

> Same thing here. The queue will eventually fill up if there's a stall, causing the reader (and / or the writer) threads to stall as well.

There's an important difference between "queue full" and "queue not full, but locked". In the former case, you can either block (waiting for the queue to drain) or discard the data; in the later case, you must wait for the lock to be released before deciding. That is, if you allow data to be discarded when the queue is full, a lock-free queue can also be stall-free.

(It might appear that a queue using a lock can also be stall-free, since the amount of work it does while locked is bounded; however that fails once the thread holding the lock is preempted by either a context switch or a hardware interrupt. To really avoid stalls in that case, you have to either block preemption and interrupts, or perhaps you might be able to make it work with some priority inheritance scheme. And yeah, I know that the hardware can "stall" while reading or updating a shared variable, but the amount of work the hardware does in that case is bounded.)

dragontamer · on June 4, 2019

> (It might appear that a queue using a lock can also be stall-free, since the amount of work it does while locked is bounded; however that fails once the thread holding the lock is preempted by either a context switch or a hardware interrupt. To really avoid stalls in that case, you have to either block preemption and interrupts, or perhaps you might be able to make it work with some priority inheritance scheme. And yeah, I know that the hardware can "stall" while reading or updating a shared variable, but the amount of work the hardware does in that case is bounded.)

If you absolutely need non-blocking locks, then Linux has pthread_mutex_trylock() and Windows has TryEnterCriticalSection(). (Remember: Linux pthread_mutexes are spinlocks in most cases. And Windows CriticalSections are also spinlocks in most cases).

> That is, if you allow data to be discarded when the queue is full, a lock-free queue can also be stall-free.

And in practice, how is that different from a locked queue that simply discards data when pthread_mutex_trylock() fails? In both locked and non-locking implementations, you are assuming that the programmer can live with discarded data (or at very least: nonblocking is more important than the discarded data).

Its trivial to write an implementation with pthread_mutex_trylock() to be stall-free when you can simply discard data.

cesarb · on June 4, 2019

> And in practice, how is that different from a locked queue that simply discards data when pthread_mutex_trylock() fails?

The difference is that it only discards data when the queue is full. If you discard data when a trylock fails, you'll randomly discard data just because the other side happened to be looking at the queue at that moment. In the common situation where the consumer is being fast enough to always drain the queue before it fills, it's the difference between not discarding any data versus discarding an unpredictable subset of the data.

dragontamer · on June 4, 2019

> In the common situation where the consumer is being fast enough to always drain the queue before it fills

There is literally no application I'm aware of where you can hold this assumption, if only because of the existence of swap memory and the way OSes page tasks in and out.

If you accept data-loss in the case of "poor performance on the consumer", you are tacitly accepting data-loss whenever that thread hits the "wrong" memory address that happened to have been swapped to disk, and now has to wait 10ms+ before that task can continue.

Its effectively random data loss from the perspective of the user.

------------

EDIT: I had a tangent here, but I probably shouldn't try to write a lock-free queue in one try in a stream of consciousness post. I just deleted it all, I probably had a subtle bug somewhere anyway.

------------

Seriously: look at the code in the blogpost. I'm not 100% convinced its bug free. The consumer-thread is still interacting with the "write" and "watermark" variables, and those two variables are NOT being updated atomically by the producer queue.

I think they manage to squeak by due to SeqCst, but its a very difficult "proof". I'm not entirely sure I can "prove" that the blogpost's code is correct.

But hey, if you think the code is fine, convince me why the following is NOT a bug.

    === In the "producer" thread ====
    
    buffer.write.store(buffer.write.load() + write_len)

Mind you, I don't think its a bug. But its really difficult for me to prove it. So I'm curious, why are you confident in their methodology when the above is such a difficult line to grok? And there's not a single compare-and-swap talked about in the entire blogpost (so they're definitely not using the "standard" methodology... to lock-free code)

bfrog · on June 4, 2019

To be fair the watermark is there to deal with the DMA problem, where you need a sequential chunk of memory.

Why not just create a ping-pong buffer setup though would be another question if the goal is DMA