Note that S3-FIFO has no loop resistance. Nevertheless, the result that this library is functioning in a loop indicates that some important changes have been made. Also, the result that Ristretto, which has loop resistance, is not functioning in a loop is clearly an anomaly and likely not being measured correctly. Ristretto's benchmark results on S3, DS1, and OLTP differ markedly from the official ones (https://github.com/dgraph-io/ristretto). Otter's benchmark results are quite suspicious.
I think you may be right. The library author said he made many changes because implementing the eviction policy following the paper's design was pretty awful.
> First, I tried to implement cache on hash table and lock free queues, as the authors wrote. I was annoyed, s3-fifo was indeed many times faster than lru cache implemented on hash table with global mutex, but still lost to ristretto and theine, which use bp-wrapper techniques. The profiler also showed that it was the eviction policy that was spending the bulk of the time. And I was only able to beat ristretto and theine after rewriting the implementation using bp-wrapper.
> In summary, I completely disagree about the scalability advantage of S3-FIFO over the other policies
> Though to be honest, what was causing the hit ratio to go down in DS1 I still don't understand
1. Unfortunately, ristretto has been showing hit ratio around 0 on almost all traces for a very long time now and the authors don't respond to this in any way. Vitess for example has already changed it to another cache. Here are two issues about it: https://github.com/dgraph-io/ristretto/issues/346 and https://github.com/dgraph-io/ristretto/issues/336. That is, ristretto shows such results even on its own benchmarks. You can see it just by running hit ratio benchmarks on a very simple zipf distribution from the ristretto repository: https://github.com/dgraph-io/ristretto/blob/main/stress_test.... On this test I got the following: stress_test.go:75: actual: 0.07, optimal: 0.68
2. Yes, otter contains a number of changes to the algorithm that in my opinion improve it. For example, S3-FIFO in otter does not have a worst case time complexity equal to O(n), there O(1).
3. But the fact that S3-FIFO performs worse on DS1 than LRU sounds very doubtful. It sounds more like a bug in the implementation, since even on the caffeine simulator S3-FIFO wins.
The usual phrase is "scan resistance", which is especially important for databases. A pure LRU policy does poorly on scans; LFU does better but performs worse than LRU on other workloads. The original ARC paper[0] has a good discussion of the tradeoffs.
Scan resistance and loop resistance are completely different. They are also distinguished in the papers. Note that ARC has scan resistance but no loop resistance.
Not equivalent completely. Scan is a one-time access, so it never hits in scanning. Loop is multi-time accesses, so it can get many hits in looping if it has loop resistance. No hits on scan resistance in looping. Over.
I am disgusted by people who are repulsed by even the correction of a complete error. They must have bad grades in college. From now on, we have no choice but to leave it alone even if a false rumor is spread.
There is an established benchmark for testing loop resistance (GLI, Loop). S3-FIFO has not passed this test. And I have already confirmed that S3-FIFO does not pass the test.
Those LIRS traces, along with many others, are available at this page [1]. I did a cursory review using their traces with Caffeine's and the author's simulators to avoid bias or a mistaken implementation. In their target workloads Caffeine was on par or better [2]. I have not seen anything novel in this or their previous works and find their claims to be easily disproven, so I have not implement this policy in Caffeine’s simulator yet.
Hi Ben, "I have not seen anything novel in this or their previous works and find their claims to be easily disproven" is a big statement.
We evaluated over 6000 workloads from 14 sources (Meta, Twitter, Microsoft, Wikipedia, VMWare, multiple CDNs, Tencent, Alibaba...), collected from 2007 to 2023. Most of the workloads we used are open-source and available to be verified [1]. If you want to claim that TinyLFU is better, you cannot just show it is better on one trace from 20 years ago. I wish this is not how you disprove a better algorithm.
Disclaim: this is the author of S3-FIFO. We have never claimed that S3-FIFO is the best on every trace.
We observed that quick demotion[2] is important to achieve a low miss ratio in modern cache workloads, and existing algorithms such as TinyLFU and LIRS have lower miss ratios because of the small 1% window they use. This motivated us to design S3-FIFO, which uses simple FIFO queues to achieve low miss ratios. It is true that compared to state-of-the-art, S3-FIFO does not use any fancy techniques, but this does not mean it has bad performance.
In our large-scale evaluations, we found that the fancy techniques in LIRS, ARC, and TinyLFU can sometimes increase the miss ratio. But simple FIFO queues are more robust. However, *it is not true that S3-FIFO is better on every trace*.
* Note that some of the S3-FIFO results in Otter's repo are not updated and have an implementation bug, and we are working with the owner to update them.
Hmm, I'd like to know more about the bugs and not updated results. It seems that all the bugs that were there long ago have been fixed (and even with some improvements). And the results show the latest version's performance. Especially since S3-FIFO shows very good results, in fact, inferior only to the adaptive version of W-TinyLFU on S3 and DS1 traces.
And about lock-free queues I don't agree, what they do with the implementation can be seen on the example of theine Reads 100% graph in the otter repository, if it wasn't there, it would be equal in speed to ristretto. And the BP-Wrapper technique actually has a huge engineering plus: it allows you to focus on optimizing different components of the system individually and easily make changes to the eviction policy that lock-free queues can't give to the developer.
Hi Alexey, I was referring to the bug you fixed two weeks ago. This comment was for the weird results that Ben's cited.
I apologize here because I did not check the commit carefully, and I thought you had not updated the results (because the results on Zipf are worse than I expected). Thank you for clarifying this!
Can you clarify what you disagree about on lock-free queues? Do you mean that S3-FIFO cannot be implemented without locks?
In fact, lock-free queues have several problems at once, which prompted me to give up on them almost immediately.
1. Yes, S3-FIFO can be implemented using lock-free queues, but the problem is that each write to a filled cache using this design will cause a large number of additional atomic operations not friendly to the processor's cache, while bp-wrapper on the contrary amortizes this load. And reading with frequency update on hot entries can have a bad effect on performance. In many ways this is exactly what the last posts in my discussion with Ben are about (not really about this, but the current problem with otter read speed is caused by a similar problem). https://github.com/Yiling-J/theine-go/issues/29#issuecomment...
2. But the main problem for me is not even that. Lock-free queues work fine as long as you only need to support Get and Set operations, but as soon as you want to add features to your cache, the complexity of the implementation starts to increase, and some features are very hard to add to such a structure. Also, improving the eviction policy is under a big question mark, because not only do you have to think about how to improve the eviction policy, but also how to avoid locks while doing so or how not to slow down the implementation with your improvements. BP-Wrapper has no such problems at all, allows you to use any eviction policy and focus on improving different parts of your cache independently of each other.
I like the discussions with you. I have a few comments and questions.
1. why does it require multiple atomic operations at each insert? Our implementation in cachelib only used one CAS operation. I will get back to you with an illustration. Maybe you can show me something I missed.
2. I like the bp-wrapper idea. It is intuitive and can be used with most/all algorithms. Batching does reduce/amortize the overhead of locking for LRU-based algorithms. But IIRC, all the promotions are still serialized. Because the promotions are still protected by the lock, and only one core can promote objects at the same time. Caffeine achieves amazing scalability because it drops all the elements that need to be updated/promoted under high contention, which means it effectively becomes a FIFO. I don't think this is the correct way to claim good scalability.
3. "reading with frequency update on hot entries can have a bad effect on performance" I don't understand this. The frequency of hot entries should quickly reach 1 (if using a one-bit counter) or 3 (if using a two-bit counter) and will not be updated anymore. Why would read and update conflict?
4. Yes, implementing get and set with locking is trivial; adding support for delete requires a bit of work (10s lines) but should be manageable. But why would lock-free queues make other features hard to implement? Can you give an example?
1. Here I don't understand how you can do with one cas on a filled cache. You need to evict items on every insert, reinsert items into the main queue, etc., and if you add support for item weights, the number of xadd and cas operations will be even greater.
2. It's a bit non-obvious here. Caffeine does not lose a single insert, update, delete operation and moreover, keeps track of their correct execution with fantastic scrupulosity. Also the loss of some reads is inherent in many eviction policies. And caffeine's lossy buffers also have a great ability: the number of these buffers is automatically adjusted at runtime based on contention, which minimizes losses.
3. I specifically mentioned here that it's not the same, but it is possible to design a load that will force as many xadd operations as possible and as few load operations as possible. But in principle you can ignore this point, as it is rather artificial.
4. Here I rather mean that all features should live in the lock-free model or somehow be separately integrated into the architecture (which only complicates things). Plus, for example, a person wants to modify the eviction policy a bit, for example, by adding the lfu part to it and then it's not clear how to do it.
1. I wrote a bit on how to implement lock-free FIFO queues here https://blog.jasony.me/system/cache/2023/12/28/fifo, let me know if any part is not clear or not correct. Moving an object from the small queue to the main queue can be viewed as two operations: evicting the item from the small queue and inserting it into the main queue.
2. I am not criticizing the design decision in Caffeine. It is good because most people just don't need crazy scalability (>16 threads) or do not run it at such high throughput. I understand that critical operations are not lost. But the promotions are lost under contention, which makes comparison unfair. S3-FIFO is still the same algorithm under contention, but The W-TinyLFU in Caffeine becomes FIFO under high contention. I do not think that it can claim a low miss ratio and good scalability at the same time. The relationship is rather a low miss ratio OR good scalability. But as I mentioned, it does not matter for most use cases, and the design is elegant.
3. skipped
4. It is true that if one wants to port a lock-based new algorithm, it would require extra effort. But I guess this is a decision you have to make: only support lock-free eviction algorithms (FIFO-based) or support more algorithms with locks. But I do want to mention that S3-FIFO is not the only lock-free algorithm and there are and will be more.
I did a comparison on your data sets such as DS1, and S3-FIFO had very lower hit ratios than LRU in most cases. I think there are very limited situations where S3-FIFO is best.
You evaluated a single trace (from twenty years ago) and claimed that S3-FIFO is worse; why not show more traces? Most of the traces we used are open-source and can be found here.
Hi falsandru, I do not know where you can the feeling that S3-FIFO is not loop-resistant, can you elaborate more?
There is an adversarial pattern where each object is only requested twice, and the second request comes very late (out of the small queue). But such a pattern is very rare (if it ever exists at all), and other algorithms, such as TinyLFU, also suffer from this access pattern.
You claimed that you verified it, but there is no evidence given so far except on the single trace, which S3-FIFO is indeed worse. Frankly speaking, I don't understand why you get irritated by other users in this post.
Note that S3-FIFO has no loop resistance. Nevertheless, the result that this library is functioning in a loop indicates that some important changes have been made. Also, the result that Ristretto, which has loop resistance, is not functioning in a loop is clearly an anomaly and likely not being measured correctly. Ristretto's benchmark results on S3, DS1, and OLTP differ markedly from the official ones (https://github.com/dgraph-io/ristretto). Otter's benchmark results are quite suspicious.