What I disliked about Riak is that although at first glance it appears you can n...

aphyr · on March 7, 2011

To my knowledge, there is no such thing as an in-memory bitcask backend for Riak. Could you have been using regular Bitcask (keeps values on disk in log-structured files) or the ETS memory backend?

davidhollander · on March 7, 2011

Bitcask is an in memory datastore because it keeps a copy of the entire dataset in memory. 4GB of data==4GB of memory>trivial amount of memory used by a file descriptor.

>keeps values on disk in log-structured files

Of course it does. Log files are the easiest way to make an in memory datastructure persistant. Logs are generally used for periodically dropping fast writes as a backup in case you want to reload the memory datastructure in the future. Read requests are structured and served from the in memory datastructure and do not require a disk access.

The fact that you have something using all your RAM that should require 0 disk accesses for reads NOT substantially beating something performing thousands of disk accesses is a FUBAR situation.

aphyr · on March 7, 2011

Bitcask is an in memory datastore because it keeps a copy of the entire dataset in memory.

Begging your pardon, but I think you may be misunderstanding bitcask. The bitcask keydir is stored in memory, but the values are stored on disk. The keydir is a hash mapping each key to a file ID and the offset/size in that file at which the value is stored. The only time values are stored in memory on is when the kernel's fs cache or readahead buffer provide them.

http://downloads.basho.com/papers/bitcask-intro.pdf

Since a filesystem directory listing is likely held by your OS cache, you should see similar performance between bitcask and files on disk: an in-memory lookup to obtain the inode/offset, and a disk seek+read.

Riak will be substantially slower than directly using bitcask, however, because you may need to talk over the network to as many as N nodes, wait for all their responses, and compute the resultant vclock/metadata, and then serialize it for HTTP (huge TCP overhead) or protocol buffers (relatively fast). If you're running on a single machine, then you may incur additional time for that machine to run what would normally be distributed over three nodes. Without knowing more about your benchmark, however, it's difficult to say.

davidhollander · on March 8, 2011

>you may be misunderstanding bitcask

Yes I was, I apologize. It must have been MongoDB that required total data be no larger than amount of RAM.

The test was run using protocol buffers and Python client on the same computer testing reads vs. a naive map reduce. The naive map reduce was storing 4,000 files in a folder, treating the filename as the key and parsing the text file contents from JSON into a python dictionary to see if an attribute matches. Basically I figure doing a loop of thousands of blocking disk accesses from a laptop harddrive on a standard filesystem buffering nothing in memory should always be much slower than any database.

This was last year so maybe Riak's performance has increased since then. I'd be interested if TokyoCabinet was added as a backend.

aphyr · on March 8, 2011

Riak, by default, uses a replication value of three. Your single test machine has to do ~three times the work, so you should expect slower performance here. (I'm oversimplifying somewhat.)

You'll see significantly improved performance on a linear test (in my informal testing, 3-4x speedups) by adding an extra two nodes. Parallelized tests pretty much scale linearly with nodes.

In practice, I've found Riak to be slightly slower than MySQL. Direct reads/writes tend to be fast, but JSON parsing can bite you and denormalization requires more writes. The major advantage is that the Riak system can scale linearly with nodes, and that it can fail in predictable and resolvable ways.

As an example, the feed system I'm currently building on Riak will survive a total network partition and allow full reads and writes from every node with no data lost. Everything is automatically merged when the partition ends. The vclock-tagged multi-value functionality of Riak is exceptionally powerful when you want to design these types of systems, and is, in my mind, worth the performance hit and additional design complexity for certain classes of problems.

This was last year so maybe Riak's performance has increased since then. I'd be interested if TokyoCabinet was added as a backend.

There are also InnoDB and multiple in-memory backends, which may provide performance characteristics more in line with what you are looking for.

seiji · on March 7, 2011

Bitcask is an in memory datastore because it keeps a copy of the entire dataset in memo

Not quite. The keys are in memory along with disk position to retrieve the associated value with one seek.