It looks like a competent and thoughtful implementation but, as best I can determine and not to take anything away from it, using an old design. The performance and scalability is throttled by the use of secondary indexing structures. You would have to use some pretty expensive hardware for the performance cliffs to not be immediately evident.
I don’t do a lot of work on graph databases these days, but I’ve seen state-of-the-art implementations do 10x this many inserts/sec/server on EC2 VMs where the local data model size was 100x the available RAM. And in principle these architectures could easily scale-out. Indexing structure and storage engine design figure prominently, both usually need to be built for the purpose.
Excellent points. To add, and I’m not taking away from the technical effort, but the use of “native graph” is rather misleading. Existing computer architecture cannot represent a graph with sequentialized memory access (1-dimensional). So any representation has to make assumptions and use random memory access patterns. A longer discussion here (obvious bias aside, the other no longer works at DSE): https://www.datastax.com/blog/2013/11/letter-regarding-nativ...
Thanks for the reply. "Native Graph" here means the system (including the storage and query engine) is designed around the Graph data structure. The opposite part of the "Native Graph" is usually called "multi-mode" databases. In other systems, the storage is designed either as tables, or as some other data structures. They only provide a Graph query interface to simulate the graph query engine. But behind the scene, they are still doing the SQL (or whatever) queries.
In Nebula, data are stored in a way so that getting all neighbors is actually a sequential read
I see, thanks for the clarification. Can you expand on that a bit more? Is this some sort of index-free adjacency then? I still don't understand how the neighbours can be stored sequentially in memory, especially if this is a distributed system.
Does the OP have links to any benchmarks? Specifically, what kind of ingestion rates can one expect with a modest number of machines? Does it support a single-machine (shared-memory parallel) environment? What kind of algorithms are supported?
It would be good to add some information about the features/capabilities on the homepage. Right now the blurbs make vague statements like "high throughput", which could be 1000 edge updates/sec or 10M.
Thanks so much for your suggestion regarding the website!I am thinking about the same thing as well. Will keep improving the site along the way. Really appreciate it.
As to the data for throughput, there are some PoC projects going on and according to data from production, for inserting, one of our clients has inserted 300b records to 6 servers within 20 hours, that is 690k inserts/sec/server.
We want the benchmark data to be verified by decent clients in their production environment. And will reveal more data in the future.
Just curious, didn’t you have to do some basic benchmarking using your own data to get these clients to signup in the first place? Or is this part of a larger engagement/partnership that these clients trust you enough to embark on this?
I can't consider this until the folks at Jepsen have run it through its paces or if it's matured and been battle tested first. A database is so important to anything these days it has to have a seal of approval from the likes of Jepsen for me to trust my data to it which is why I bias towards existing solutions before jumping on a new db.
Great digging! Thanks so much for paying attention to the benchmark report data. We apologize that you have to wait for so long!
Yes we have been working on the benchmark data for quite some time because we have been working with our clients to verify our capability. For example, one of our clients has inserted 300b records to 6 servers within 20 hours, then we are confident to say that Nebula Graph can manage 690k inserts/sec/server.
We will keep working and provide a trustworthy benchmark report for you as soon as we can.
Graph databases are efficient in exploring multi-hop relationships which are common in many business scenarios. So basically if your application needs to query n-hop relationships all the time, then graph database is a better choice. Some main use cases include real-time recommendation (product/content/shop), risk management like fraud detection in the financial services industry, knowledge graph and machine learning, etc.
I don’t do a lot of work on graph databases these days, but I’ve seen state-of-the-art implementations do 10x this many inserts/sec/server on EC2 VMs where the local data model size was 100x the available RAM. And in principle these architectures could easily scale-out. Indexing structure and storage engine design figure prominently, both usually need to be built for the purpose.