Nebula Graph: A Linearly Scalable, Distributed Graph Database Written in C++

jandrewrogers · on April 27, 2020

It looks like a competent and thoughtful implementation but, as best I can determine and not to take anything away from it, using an old design. The performance and scalability is throttled by the use of secondary indexing structures. You would have to use some pretty expensive hardware for the performance cliffs to not be immediately evident.

I don’t do a lot of work on graph databases these days, but I’ve seen state-of-the-art implementations do 10x this many inserts/sec/server on EC2 VMs where the local data model size was 100x the available RAM. And in principle these architectures could easily scale-out. Indexing structure and storage engine design figure prominently, both usually need to be built for the purpose.

continuations · on April 27, 2020

What are the differences between Nebula's old design and the new architectures of graph DB?

Any open source graph DB that uses a new architecture?

rajman187 · on April 27, 2020

Excellent points. To add, and I’m not taking away from the technical effort, but the use of “native graph” is rather misleading. Existing computer architecture cannot represent a graph with sequentialized memory access (1-dimensional). So any representation has to make assumptions and use random memory access patterns. A longer discussion here (obvious bias aside, the other no longer works at DSE): https://www.datastax.com/blog/2013/11/letter-regarding-nativ...

shermanye · on April 27, 2020

Thanks for the reply. "Native Graph" here means the system (including the storage and query engine) is designed around the Graph data structure. The opposite part of the "Native Graph" is usually called "multi-mode" databases. In other systems, the storage is designed either as tables, or as some other data structures. They only provide a Graph query interface to simulate the graph query engine. But behind the scene, they are still doing the SQL (or whatever) queries.

In Nebula, data are stored in a way so that getting all neighbors is actually a sequential read

rajman187 · on April 27, 2020

I see, thanks for the clarification. Can you expand on that a bit more? Is this some sort of index-free adjacency then? I still don't understand how the neighbours can be stored sequentially in memory, especially if this is a distributed system.

ddorian43 · on April 27, 2020

dgraph.io uses posting lists.

I think the state of the art is https://github.com/GraphBLAS.

Examples: https://github.com/michelp/pggraphblas & https://github.com/RedisGraph/RedisGraph/

moab · on April 27, 2020

Does the OP have links to any benchmarks? Specifically, what kind of ingestion rates can one expect with a modest number of machines? Does it support a single-machine (shared-memory parallel) environment? What kind of algorithms are supported?

It would be good to add some information about the features/capabilities on the homepage. Right now the blurbs make vague statements like "high throughput", which could be 1000 edge updates/sec or 10M.

jamie-vesoft · on April 27, 2020

Thanks so much for your suggestion regarding the website!I am thinking about the same thing as well. Will keep improving the site along the way. Really appreciate it.

As to the data for throughput, there are some PoC projects going on and according to data from production, for inserting, one of our clients has inserted 300b records to 6 servers within 20 hours, that is 690k inserts/sec/server.

We want the benchmark data to be verified by decent clients in their production environment. And will reveal more data in the future.

Thanks again!

harikb · on April 27, 2020

Just curious, didn’t you have to do some basic benchmarking using your own data to get these clients to signup in the first place? Or is this part of a larger engagement/partnership that these clients trust you enough to embark on this?

gigatexal · on April 27, 2020

I can't consider this until the folks at Jepsen have run it through its paces or if it's matured and been battle tested first. A database is so important to anything these days it has to have a seal of approval from the likes of Jepsen for me to trust my data to it which is why I bias towards existing solutions before jumping on a new db.

jamie-vesoft · on April 28, 2020

Good point. Software systems mature with on-going testing. Nebula Graph has implemented Jepsen tests for quite some time already. See https://nebula-graph.io/en/posts/detect-data-consistency-iss...

gigatexal · on April 28, 2020

That is really good then! I’ll check it out now

FridgeSeal · on April 27, 2020

Oooh this looks interesting.

A comparison with the likes of DGraph and Neo4J would be really useful!

VHRanger · on April 27, 2020

What graph algorithms are implemented beyond querying?

Also, how does a node locally store where it's neighbours are stored in the cluster?

boxfire · on April 27, 2020

I found these in the docs which are verbose and helpful:

https://github.com/vesoft-inc/nebula/blob/master/docs/manual...

jamie-vesoft · on April 27, 2020

Thanks for sharing! Yes you are right, the architecture articles are trying to help users understand how Nebula Graph stores and processes data.

star-trek-fleet · on April 27, 2020

https://news.ycombinator.com/item?id=22051271 (3 months ago)

Someone mentioned benchmark, it was mentioned the authors are working on that. Have not checked the current state.

jamie-vesoft · on April 27, 2020

Great digging! Thanks so much for paying attention to the benchmark report data. We apologize that you have to wait for so long!

Yes we have been working on the benchmark data for quite some time because we have been working with our clients to verify our capability. For example, one of our clients has inserted 300b records to 6 servers within 20 hours, then we are confident to say that Nebula Graph can manage 690k inserts/sec/server.

We will keep working and provide a trustworthy benchmark report for you as soon as we can.

Thanks again!

vardump · on April 27, 2020

Would love to see a distributed hypergraph database. Do such things exist in a practical form yet?

kvbe · on April 27, 2020

what are the main use cases for these type of graph databases?

jamie-vesoft · on April 28, 2020

Graph databases are efficient in exploring multi-hop relationships which are common in many business scenarios. So basically if your application needs to query n-hop relationships all the time, then graph database is a better choice. Some main use cases include real-time recommendation (product/content/shop), risk management like fraud detection in the financial services industry, knowledge graph and machine learning, etc.