28

I was looking on the scalability of Neo4j, and read a document written by David Montag in January 2013.

Concerning the sharding aspect, he said the 1st release of 2014 would come with a first solution.

Does anyone know if it was done or its status if not?

Thanks!

2 Answers 2

66

Disclosure: I'm working as VP Product for Neo Technology, the sponsor of the Neo4j open source graph database.

Now that we've just released Neo4j 2.0 (actually 2.0.1 today!) we are embarking on a 2.1 release that is mostly oriented around (even more) performance & scalability. This will increase the upper limits of the graph to an effectively unlimited number of entities, and improve various other things.

Let me set some context first, and then answer your question.

As you probably saw from the paper, Neo4j's current horizontal-scaling architecture allows read scaling, with writes all going to master and fanning out. This gets you effectively unlimited read scaling, and into the tens of thousands of writes per second.

Practically speaking, there are production Neo4j customers (including Snap Interactive and Glassdoor) with around a billion people in their social graph... in all cases behind an active and heavily-hit web site, being handled by comparatively quite modest Neo4j clusters (no more than 5 instances). So that's one key feature: the Neo4j of today an incredible computational density, and so we regularly see fairly small clusters handling a substantially large production workload... with very fast response times.

More on the current architecture can be found here: www.neotechnology.com/neo4j-scales-for-the-enterprise/ And a list of customers (which includes companies like Wal-Mart and eBay) can be found here: neotechnology.com/customers/ One of the world's largest parcel delivery carriers uses Neo4j to route all of their packages, in real time, with peaks of 3000 routing operations per second, and zero downtime. (This arguably is the world's largest and most mission-critical use of a graph database and of a NOSQL database; though unfortunately I can't say who it is.)

So in one sense the tl;dr is that if you're not yet as big as Wal-Mart or eBay, then you're probably ok. That oversimplifies it only a bit. There is the 1% of cases where you have sustained transactional write workloads into the 100s of thousands per second. However even in those cases it's often not the right thing to load all of that data into the real-time graph. We usually advise people to do some aggregation or filtering, and bring only the more important things into the graph. Intuit gave a good talk about this. They filter a billion B2B transactions into a much smaller number of aggregate monthly transaction relationships with aggregated counts and currency amounts by direction.

Enter sharding... Sharding has gained a lot of popularity these days. This is largely thanks to the other three categories of NOSQL, where joins are an anti-pattern. Most queries involve reading or writing just a single piece of discrete data. Just as joining is an anti-pattern for key-value stores and document databases, sharding is an anti-pattern for graph databases. What I mean by that is... the very best performance will occur when all of your data is available in memory on a single instance, because hopping back and forth all over the network whenever you're reading and writing will slow things significantly down, unless you've been really really smart about how you distribute your data... and even then. Our approach has been twofold:

  1. Do as many smart things as possible in order to support extremely high read & write volumes without having to resort to sharding. This gets you the best and most predictable latency and efficiency. In other words: if we can be good enough to support your requirement without sharding, that will always be the best approach. The link above describes some of these tricks, including the deployment pattern that lets you shard your data in memory without having to shard it on disk (a trick we call cache-sharding). There are other tricks along similar lines, and more coming down the pike...

  2. Add a secondary architecture pattern into Neo4j that does support sharding. Why do this if sharding is best avoided? As more people find more uses for graphs, and data volumes continue to increase, we think eventually it will be an important and inevitable thing. This would allow you to run all of Facebook for example, in one Neo4j cluster (a pretty huge one)... not just the social part of the graph, which we can handle today. We've already done a lot of work on this, and have an architecture developed that we believe balances the many considerations. This is a multi-year effort, and while we could very easily release a version of Neo4j that shards naively (that would no doubt be really popular), we probably won't do that. We want to do it right, which amounts to rocket science.

7
  • Great post. In our use case we have multiple mostly distinct graphs. They only touch on a few edges, so for that kind of use case it wouldn't be that strange to give each distinct graph part its own shard. So we would steer which node goes into what shard.
    – Wouter
    Jun 19, 2014 at 6:55
  • Is it possible to do neo4j sharding in embedded mode ?
    – Abhimanyu
    Feb 23, 2015 at 11:51
  • So someone needs to implement an indexing layer a la Facebook's Unicorn ;-)
    – zazi
    Jun 27, 2015 at 9:25
  • 11
    Sooo... Multi-years later (technically, just) how's the sharding coming?
    – Basic
    Feb 20, 2016 at 2:11
  • and now! how to shard neo4j?
    – afruzan
    Jul 16, 2017 at 10:58
25

TL;DR With 2018 is days away neo4j still does not support sharding as it is typically considered.

Details Neo4j still requires all data to fit on a single node. The node contents can be replicated within a cluster - but actual sharding is not part of the picture.

When neo4j talks of sharding they are referring to caching portions of the database in memory: different slices are cached on different replicated nodes. That differs from say mysql sharding in which each node contains only a portion of the total data.

Here is a summary of their "take" on scalability: their product term is "High Availability" https://neo4j.com/blog/neo4j-scalability-infographic/

enter image description here

. Note that High Availability should not be the same as Scalability: so they do not actually support the latter in the traditional understanding of the term.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.