Neo4j Data modelling 101

Published in

Neo4j Developer Blog

8 min readAug 14, 2020

What is Neo4j?

“The world’s most flexible, reliable and developer friendly graph database as a service.” It is an online database management system with Create, Read, Update and Delete (CRUD) operations that stores data as a graph.

What is a graph database?

If you are coming from my previous blog about setting up Neo4j, skip this and the next section. Go to Data modelling directly.

A graph database, also called a graph-oriented database, is a type of NoSQL database that uses graph theory to store, map and query relationships.
A graph database is essentially a collection of nodes and edges.

A graph is composed of two elements: a node (or vertex) and a relationship (or edge). Each node represents an entity (a person, place, thing, category or other piece of data), and each relationship represents how two nodes are associated. This general-purpose structure allows you to model all kinds of scenarios — from a system of roads, to a network of devices, to a population’s medical history or anything else defined by relationships.

Why would you ever require a graph database anyway?

A graph database, unlike a relational database management system (RDBMS), treats relationships as first class citizens. It is not required to use approaches such as complex join queries or accessing foreign keys to get data related to each other. We join entities as soon as we know they’re connected, so these mapping methods are unnecessary. Since graph databases employ object oriented thinking at their core, the data model you draw on your whiteboard is the model of data you can store in your database.

Modern data has, implicitly, lots of relationships. In order to leverage these data connections, organizations need a database technology that stores relationship information as a first-class entity. That technology is a graph database. Unfortunately, legacy RDBMS are poor at handling data relationships. Also, their rigid schemas make it difficult to add different connections or adapt to new business requirements.

Not only do graph databases effectively store data relationships; they’re also flexible when expanding a data model or conforming to changing business needs. You can read more about advantages of using Graph databases.

Data Modelling

Neo4j databases store data in the form of Nodes and Properties. The node label signifies the table and the properties signify the columns (as in RDBMS). Further the nodes (labels) are connected to other nodes via relationships that signify a connection in the real world (for example- Company ‘is located in’ City, or Contact ‘is lead for’ a Company). In Neo4j, relationships can have properties as well.

Data modelling is a significant part of any project and how you design your database will drive your data ingestion process as well as the consumption layer. Often making big changes down the line maybe not be possible, certainly not easy, making this process more important.

The data modelling process in graph databases (and especially Neo4j here) involves decisions between representing a data point as a node or relationship, and often clever designs that will help the querying later in the consumption layer.

RDBMS to Graph — how it translates

From Neo4j:

Model: Relational to Graph - Neo4j Graph Database Platform

For those with a background in relational data modeling, this guide will help transfer your existing knowledge of the…

neo4j.com

Table to Node Label — each entity table in the relational model becomes a label on nodes in the graph model.
Row to Node — each row in a relational entity table becomes a node in the graph.
Column to Node Property — columns (fields) on the relational tables become node properties in the graph.
Business primary keys only — remove technical primary keys, keep business primary keys.
Add Constraints/Indexes — add unique constraints for business primary keys, add indexes for frequent lookup attributes.
Foreign keys to Relationships — replace foreign keys to the other table with relationships, remove them afterwards.
No defaults — remove data with default values, no need to store those.
Clean up data — duplicate data in denormalized tables might have to be pulled out into separate nodes to get a cleaner model.
Index Columns to Array — indexed column names (like email1, email2, email3) might indicate an array property.
Join tables to Relationships — join tables are transformed into relationships, columns on those tables become relationship properties.

Designing data model as a graph

Neo4j describes the data modelling process as creating a whiteboard friendly design, which means that the nodes and relationships which we can build on the whiteboard, or as English sentences, often drive how we model our data.

Often the data that we have can be represented as a property or a node label in itself. For example in case of email transactions data, we could depict the data using:

“Email sent” as a relationship between the sender and receiver

However it becomes clear that this model, while being a whiteboard model, lacks ability to extend its capabilities to store other information such as CC, BCC, replied_to, forwarded_to, etc. Hence we modify our model to depict Email as a separate node label:

Thus, we can see that our data drives the design process.

Another example involving City and Company.

We could include the City value for the Company label in the Company node itself.

To query such data to filter by a location would require traversing all the Company nodes in the graph, and then filtering by their City value. As it is evident here, since the number of Company nodes can potentially be huge, querying for all Companies in a particular City will be take a lot of resources, in terms of time and CPU.

We could identify such downstream applications and use that knowledge to drive our data modelling process. In our case, the graph would look like:

Company nodes connected to the City nodes

Now the query can be modified to filter the City first, and only look for Companies that are connected to the City node. This improves our query time significantly.

Our queries (consumption) drive the data design process.

The EXPLAIN keyword

Cypher keyword EXPLAIN is used for query profiling, using it will display the plan of how the query will be run, as well as approximate records at each level.

Explanation for queries to get Companies located in a particular State, for example, will differ vastly based on how the data is modelled and how it is queried. Below, on the left, shows what happens when we filter for a State first and then look for Companies that are attached to that State node. As you can see, the number of records in each step are nominal as compared to the explanation on the right, which queries all the Companies and attempts to filter them based on the State value which is stored as its property.

Comparison between queries when filtering using a intermediate State node versus filtering by node property

Explain is a powerful tool that, in the initial stages of data modelling and ingestion will give a fair idea of how the queries are going to look in the consumption step, or at least how they should look like. That should drive the data modelling.

PROFILE is a similar keyword which explains the query plan as well as executes it.

Index and Constraints

As with RDBMS, Neo4j supports creating Index for properties of labels. Simply put, a database index is a redundant copy of some of the data in the database which helps making searches of related data more efficient. This however uses more storage and affects the write speeds, therefore what columns or properties to index is a decision that is made while designing the data and treating it like a magic bullet to speed queries is not recommended.

Judicious use of indices can improve the queries many folds and it should be a significant part of data modelling discussion.

Constraints on the other hand are business rules that restrict data duplication for key properties for specific labels. For example, if we get data from sources that has multiple entries for City named London, we don’t want to create them as separate records. Constraints help us maintain data integrity.

Handling datetime values

More often than not, data that you will try to ingest will have date values. As of Neo4j 3.4, the system handles duration and dates. Neo4j allows making indexes on numeric properties and run range queries that use the index. We can take advantage of this for dates by storing them as millisecond timestamps, allowing us to perform date range queries.

Another creative solution is time trees. The idea is to create hierarchical date month and year nodes that are connected via a child relationship from top to bottom and via next relationship to neighbouring nodes in the same level. As visible below, this helps structure the queries for date filtering by filtering the year, then month, then day and so on. This helps reduce the number of records significantly at each step and improves filter queries.

To create the time tree graph, there is Graphaware Timetree product, that provides jar files which can be accessed as plugin in the Neo4j environment.

To create the general purpose time tree graph to start with quickly, find the code below. Change the years’ range in the code on the top and it will create the time tree graph as depicted above.

This is the forked version that I have on my own github, original writer is Laurie Boyes, described in his blog here:

Adventures with Neo4j and Timetrees

medium.com

This concludes few of the considerations while designing your database as a graph database in Neo4j.