A call for Lightning Talks was sent out to the Neo4j London User Group, so I put in a few ideas and my Open Gov Data proposal was selected above Authentication (and adding extra users) for the March 2017 meetup.
The angle of my talk was around the suitability of Neo4j for rapidly exploring data sets.
If you’re just after the slides, they are available here: Exploring Open Government Data using Neo4j
Query profiling & tuning
In the interests of time I decided to update some of my previous work that used a GraphGist with a cut-down dataset. This time I decided to use the full version of the latest (2015) UK road safety dataset for makes & models which is licensed under the Open Government Licence v3.0. Out of curiosity I wanted to see how the
LOAD CSV query from my GraphGist performed with ~142k rows. The answer wasn’t pretty as it took around an hour on my laptop.
Profiling revealed the absence of indices on the ‘lookup’ nodes was causing a variety of ‘NodeByLabelScan‘ operations as can be seen in figure 1.
The developer guide section on importing CSV files has some useful pointers (as well as links to great resources by Michael Hunger & Mark Needham). The most relevant one in this case was to “Avoid merging nodes and relationships in the same query“, as you can see that figure 1 contains both “MergeCreateNode” and “MergeCreateRelationship“. Consequently I made a slight simplification to the graph model by retaining the car model as a property on the Vehicle node instead of using a separate Model node. This removed a relationship merge and meant I could run the data load during the live demo. Plus the Model nodes could always be reconstituted by a separate query.
Rather than leaving dead time, I talked through what the
LOAD CSV query was doing and let it run in the background… On the night it took 38 seconds, which is almost 2 orders of magnitude faster than the unoptimised query. As I put down on slide 15 – profile your queries!