XML – second class citizenship

When integrating open data, XML is a highly prevalent format, yet it seems that it has fallen out of favour and big data tools only support it as an afterthought.

For instance, a number of Apache data handling tools can process XML data, provided you use their choice of schema definition. Whilst this is ok for trivial examples, real XML will typically have attributes, deeply nested elements, repeating groups, and often namespaces too. Not forgetting that the schema may define optional elements. 

Let’s take a quick look at the XML support available within Apache NiFi and for Apache Spark. Though we’ll need to bear in mind the impedance mismatch between the tree and tabular data models.
Continue reading

Quick tip: Using Git with NiFi Registry in Docker

Apache NiFi is a great tool for handling data flows, however the flow development lifecycle has been slightly challenging.

The recent release of NiFi Registry, a sub-project to provide shared resources across instances of NiFi, initially provides the capability to manage versioned flows. As of version 0.2.0, NiFi Registry added support for persisting flow snapshots to Git, making it very compelling!

In this post, we’ll see how to set this up for use when developing NiFi flows in a dockerised environment.
Continue reading

Open Gov Data talk at Neo4j London User Group

A call for Lightning Talks was sent out to the Neo4j London User Group, so I put in a few ideas and my Open Gov Data proposal was selected above Authentication (and adding extra users) for the March 2017 meetup.

Continue reading

Emerging trends for 2017

In this short post I present my thoughts on trends that are likely to be important considerations for enterprises in 2017. A few years ago everyone was talking about ‘SMAC’, which stood for Social, Mobile, Analytics and Cloud. So not to be outdone, I’ve punnily organised the trends as VICTIM:

Continue reading

Neo4j 2.2 Authentication and adding extra users

Token-based authentication is new in Neo4j 2.2, but how does it work?
The first thing to know is that it is enabled by default in conf/neo4j-server.properties by:

# Require (or disable the requirement of) auth to access Neo4j
dbms.security.auth_enabled=true

Continue reading

Attribute-Based Access Control with a graph database

Traditional access control relies on the identity of a user, their role or their group memberships. This can become awkward to manage, particularly when other factors such as time of day, or network location come into play. These additional factors, or attributes, require a different approach, the US National Institute of Standards and Technology (NIST) have published a draft special paper (NIST 800-162) on Attribute-Based Access Control (ABAC).

This post, and the accompanying Graph Gist, explore the suitability of using a graph database to support policy decisions.

Continue reading

Exploring a UK Open Government Dataset with Neo4j

In my first job I was working for a company that developed a management information system for UK Police Forces; this system produced the statutory HMIC (Her Majesty’s Inspectorate of Constabulary) reports and allowed OLAP exploration of the datasets loaded into cubes from the data warehouse tables.

One of the areas that I implemented was the key performance indicators for Road Traffic Collisions, so I was intrigued to discover that the fuller, anonymised STATS19 dataset was now available on data.gov.uk. If you’re interested in the STATS19 form you can see it here.

Continue reading

The path to Alfresco enlightenment

I started working with Alfresco back in 2005 and the code base was a lot smaller back then! More recently I’ve seen people try to dive into WebScript development without a concrete understanding of the foundational elements of the API. When I was set the task of organising an internal ‘hackathon’ as part of a ‘company day’ I decided that the goal should be to create a hands-on code-based tutorial.
Continue reading

Android Email Extraction to .eml

Sometimes the Android ecosystem is a little lacking with tool support; for instance I needed to extract a set of sent items from a POP3 mailbox – the stock mail client only allows you to perform 3 actions: delete, mark as unread or favourite.

Armed with the Android SDK, some SQL queries and a Groovy script we’ll see how it’s possible to recover email to RFC822 .eml files.

Continue reading

Breaking the monolith

I’m lifting the lid on my latest pet project which is set to revolutionise the ECM world. The codename is mu-fresco as it puts Alfresco into a Hadron collider with microservices.

This came about as I didn’t have access to 448 cores of JVM Azul goodness. The pretotype used 20 over-clocked Raspberry Pi units and an old HP Superdome picked up from eBay (I needed something beefy for the database and it was cheaper than RDS). After a bit of light surgery with a sharp scalpel, aka Spring Remoting with Hessian, it was time to awaken Frankenstein’s monster. The results are very promising though there are still a few kinks to be ironed out, with the Solr 1.4 index being one and the shared database schema between microservices being a glaring architectural impurity.

As for the next iteration, well “I’ve decided to take my work back underground to stop it falling into the wrong hands”.