Lean Java Engineering

XML – second class citizenship

Posted on November 20, 2019 | Leave a comment

When integrating open data, XML is a highly prevalent format, yet it seems that it has fallen out of favour and big data tools only support it as an afterthought.

For instance, a number of Apache data handling tools can process XML data, provided you use their choice of schema definition. Whilst this is ok for trivial examples, real XML will typically have attributes, deeply nested elements, repeating groups, and often namespaces too. Not forgetting that the schema may define optional elements.

Let’s take a quick look at the XML support available within Apache NiFi and for Apache Spark. Though we’ll need to bear in mind the impedance mismatch between the tree and tabular data models.
Continue reading →

Leave a comment

Posted in Analytics, NoSQL

Tagged big data, NiFi, spark, xml

Quick tip: Using Git with NiFi Registry in Docker

Posted on July 26, 2018 | 2 comments

Apache NiFi is a great tool for handling data flows, however the flow development lifecycle has been slightly challenging.

The recent release of NiFi Registry, a sub-project to provide shared resources across instances of NiFi, initially provides the capability to manage versioned flows. As of version 0.2.0, NiFi Registry added support for persisting flow snapshots to Git, making it very compelling!

In this post, we’ll see how to set this up for use when developing NiFi flows in a dockerised environment.
Continue reading →

2 Comments

Posted in How-to

Tagged Docker, Git, NiFi

Open Gov Data talk at Neo4j London User Group

Posted on March 31, 2017 | Leave a comment

A call for Lightning Talks was sent out to the Neo4j London User Group, so I put in a few ideas and my Open Gov Data proposal was selected above Authentication (and adding extra users) for the March 2017 meetup.

Continue reading →

Leave a comment

Posted in How-to, Talk, Uncategorized

Tagged graph database, neo4j, open data

Emerging trends for 2017

Posted on January 3, 2017 | Leave a comment

In this short post I present my thoughts on trends that are likely to be important considerations for enterprises in 2017. A few years ago everyone was talking about ‘SMAC’, which stood for Social, Mobile, Analytics and Cloud. So not to be outdone, I’ve punnily organised the trends as VICTIM:

Continue reading →

Leave a comment

Posted in Notes

Tagged IoT, mahout, neo4j, spark, VR

Neo4j 2.2 Authentication and adding extra users

Posted on April 16, 2015 | 9 comments

Token-based authentication is new in Neo4j 2.2, but how does it work?
The first thing to know is that it is enabled by default in conf/neo4j-server.properties by:

# Require (or disable the requirement of) auth to access Neo4j dbms.security.auth_enabled=true

Continue reading →

9 Comments

Posted in How-to

Tagged neo4j, security

Attribute-Based Access Control with a graph database

Posted on April 13, 2015 | 2 comments

Traditional access control relies on the identity of a user, their role or their group memberships. This can become awkward to manage, particularly when other factors such as time of day, or network location come into play. These additional factors, or attributes, require a different approach, the US National Institute of Standards and Technology (NIST) have published a ~~draft~~ special paper (NIST 800-162) on Attribute-Based Access Control (ABAC).

This post, and the accompanying Graph Gist, explore the suitability of using a graph database to support policy decisions.

Continue reading →

2 Comments

Posted in NoSQL

Tagged graph database, neo4j, security

Exploring a UK Open Government Dataset with Neo4j

Posted on April 10, 2015 | 1 comment

In my first job I was working for a company that developed a management information system for UK Police Forces; this system produced the statutory HMIC (Her Majesty’s Inspectorate of Constabulary) reports and allowed OLAP exploration of the datasets loaded into cubes from the data warehouse tables.

One of the areas that I implemented was the key performance indicators for Road Traffic Collisions, so I was intrigued to discover that the fuller, anonymised STATS19 dataset was now available on data.gov.uk. If you’re interested in the STATS19 form you can see it here.

Continue reading →

1 Comment

Posted in Analytics, How-to, NoSQL

Tagged big data, graph database, neo4j, open data

The path to Alfresco enlightenment

Posted on March 13, 2015 | Leave a comment

I started working with Alfresco back in 2005 and the code base was a lot smaller back then! More recently I’ve seen people try to dive into WebScript development without a concrete understanding of the foundational elements of the API. When I was set the task of organising an internal ‘hackathon’ as part of a ‘company day’ I decided that the goal should be to create a hands-on code-based tutorial.
Continue reading →

Leave a comment

Posted in Alfresco, Testing

Android Email Extraction to .eml

Posted on June 10, 2014 | 3 comments

Sometimes the Android ecosystem is a little lacking with tool support; for instance I needed to extract a set of sent items from a POP3 mailbox – the stock mail client only allows you to perform 3 actions: delete, mark as unread or favourite.

Armed with the Android SDK, some SQL queries and a Groovy script we’ll see how it’s possible to recover email to RFC822 .eml files.

Continue reading →

3 Comments

Posted in Scripts

Tagged Android, Email, groovy

Breaking the monolith

Posted on April 1, 2014 | Leave a comment

I’m lifting the lid on my latest pet project which is set to revolutionise the ECM world. The codename is mu-fresco as it puts Alfresco into a Hadron collider with microservices.

This came about as I didn’t have access to 448 cores of JVM Azul goodness. The pretotype used 20 over-clocked Raspberry Pi units and an old HP Superdome picked up from eBay (I needed something beefy for the database and it was cheaper than RDS). After a bit of light surgery with a sharp scalpel, aka Spring Remoting with Hessian, it was time to awaken Frankenstein’s monster. The results are very promising though there are still a few kinks to be ironed out, with the Solr 1.4 index being one and the shared database schema between microservices being a glaring architectural impurity.

As for the next iteration, well “I’ve decided to take my work back underground to stop it falling into the wrong hands”.

Leave a comment

Posted in Alfresco, Deployment, Lean