When integrating open data, XML is a highly prevalent format, yet it seems that it has fallen out of favour and big data tools only support it as an afterthought.
For instance, a number of Apache data handling tools can process XML data, provided you use their choice of schema definition. Whilst this is ok for trivial examples, real XML will typically have attributes, deeply nested elements, repeating groups, and often namespaces too. Not forgetting that the schema may define optional elements.
Let’s take a quick look at the XML support available within Apache NiFi and for Apache Spark. Though we’ll need to bear in mind the impedance mismatch between the tree and tabular data models.
Continue reading