Grant Ingersoll started the conference with a keynote on search + big data “it’s still all about the user“. A refreshing talk on keeping the end users needs, both internal and external, in mind rather than fixating on specific technologies / technical band wagons.
Up front there were some interesting quotes on data being digital air – a source of sustenance and pollution at the same time. Also the fact that Bitly aim to reduce data so they can analyse it – therefore they have to capture a lot of data in the first place.
Evolution of search
Grant then dived into the evolution of search:
- content relationships
- user interaction – clicks, social graph etc.
These were presented as sets where the intersection of these sets is currently on the cutting edge.
Search – Discovery – Analytics cycle
Some people get that by analysing and understanding the behaviour of their users they can create a more compelling offering. However, whilst this can be achieved by combining different open source projects, it needs to be more integrated and easier to consume at the open source level.
When it’s done right it brings great benefit to users and the business.
- scalable search with near real time
- large scale cost effective storage
- distributed processing
- machine learning
Of course a lot of this can be done with the usual suspects in the Lucene/Hadoop ecosystem, however there are some other projects (see below) that you may want to check out.
Grant likened this to the missing piece from most puzzles and some use cases are trail blazing new territory.
The goals and tooling were presented in two slices: online & offline analytics.
For end users
Offline (batch) analytics (Mahout/Hadoop)
- link analysis / search trails
What to analyse: trends/stats; social/personal; location
For internal users
Offline (Hadoop, pig, hive, Lucid Works Enterprise)
What to analyse: Trends; Operational alerts (e.g. queries per second)
Making it accessible
“It’s not all sunshine and rainbows” as there is some glue missing in the ecosystem.
This is where you/we (the community) can help out – Grant gave a call to action for contributions to getting glue code for:
- Lucene index -> Pig/Others
- Mahout -> Pig/Others
- Mahout -> Lucene/Solr
- Logs -> Pig/Others
It was also said that it would be nice to have more in-index functionality such as relational database functionality such as aggregation; arbitrary stats; complex joins.
The core message was summarised by a sports quote from a basketball coach:
You can have the data, but you need to communicate it to people so that they can utilise it.
The key takeaway was a reminder to know your data, “get the data in your mind” – so you can instinctively aid the end users.