Lucene EuroCon Day 1 – Opening keynote speech

Grant Ingersoll started the conference with a keynote on search + big data “it’s still all about the user“. A refreshing talk on keeping the end users needs, both internal and external, in mind rather than fixating on specific technologies / technical band wagons.

Up front there were some interesting quotes on data being digital air – a source of sustenance and pollution at the same time. Also the fact that Bitly aim to reduce data so they can analyse it – therefore they have to capture a lot of data in the first place.

Evolution of search

Grant then dived into the evolution of search:

  1. documents
  2. queries
  3. content relationships
  4. user interaction – clicks, social graph etc.

These were presented as sets where the intersection of these sets is currently on the cutting edge.

Search – Discovery – Analytics cycle

Some people get that by analysing and understanding the behaviour of their users they can create a more compelling offering. However, whilst this can be achieved by combining different open source projects, it needs to be more integrated and easier to consume at the open source level.

When it’s done right it brings great benefit to users and the business.

Needs

  • scalable search with near real time
  • large scale cost effective storage
  • distributed processing
  • machine learning

Of course a lot of this can be done with the usual suspects in the Lucene/Hadoop ecosystem, however there are some other projects (see below) that you may want to check out.

Analytics

Grant likened this to the missing piece from most puzzles and some use cases are trail blazing new territory.
The goals and tooling were presented in two slices: online & offline analytics.

For end users

Offline (batch) analytics (Mahout/Hadoop)

  • link analysis / search trails
  • recommendations

Online

One of the projects to look at is Storm from Twitter or S4

What to analyse: trends/stats; social/personal; location

For internal users

Offline (Hadoop, pig, hive, Lucid Works Enterprise)

What to analyse: Top X results; zero results; MRR, MAP; user segmentation; location, conversions; ad hoc analysis

Online

What to analyse: Trends; Operational alerts (e.g. queries per second)

Tools:

Making it accessible

It’s not all sunshine and rainbows” as there is some glue missing in the ecosystem.

This is where you/we (the community) can help out – Grant gave a call to action for contributions to getting glue code for:

  • Lucene index -> Pig/Others
  • Mahout -> Pig/Others
  • Mahout -> Lucene/Solr
  • Logs -> Pig/Others

It was also said that it would be nice to have more in-index functionality such as relational database functionality such as aggregation; arbitrary stats; complex joins.

Summary

The core message was summarised by a sports quote from a basketball coach:
You can have the data, but you need to communicate it to people so that they can utilise it.

The key takeaway was a reminder to know your data, “get the data in your mind” – so you can instinctively aid the end users.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s