This post introduces a few basic elements of the Solr Search Platform and highlights how faceting can ‘bucket’ results into counts

What are we doing?

At Findmypast we user Solr as a search index for our historical records data. Censuses, births, marriages, deaths and a whole host of other data is searchable via a single Solr cloud.

Our 2.7 billion historical records (that’s all the records excluding the newspapers from the British Newspaper Archive which are indexed in Solr separately) are easily queryable in Solr and we can aggregate result counts in a highly flexible way using Solr’s faceting features.

Getting started

If you’d like to play along at home with this blog, there’s a Github repo for those who have access to Findmypast’s account and a zip download for those who don’t.

You’ll need docker to run the examples, but I’ve included all the relevant output in the blog to save you the trouble.

If you’re running the examples, after cloning or downloading, if you’re on Unix or Mac you should be able to just launch the ./run.sh file and everything will take care of itself. On Windows, you’ll need to change run.sh to run.cmd

Two minute intro to Solr

I can’t do justice to everything Solr can offer, but here’s a taster of the major features

  • Full-text search:
    • We’re not just talking about doing searches that start with or contain some text; searches can make use of fuzziness, synonyms and proximity for example.
  • NoSQL features:
    • You don’t need to make all your data conform to the same structure in Solr.
  • Highly scalable:
    • You can spread your search index out over multiple nodes and machines.
  • Search faceting:
    • This is what the bulk of this post is about!

Bootstrapping Solr

Solr comes with a bunch of built-in defaults that I’ve made use of in this blog. Because Solr has NoSQL features, you can store whatever combination of fields each record might have for that document and only those fields. Each document in Solr can look very different to any other document.

For example, DocA might have a Forename field and a YearOfBirth field, and DocB might have a Surname field and a YearOfDeath field. Both can be stored in the same index without the annoyance of null columns that a relation database would need.

Even better, if you post data to Solr using a prescribed format for field naming, it’ll work out what type of field you mean and how best to store it.

For example, send a field called gender_s and the _s suffix tells Solr this field is a string that it should index and store.

_i says the field is an integer, _f is a float and so on.

_ss means the field is a string, but that it can contain multiple values. (It’s the plural of _s)

_is - multiple integers, _fs - multiple floats and so on.

In production, this isn’t advisable as it’s best to tell Solr what sort of fields to expect and how to store and index them. But for getting started, it’s ideal.

In this demo, I’ve used a few million rows from the 1881 census of England, Wales and Scotland and imported them into Solr using the default schema Solr ships with.

If you run the accompanying demo file, it bootstraps Solr using the out-of-the-box config, spends a few minutes formatting the 1881 data dump for import and then posts the data to Solr. Just over 4 million records import at around 4 minutes on my laptop - this goes to highlight just how much speed is crucial to Solr and a million rows a minute import in a Docker VM is pretty quick!

Querying Solr

At Findmypast, querying Solr is doing over http and the output is JSON making it a breeze to integrate.

At its most basic, a query to Solr using just a URL might pass just a few parameters:

…q=country_s:scotland&rows=10

The ‘q’ parameter is our query that asks for all data where the field country_s is scotland. The ‘rows’ parameter limits our results to the first 10 matches.

The ‘q’ parameter can make use of boolean logic (amongst other features) so a query might look like this for records marked with the country ‘Scotland’ and the gender ‘Male’:

http://0.0.0.0:8983/solr/recordsets/select?q=country_s:scotland+AND+gender_s:m&rows=10

Results come back in JSON format. In a Chrome browser, the plugin JSON formatter makes the responses easy to read.

There is a MASSIVE amount more you can do with querying, but I want to talk a bit about faceting the results.

None of what I have to say here is revolutionary or new, but it’s a quick taster of Solr’s faceting all in one place

Solr Facets

Facets aggregate or ‘bucket’ your results into counts. You get a bunch of results and instead of doing work in an application to group them by fields or formulas, you get Solr to do it.

There are a few different types of facets that all provide a different shape of result. For each type, I’ve included an example that shapes the included 1881 census dataset, but I’ve stopped it delivering the base results by supplying the parameter rows=0.

What we’re interested in here is the faceting of results rather than how we might construct a complicated query for results.

Simple Terms Facet Example

Our first example simply groups the results by the birth year of the records.

http://0.0.0.0:8983/solr/recordsets/select?
  q=*:*&
  rows=0&
  facet=on&
  facet.mincount=1&
  facet.field=birth_year_i

The first two parameters say we want to query all data, but not return any documents from the actual data.

facet=on is required to switch on faceting

facet.mincount=1 says we only want facet results where we actually get a number of records returned affected by our query.

facet.field=birth_year_i is the main bit where we say we want counts from the result split by the ‘birth_year_i’ field.

So this query is saying for the sample data in the 1881 Census, which years were they born in?

The results look like this:

{
  "responseHeader":{...},
  "response":{...},
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "birth_year_i":[
        "1881",120771,
        "1880",108354,
        "1879",108150,
        "1878",104087,
        "1877",101469,
        ...
      ]
    }
    ...
  }
}

The facet_counts : facet_fields section in the results shows us that 1881 is the most common year followed by 1880, 1879, 1878 and so on. The most common age for people in the sample 1881 Census was 0!

OK, that’s done, but now let’s arrange that data by decade. Give us counts of people in the Census grouped by their birth year decade.

For that, we can use a Range Facet:

Range Facet Example:

http://0.0.0.0:8983/solr/recordsets/select?json=
{
  query : "*:*",
  limit: 0,
  facet: {
    ages: {
      type: range,
      field: birth_year_i,
      start: 1770,
      end: 1890,
      gap: 10,
      other: all
    }
  }
}

I’ve switched querying style here a little to use the new JSON API endpoint that the latest version of Solr uses.

As well as being a bit easier to read, it has type-safety in the parameters and becomes clearer with more complex queries.

Back to Range Facets…

Range facets can have start, end & gap numbers to bucket the data and a selection of other parameters you find by googling.

The results look like this:

{
  "responseHeader":{...},
  "response":{...},
  "facets":{
    "count":4120178,
    "ages":{
      "buckets":[{
          "val":1770,
          "count":6},
        {
          "val":1780,
          "count":479},
        {
          "val":1790,
          "count":8789},
        {
          "val":1800,
          "count":53921},
        {
          "val":1810,
          "count":145564},
          ...],
      "before":{
        "count":0},
      "after":{
        "count":0},
      "between":{
        "count":4120178}}}}

Each decade is listed with how many people were born in that year who appeared in the sample data for the 1881 UK Census.

Let’s see how many different birth years there are in the data and what the average age of each person in the Census sample is:

Metrics Facet Example:

http://0.0.0.0:8983/solr/recordsets/select?json=
{
  query : "*:*",
  limit: 0,
  facet: {
    birth_years: "unique(birth_year_i)",
    avg_age: "avg(age_i)"
  }
}

Here, I’ve applied some formulas to the facets: unique() tells us how many different values there are and avg() tells us the mean average value:

{
  "responseHeader":{...},
  "response":{...},
  "facets":{
    "count":4120178,
    "birth_years":106,
    "avg_age":25.62985434124448
  }
}

So there are 106 different birth years and the average age was 25.6

There are lots of other metrics functions available as you might guess!

Facets of Facets…

Facets can get really powerful when you group one set of facet results by another field. Imagine a hierarchy of fields such as Country=>County=>Town. You can produce a result-set that shows aggregation by each level in the hierarchy.

http://0.0.0.0:8983/solr/recordsets/select?json=
{
  query: "country_s:scotland",
  limit: 0,
  facet: {
    counties: {
      limit: 1000,
      type: terms,
      field: county_ss,
      facet: {
        towns: {
          limit: 1000,
          type: terms,
          field: town_s,
          missing: true
        }
      }
    }
  }
}

Let’s break this down a little: First, facet the domain of results by aggregating by the county_ss field. Then for each of those values, facet the results by each town_s in that county_ss.

Also include a count for any records that have that county_ss but a missing town_s field (Remember, solr is flexible so you don’t need to have a value for every field for every document).

The results look like this:

{
  "responseHeader":{...},
  "response":{...},
  "facets":{
    "count":115556,
    "counties":{
      "buckets":[{
          "val":"midlothian",
          "count":45818,
          "towns":{
            "missing":{"count":0},
            "buckets":[{
                "val":"leith",
                "count":38588},
              {
                "val":"edinburgh",
                "count":7230}]}},
        {
          "val":"ayrshire",
          "count":13990,
          "towns":{
            "missing":{"count":0},
            "buckets":[{
                "val":"saltcoats",
                "count":5090},
              {
                "val":"stevenston",
                "count":3552},
              {
                "val":"largs",
                "count":3178},
              {
                "val":"stewarton",
                "count":1355},
              {
                "val":"landward",
                "count":815}]}},
        ...
        ]
    }
  }
}

For each county we can see not only the count of records at this level, but also the breakdown of towns and the counts at that level too.

Although I’ve restricted the query to just 2 levels here (County and Town), there’s no real limit - You could easily produce results for Continent => Country => County / State => District => Town => Street

Conclusion

The last set of examples where we nested one set of facets within another, hints at some real power in Solr where any type of facet can be nested within another:

A Terms Facet with a nested Range Facet could show you the counts of ages or people within each county for example.

Or a Terms Facet with a nested Metrics Facet using max(age__i) could show you the oldest person in each town, county and country.

There are lots of different ways to facet data - most of which you can find somewhere on Findmypast - We use Term faceting in the Category lists on a Country or World Search. We use Range faceting when displaying the publication years of our newspaper records. We use Nested faceting when displaying the list of datasets with record counts in the categories on our Dataset A-Z.

Solr’s powerful faceting features drive some of the most flexible searching on Findmypast.

Alex Clark
Search Developer
Findmypast
aclark@findmypast.com
www.findmypast.co.uk