Building elasticsearch queries and the case for abstraction

This post discusses why writing elasticsearch queries can be hard, and why we
wrote bodybuilder to abstract away this concern.

Elasticsearch is a versatile tool that enables very powerful searches and
aggregations over large datasets. It lets us ask questions like:

  • How many users complained about ads on SpanishDict in the past 30 days?
  • Which complaints were categorized as inappropriate, with the user's message
    including the phrase “victoria’s secret”?
  • How many resources does a particular ad load, and how does this compare to historical data?

But with great power can come great complexity, and we have found that writing
queries like these in code can quickly become unwieldy. Take this filtered query
as an example:

// Return documents from the last 24 hours, from projects '12' or '24', where the advertiser was 'adx'.
{
  filter: {
    bool: {
      must: [
        {
          range: {
            timestamp: {
              gt: 'now-24h'
            }
          }
        },
        {
          nested: {
            path: 'ad_impression_ids',
            filter: {
              term: {
                'ad_impression_ids.advertiser_name': 'adx'
              }
            }
          }
        }
      ],
      should: [
        {
          term: {
            project_id: '12'
          }
        },
        {
          term: {
            project_id: '24'
          }
        }
      ]
    }
  }
};

If this were a one-off query I think we could handle writing queries like these
manually. But because we are writing an API on top of our elasticsearch dataset,
we construct these queries in code. This becomes very complex because:

  • We can't always construct an entire query in one function, different pieces of our application have information needed for different parts of the query.
  • It is not always obvious how to combine multiple queries, filters, and aggregations together to get the expected result.
  • Writing large blocks of JSON causes clutter in code.
  • We almost always have to crack open the elasticsearch docs to relearn the syntax for different queries and filters.
  • We have to write unit tests for these queries, because we need to be confident that the JSON is constructed properly.
  • Building queries is not a core concern of our application, and it distracts us from getting things done.

This is a good candidate for abstraction. We should delegate the task of writing
complex queries to someone else so we can focus on answering important questions
about our data.

Building upon Patterns

The key to building out this abstraction is to identify and exploit patterns:

  • The elasticsearch query DSL consists of clauses that are combined in a predictable way, following a well-defined pattern.
  • Our abstraction should provide a simple, predictable interface on top of this pattern.

The concept of abstracting away the business of building complex queries is not new,
so we can find and follow an established pattern for this task. The
Builder pattern is an
established design pattern for "constructing complex objects by separating
construction and representation," and fits our use case well.

Moreover, query builders are common, especially for SQL -- sequelize, knex.js,
for example. There are also a few libraries out there for building elasticsearch
queries: elastic.js and
esq, unfortunately these have received
little maintenance recently. However, we can look to these libraries for
examples of how to write our own query builder.

Bodybuilder

Following the Builder pattern, the problem of building a complex elasticsearch
query body boils down to:

  1. Defining functions that return an object representing a single query clause.
  2. Combining multiple query clauses together.
  3. Defining a builder class that provides methods for calling these functions, and stores the changing state of our elasticsearch query body.

For example, a query clause such as a match query can be represented as a
function:

{
  "match" : {
    "message" : "this is a test"
  }
}

This object is a template that we can create a function for, e.g.

matchQuery("message", "this is a test")

Next, we should be able to call these template functions on our builder class. We can specify the query type we want by passing its name
as the first argument, e.g.

var body = new Builder()
body.query("match", "message", "this is a test")

This class should also hold on to the current state of our query body in a private variable:

body.query("match", "message", "this is a test")
// -> { "match" : { "message" : "this is a test" } }

It should also know how to combine multiple queries together without exposing
the details of how this is done:

// Combines two query clauses using a Boolean query clause.
body.query("term", "user", "herald") // -> 'must' clause
    .orQuery("term", "user", "johnny") // -> 'should' clause

And finally, the builder must provide a method for spitting out the constructed
object:

body.build() // -> returns constructed query body

Putting all of this together, the adx query from the beginning of this post
can be rewritten much more simply as:

var body = new Builder()
  .filter('range', 'timestamp', {gt: 'now-24h'})
  .filter('nested', 'ad_impression_ids', 'term', 'advertiser_name', 'adx')
  .orFilter('term', 'project_id', '12')
  .orFilter('term', 'project_id', '24')
  .build()

Bodybuilder
is a new project and any feedback or contributions in the way of
pull requests or issues are welcome.


Abstracting out the business of building complex elasticsearch query bodies was a clear win for us. It helped isolate the challenge of writing clean, testable code around elasticsearch queries, and cleared the path for building out an awesome product. As a team we continually look for opportunities to separate concerns by pulling out modules from our code base and iterating on them separately. And by releasing this as an open-source project we hope to bring together other contributors who can help make the code even better.

Do you have a solution for building complex queries programmatically in code? Has this been a sticking point in any of your projects? Share your thoughts in the comments below!