ElasticSearch: Pagination by document creation date

It’s been a while. Kept you waiting, huh?

Pagination is not an easy task. There are a lot of implementations that are simply wrong, returning pages missing entries when there’s insertion/updates between queries or sufferring from serious performance issues.

With ElasticSearch, we have multiple ways of implementing pagination. The Scroll API is a good feature when you need to fetch multiple documents inside a task, but it won’t save you when you need to work with pages in a stateless manner (maybe there’s a front-end client consuming your api).

Real case: I needed to search for documents using a task that would run every N minutes. At the end of every search, I would save a “checkpoint”, something indicating where I last stopped so that the next iteration could resume from it.

Pagination with a “creation date” field

Supposing we have the following documents stored in an index:

[
  {
    "creationDate": "2018-01-21T19:51:42.840Z",
    "name": "Marcelo"
  },
  {
    "creationDate": "2018-01-21T19:51:42.840Z",
    "name": "Sarah"
  },
  {
    "creationDate": "2018-01-21T19:51:42.840Z",
    "name": "Yasmin"
  },
  {
    "creationDate": "2018-01-21T20:51:42.840Z",
    "name": "Julia"
  },
  {
    "creationDate": "2018-01-21T20:51:42.840Z",
    "name": "Nicolly"
  }
]

In our example, we will search through these documents with a page size of 2. We want to find all those 5 documents, ordered by their creation date, meaning that when new documents are inserted we can fetch only the new ones.

Episode 1: The naïve approach

We can structure our query in a way that documents are ordered by their creationDate and for every iteration we would store the last document creationDate. This is what our query will look like (with the GET /my_index/_search ommitted):

{
  "size": 2,
  "query": {
    "range": {
      "_name": "This block won't be present in the first page request (since we don't have the last creationDate reference!)",
      "creationDate": {
        "gt": "2018-01-21T19:51:42.840Z"
      }
    }
  },
  "sort": [
    {
      "creationDate": {
        "order": "asc"
      }
    }
  ]
}

And this is the result after 3 iterations/pages (only _source included, the other metadata is not needed for now):

[{
  "_source": {
    "creationDate": "2018-01-21T19:51:42.840Z",
    "name": "Sarah"
  }
}, {
  "_source": {
    "creationDate": "2018-01-21T19:51:42.840Z",
    "name": "Marcelo"
  }
}, {
  "_source": {
    "creationDate": "2018-01-21T20:51:42.840Z",
    "name": "Julia"
  }
}, {
  "_source": {
    "creationDate": "2018-01-21T20:51:42.840Z",
    "name": "Nicolly"
  }
}]

We only got 4 results!

Where is Yasmin?

Turns out that after we got Sarah and Marcelo, our next query asked for a page with 2 results returning documents created after the first query’s last document (Marcelo) with a creationDate greater than “2018-01-21T19:51:42.840Z”.

The problem: Yasmin was also created in that same instant.

Documents created in the same instant makes this approach unpredictable and not always accurate.

Worse yet: The problem can persist for a long time before someone notice it, if they ever do.

Episode 2: The right approach

Starting with version 5.0, ElasticSearch provides us a nice and realiable way to address this issue: the Search After parameter. Quoting the docs:

The search_after parameter circumvents this problem by providing a live cursor. The idea is to use the results from the previous page to help the retrieval of the next page.

and

search_after is not a solution to jump freely to a random page but rather to scroll many queries in parallel. It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher.

We can change the sort in out last query to include the _uid field, unique to every document in that specific index.

Then, starting from the first query, we must include a search_after parameter that will include the creationDate and _uid from the last document in the previous query. Those values are returned with every match in the sort block:

{
  "_source": {
    "creationDate": "2018-01-21T20:51:42.840Z",
    "name": "Julia"
  },
  "sort": [
    1516564302840,
    "type#AWEagzSzwHnysW-7Ulq9"
  ]
}

So, when finding all documents created after Julia, we would use the following query:

{
  "size": 2,
  "search_after": [
    1516564302840,
    "type#AWEagzSzwHnysW-7Ulq9"
  ],
  "sort": [
    {
      "creationDate": {
        "order": "asc"
      }
    },
    {
      "_uid": {
        "order": "asc"
      }
    }
  ]
}

That way we can assure that we can find all the current 5 documents and all future documents that will be inserted.

There’s just one caveat: Our application must not insert documents with a creationDate lower than the last one.

Other DBMS

This solution can also be “ported” to other DBMS easily.

For example, I found that solution when looking for a similar way that we used to do with CouchDB keys and views.

In this case, it is almost identical.

It would look like this with SQL databases when you are not using an auto-incremented primary key:

First page query:

select
	uuid,
	creation_date
from
	your_table
order by
	creation_date asc,
	uuid asc
limit 10

Second page query:

select
	uuid,
	creation_date
from
	your_table
where
  -- the document is indeed created later than the last one
	creation_date > :last_page_row_creation_date
  -- OR it was created at the same time, but wasn't included in the last page because of it's limit clause
		or creation_date = :last_page_row_creation_date and uuid > :last_page_row_uuid
order by
	creation_date asc,
	uuid asc
limit 10

Conclusion

This is a safe way of implementing pagination using ElasticSearch when we plan to iterate through the results in a stateless manner. It is much valuable when using ElasticSearch with endlessy running tasks or when it is used as your primary database.

Good luck!