'Sort multi-bucket aggregation by source fields inside inner multi-bucket aggregation

TL;DR: Using an inner multi-bucket aggregation (top_hits with size: 1) inside an outer multi-bucket aggregation, is it possible to sort the buckets of the outer aggregation by the data in the inner buckets?


I have the following index mappings

{
  "parent": {
    "properties": {
      "children": {
        "type": "nested",
        "properties": {
          "child_id": { "type": "keyword" }
        }
      }
    }
  }
}

and each child (in data) has also the properties last_modified: Date and other_property: String.

I need to fetch a list of children (of all the parents but without the parents), but only the one with the latest last_modified per each child_id. Then I need to sort and paginate those results to return manageable amounts of data.

I'm able to get the data and paginate over it with a combination of nested, terms, top_hits, and bucket_sort aggregations (and also get the total count with cardinality)

{
  "query": {
    "match_all": {}
  },
  "size": 0,
  "aggs": {
    "children": {
        "nested": {
            "path": "children"
        },
        "aggs": {
            "totalCount": {
                "cardinality": {
                    "field": "children.child_id"
                }
            },
            "oneChildPerId": {
                "terms": {
                    "field": "children.child_id",
                    "order": { "_term": "asc" },
                    "size": 1000000
                },
                "aggs": {
                    "lastModified": {
                        "top_hits": {
                            "_source": [
                                "children.other_property"
                            ],
                            "sort": {
                                "children.last_modified": {
                                    "order": "desc"
                                }
                            },
                            "size": 1
                        }
                    },
                    "paginate": {
                        "bucket_sort": {
                            "from": 36,
                            "size": 3
                        }
                    }
                }
            }
        }
    }
  }
}

but after more than a solid day of going through the docs and experimenting, I seem to be no closer to figuring out, how to sort the buckets of my oneChildPerId aggregation by the other_property of that single child retrieved by lastModified aggregation.

Is there a way to sort a multi-bucket aggregation by results in a nested multi-bucket aggregation?


What I've tried:

  • I thought I could use bucket_sort for that too, but apparently its sort can only be used with paths containing other single-bucket aggregations and ending in a metic one.
  • I've tried to find a way to somehow transform the 1-result multi-bucket of lastModified into a single-bucket, but haven't found any.

I'm using ElasticSearch 6.8.6 (the bucket_sort and similar tools weren't available in ES 5.x and older).



Solution 1:[1]

I had the same problem. I needed a terms aggregation with a nested top_hits, and want to sort by a specific field inside the nested aggregation.

Not sure how performant my solution is, but the desired behaviour can be achieved with a single-value metric aggregation on the same level as the top_hits. Then you can sort by this new aggregation in the terms aggregation with the order field.

Here an example:

POST books/_doc 
{ "genre": "action", "title": "bookA", "pages": 200 }
POST books/_doc 
{ "genre": "action", "title": "bookB", "pages": 35 }
POST books/_doc 
{ "genre": "action", "title": "bookC", "pages": 170 }
POST books/_doc 
{ "genre": "comedy", "title": "bookD", "pages": 80 }
POST books/_doc 
{ "genre": "comedy", "title": "bookE", "pages": 90 }

GET books/_search
{
  "size": 0,
  "aggs": {
    "by_genre": {
      "terms": {
        "field": "genre.keyword",
        "order": {"max_pages": "asc"}
      },
      "aggs": {
        "top_book": {
          "top_hits": {
            "size": 1,
            "sort": [{"pages": {"order": "desc"}}]
          }
        },
        "max_pages": {"max": {"field": "pages"}}
      }
    }
  }
}

by_genre has the order field which sorts by a sub aggregation called max_pages. max_pages has only been added for this purpose. It creates a single-value metric by which the order is able to sort by.

Query above returns (I've shortened the output for clarity):

{ "genre" : "comedy", "title" : "bookE", "pages" : 90 }
{ "genre" : "action", "title" : "bookA", "pages" : 200 }

If you change "order": {"max_pages": "asc"} to "order": {"max_pages": "desc"}, the output becomes:

{ "genre" : "action", "title" : "bookA", "pages" : 200 }
{ "genre" : "comedy", "title" : "bookE", "pages" : 90 }

The type of the max_pages aggregation can be changed as needed , as long as it is a single-value metic aggregation (e.g. sum, avg, etc)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1