'Elasticsearch filter results by count of property whose value is less than a number

I have an index that is structured like

{
  "took": 301,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4270,
      "relation": "eq"
    },
    "max_score": 2.0,
    "hits": [
      {
        "_index": "asset_revision_structured_data",
        "_type": "_doc",
        "_id": "2931293",
        "_score": 2.0,
        "_source": {
          "doc": {
            "prediction": {
              "drugs": {
                "document_metadata": {},
                "predictions": {
                  "relevant_drugs": [
                    {
                      "confidence_score": 0.9946682341655051
                    }
                  ]
                }
              }
            }
          }
        }
      }
    ]
  }
}

I would like to filter the results to return all hits where 50% or more relevant_drugs have a confidence_score < 0.6.

I know that this would give me all hits where there contains a relevant_drugs entry with confidence_score < 0.6:

{
  "query": {
    "bool": {
      "must": [
        {
          "exists": {
            "field": "doc.prediction.drugs"
          }
        },
        {
          "range": {
            "doc.prediction.drugs.predictions.relevant_drugs.confidence_score": {
              "lt": 0.6
            }
          }
        }
      ]
    }
  },
  "_source": ["doc.prediction.drugs"]
}

but I would like to only return back hits where that clause applies to greater than half the relevant_drugs. How would I do this?

Thanks

opensearch

Solution 1:^[1]

Tldr;

I don't believe Elasticsearch has a specific query to do so. But you can use Painless. It allow for scripted behaviour in your queries. I also leverage the RuntimeFields to create on the fly a field I can apply a filter to.

To Reproduce

Here is the data I used to run my tests

POST /71916396/_doc
{
  "relevant_drugs": [
    {
      "confidence_score": 0.9946682341655051
    },
    {
      "confidence_score": 0.8946682341655051
    }
  ]
}

POST /71916396/_doc
{
  "relevant_drugs": [
    {
      "confidence_score": 0.9946682341655051
    },
    {
      "confidence_score": 0.02
    },
    {
      "confidence_score": 0.1
    }
  ]
}

POST /71916396/_doc
{
  "relevant_drugs": [
    {
      "confidence_score": 0.1
    }
  ]
}

To Solve

Below the query, with a runtime field getting the median of all the confidence_score in your documents. And then filtering for low confidence score.

GET /71916396/_search
{
  "runtime_mappings": {
    "confidence_median": {
      "type": "double",
      "script": {
        "source": """
        def drugs = params['_source']['relevant_drugs'];
        
        def sorted_drugs = drugs.stream().sorted((d1, d2) -> d1.get('confidence_score').compareTo(d2.get('confidence_score'))).collect(Collectors.toList());
        
        def median = -1.0;
        if (sorted_drugs.length % 2 == 0)
        {
          median = ((double)sorted_drugs[sorted_drugs.length/2]['confidence_score'] + (double)sorted_drugs[sorted_drugs.length/2 - 1]['confidence_score'])/2;
        }
        else
        {
          median = (double) sorted_drugs[sorted_drugs.length/2]['confidence_score'];
        }
        
        
        emit(median)
        
        """
      }
    }
  },
  "query": {
    "range": {
      "confidence_median": {
        "lte": 0.6
      }
    }
  }, 
  "size": 10
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Paulo