'Elasticsearch filter results by count of property whose value is less than a number
I have an index that is structured like
{
"took": 301,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4270,
"relation": "eq"
},
"max_score": 2.0,
"hits": [
{
"_index": "asset_revision_structured_data",
"_type": "_doc",
"_id": "2931293",
"_score": 2.0,
"_source": {
"doc": {
"prediction": {
"drugs": {
"document_metadata": {},
"predictions": {
"relevant_drugs": [
{
"confidence_score": 0.9946682341655051
}
]
}
}
}
}
}
}
]
}
}
I would like to filter the results to return all hits
where 50% or more relevant_drugs
have a confidence_score
< 0.6.
I know that this would give me all hits where there contains a relevant_drugs
entry with confidence_score
< 0.6:
{
"query": {
"bool": {
"must": [
{
"exists": {
"field": "doc.prediction.drugs"
}
},
{
"range": {
"doc.prediction.drugs.predictions.relevant_drugs.confidence_score": {
"lt": 0.6
}
}
}
]
}
},
"_source": ["doc.prediction.drugs"]
}
but I would like to only return back hits where that clause applies to greater than half the relevant_drugs
. How would I do this?
Thanks
Solution 1:[1]
Tldr;
I don't believe Elasticsearch
has a specific query to do so.
But you can use Painless. It allow for scripted behaviour in your queries. I also leverage the RuntimeFields to create on the fly a field I can apply a filter to.
To Reproduce
Here is the data I used to run my tests
POST /71916396/_doc
{
"relevant_drugs": [
{
"confidence_score": 0.9946682341655051
},
{
"confidence_score": 0.8946682341655051
}
]
}
POST /71916396/_doc
{
"relevant_drugs": [
{
"confidence_score": 0.9946682341655051
},
{
"confidence_score": 0.02
},
{
"confidence_score": 0.1
}
]
}
POST /71916396/_doc
{
"relevant_drugs": [
{
"confidence_score": 0.1
}
]
}
To Solve
Below the query, with a runtime field getting the median of all the confidence_score in your documents. And then filtering for low confidence score.
GET /71916396/_search
{
"runtime_mappings": {
"confidence_median": {
"type": "double",
"script": {
"source": """
def drugs = params['_source']['relevant_drugs'];
def sorted_drugs = drugs.stream().sorted((d1, d2) -> d1.get('confidence_score').compareTo(d2.get('confidence_score'))).collect(Collectors.toList());
def median = -1.0;
if (sorted_drugs.length % 2 == 0)
{
median = ((double)sorted_drugs[sorted_drugs.length/2]['confidence_score'] + (double)sorted_drugs[sorted_drugs.length/2 - 1]['confidence_score'])/2;
}
else
{
median = (double) sorted_drugs[sorted_drugs.length/2]['confidence_score'];
}
emit(median)
"""
}
}
},
"query": {
"range": {
"confidence_median": {
"lte": 0.6
}
}
},
"size": 10
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Paulo |