'ElasticSearch random score combined with boost?
I am building a iOS app with Firebase, and using ElasticSearch as a search engine to get more advanced queries.
I am trying to achieve a system where I can get a random record from the index, based on a query. I have already got this working using the "random_score" function with a seed.
So all documents should right now have equal chance of being selected. Is it possible to add a boost or something(sorry, I am new to ES)?
Let's say the the document has the field "boost_enabled" and it set to true, the document will be 3 times more likely to be selected, so "increasing" the chance of being selected as a random?
So in theory it should look like this:
Documents that matches the query:
"document1"
"document2"
"document3"
They all have an equal chance of being selected (33%)
What I wish to achieve is if "document1" has the field "boost_enabled" = true
It should look like this:
"document1"
"document1"
"document1"
"document2"
"document3"
So now "document1" is 3 times more likely to be selected as the random record.
Would really appreciate some help.
EDIT:
I've come up with something like this, is this correct or not? I am pretty sure it's not though...
"query" : {
"function_score": {
"query": {
"bool" : {
"must": {
"match_all": {}
},
"should": [
{ "exists" : {
"field" : "boost_enabled",
"boost" : 3
}
}
]
"filter" : filterArray
}
},
"functions": [
{
"random_score": {"seed": seed}
}
]
}
}
/ Mads
Solution 1:[1]
Yes, Elasticsearch has something like that - refer to Elasticsearch: Query-Time Boosting.
In your case, you would have a portion of your query that notes the presence of the flag you described and this "subquery" would have a boost. bool
with its should
clause will probably be useful.
NB: This is not EXACTLY like being able to say matching document is n
times as likely to be a result
EDITS:
--
EDIT 1:
Elasticsearch will tell you how it comes up with the score via the Explain API which might be helpful in tweaking parameters.
--
EDIT 2:
I apologize for what I had posted above. Upon further thought and exploration, I think the boost
parameter is not quite what is required here. function_score
already has the notion of weight but even that falls short. I have found other users with requirements similar to yours but it looks like there haven't been any good solutions proposed for this.
References:
- Elasticsearch Github Issue on Weighted Random Sampling
- Stackoverflow Post with a Request Identical to Github Issue
I do not think the solutions proposed in those posts are quite right. I put together a quick shell script hitting the Elasticsearch REST API and relying on jq
(a popular CLI for processing JSON) to demonstrate: Github Gist: Flawed Attempt At Weighed Random Sampling with Elasticsearch
In the script, featured_flag
is equivalent to your boost_enabled
, and undesired_flag
is there to demonstrate how to only consider a subset of documents in the index. You can copy the script tweak global variables at the top of the script like Elasticsearch server, index, etc to try it out.
Some notes on the script:
- script creates one document with
featured_flag
enabled and one document withundesired_flag
enabled that should not be ever chosen TOTAL_DOCUMENTS
can be used to adjust how many total documents are created (including the first two created)FEATURED_FLAG_WEIGHT
is the weight applied at query time viafunction_score
- script reruns the same query 1000 times and outputs stats on how many times each of the created documents was returned as the first result
I would imagine your index has many "featured" or "boosted" samples among many that are not. With the described requirements, the probability of choosing a sample depends on weight of the document (let's say 3 for boosted documents, 1 for the rest) and the sum of weights across all valid documents that you want taken into consideration. Therefore, it seems like simple weights, boosts, and randoms are just insufficient
A lot of people have considered and posted solutions for the task of weighted random sampling without Elasticsearch. This appears to be a good stab at explaining a few approaches: electric monk: Weighted Random Distribution. A lot of algorithmic details may not be quite relevant here but I thought they were interesting.
I think the ideal solution would require work to be done outside of Elasticsearch (without delving into creating Elasticsearch plugins, scorers, etc). Here is the best that I can come up with at the moment:
A numeric weight field stored in the documents (can continue with boolean fields but this seems more flexible)
Hit Elasticsearch with an initial query leveraging aggregations for some stats we need
- Possibly a sum aggregation for the sum of weights required for document probabilities
- A terms aggregation to get counts of documents by weights (ex:
m
documents with weight 1,n
documents with weight 3)
- Outside of Elasticsearch (in the app), choose the sample
- generate a random number within the range of 0 to
sum_of_weights
-1 - use the aggregation results and the generated random to select an index (see the algorithmic solutions for weighted random sampling outside of Elasticsearch) that is in the range of 0 to
total_valid_documents
-1 (call thisselected_index
)
- generate a random number within the range of 0 to
- Hit Elasticsearch a second time with appropriate filters for considering only valid documents, a
sort
parameter that guarantees the document set is ordered the same way each time we run this process (perhaps sorted by weight and by document id), and afrom
parameter set to theselected_index
Slightly related to all this, I posted a slightly different write up.
Solution 2:[2]
In ES 7.15 i use {script_score} key, below you can see my example.
This code "source": "_score + Math.random()"
added random 0.0 -> 1.0 number to my native boosted score. For more information you can see this
{
"size": { YOUR_SIZE_LIMIT },
"query": {
"script_score": {
"query": {
"query_string": {
"fields": [
"{ YOUR_FIELD_1 }^6",
"{ YOUR_FIELD_2 }^3",
],
"query": "{ YOUR_SEARCH_QUERY }"
}
},
"script": {
"source": "_score + Math.random()"
}
}
}
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |