'Ignore filtered words from the query string when using phrase match in Elasticsearch
I'm using a custom index analyzer to remove a certain set of stop words. I'm then making phrase match queries with text that includes some of the stop words. I would expect that the stop words get filtered out of the query, however they are not (and any documents that do not include them are being excluded from the results).
Here's a simplified example of what I'm trying to do:
#!/bin/bash
export ELASTICSEARCH_ENDPOINT="http://localhost:9200"
# Create index, with a custom analyzer to filter out the word 'foo'
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
"settings": {
"analysis": {
"analyzer": {
"fooAnalyzer": {
"type": "custom",
"tokenizer": "letter",
"filter": [
"fooFilter"
]
}
},
"filter": {
"fooFilter": {
"type": "stop",
"stopwords": [
"foo"
]
}
}
}
},
"mappings": {
"myDocument": {
"properties": {
"myMessage": {
"analyzer": "fooAnalyzer",
"type": "string"
}
}
}
}
}'
# Add sample document
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"myDocument"}}
{"myMessage":"bar baz"}
'
If I perform a phrase_match
search against this index with a filtered stop word in the middle of the query, I would expect it to match (since 'foo'
should be filtered away by our analyzer).
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "bar foo baz"
}
}
}
}
'
However, I get no results.
Is there a way to instruct Elasticsearch to tokenize and filter the query string before performing the search?
Edit 1: now I'm even more confused. I was seeing before that phrase matching wasn't working if my query contained stop words in the middle of the query text. Now, in addition, I'm seeing that the phrase query does not work if the document contains stop words in the middle of the query text. Here's a minimal example, still using the mapping from above.
POST play/myDocument
{
"myMessage": "fib foo bar" <---- remember that 'foo' is a stopword and is filtered out of analysis
}
GET play/_search
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "fib bar"
}
}
}
}
This query does not match. I'm very surprised by this! I would expect the foo stop word to be filtered out and ignored.
For an example of why I'd expect this, see this query:
POST play/myDocument
{
"myMessage": "fib 123 bar"
}
GET play/_search
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "fib bar"
}
}
}
}
This matches, because the '123'
is filtered out by my 'letter'
tokenizer. It seems like phrase matching is ignoring the stop word filtering completely, and acting as if those tokens were in the analyzed field all along (even though they don't show up in the list of tokens from _analyze).
My current best idea for a workaround:
- call the
_analyze
endpoint against my document's text string using my custom analyzer. this will return the tokens from the original text string but remove the pesky stop words for me - save a version of my text using only the tokens into a
"filtered"
field in the document
Later, at query time:
- call the
_analyze
endpoint against my query string using my custom analyzer to get just the tokens - make my phrase match query using the filtered token string against the document's new
"filtered"
field
Solution 1:[1]
It turns out that if you want to use phrase matching, the token filter is too late to remove unwanted words. By that point, the position
field of your significant tokens is polluted by the existence of the filtered tokens and the phrase matching refuses to work.
The answer - filter before we get to the token filter level. I created a char_filter
that removes our unwanted term and phrase matching started working correctly!
PUT play
{
"settings": {
"analysis": {
"analyzer": {
"fooAnalyzer": {
"type": "custom",
"tokenizer": "letter",
"char_filter": [
"fooFilter"
]
}
},
"char_filter": {
"fooFilter": {
"type": "pattern_replace",
"pattern": "(foo)",
"replacement": ""
}
}
}
},
"mappings": {
"myDocument": {
"properties": {
"myMessage": {
"analyzer": "fooAnalyzer",
"type": "string"
}
}
}
}
}
Queries:
POST play/myDocument
{
"myMessage": "fib bar"
}
GET play/_search
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "fib foo bar"
}
}
}
}
and
POST play/myDocument
{
"myMessage": "fib foo bar"
}
GET play/_search
{
"query": {
"match": {
"myMessage": {
"type": "phrase",
"query": "fib bar"
}
}
}
}
both now work!
Solution 2:[2]
A workaround that should work:
- call the
_analyze
endpoint against my query string using my custom analyzer. this will return the tokens from the original query string but remove the pesky stopwords for me - make my phrase match query using the filtered tokens
However, this would obviously require two calls to Elasticsearch for every one of my queries. I'd like to find a better solution if possible.
Solution 3:[3]
Solution
Here is an alternative solution to a similar problem – but removing english stop words AND dealing with multi-value fields; tested on v7.10. It doesn't require explicitly using a char_filter
, it uses a standard analyzer
with english stop words
and makes the field a text
, so it should handle match_phrases
properly:
PUT play
{
"settings": {
"analysis": {
"analyzer": {
"phrase_analyzer": {
"type": "standard",
"stopwords": "_english_" //for my use case
}
}
}
},
"mappings": {
// "myDocument" is not used in v7.x
"properties": {
"myMessage": {
"analyzer": "phrase_analyzer",
"type": "text" //changed to handle match_phrase
}
}
}
}
For this demo data:
POST _bulk
{ "index": { "_index": "play", "_id": "1" } }
{ "myMessage": ["Guardian of the Galaxy"]}
{ "index": { "_index": "play", "_id": "2" } }
{ "myMessage": ["Ambassador of Peace", "Guardian of the Galaxy"]}
{ "index": { "_index": "play", "_id": "3" } }
{ "myMessage": ["Guardian of the Galaxy and Ambassador of Peace"]}
{ "index": { "_index": "play", "_id": "4" } }
{ "myMessage": ["Ambassador of Peace and Guardian of the Galaxy"]}
{ "index": { "_index": "play", "_id": "5" } }
{ "myMessage": ["Supreme Galaxy and All Living Beings Guardian"]}
{ "index": { "_index": "play", "_id": "6" } }
{ "myMessage": ["Guardian of the Sun", "Worker of the Galaxy"]}
Query 1:
GET play/_search
{
"query": {
"match_phrase": {
"myMessage": {
"query": "guardian of the galaxy",
"slop": 99 //useful on multi-values text fields
//https://www.elastic.co/guide/en/elasticsearch/reference/7.10/position-increment-gap.html
}
}
}
}
Should return docs 1 to 5, because each one has at least on value that matches either "guardian"
or "galaxy"
; and doc 6 will not be a match because each of these words are matched on different values, but not on the same (that's why we used slop=99
).
Query 2:
GET play/_search
{
"query": {
"match_phrase": {
"myMessage": {
"query": "\"guardian of the galaxy\"",
"slop": 99
}
}
}
}
Should return only docs 1 to 4, because the (escaped) double quotes enforce exact substring match per value and doc 5 has the 2 words in a different position.
Explanation
The problem is that you used a stop token filter
1 ...
Token filters are not allowed to change the position or character offsets of each token.
and a match_phrase
query, but 2...
The
match_phrase
query analyzes the text and creates a phrase query out of the analyzed text.
So the position
was already calculated before the stop token filter was applied and the match_phrase
relies on the it to compute a match. The '123'
worked properly because the letter tokenizer
does define the position
1, so the match_phrase
is happy!
The tokenizer is also responsible for recording the order or position of each term.
Exceptional Case - 0.3% were False Positives
After testing this solution with bigger data variety I found some exceptional false positives – about 0.3% of the 4k search results. In my particular case, I'm using match_phrase
in the filter
. To reproduce the false positive, we could just switch the order of the values from item 6, so that the words "Galaxy"
and "Guardian"
appear to be close to each other:
POST _bulk
{ "index": { "_index": "play", "_id": "7" } }
{ "myMessage": ["Worker of the Galaxy", "Guardian of the Sun"]}
The previous Query 1 would return it too, when it clearly shouldn't. I could not solve it by using Elasticsearch APIs, but did it by programmatically removing the stop words from the query 1 (see next).
Query 3:
GET play/_search
{
"query": {
"match_phrase": {
"myMessage": {
"query": "guardian galaxy", //manually removed "of" and "the" stop words
"slop": 99 //useful on multi-values text fields
//https://www.elastic.co/guide/en/elasticsearch/reference/7.10/position-increment-gap.html
}
}
}
}
See more at:
- Tokenizer vs. Token Filter: Anatomy of an analyzer
- Match phrase query
- Tokenizer and the position calculation: Token graphs
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Ricardo |
Solution 2 | Topher |
Solution 3 |