'Searching numbers as keywords or strings with Mongo Atlas Search (as possible in Elastic Search)

Sometimes it's useful to allow numbers to be treated as keywords or strings when using a search index. For example, suppose I have transaction data something like this:

[
  { "amount": 715,  "description": "paypal payment" },
  { "amount": 7500, "description": "second visa payment" },
  { "amount": 7500, "description": "third visa payment" }
]

I might want to allow a search box entry such as "7500 second" to produce the last two rows, with the "second visa payment" row scoring highest.

How can I achieve this with Mongo DB Atlas, using its search index facility?

In Elastic Search, it's possible by adding a keyword field on the numeric field, as per this example:

INDEX=localhost:9200/test
curl -X DELETE "$INDEX?pretty"
curl -X PUT "$INDEX?pretty" -H 'Content-Type: application/json' -d'
{
  "mappings" : {
    "properties" : {
      "amount" : {
        "type" : "long",
        "fields" : {
          "keyword" : {
            "type" : "keyword",
            "ignore_above" : 256
          }
        }
      },
      "description" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type" : "keyword",
            "ignore_above" : 256
          }
        }
      }
    }
  }
}'
curl -X POST "$INDEX/_bulk?pretty" -H 'Content-Type: application/x-ndjson' -d '
{ "index": {"_id":"61d244595c590a67157d5f82"}}
{ "amount": 512,"description": "paypal payment" }
{ "index": {"_id":"61d244785c590a67157d62b3"}}
{ "amount": 7500, "description": "second visa payment" }
{ "index": {"_id":"61d244785c590a67157d62b4"}}
{ "amount": 7500, "description": "third visa payment" }
'
sleep 1

curl -s -X GET "$INDEX/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "query_string": {
        "query": "75* second"

    }
  }
}
' # | jq '.hits.hits[] | {_source,_score}'

Here the search on "75* second" gives the desired result:

{
  "_source": {
    "amount": 7500,
    "description": "second visa payment"
  },
  "_score": 1.9331132
}
{
  "_source": {
    "amount": 7500,
    "description": "third visa payment"
  },
  "_score": 1
}

With eqivalent data in Mongo Atlas (v5.0), I've tried setting up an index with a lucene.keyword on the "amount" field as a string, but it has no effect on the results (which only pay attention to the description field). Similarly, added a string field type on the amount field doesn't produce any rows: it seems Mongo Atlas Search insists on using number-type queries on numeric fields.

I'm aware that I can use a more complex compound query, combining numeric and string fields, to get the result (example below), but this isn't necessarily convenient for a user, who just wants to chuck terms in a box without worrying about field names. I may wish to search over ALL number fields in a row, rather than just one, and include results where only some of the terms match, potentially fuzzily. (A possible use case here is searching over transaction data, with a question like "when was my last payment for about 200 dollars to Steven?" in mind).

One possibility might be to create an "all text" field in the mongo DB, allowing the numbers to be stored as strings, and similar to what happens (or used to happen) in Elastic Search. This might require a materialized view on the data, or else an additional, duplicative field, which then would be indexed.... is there an easier solution, or one that involves less data duplication? (the table in question is large, so storage costs matter).

The data in mongo look something like this. amount could be a float or an integer (or likely both, in different fields).

{"_id":{"$oid":"61d244595c590a67157d5f82"},"amount":{"$numberInt":"512"},"description":"paypal payment"}
{"_id":{"$oid":"61d244785c590a67157d62b3"},"amount":{"$numberInt":"7500"},"description":"second visa payment"}
{"_id":{"$oid":"61d244785c590a67157d62b4"},"amount":{"$numberInt":"7500"},"description":"third visa payment"}

An example of a search index definition I've tried (among many!) is:

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "amount": {
        "multi": {
          "test": {
            "analyzer": "lucene.keyword",
            "ignoreAbove": null,
            "searchAnalyzer": "lucene.keyword",
            "type": "string"
          }
        },
        "type": "string"
      },
      "description": {
        "type": "string"
      }
    }
  },
  "storedSource": true
}

...and a sample search pipeline is:

[
  {
    "$search": {
      "index": "test",
      "text": {
        "path": {
          "wildcard": "*"
        },
        "query": "7500 second"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "description": 1,
      "amount": 1,
      "score": {
        "$meta": "searchScore"
      }
    }
  }
]

This gives only the second row (i.e. the "7500" in the query is effectively ignored, and only the description field matches):

[
  {
    "_id": "61d244785c590a67157d62b3",
    "amount": 7500,
    "description": "second visa payment",
    "score": 0.42414236068725586
  }
]

The following compound query does work, but it's overly complex to produce, especially with many numeric and string fields:

{
  "index": "test",
  "compound": {
    "should": [
      {
        "text": {
          "query": "second",
          "path": "description"
        }
      },
      {
        "near": {
          "path": "amount",
          "origin": 7500,
          "pivot": 1
        }
      }
    ]
  }
}

Documentation on field types and mappings is at https://www.mongodb.com/docs/atlas/atlas-search/define-field-mappings/, operators and collectors at https://www.mongodb.com/docs/atlas/atlas-search/operators-and-collectors/ .

See https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html for Elastic's guidance on why and when it can be useful to index numeric fields as keywords.



Solution 1:[1]

In Atlas Search, the data type defined in your index definition determines what operators you can use to query the values. In this case, I used the default and the following query options to target the values you are looking for above.

For only the numeric value:

{
  compound: {
    should: [
      {
        range: {
          gt: 7499,
          lte: 7500,
          path: 'amount'
        }
      }
    ]
  }
}

If I want query for both the text and the number, it's also simply a compound query, though an edgeGram autocomplete field type would be desired in an optimal state. It's really important for me to simplify before optimizing:

{
  compound: {
    must: [
      {
        range: {
          gt: 7499,
          lte: 7500,
          path: 'amount'
        }
      },
      {
        wildcard: {
          query: "*",
          path: 'description',
          allowAnalyzedField: true
        }
      }
    ]
  }
}

I hope this is helpful. keyword is only a good analyzer for this field in the case of the description field if you want to do the wildcard. Standard or the language that description is written in would both be better.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nice-Guy