'MongoDB query optimizer keeps choosing the least efficient index for the query

I have a large collection (~20M records) with some moderate documents with ~20 indexed fields. All of those indexes are single field. This collection also has quite a lot of write and read traffic.

MongoDB version is 4.0.9.

I am seeing at peak times that the query optimizer keeps selecting a very inefficient index for the winning plan.

In the example query:

{
  name: 'Alfred Mason',
  created_at: { $gt: ... },
  active: true
}

All of the fields are indexed:

{ name: 1 }
{ created_at: 1 }
{ active: 1 }

When I run explain(), the winning plan will use created_at index, which will scan ~200k documents before returning 4 that match the query. Query execution time is ~6000 ms.

If I use $hint to force the name index, it will scan 6 documents before returning 4 that match the query. Execution time is ~2 ms.

Why does query optimizer keeps selecting the slowest index? It does seem suspicious that it only happens during peak hours, when there is more write activity with the collection, but what is the exact reasoning? What can I do about it?

Is it safe to use $hint in production environment?

Is is reasonable to remove the index on the date field completely as $gt query doesn't seem any faster than a COLLSCAN? That could force the query optimizer to use an indexed field. But then again, it could also select another inefficient index (the boolean field).

I can't use compound indexes as there are a lot of use cases that use different combinations of all 20 indexes available.



Solution 1:[1]

There could be a number of reasons why Mongo appears to not be using the best execution plan, including:

  • The running time and execution plan estimate using the single field index on the name field is not accurate. This could be due to bad statistics, i.e. Mongo is making an estimate using stale or not up to date information.
  • While for your particular query the created_at index is not optimal, in general, for most of the possible queries on this field, the created_at index would be optimal.

My answer here is actually that you should probably be using a multiple field index, given that you are filtering on multiple fields. For the example filter you gave in the question:

{
  name: 'Alfred Mason',
  created_at: { $gt: ... },
  active: true
}

I would suggest trying both of the following indices:

db.getCollection('your_collection').createIndex(
    { "name": 1, "created_at": 1, "active": 1 } );

and

db.getCollection('your_collection').createIndex(
    { "created_at": 1, "name": 1, "active": 1 } );

Whether you would want created_at to be first in the index, or rather name to be first, would depend on which field has the higher cardinality. Cardinality basically means how unique are all of the values in a given field. If every name in the collection be distinct, then you would probably want name to be first. On the other hand, if every created_at timestamp is expected to be unique, then it might make sense to put that field first. As for active, it appears to a boolean field, and as such, can only take on two values (true/false). It should be last in the index (and you might even be able to omit it entirely).

Solution 2:[2]

I do not think it is necessary to index all fields, and it is better to choose the appropriate fields.

Prefixes in Compound Indexes may be useful for you

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Tim Biegeleisen
Solution 2 Hamed Lohi