'How does sorting with an index work in MongoDB?

I'm wondering how sorting with an index actually works in MongoDB. There are a couple articles in the MongoDB documentation, but they don't actually describe how the sort proceeds or the time complexity. Searches of SO and the interweb in general so far haven't turned up anything relevant.

Let's assume there are a documents in a collection, the find() clause matches b documents, there's a limit of c documents returned, a >> b >> c, and c is some suitably large number such that the returned set cannot fit in memory - let's say 1M documents, for example.

At the start of the operation, there exist b documents that needs to be sorted and a sorted tree index of size a for the feature the documents will be sorted by.

I can imagine:

A) traverse the index in order, and for each ObjectID traverse the list of b documents. Return matches until c is reached. This would be O(ab).

B) as A), but build a hashset of the ObjectIDs in the b documents first. This is O(a), but takes O(b) memory.

I've tried to consider sorts based on traversing the set of b documents, but can't seem to come up with anything faster than O(b log b), which is no better than sorting without an index.

I assume (but maybe I'm wrong) that every sort doesn't require an index scan, so how does the sort actually work?

Update:

Kevin's answer and provided link narrow down the question a lot, but I'd like to confirm/clarify a few points:

  1. As I understand it, you cannot use different indexes for the query and the sort if you want to avoid an in-memory sort. When I read this page it appeared as though you could (or at least, it didn't specify one way or the other), but that seems to be incorrect. Essentially, the documents are sorted because they're looked up in the order of the index during the query and therefore returned in the order of the index. Right?
  2. When querying on a compound index, the the sorting index must be the first index in the compound index, except for indexes where the query is an equality. If not, the sort is performed in memory. Right?
  3. How does sorting work with $in or $or queries? For example, assume the query is

    {a: {$in: [4, 6, 2, 1, 3, 10]}, b: {$gt: 1, $lt: 6}}

... and there's a compound index on a and b in that order. How would the sort work in the cases the sort is on a or b? $or is even more complicated since, as I understand it, $or queries are essentially split into multiple separate queries. Are $or queries always an in-memory sort, at least for merging the results of the separate queries?



Solution 1:[1]

Indexes in MongoDB are stored in a B-tree structure, where each index entry points to a specific location on-disk. Using a B-tree structure also means that a MongoDB index is stored in a sorted order, always traversed in-order, and is cheap for MongoDB to fetch a series of documents in a sorted order via indexes.

Update: The B-tree structure is true for the MMAPv1 storage engine, but is implemented slightly differently by the WiredTiger storage engine (default since MongoDB 3.2). The basic idea remains the same, where it's cheap to traverse the index in a sorted order.

A SORT stage (i.e. in-memory sort) in a query is limited to 32MB of memory use. A query will fail if the SORT stage exceeds this limit. This limit can be sidestepped by utilizing the sorted nature of indexes, so that MongoDB can return a query with a sort() parameter without performing an in-memory sort.

Let us assume that the query is of the shape:

    db.a.find({b:{$gt:100}, c:{$gt:200}}).sort(...)

with collection a having an index of:

    db.a.createIndex({b:1,c:1})

There are two possible scenarios when a sort() stage is specified in the query:

1. MongoDB cannot use the sorted nature of the index and must perform an in-memory SORT stage.

This is the outcome if the query cannot use the "index prefix". For example:

    db.a.find({b:{$gt:100}, c:{$gt:200}}).sort({c:1})

In the query above, the index {b:1,c:1} can be used to:

  • Match documents having b greater than 100 for the {b:{$gt:100}} portion of the query.
  • However, there is no guarantee that the returned documents are sorted in terms of c.

Therefore, MongoDB has no choice but to perform an in-memory sort. The explain() output of this query will have a SORT stage. This SORT stage would be limited to 32MB of memory use.

2. MongoDB can use the sorted nature of the index.

This is the outcome if the query uses:

  • Sort keys that matches the order of the index, and
  • Specifies the same ordering as the index (i.e. the index {b:1,c:1} can be used for sort({b:1,c:1}) or sort({b:-1,c:-1}) but not sort({b:1,c:-1}))

For example:

    db.a.find({b:{$gt:100}, c:{$gt:200}}).sort({b:1})

In the query above, the index {b:1,c:1} can be used to:

  • Match documents having b greater than 100 for the {b:{$gt:100}} portion of the query.
  • In this case, MongoDB can guarantee that the returned documents are sorted in terms of b.

The explain() output of the query above will not have a SORT stage. Also, the explain() output of the query with and without sort() are identical. In essence, we are getting the sort() for free.

A worthwhile resource to understand this subject is Optimizing MongoDB Compound Indexes. Please note that this blog post was written way back in 2012. Although some of the terminology may be outdated, the technicality of the post is still relevant.

Update on follow-up questions

  1. MongoDB uses only one index for most queries. So for example, to avoid an in-memory SORT stage in the query

    db.a.find({a:1}).sort({b:1})
    

    the index must cover both a and b fields at the same time; e.g. a compound index such as {a:1,b:1} is required. You cannot have two separate indexes {a:1} and {b:1}, and expect the {a:1} index to be used for the equality part, and the {b:1} index to be used for the sort part. In this case, MongoDB will choose one of the two indexes.

    Therefore, it is correct that the results are sorted because they are looked up and returned in the order of the index.

  2. To avoid having an in-memory sort using a compound index, the first part of the index must cater to the equality part of the query, and the second part must cater to the sort part of the query (as shown in the explanation for (1) above).

    If you have a query like this:

    db.a.find({}).sort({a:1})
    

    the index {a:1,b:1} can be used for the sort part (since you're basically returning the whole collection). And if your query looks like this:

    db.a.find({a:1}).sort({b:1})
    

    the same index {a:1,b:1} can also be used for both parts of the query. Also:

    db.a.find({a:1,b:1})
    

    can also use the same index {a:1,b:1}

    Notice the pattern here: the find() followed by sort() parameters follow the index order {a:1,b:1}. Therefore a compound index must be ordered by equality -> sort.

Update regarding sorting of different types

If a field has different types between documents (e.g. if a is string in one document, number in others, boolean in yet another), how do the sort proceed?

The answer is MongoDB BSON type comparison order. To paraphrase the manual page, the order is:

  1. MinKey (internal type)
  2. Null
  3. Numbers (ints, longs, doubles, decimals)
  4. Symbol, String
  5. Object
  6. Array
  7. BinData
  8. ObjectId
  9. Boolean
  10. Date
  11. Timestamp
  12. Regular Expression
  13. MaxKey (internal type)

So from the example above using ascending order, documents containing numbers will appear first, then strings, then boolean.

Solution 2:[2]

Although @kevinadi gives one awesome answer. I add one more thing about compound indexes with sorting.

Per doc mongo index best performance , Use Compound Indexes

Follow the ESR rule

For compound indexes, this rule of thumb is helpful in deciding the order of fields in the index:

  • First, add those fields against which Equality queries are run.
  • The next fields to be indexed should reflect the Sort order of the query.
  • The last fields represent the Range of data to be accessed.

And Alex gives more details on the ESR rule

  • Equality predicates should be placed first

An equality predicate is any filter condition that is attempting to match a value exactly. For example:

find({ x: 123 })
find({ x: { $eq: 123 } })
aggregate([ { $match:{ "x.y": 123 } } ])

These filters will be tightly bound when seen in the indexBounds of an Explain Plan:

"indexBounds" : {
    "x" : [
        "[123.0, 123.0]"
    ]
}

Note that multiple equality predicates do not have to be ordered from most selective to least selective. This guidance has been provided in the past however it is erroneous due to the nature of B-Tree indexes and how in leaf pages, a B-Tree will store combinations of all field’s values. As such, there is exactly the same number of combinations regardless of key order.

  • Sort predicates follow Equality predicates Sort predicates represent the entire requested sort for the operation and determine the ordering of results. For example:
find().sort({ a: 1 })
find().sort({ b: -1, a: 1 })
aggregate([ { $sort: { b: 1 } } ])

A sort predicate will be unbounded as it requires the entire key range to be scanned to satisfy the sort requirements:

"indexBounds" : {
    "b" : [
        "[MaxKey, MinKey]"
    ],
    "a" : [
        "[MinKey, MaxKey]"
    ]
}
  • Range predicates follow Equality and Sort predicates

Range predicates are filters that may scan multiple keys as they are not testing for an exact match. For example:

find({ z: { $gte: 5} })
find({ z: { $lt: 10 } })
find({ z: { $ne: null } })

The range predicates will be loosely bounded as a subset of the key range will need to be scanned to satisfy the filter requirements:

"indexBounds" : {
    "z" : [
        "[5.0, inf.0]"
    ]
}
"indexBounds" : {
    "z" : [
        "[-inf.0, 10.0)"
    ]
}
"indexBounds" : {
    "z" : [
        "[MinKey, undefined)",
        "(null, MaxKey]"
    ]
}

  • (E) Equality First

    When creating queries that ensure selectivity, we learn that “selectivity” is the ability of a query to narrow results using the index. Effective indexes are more selective and allow MongoDB to use the index for a larger portion of the work associated with fulfilling the query.

    Equality fields should always form the prefix for the index to ensure selectivity.

  • (E ? S) Equality before Sort

    Placing Sort predicates after sequential Equality keys allow for the index to:

    • Provide a non-blocking sort.
    • Minimize the amount of scanning required.
  • (E ? R) Equality before Range

    Though Range predicates scan a subset of keys (unlike Sort predicates), they should still be placed after Equality predicates to ensure the key ordering is optimized for selectivity.

  • (S ? R) Sort before Range

    Having a Range predicate before the Sort can result in a Blocking (In Memory) Sort being performed as the index cannot be used to satisfy the sort criteria.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 zangw