'MongoDB: find duplicates across multiple keys

The documents are of the type

{
   "id": 1,
   "sig1": "12345",
   "sig2": "67890",
},
{
   "id": 2,
   "sig1": "67890",
   "sig2": "ABCDE",
},

I would like to find all values that have one or more matching document, no matter if the value is in sig1 or sig2. So not only matching all the documents that have the same sig1 or sig2, but also the two documents above, because the first's sig2 is the same as the second's sig1.

I would also like to have an index on this query if it's possible, because otherwise for millions of documents it's probably going to be a very expensive search.

I know an aggregation to find duplicates in a single field is like this:

[
    {"$group" : { "_id": "$sig1", "count": { "$sum": 1 }, "docs": {"$push": "$id"} } },
    {"$match": {"_id" :{ "$ne" : "" } , "count" : {"$gt": 1} } },
]

However I'm struggling to combine this with a cross-key/OR search.



Solution 1:[1]

I don't know how performant this will be with your collection, but here's one way to get a list of "id"s with common document values.

db.collection.aggregate([
  {
    "$set": {
      "valSet": {
        "$map": {
          "input": {
            "$objectToArray": "$$ROOT"
          },
          "in": {
            "$getField": {
              "field": "v",
              "input": "$$this"
            }
          }
        }
      }
    }
  },
  {
    "$unwind": "$valSet"
  },
  {
    "$group": {
      "_id": "$valSet",
      "ids": { "$push": "$id" }
    }
  },
  {
    "$match": {
      "$expr": { "$gt": [ { "$size": "$ids" }, 1 ] }
    }
  }
])

Example query output:

[
  {
    "_id": "67890",
    "ids": [ 1, 2 ]
  }
]

Try it on mongoplayground.net.

If you want to restrict the fields to just "sig1" and "sig2", you could use:

db.collection.aggregate([
  {
    "$set": {
      "valSet": [
        "$sig1",
        "$sig2"
      ]
    }
  },
  {
    "$unwind": "$valSet"
  },
  {
    "$group": {
      "_id": "$valSet",
      "ids": {
        "$push": "$id"
      }
    }
  },
  {
    "$match": {
      "$expr": {
        "$gt": [
          {
            "$size": "$ids"
          },
          1
        ]
      }
    }
  }
])

Try it on mongoplayground.net.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1