'MongoDB ReplicaSet issues

We are running MongoDB ReplicaSet on Kubernetes. One of MongoDB pods in CrashLoop and it shows OOMKilled as true. And the pod has crashed 234 times since then.

We have one primary and two secondaries.

Here are the latest logs. Container lives around a minute and crashes again. I am trying to understand what the logs mean.

What does OplogStartMissing mean?

145 {"log":"2022-03-08T09:24:44.127+0000 I REPL     [rsBackgroundSync] Starting rollback due to     OplogStartMissing: Our last op time fetched: { ts: Timestamp(1646656464, 1), t: 58 }. source    's GTE: { ts: Timestamp(1646656801, 1), t: 60 } hashes: (2206456552855381608/810867260034420    2316)\n","stream":"stdout","time":"2022-03-08T09:24:44.12744806Z"}
147 {"log":"2022-03-08T09:24:44.127+0000 I REPL[rsBackgroundSync] Rollback using the 'rollbackViaRefetch' method because UUID support is feature compatible with featureCompatibilityVersion 3.6.\n","stream":"stdout","time":"2022-03-08T09:24:44.12747365Z"}
148 {"log":"2022-03-08T09:24:44.127+0000 I REPL[rsBackgroundSync] transition to ROLLBACK from SECONDARY\n","stream":"stdout","time":"2022-03-08T09:24:44.127477084Z"}
149 {"log":"2022-03-08T09:24:44.127+0000 I ROLLBACK [rsBackgroundSync] Starting rollback. Sync source: mongodb-2.mongodb.maglev-system.svc.cluster.local:27017\n","stream":"stdout","time":"    2022-03-08T09:24:44.127480067Z"}
150 {"log":"2022-03-08T09:24:44.133+0000 I ROLLBACK [rsBackgroundSync] Finding the Common Point\n","stream":"stdout","time":"2022-03-08T09:24:44.133319869Z"}
151 {"log":"2022-03-08T09:24:44.136+0000 I ROLLBACK [rsBackgroundSync] our last optime:   Timest    amp(1646656464, 1)\n","stream":"stdout","time":"2022-03-08T09:24:44.136901468Z"}
152 {"log":"2022-03-08T09:24:44.136+0000 I ROLLBACK [rsBackgroundSync] their last optime: Timestamp(1646731479, 1)\n","stream":"stdout","time":"2022-03-08T09:24:44.136912166Z"}
153 {"log":"2022-03-08T09:24:44.136+0000 I ROLLBACK [rsBackgroundSync] diff in end of log times: **-75015** seconds\n","stream":"stdout","time":"2022-03-08T09:24:44.136916265Z"}
154 {"log":"2022-03-08T09:24:44.320+0000 I NETWORK  [listener] connection accepted from 127.0.0.    1:41476 #2 (1 connection now open)\n","stream":"stdout","time":"2022-03-08T09:24:44.32070222    4Z"}

Especially, diff in the end of log times is negative. What does negative value signify. What does RollBackViaRefetch mean?



Solution 1:[1]

OOMKilled - Means the container was killed because it tried to use more memory than you allocated to it in your resources.limits section.

OplogStartMissing - Most of the time seems to point to your OpLog being too small. Try increasing it.

RollbackViaRefetch - From the documentation:

Nodes go into rollback if after they receive the first batch of writes from their sync source, they realize that the greater than or equal to predicate did not return the last op in their oplog. When rolling back, nodes are in the ROLLBACK state and reads are prohibited. When a node goes into rollback it drops all snapshots. The rolling-back node first finds the common point between its oplog and its sync source's oplog. It then goes through all of the operations in its oplog back to the common point and figures out how to undo them.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 testfile