MongoDB weird problem of sh.stopBalancer stuck solution

  • 2020-10-31 22:02:22
  • OfStack

background

Part1: Write it first

When we use the MongoDB sharding cluster, we use the following command to manage the start and stop of Balancer:


>sh.stopBalancer()  stop Balancer
>sh.startBalancer()  open Balancer

Part2: background

After starting balancer, the client reported that the front-end application wrote slowly and the query timed out. So we tried to turn off balancer to avoid the impact of the chunk migration on cluster performance.

But in the call sh.stopBalancer sh.stopBalancer will be stuck:


mongos>sh.stopBalancer()
Waiting for active hosts...
Waiting for the balancer lock...
assert.soon failed,msg:Waited too long for lock balancer to unlock
doassert@src/mongo/shell/assert.js:18:14
assert.soon@src/mongo/shell/assert.js:202:13
sh.waitForDLock@src/mongo/shell/utils_sh.js:198:1
sh.waitForBalancerOff@src/mongo/shell/utils_sh.js:264:9
sh.waitForBalancer@src/mongo/shell/utils_sh.js:294:9
sh.stopBalancer@src/mongo/shell/utils_sh.js:161:5
@(shell):1:1
Balancer still may be active, you must manually verify this is not the case using the
config.changelog collection.
2018-02-11T16:28:29.753+0800
E QUERY [thread1] Error: Error:
assert.soon failed, msg:Waited too long for lock balancer to unlock :
sh.waitForBalancerOff@src/mongo/shell/utils_sh.js:268:15
sh.waitForBalancer@src/mongo/shell/utils_sh.js:294:9
sh.stopBalancer@src/mongo/shell/utils_sh.js:161:5
@(shell):1:1

As can be seen from the above error, it is caused by balancer currently operating,

Warning: Warning In version 3.4, balancer ran on the primary node of config server and in earlier versions, balancer ran on mongos. When the balancer process is active, the master server of the config server replica set obtains the "balancer lock" by modifying the documents in the lock collection of the config database. This "balancer lock" can only be released voluntarily.

Part3: Screening method

We call sh.status() The command can see that balancer is currently turned off, but running is still yes, indicating that a migration is running.


 balancer:
Currently enabled: no
Currently running: yes

We look and find that the migrations collection is empty, indicating that no collection is being migrated


mongos> db.migrations.find()

We look at the information under the locks collection, and the description in state 2 is holding the lock


mongos> db.locks.find()
{ "_id" : "balancer", "state" : 2, "ts" : ObjectId("5a324c42329457086086da07"), "who" : "ConfigServer:Balancer", "process" : "ConfigServer", "when" : ISODate("2018-01-31T08:33:43.346Z"), "why" : "CSRS Balancer" }

Warning: warning

The why column in the locks collection tells us the reason for holding the lock, if there is a document being migrated, its state should be 2, and the reason in why is shown Migrating chunk(s) in collection db.collationname .

As of version 3.4, the status field of balancer will always be the value 2 to prevent balancing operations on older versions of mongos instances. The when field refers to the time when an config server member becomes the master node.

The solution

Part1: Write it first

Common possible reasons for the inability to stop are as follows:

chunk migration is under way, and it must wait for the completion of chunk migration before it can be stopped normally. The server time at the back end is out of sync; The mongo client version is lower than the server version, and this article is the third case. The mongo client is version 3.2, while config server and mongod are both version 3.4 of mongo.

Part2: Solutions

Replace the old mongo client with the 3.4 client


mongos> sh.stopBalancer()
{ "ok" : 1 }
 
config:PRIMARY> db.version()
3.4.9-2.9

Part3: Cause analysis

The reason for the sticking is that the client mongo is version 3.2 and the config node is version 3.4. When stopBalancer of version 3.2 executes stopBalancer(), the stopBalancer code assumes that if the balancerStop command is not found, it USES the older version of the logic and waits for the lock to be released. Starting from version 3.4, Balance processes move from mongos to configer server's primary node.

conclusion

With this example, we were able to learn about the problems with the CLIENT version of mongo, and what are the common reasons for not stopping ES140en.stopBalancer. Due to the limited level of the author, the writing time is also very hasty, the article will inevitably appear 1 mistakes or inaccurate, inappropriate places ask readers to criticize and correct.


Related articles: