Mongodb's oplog details

2020-12-21 18:14:02
OfStack

Oplog is the key data structure for MongoDB to implement the replication set. After Primary operates on the database in the replication set, an Oplog document will be generated and saved in the local.oplog.rs set. The Secondary member will pull Oplog of Primary and repeat the same operation, so that the Secondary member and Primary have 1 data. Virtually every member in the replication set holds Oplog, and other members select the nearest member to pull Oplog data based on factors such as connection latency.

Oplog exists the set local. oplog. rs, which is the built-in set of the system, 1 capped collection, that is, the collection has a fixed size. Once it is full, the data will be written from the beginning, just like a circular queue structure. This collection size is set when the cluster is initialized with the default size of 5% free disk space, and can also be set with the oplogSizeMB option in the configuration file, or dynamically with the replSetResizeOplog command after starting MongoDB.

Oplog is not different from other documents of MongoDB in that it has a fixed number of attributes:

ts: Special timestamp data structures built into MongoDB, such as Timestamp(1503110518, 1), represented by a second Unix timestamp and 1 sequential growth integer increment. The length is 64 bits, of which the Unix timestamp is 32 bits, and the latter 32 bits can hold the number of operations in the same 1 second. The h: hash value represents the only 1 identifier for each Oplog. v: Oplog version ns: namespace namespace, database + collection, represented by ES47en.collection. But if it's a table action command, etc., change to ES49en.$cmd. op: operation type, operation types, including the following: i: insert, insert document u: update, update documentation d: delete, delete document c: command, operating commands such as createIndex, etc n: Empty operation used to synchronize Oplog time information between master and slave when idle o: operation, Oplog operation specific content, such as i operation type, o is the inserted document. For u operation type, only part of the content is updated. The content of the o key is {$set: {... }} o2: For the update operation, containing the value of the _id attribute.

Replay of Oplog is idempotent (idempotent), that is, replay of the same Oplog many times results in 1. This is MongoDB that converts a number of command operations to keep the generated Oplog idempotent, such as performing the following $inc operation:


db.test.update({_id: ObjectId("533022d70d7e2c31d4490d22")}, {$inc: {count: 1}})

Oplog generated is:


{
 "ts" : Timestamp(1503110518, 1),
 "t" : NumberLong(8),
 "h" : NumberLong(-3967772133090765679),
 "v" : NumberInt(2),
 "op" : "u",
 "ns" : "mongo.test",
 "o2" : {
  "_id" : ObjectId("533022d70d7e2c31d4490d22")
 },
 "o" : {
  "$set" : {
   "count" : 2.0
  }
 }
}

The above MongoDB guarantees that the data operation of Oplog (DML statement) is idempotent, but the data table operation (DDL statement) command is not guaranteed, such as repeating the same createIndex command.

Oplog query

The documents in Capped collection are sorted in insertion order with no other indexes, but local.oplog.rs is a special capped collection. In the Wiredtiger engine, the time stamp of Oplog is stored as a special meta-information, so that Oplog can be sorted by ts field and filtered by ts field when query Oplog.

1 Generally, Secondary synchronization requires initial sync and incremental sync, and initial sync synchronization after completion, Oplog needs to be pulled from the synchronization time point for continuous replay. So query Oplog operation 1 is generally:


db.oplog.rs.find({$gte:{'ts': Timestamp(1503110518, 1)}})

Secondary needs to continuously obtain Oplog generated by Primary, and replication assembly uses tailable cursor to continuously obtain Oplog data, which is very similar to ES139en-ES140en system. This improves efficiency because the 1-like cursor is turned off after use, while tailable cursor saves the last id and keeps getting the data.

If the pymongo drive is used, locating Oplog after a certain point in time can be written as follows:


coll = db['local'].get_collection(
 'oplog.rs',
 codec_options=bson.codec_options.CodecOptions(document_class=bson.son.SON))

cursor = coll.find({'ts': {'$gte': start_optime}},
 cursor_type=pymongo.cursor.CursorType.TAILABLE,
 oplog_replay=True,
 no_cursor_timeout=True)

while True:
 try:
  oplog = cursor.next()
  process(oplog)
 except StopException:
  #  No more  Oplog  data 
  time.sleep(1)

cursor_type uses either TAILABLE or TAILABLE_AWAIT. With the latter type, if there is no more Oplog data, the request blocks waiting for Oplog data or for a timeout return on the waiting time.

Setting the oplog_replay flag indicates that the type of the request is capped collection to save Oplog, and provides ts filter parameters for query optimization.

Once Oplog is obtained, data can be synchronized or distributed to interested consumers for special analysis, such as the MongoShake tool.

Refer to the documentation:

Replica Set Oplog: https://docs.mongodb.com/manual/core/replica-set-oplog/
MongoDB oplog ramble: http: / / caosiyang github. io / 2016/12/24 / mongodb oplog /
MongoDB replicate set principle: https: / / www ofstack. com article / 166148. htm