An in depth understanding of MongoDB's composite index

  • 2020-10-23 21:15:35
  • OfStack

Why do you need indexes?

When you complain about the inefficiencies of MongoDB collection queries, you may want to consider using indexes. For the sake of the rest of this article, explain the indexing mechanism in MongoDB (which also applies to other databases such as mysql).


mongo-9552:PRIMARY> db.person.find()
{ "_id" : ObjectId("571b5da31b0d530a03b3ce82"), "name" : "jack", "age" : 19 }
{ "_id" : ObjectId("571b5dae1b0d530a03b3ce83"), "name" : "rose", "age" : 20 }
{ "_id" : ObjectId("571b5db81b0d530a03b3ce84"), "name" : "jack", "age" : 18 }
{ "_id" : ObjectId("571b5dc21b0d530a03b3ce85"), "name" : "tony", "age" : 21 }
{ "_id" : ObjectId("571b5dc21b0d530a03b3ce86"), "name" : "adam", "age" : 18 }

When you insert more than one document into a collection, each document is persisted through the underlying storage engine and has a location that allows it to be read from the storage engine. For example, in the mmapv1 engine, the location information is "file id + file offset". In the wiredtiger storage engine (1 KV storage engine), the location information is an key generated by wiredtiger when storing documents, through which the corresponding document can be accessed. For convenience of introduction, unified 1 USES pos(short for position) to represent location information.

What is a composite index?

A composite index, Compound Index, is an index created by combining multiple keys into one, which speeds up queries that match multiple keys. Take a simple example to understand composite indexes.

The students set is as follows:


db.students.find().pretty()
{
 "_id" : ObjectId("5aa7390ca5be7272a99b042a"),
 "name" : "zhang",
 "age" : "15"
}
{
 "_id" : ObjectId("5aa7393ba5be7272a99b042b"),
 "name" : "wang",
 "age" : "15"
}
{
 "_id" : ObjectId("5aa7393ba5be7272a99b042c"),
 "name" : "zhang",
 "age" : "14"
}

Indexes are created on the name and age keys respectively (_id comes with its own index) :


db.students.getIndexes()
[
 {
 "v" : 1,
 "key" : {
 "name" : 1
 },
 "name" : "name_1",
 "ns" : "test.students"
 },
 {
 "v" : 1,
 "key" : {
 "age" : 1
 },
 "name" : "age_1",
 "ns" : "test.students"
 }
]

When making a multi-key query, the execution can be analyzed by explian() (the results are retained only by winningPlan) :


db.students.find({name:"zhang",age:"14"}).explain()
"winningPlan":
{
 "stage": "FETCH",
 "filter":
 {
  "name":
  {
   "$eq": "zhang"
  }
 },
 "inputStage":
 {
  "stage": "IXSCAN",
  "keyPattern":
  {
   "age": 1
  },
  "indexName": "age_1",
  "isMultiKey": false,
  "isUnique": false,
  "isSparse": false,
  "isPartial": false,
  "indexVersion": 1,
  "direction": "forward",
  "indexBounds":
  {
   "age": [
    "[\"14\", \"14\"]"
   ]
  }
 }
}

It can be seen from winningPlan that this query is divided into IXSCAN and FETCH phases in turn. IXSCAN is the index scan, using the age index; FETCH is to query documents according to the index, and it needs to use name to filter the query.

Create composite indexes for name and age:


db.students.createIndex({name:1,age:1})
db.students.getIndexes()
[
 {
 "v" : 1,
 "key" : {
 "name" : 1,
 "age" : 1
 },
 "name" : "name_1_age_1",
 "ns" : "test.students"
 }
]

With a composite index, the same query is executed differently:


db.students.find({name:"zhang",age:"14"}).explain()
"winningPlan":
{
 "stage": "FETCH",
 "inputStage":
 {
  "stage": "IXSCAN",
  "keyPattern":
  {
   "name": 1,
   "age": 1
  },
  "indexName": "name_1_age_1",
  "isMultiKey": false,
  "isUnique": false,
  "isSparse": false,
  "isPartial": false,
  "indexVersion": 1,
  "direction": "forward",
  "indexBounds":
  {
   "name": [
    "[\"zhang\", \"zhang\"]"
   ],
   "age": [
    "[\"14\", \"14\"]"
   ]
  }
 }
}

It can be seen from winningPlan that the order of this query does not change and is divided into IXSCAN and FETCH in turn. However, IXSCAN USES a composite index of name and age; FETCH queries documents by index without filtering.

The amount of data in this example is too small to see the problem. In practice, however, when the data volume is large and IXSCAN returns many indexes, filtering on FETCH can be very time-consuming. Here is a real case study.

Locate MongoDB performance issues

As the number of error data received increased, we at Fundebug have cumulatively handled 350 million error events, which continues to present performance challenges for our services, especially for the MongoDB cluster.

For the production database, configure profile to record performance data for MongoDB. Execute the following command, and all database reads and writes greater than 1s will be recorded.


db.setProfilingLevel(1,1000)

Query the data recorded by profile and you will find a query in the events collection that is very slow:


db.system.profile.find().pretty()
{
 "op" : "command",
 "ns" : "fundebug.events",
 "command" : {
 "count" : "events",
 "query" : {
 "createAt" : {
 "$lt" : ISODate("2018-02-05T20:30:00.073Z")
 },
 "projectId" : ObjectId("58211791ea2640000c7a3fe6")
 }
 },
 "keyUpdates" : 0,
 "writeConflicts" : 0,
 "numYield" : 1414,
 "locks" : {
 "Global" : {
 "acquireCount" : {
 "r" : NumberLong(2830)
 }
 },
 "Database" : {
 "acquireCount" : {
 "r" : NumberLong(1415)
 }
 },
 "Collection" : {
 "acquireCount" : {
 "r" : NumberLong(1415)
 }
 }
 },
 "responseLength" : 62,
 "protocol" : "op_query",
 "millis" : 28521,
 "execStats" : {
 },
 "ts" : ISODate("2018-03-07T20:30:59.440Z"),
 "client" : "192.168.59.226",
 "allUsers" : [ ],
 "user" : ""
}

There are hundreds of millions of documents in the events collection, so it's not surprising that count is slow. According to profile data, this query took 28.5s, which is an absurdly long time. In addition, numYield is at 1414, which should be the direct reason for its slow operation. According to the MongoDB documentation, the meaning of numYield is this:

[

The number of times the operation yielded to allow other operations to complete. Typically, operations yield when they need access to data that MongoDB has not yet fully read into memory. This allows other operations that have data in memory to complete while MongoDB reads in data for the yielding operation.

]

This means that a lot of time is spent reading on the hard disk, and reading many times. Presumably, it was a problem with the index.

Use explian() to analyze this query (executionStats only) :


db.events.explain("executionStats").count({"projectId" : ObjectId("58211791ea2640000c7a3fe6"),createAt:{"$lt" : ISODate("2018-02-05T20:30:00.073Z")}})
"executionStats":
{
 "executionSuccess": true,
 "nReturned": 20853,
 "executionTimeMillis": 28055,
 "totalKeysExamined": 28338,
 "totalDocsExamined": 28338,
 "executionStages":
 {
  "stage": "FETCH",
  "filter":
  {
   "createAt":
   {
    "$lt": ISODate("2018-02-05T20:30:00.073Z")
   }
  },
  "nReturned": 20853,
  "executionTimeMillisEstimate": 27815,
  "works": 28339,
  "advanced": 20853,
  "needTime": 7485,
  "needYield": 0,
  "saveState": 1387,
  "restoreState": 1387,
  "isEOF": 1,
  "invalidates": 0,
  "docsExamined": 28338,
  "alreadyHasObj": 0,
  "inputStage":
  {
   "stage": "IXSCAN",
   "nReturned": 28338,
   "executionTimeMillisEstimate": 30,
   "works": 28339,
   "advanced": 28338,
   "needTime": 0,
   "needYield": 0,
   "saveState": 1387,
   "restoreState": 1387,
   "isEOF": 1,
   "invalidates": 0,
   "keyPattern":
   {
    "projectId": 1
   },
   "indexName": "projectId_1",
   "isMultiKey": false,
   "isUnique": false,
   "isSparse": false,
   "isPartial": false,
   "indexVersion": 1,
   "direction": "forward",
   "indexBounds":
   {
    "projectId": [
     "[ObjectId('58211791ea2640000c7a3fe6'), ObjectId('58211791ea2640000c7a3fe6')]"
    ]
   },
   "keysExamined": 28338,
   "dupsTested": 0,
   "dupsDropped": 0,
   "seenInvalidated": 0
  }
 }
}

It can be seen that events set did not establish composite index for projectId and createAt, so projectId index was adopted in IXSCAN stage, whose nReturned was 28338. The FETCH phase needs to be filtered according to createAt, and ITS nReturned is 20853, which filters 7485 documents. In addition, the executionTimeMillisEstimate of IXSCAN and FETCH are 30ms and 27815ms respectively, so basically all the time is spent in FETCH, which should be caused by reading the hard disk.

Create a composite index

It was an embarrassing mistake not to create composite indexes for projectId and createAt. Fix it now.


db.events.createIndex({projectId:1,createTime:-1},{background: true})

Building an index in a production environment is best done at night. This command 1 took about 7 hours! background is set to true to ensure the availability of the database without blocking other operations on the database. However, the command 1 will occupy the terminal and CTRL + C cannot be used, otherwise the index build process will be terminated.

With the composite index creation result, the previous query is much faster (executionStats only) :


db.students.find().pretty()
{
 "_id" : ObjectId("5aa7390ca5be7272a99b042a"),
 "name" : "zhang",
 "age" : "15"
}
{
 "_id" : ObjectId("5aa7393ba5be7272a99b042b"),
 "name" : "wang",
 "age" : "15"
}
{
 "_id" : ObjectId("5aa7393ba5be7272a99b042c"),
 "name" : "zhang",
 "age" : "14"
}
0

The count operation USES the composite index of projectId and createAt, so it is very fast. It only takes 46ms, and the performance is improved nearly 600 times!! By comparing the results before and after the use of composite index, it is found that totalDocsExamined drops from 28338 to 0, indicating that after the use of composite index, there is no need to query documents, only need to scan the index, so there is no need to access the disk, which is naturally much faster.

reference

MongoDB composite index MongoDB documentation: Compound Indexes

conclusion


Related articles: