Detailed analysis of parameter limits and thresholds in MongoDB

2020-11-25 07:40:02
OfStack

preface

Today, I searched the data of spark mongo and found some knowledge of MongoDB by accident. These are all new to me, so I made a special record of them. (& # 3665; The & # 8226; . & # 8226; The & # 3665;)

Without further ado, let's take a look at the details

1. BSON document

BSON document size: 1 document document maximum size is 16M; Documents larger than 16M need to be stored in GridFS. Depth of document embedding: The structure (tree) of the BSON document is 100 at most.

2. Namespaces

The collection namespace:.., which has a maximum length of 120 bytes. This also dictates that the names database and collection should not be too long. Number of namespaces: for the MMAPV1 engine, the maximum number is about 24,000, with one namespace per collection and one per index; There is no such limit for the wiredTiger engine. namespace file size: For the MMAPV1 engine, the default size is 16M, which can be changed in the configuration file. wiredTiger is not subject to this restriction.

3. indexes

index key: key cannot exceed 1024 bytes per index. If the length of index key exceeds this value, the write operation will fail. The number of indexes in each collection must not exceed 64. Index name: We can set the name for index, and the final full name is.. $, maximum 128 bytes. The default is the combination of the filed name and the index type. We can explicitly specify the index name when creating the index. See the createIndex() method. A composite index can contain a maximum of 31 field.

4. Data

Capped Collection: If you specify the maximum number of documents when creating collection of type "Capped", then the maximum number cannot exceed 2 to the 32nd power. If you do not specify the maximum number, then there is no limit. For the Database Size: MMAPV1 engine, each database must not hold more than 16,000 data files, that is, the maximum amount of data for a single database is 32TB, which can be limited to 8TB by setting "smallFiles". Data Size: For the MMAVPV1 engine, a single mongod cannot manage data sets that exceed the maximum virtual memory address space, such that each mongod instance under linux (64-bit) can maintain up to 64T data. The wiredTiger engine does not have this limitation. Number of collection per Database: For the MMAPV1 engine, the number of collections per database depends on the namespace file size (used to hold namespace) and the number of indexes per collection, with the final total size not exceeding the namespace file size (16M). The wiredTiger engine is not subject to this restriction.

5. Replica Sets

A maximum of 50 members are supported per replica set. replica set can have up to seven voting members. (The voters) If the oplog size is not explicitly specified, it does not exceed 50G at most.

6. Sharded Clusters

group aggregate function, not available in sharding mode. Use the mapreduce or aggregate method. Coverd Queries: that is, Fields in the query condition must be part 1 of index, and the return result contains only fields in index; For the sharding cluster, if shard key is not included in query, the index cannot be overwritten. Although _id is not "shard key", if only _id is included in the query condition and only the _id field value is required in the returned result, you can use an overwritten query, which does not seem to make sense (unless you are checking for the existence of document for this _id). If sharding (originally not sharding) is turned on with the data already stored, the maximum data should not exceed 256G. When collection is sharding, it can store as much data as it wants. For sharded collection, update, remove for single data operation (operation options are multi:false or justOne), the shard key or _id field must be specified; Otherwise error will be thrown. Only 1 index: Only 1 index is not supported between shards, unless this "shard key" is the leftmost prefix for only 1 index. shard key for collection is {" zipcode ":1," name ":1}. If you want to create the only index for collection, then the only index must have zipcode and name as the leftmost prefix, for example: collection ({" zipcode" :1, "name" :1, "company" :1},{unique:true}). Allow maximum when chunk migration document number: if a chunk documents number more than 250000 (default chunk size of 64 M), or number of document is greater than 1.3 * (chunk maximum size (a configuration parameter)/document average size), the chunk will not be "move" (balancer or human intervention), must wait to be move split.

7. shard key

The length of shard key must not exceed 512 bytes. The "shard key index" can be a positive index based on shard key, or a composite index starting with shard key. The shard key index cannot be an multikey index (arrow-based index), text index, or geo index. Shard key is immutable and the shard key value in document cannot be modified at any time. If you need to change shard key, you need to manually clean the data, that is, the full amount of dump raw data, and then modify and save it in the new collection. Monotone increasing (decreasing) shard key limits the throughput of insert; If _id is shard key, you need to know that _id is generated by ObjectId(), which is also self-increment. For monotone increasing shard key, all insert operations on collection will be carried out on one shard node, then this shard will carry all insert operations of cluster, because the resources of a single shard node are limited, so the amount of insert of the whole cluster will be limited. If cluster were primarily read, update operations, there would be no such restriction. To avoid this problem, consider using "hashed shard key" or choosing a non-monotone increasing key as shard key. (rang shard key and hashed shard key have their advantages and disadvantages and need to be determined according to query).

8. Operations

If mongodb cannot use index sorting to get documents, then the size of documents participating in the sorting needs to be less than 32M. Aggregation Pileline operation. Pipeline stages is limited to 100M memory. If stage exceeds this limit, an error will occur. To handle larger data sets, turn on the "allowDiskUse" option, which allows pipeline stages to write additional data to temporary files.

9. Naming conventions

database names are case sensitive. database name do not include: /. ' '$* < > :|? database names cannot exceed 64 characters in length. collection names can start with "_" or alphanumeric characters, but cannot contain the "$" symbol, cannot be null characters or null, cannot start with" system. "because this is the system reserved word. The document field name cannot contain ". "or null, and cannot begin with" $" because $is a "reference symbol."

Finally, the query method with list in json nested is recorded. Sample data:


{
 "_id" : ObjectId("5c6cc376a589c200018f7312"),
 "id" : "9472",
 "data" : {
 "name" : " test ",
 "publish_date" : "2009-05-15",
 "authors" : [ 
  {
  "author_id" : 3053,
  "author_name" : " The test data "
  }
 ],
 }
}

I want to query author_id in authors. query can be written like this:


db.getCollection().find({'data.authors.0.author_id': 3053})

The first index is represented by a 0, and the dot by a nested structure. However, spark mongo cannot be imported in this way. Other methods are needed.

conclusion