Query duplicate data records in MongoDB using aggregate

  • 2020-06-01 11:15:21
  • OfStack

Aggregation (aggregate) in MongoDB is mainly used to process data (such as statistical averages, sums, etc.) and return the calculated data results. Somewhat similar to count(*) in the sql statement.

aggregate () method

The aggregated method in MongoDB USES aggregate().


The basic syntax for the aggregate() method is as follows:


As we know, MongoDB is a document-type database that stores documents of type JSON objects. Because of this feature, we often use MongoDB for data access in Node.js. However, because Node.js is performed asynchronously, we cannot guarantee that every database save operation is atomic. In other words, if the client initiates the same event twice in a row to store the data in the database, it is likely that the data will be saved repeatedly. In high-concurrency situations, even if you have done very strict validations in your code, such as determining whether the data you want to save already exists before inserting it, there is still a risk that the data will be saved repeatedly. Because in asynchronous execution, there is no way to guarantee which thread will execute first and which thread will execute later, all requests initiated by the client will not be executed in the order we imagine. A better solution is to create a 1-only index in all tables in the Mongo database. In fact, by default, MongoDB creates a 1-only index of the _id field for all tables (which can be canceled). If you want to automatically create an index using mongoose.schema in Node.js, you can refer to the following code:

var mongoose = require('mongoose');
var Schema = mongoose.Schema;
var customerSchema = new mongoose.Schema({
cname: String,
cellPhone, String,
sender: String,
tag: String,
behaviour: Number,
createTime: {
type: Date,
default: Date.now
type: Boolean,
default: true
}, {
versionKey: false
customerSchema.index({cname:1,cellPhone:1,sender:1,tag:1,behaviour:1}, {unique: true});module.exports = mongoose.model('customer', customerSchema); 

In model above we defined the structure of the table customer and created a unique index on the fields cname, cellPhone, sender, tag, behaviour via the index() method, so that when duplicate data containing these fields is inserted, the database throws an exception. Borrow mongoose, if the database table has been created before and the program is running, when we modify model and add the index, and then restart app, mongoose will automatically detect and create the index whenever there is access to model. Of course, index creation will fail if the data is duplicated. At this point, we can automatically delete duplicate data from the database by adding the dropDups option when creating the index, such as:

customerSchema.index({cname:1,cellPhone:1,sender:1,tag:1,behaviour:1}, {unique: true, dropDups: true});

However, according to MongoDB's official instructions, this option is no longer available in versions since 3.0 and does not provide an alternative solution. It seems that the authorities no longer provide the ability to automatically delete duplicate records when creating an index. How do you quickly and efficiently find duplicate records and delete them? First we'll find the records, and then we'll delete them using the remove() method. The following query can find records with duplicate data for a given field:

{ $group: { 
_id: { firstField: "$firstField", secondField: "$secondField" }, 
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 } 
{ $match: { 
count: { $gt: 1 } 

Replace the value of the _id property to specify the field you want to judge. Accordingly, the code in Node.js is as follows:

var deferred = Q.defer();
var group = { firstField: "$firstField", secondField: "$secondField"};
_id: group,
uniqueIds: {$addToSet: '$_id'},
count: {$sum: 1}
}).match({ count: {$gt: 1}}).exec(deferred.makeNodeResolver());
return deferred.promise; 

The code above USES Q to replace the callback in the function execution. In the asynchronous programming of Node.js, it is a good choice to use Q to handle callbacks.

Here is the result:

/* 1 */
"result" : [ 
"_id" : {
"cellPhone" : "15827577345",
"actId" : ObjectId("5694565fa50fea7705f01789")
"uniqueIds" : [ 
"count" : 2.0000000000000000
"_id" : {
"cellPhone" : "18171282716",
"actId" : ObjectId("566b0d8dc02f61ae18e68e48")
"uniqueIds" : [ 
"count" : 2.0000000000000000
"ok" : 1.0000000000000000

As you can see from the results, 1 has the same number of records for two sets of data, so the length of the result array returned is 2. The uniqueIds property is an array containing the value of the _id field of the duplicate record, from which we can use the remove() method to find and delete the corresponding data.

Related articles: