Introduction to MapReduce in MongoDB

  • 2020-05-24 06:25:17
  • OfStack

MongoDB MapReduce

MapReduce is a computing model that simply takes a large amount of work (data), breaks it down (MAP), performs it, and then merges the results into the final result (REDUCE). The advantage of this is that once the task has been broken down, parallel computation can be performed on a large number of machines, reducing the overall operation time.

Above is the theoretical part of MapReduce, below is the practical application, and below is the example of MongoDB MapReduce.

Here is an example of MongoDB official:


> db.things.insert( { _id : 1, tags : ['dog', 'cat'] } );
> db.things.insert( { _id : 2, tags : ['cat'] } );
> db.things.insert( { _id : 3, tags : ['mouse', 'cat', 'dog'] } );
> db.things.insert( { _id : 4, tags : []  } ); > // map function
> map = function(){
...    this.tags.forEach(
...        function(z){
...            emit( z , { count : 1 } );
...        }
...    );
...}; > // reduce function
> reduce = function( key , values ){
...    var total = 0;
...    for ( var i=0; i<values.length; i++ )
...        total += values[i].count;
...    return { count : total };
...}; db.things.mapReduce(map,reduce,{out:'tmp'})
{
    "result" : "tmp",
    "timeMillis" : 316,
    "counts" : {
        "input" : 4,
        "emit" : 6,
        "output" : 3
    },
    "ok" : 1,
}
> db.tmp.find()
{ "_id" : "cat", "value" : { "count" : 3 } }
{ "_id" : "dog", "value" : { "count" : 2 } }
{ "_id" : "mouse", "value" : { "count" : 1 } }

The example is simple: count the number of times each tag appears in a tag system.

There, besides emit functions, all are standard js syntax, this emit function is very important, it can be understand that when all the need to compute document (because when mapReduce, to filter the documents, will talk about next) performed map function, map function returns key_values right, is the first parameter key emit key, values is an array of the second argument of n corresponding to emit of 1key. This key_values is passed as a parameter to reduce as the 1.2 th parameter.

The task of the reduce function is to change key-values into key-value, which is to change the values array into a single 1 value, value. When the values array in key-values is too large, it will be cut into many small key-values blocks. Then, Reduce function will be executed, and the results of multiple blocks will be combined into a new array. As the second parameter of Reduce function, Reducer operation will continue. Predictably, if our initial values is very large, we may Reduce again for the set formed after the first block calculation. It's kind of like a multiorder merge sort. How much weight will be specific, it depends on the amount of data.

reduce1 must be able to be called over and over again, whether it is the mapping link or the previous simplification link. So the document returned by reduce must be able to be an element of the second parameter of reduce.

(when the Map function is written, the second parameter component of emit constitutes the second parameter of reduce function, while the return value of Reduce function is 1 in the form of the second parameter of emit function. The return value of multiple reduce functions may be formed into an array to perform the Reduce operation again as the new second input parameter.)

The parameter list of MapReduce function is as follows:


db.runCommand(
 { mapreduce : <collection>,
   map : <mapfunction>,
   reduce : <reducefunction>
   [, query : <query filter object>]
   [, sort : <sort the query.  useful for optimization>]
   [, limit : <number of objects to return from collection>]
   [, out : <output-collection name>]
   [, keeptemp: <true|false>]
   [, finalize : <finalizefunction>]
   [, scope : <object where fields go into javascript global scope >]
   [, verbose : true]
 }
);

Or write it like this:

db.collection.mapReduce(
                         <map>,
                         <reduce>,
                         {
                           <out>,
                           <query>,
                           <sort>,
                           <limit>,
                           <keytemp>,
                           <finalize>,
                           <scope>,
                           <jsMode>,
                           <verbose>
                         }
                       )

1.mapreduce: specify collection to be processed by mapreduce
2.map: map function
3.reduce: reduce function
4.out: name of collection for output, do not specify collection that will create a random name by default (if the out option is used, you do not need to specify keeptemp: true because it is already implicit)
5.query: a filter that only documents that meet the criteria will call the map function. (query. limit, sort can be combined at will.)
6.sort: the sort sorting parameter combined with limit (also sorts documents before sending them to map function) can optimize the grouping mechanism
7.limit: the maximum number of documents sent to the map function (without limit, sort alone is not very useful)
8.keytemp: true or false, indicating whether the collection output is temporary. If you want to keep the set after the connection is closed, specify keeptemp as true. If a script is executed, the result collection is automatically deleted when the script exits or calls close
9.finalize: is a function, which will calculate key and value once and return a final result after executing map and reduce. This is the last step of the process, so finalize is a proper time to calculate the average, cut out the array, and clear the redundant information
10.scope: the variable to be used in the javascript code, and the variable defined here is visible in the map, reduce, finalize functions
11.verbose: detailed output option for debugging. If you want to see MpaReduce in action, you can set it to true. It is also possible for print to output information from the map, reduce, finalize processes to the server log.

The document structure returned by executing the MapReduce function is as follows:


  { result : <collection_name>,     timeMillis : <job_time>,     counts : {                input : <number of objects scanned>,                emit : <number of times emit was called>,                output : <number of items in output collection>      } ,      ok : <1_if_ok>,      [, err : <errmsg_if_error>] }

1.result: name of collection that stores the result, which is a temporary collection that is automatically deleted when the connection to MapReduce is closed.
2.timeMillis: execution time in milliseconds
3.input: the number of documents that meet the requirements to be sent to the map function
4.emit: the number of times emit is called in the map function, that is, the total amount of data in all the collections
5.ouput: number of documents in the result set (count is very helpful for debugging)
6.ok: success or not, success is 1
7.err: if you fail, there can be reasons for failure, but empirically, the reasons are vague and don't make much difference

The java code executes the MapReduce method:


public void MapReduce() {
        Mongo mongo = new Mongo("localhost",27017);
        DB db = mongo.getDB("qimiguangdb");
        DBCollection coll = db.getCollection("collection1");
      
        String map = "function() { emit(this.name, {count:1});}";
                                                                                                             
  
        String reduce = "function(key, values) {"; 
        reduce=reduce+"var total = 0;"; 
        reduce=reduce+"for(var i=0;i<values.length;i++){total += values[i].count;}"; 
        reduce=reduce+"return {count:total};}"; 
         
        String result = "resultCollection"; 
         
        MapReduceOutput mapReduceOutput = coll.mapReduce(map, 
                reduce.toString(), result, null); 
        DBCollection resultColl = mapReduceOutput.getOutputCollection(); 
        DBCursor cursor= resultColl.find(); 
        while (cursor.hasNext()) { 
            System.out.println(cursor.next()); 
        } 
    } 


Related articles: