The MapReduce programming model in MongoDB USES instances

  • 2020-05-10 23:06:35
  • OfStack

Note: the MongoDB used by the author is version 2.4.7.

Example of a word count:

Insert data for word count:


db.data.insert({sentence:'Consider the following map-reduce operations on a collection orders that contains documents of the following prototype'})
db.data.insert({sentence:'I get the following error when I follow the code found in this link'})

The figure is concise, and the data does not contain punctuation marks. Write the following in mongo shell:


var map = function() {
    split_result = this.sentence.split(" ");
    for (var i in split_result) {
        var word = split_result[i].replace(/(^\s*)|(\s*$)/g,"").toLowerCase(); // Remove possible Spaces on either side of the word and convert the word to lowercase 
        if (word.length != 0) {
            emit(word, 1);
        }
    }
}
var reduce = function(key, values) {
    return Array.sum(values);
}
db.data.mapReduce(
    map,
    reduce,
    {out:{inline:1}}
)


The first and second parameters of db.data.mapReduce specify map and reduce, respectively. The input of map is each document in the collection, and emit() generates key-value pairs. reduce, on the other hand, handles multiple values of the key.

The third parameter of mapReduce indicates that mapreduce is performed in memory and the result is returned as follows:


{
        "results" : [
                {
                        "_id" : "a",
                        "value" : 1
                },
                {
                        "_id" : "code",
                        "value" : 1
                },
                {
                        "_id" : "collection",
                        "value" : 1
                },
                {
                        "_id" : "consider",
                        "value" : 1
                },
                {
                        "_id" : "contains",
                        "value" : 1
                },
                {
                        "_id" : "documents",
                        "value" : 1
                },
                {
                        "_id" : "error",
                        "value" : 1
                },
                {
                        "_id" : "follow",
                        "value" : 1
                },
                {
                        "_id" : "following",
                        "value" : 3
                },
                {
                        "_id" : "found",
                        "value" : 1
                },
                {
                        "_id" : "get",
                        "value" : 1
                },
                {
                        "_id" : "i",
                        "value" : 2
                },
                {
                        "_id" : "in",
                        "value" : 1
                },
                {
                        "_id" : "link",
                        "value" : 1
                },
                {
                        "_id" : "map-reduce",
                        "value" : 1
                },
                {
                        "_id" : "of",
                        "value" : 1
                },
                {
                        "_id" : "on",
                        "value" : 1
                },
                {
                        "_id" : "operations",
                        "value" : 1
                },
                {
                        "_id" : "orders",
                        "value" : 1
                },
                {
                        "_id" : "prototype",
                        "value" : 1
                },
                {
                        "_id" : "that",
                        "value" : 1
                },
                {
                        "_id" : "the",
                        "value" : 4
                },
                {
                        "_id" : "this",
                        "value" : 1
                },
                {
                        "_id" : "when",
                        "value" : 1
                }
        ],
        "timeMillis" : 1,
        "counts" : {
                "input" : 2,
                "emit" : 30,
                "reduce" : 3,
                "output" : 24
        },
        "ok" : 1,
}


The value of results is the result of the processing of MapReduce, and timeMillis indicates the time spent; In counts, input indicates the number of documents input, emit indicates the number of times emit is called in map, reduce indicates the number of times reduce is called (in this case, reduce is not required if the number of single times is 1), and output indicates the number of documents output.

As you can see, the key _id is no longer generated automatically, but is replaced by key in reduce. Of course, you can also enter the result into a new collection, for example:

db.data.mapReduce( map, reduce, {out:"mr_result"} )

You can then view the contents of the mr_result collection:
db.mr_result.find()

You can also perform the mapreduce task using db.runCommand, which gives developers more options, as detailed in [1]. Materials [2][3][4] provide more comprehensive information about mapreduce. Data [6] is a Chinese translation of data [5].

It should be noted that, as mentioned in [5], threads are created using ScopedThread(). When I ran new ScopedThread() in GUI tool Robomongo, I reported an error: ReferenceError: ScopedThread is not defined (shell):1

However, mongo shell works fine:


> new ScopedThread()
Sat Mar 22 21:32:36.062 Error: need at least one argument at src/mongo/shell/utils.js:101

If you manage MongoDB in another programming language, you should use the language's built-in threads when using threads.

Regarding mapreduce implemented by mongodb, I think it would be better to support multiple MR tasks for smooth transition.


Related articles: