The powerful statistics framework in MongoDB Aggregation USES instance resolution

  • 2020-05-15 02:28:49
  • OfStack

I heard that Aggregation is used a lot in the project, so I will do more exercises for it.

Basic operations include:

The & # 8226; $project - fields can be extracted from subdocuments, and fields can be renamed

The & # 8226; $match - enables lookup

The & # 8226; $limit - accepts one number, n, and returns the first n documents in the result set.

The & # 8226; $skip - accepts one number, n, and discards the first n documents in the result set. It is inefficient and will still traverse the previous n documents.

The & # 8226; $unwind - you can cut a document containing an array into several parts. For example, if there is an array field A in your document and 10 elements in A, then 10 documents will be generated after $unwind processing. Only the field A is different from these documents

The & # 8226; $group - statistics operation, also provides 1 series of subcommands

� $avg, $sum...

The & # 8226; $sort - sorting

Python article
Experiment 1. Student data statistics
1. Generate student data:


#!/usr/bin/env python
# coding=utf-8
from pymongo import MongoClient
from random import randint
name1 = ["yang ", "li ", "zhou "]
name2 = [
  "chao",
  "hao",
  "gao",
  "qi gao",
  "hao hao",
  "gao gao",
  "chao hao",
  "ji gao",
  "ji hao",
  "li gao",
  "li hao",
]
provinces = [
  "guang dong",
  "guang xi",
  "shan dong",
  "shan xi",
  "he nan"
]
client = MongoClient('localhost', 27017)
db = client.student
sm = db.smessage
sm.remove()
for i in range(1, 100):
  name = name1[randint(0, 2)] + name2[randint(0, 10)]
  province = provinces[randint(0, 4)]
  new_student = {
    "name": name,
    "age": randint(1, 30),
    "province": province,
    "subject": [
      {"name": "chinese", "score": randint(0, 100)},
      {"name": "math", "score": randint(0, 100)},
      {"name": "english", "score": randint(0, 100)},
      {"name": "chemic", "score": randint(0, 100)},
    ]}
  print new_student
  sm.insert_one(new_student)

print sm.count()

Ok, so now we have 100 pieces of student data in the database.

Now I want to get the average age of guangdong students, enter in mongo console:

It's even easier when you think about the average age of all the provinces:


db.smessage.aggregate(
{$match: {province: "guang dong"}}
)

{ "_id" : "guang xi", "age" : 15.19047619047619 }
{ "_id" : "guang dong", "age" : 16.05263157894737 }
{ "_id" : "shan dong", "age" : 17.44 }
{ "_id" : "he nan", "age" : 20 }
{ "_id" : "shan xi", "age" : 16.41176470588235 }

If you want to get the average scores of all subjects in guangdong province:


db.smessage.aggregate(
{$match: {province: "guang dong"}},
{$unwind: "$subject"},
{$group: { _id: {province:"$province",sujname:"$subject.name"}, per:{$avg:"$subject.score"}}}
)

Plus sort:


db.smessage.aggregate(
{$match: {province: "guang dong"}},
{$unwind: "$subject"},
{$group: { _id: {province:"$province",sujname:"$subject.name"}, per:{$avg:"$subject.score"}}},
{$sort:{per:1}}
)

Experiment 2. Search for the water king
There is a collection of magazine articles, and you may want to find out which author has published the most articles. Assume that each article is saved as a document in MongoDB.

1. Insert data


#!/usr/bin/env python
# coding=utf-8
from pymongo import MongoClient
from random import randint


name = [
  'yangx',
  'yxxx',
  'laok',
  'kkk',
  'ji',
  'gaoxiao',
  'laoj',
  'meimei',
  'jj',
  'manwang',
]

title = [
  '123',
  '321',
  '12',
  '21',
  'aaa',
  'bbb',
  'ccc',
  'sss',
  'aaaa',
  'cccc',
]

client = MongoClient('localhost', 30999)
db = client.test
bbs = db.bbs
bbs.remove()
for i in range(1, 10000):
  na = name[randint(0, 9)]
  ti = title[randint(0, 9)]
  newcard = {
    'author': na,
    'title': ti,
  }
  bbs.insert_one(newcard)

print bbs.count()

Now we have 10,000 articles.

2. Project the author field with $project


{"$project": {"author":1}}

This syntax is similar to a field selector in a query: you can select the field you want to project by specifying "fieldname" : 1, or you can exclude unwanted fields by specifying "fieldname":0.

After performing this "$project" operation, each document in the result set is represented as {"_id" : id, "author" : "authorName"}. These results will only exist in memory and will not be written to disk.

3. Group author names with group


{"group":{"_id":"$author","count":{"$sum":1}}}

This will sort the authors by name, and for every time an author's name appears, the author's "count" is added by 1.

Here you first specify the field "author" that you want to group. This is specified by "_id" : "$author". You can think of this operation as: after this operation is completed, there is only one result document for each author, so "author" becomes the unique 1 identifier of the document ("_id").

The second field means adding 1 to the "count" field for each document in the group. Note that there is no "count" field in the newly added document; This "$group" creates a new field.

At the end of this step, each document in the result set will look like this: {"_id" : "authorName", "count" : articleCount}.

4. Sort by sort


{"$sort" : {"count" : -1}}

This will sort the documents in the result set in descending order according to the "count" field.

5. Limit the results to the first five documents


{"$limit" : 5}

This operation limits the final return result to the first five documents in the current result.
When you actually run MongoDB, you pass each of these operations to the aggregate() function:


> db.articles.aggregate({"$project" : {"author" : 1}},
... {"$group" : {"_id" : "$author", "count" : {"$sum" : 1}}},
... {"$sort" : {"count" : -1}},
... {"$limit" : 5}
... )

aggregate() returns an array of documents containing the top five published authors.


db.smessage.aggregate(
{$match: {province: "guang dong"}}
)

{ "_id" : "guang xi", "age" : 15.19047619047619 }
{ "_id" : "guang dong", "age" : 16.05263157894737 }
{ "_id" : "shan dong", "age" : 17.44 }
{ "_id" : "he nan", "age" : 20 }
{ "_id" : "shan xi", "age" : 16.41176470588235 }

0 Java article

I built some data in db (the data is generated randomly, it can be used) without index, and the document structure is as follows:

Document structure:


 {
  "_id" : ObjectId("509944545"),
  "province" : " hainan ",
  "age" : 21,
  "subjects" : [
  {
  "name" : " Chinese language and literature ",
  "score" : 53
  },
  {
  "name" : " mathematics ",
  "score" : 27
  },
  {
  "name" : " English ",
  "score" : 35
  }
   ],
  "name" : " jaimie "
 }

The next two functions are to be implemented:

The average age of students in Shanghai was calculated Statistics each province each subject average score

Let's do the next 11

The average age of students in Shanghai was calculated

In terms of this requirement, there are several steps to realize the function: 1. Find out the students in Shanghai; 2. Calculate the average age (of course, you can also find out the average of all provinces and then find out the average of Shanghai). So the idea is clear

First, put on $match and take out the students in Shanghai


{$match:{'province':' Shanghai '}}

Then use $group to calculate the average age


db.smessage.aggregate(
{$match: {province: "guang dong"}}
)

{ "_id" : "guang xi", "age" : 15.19047619047619 }
{ "_id" : "guang dong", "age" : 16.05263157894737 }
{ "_id" : "shan dong", "age" : 17.44 }
{ "_id" : "he nan", "age" : 20 }
{ "_id" : "shan xi", "age" : 16.41176470588235 }

3

$avg is a subcommand of $group for averaging, and similarly $sum, $max...
The above two commands are equivalent to


db.smessage.aggregate(
{$match: {province: "guang dong"}}
)

{ "_id" : "guang xi", "age" : 15.19047619047619 }
{ "_id" : "guang dong", "age" : 16.05263157894737 }
{ "_id" : "shan dong", "age" : 17.44 }
{ "_id" : "he nan", "age" : 20 }
{ "_id" : "shan xi", "age" : 16.41176470588235 }

4

Below is the Java code


db.smessage.aggregate(
{$match: {province: "guang dong"}}
)

{ "_id" : "guang xi", "age" : 15.19047619047619 }
{ "_id" : "guang dong", "age" : 16.05263157894737 }
{ "_id" : "shan dong", "age" : 17.44 }
{ "_id" : "he nan", "age" : 20 }
{ "_id" : "shan xi", "age" : 16.41176470588235 }

5

Output results:


db.smessage.aggregate(
{$match: {province: "guang dong"}}
)

{ "_id" : "guang xi", "age" : 15.19047619047619 }
{ "_id" : "guang dong", "age" : 16.05263157894737 }
{ "_id" : "shan dong", "age" : 17.44 }
{ "_id" : "he nan", "age" : 20 }
{ "_id" : "shan xi", "age" : 16.41176470588235 }

6

So that's the end of the project. Let's look at another requirement

Statistics each province each subject average score

To start with a more structured database document, subjects is an array that needs to be 'split' and then counted

The main processing steps are as follows:

2. Divide the rent according to province and subject and calculate the average score of each subject

$unwind array


{$unwind:'$subjects'}

Group according to province and subject, and calculate the average score


{$group:{
   _id:{
     subjname: " $subjects.name " ,  //  The specified group Field of 1 subjects.name,  Rename as  subjname
     province:'$province'     //  The specified group Field of 1 province,  Rename as  province( Did not change )
   },
   AvgScore:{
    $avg: " $subjects.score "     //  right  subjects.score  For the average 
   }
 }

java code is as follows:


db.smessage.aggregate(
{$match: {province: "guang dong"}}
)

{ "_id" : "guang xi", "age" : 15.19047619047619 }
{ "_id" : "guang dong", "age" : 16.05263157894737 }
{ "_id" : "shan dong", "age" : 17.44 }
{ "_id" : "he nan", "age" : 20 }
{ "_id" : "shan xi", "age" : 16.41176470588235 }

9

The output


{ "serverUsed" : "localhost/127.0.0.1:27017" , 
  "result" : [ 
   { "_id" : { "subjname" : " English " , "province" : " hainan "} , "AvgScore" : 58.1} , 
   { "_id" : { "subjname" : " mathematics " , "province" : " hainan "} , "AvgScore" : 60.485} ,
   { "_id" : { "subjname" : " Chinese language and literature " , "province" : " jiangxi "} , "AvgScore" : 55.538} , 
   { "_id" : { "subjname" : " English " , "province" : " Shanghai "} , "AvgScore" : 57.65625} , 
   { "_id" : { "subjname" : " mathematics " , "province" : " guangdong "} , "AvgScore" : 56.690} , 
   { "_id" : { "subjname" : " mathematics " , "province" : " Shanghai "} , "AvgScore" : 55.671875} ,
   { "_id" : { "subjname" : " Chinese language and literature " , "province" : " Shanghai "} , "AvgScore" : 56.734375} , 
   { "_id" : { "subjname" : " English " , "province" : " yunnan "} , "AvgScore" : 55.7301 } ,
   .
   .
   .
   .
   "ok" : 1.0
 }

The statistics are over... Wait a moment, it seems a little too rough, although the statistics come out, but can not see at all, the same province of the subjects are not 1. �

Let's do a little bit of reinforcement,

Side quests: count the subject scores of the same province to 1 (i.e., expect 'province':'xxxxx', avgscores:[{'xxx':xxx},...] In this form)

There is one thing to be done. On the basis of the above statistical results, the average score and score are first rubbed up to 1 with $project, which is as follows


{ "subjinfo" : { "subjname" : " English " ,"AvgScores" : 58.1 } ,"province" : " hainan " }

Then, according to the province group, divide each subject into an average of push to 1 block. The command is as follows:

$project reconstructs group results


{$project : {province:"$_id.province", subjinfo:{"subjname":"$_id.subjname", "avgscore":"$AvgScore"}}

$is grouped again using group


{$group:{_id:"$province", avginfo:{$push:"$subjinfo"}}}

The java code is as follows:


Mongo m = new Mongo("localhost", 27017);
DB db = m.getDB("test");
DBCollection coll = db.getCollection("student");
       
/*  create  $unwind  operation ,  Used to shard arrays */
DBObject unwind = new BasicDBObject("$unwind", "$subjects");
       
/* Group operation */
DBObject groupFields = new BasicDBObject("_id", new BasicDBObject("subjname", "$subjects.name").append("province", "$province"));
groupFields.put("AvgScore", new BasicDBObject("$avg", "$subjects.scores"));
DBObject group = new BasicDBObject("$group", groupFields);
       
/* Reshape Group Result*/
DBObject projectFields = new BasicDBObject();
projectFields.put("province", "$_id.province");
projectFields.put("subjinfo", new BasicDBObject("subjname","$_id.subjname").append("avgscore", "$AvgScore"));
DBObject project = new BasicDBObject("$project", projectFields);
       
/*  The results push to 1 since */
DBObject groupAgainFields = new BasicDBObject("_id", "$province");
groupAgainFields.put("avginfo", new BasicDBObject("$push", "$subjinfo"));
DBObject reshapeGroup = new BasicDBObject("$group", groupAgainFields);
 
/*  To view Group The results of  */
AggregationOutput output = coll.aggregate(unwind, group, project, reshapeGroup);
System.out.println(output.getCommandResult());

The results are as follows:


{ "serverUsed" : "localhost/127.0.0.1:27017" , 
 "result" : [ 
    { "_id" : " liaoning " , "avginfo" : [ { "subjname" : " mathematics " , "avgscore" : 56.46666666666667} , { "subjname" : " English " , "avgscore" : 52.093333333333334} , { "subjname" : " Chinese language and literature " , "avgscore" : 50.53333333333333}]} , 
    { "_id" : "4 sichuan " , "avginfo" : [ { "subjname" : " mathematics " , "avgscore" : 52.72727272727273} , { "subjname" : " English " , "avgscore" : 55.90909090909091} , { "subjname" : " Chinese language and literature " , "avgscore" : 57.59090909090909}]} , 
    { "_id" : " chongqing " , "avginfo" : [ { "subjname" : " Chinese language and literature " , "avgscore" : 56.077922077922075} , { "subjname" : " English " , "avgscore" : 54.84415584415584} , { "subjname" : " mathematics " , "avgscore" : 55.33766233766234}]} , 
    { "_id" : " anhui " , "avginfo" : [ { "subjname" : " English " , "avgscore" : 55.458333333333336} , { "subjname" : " mathematics " , "avgscore" : 54.47222222222222} , { "subjname" : " Chinese language and literature " , "avgscore" : 52.80555555555556}]} 
  .
  .
  .
  ] , "ok" : 1.0}


Related articles: