MongoDB Query Language Part 2 – Count, Distinct, Group, and Introduction to the Aggregation Framework

What we will cover:

  • Counting documents
  • Selecting distinct values
  • Grouping results
  • Aggregation framework
  • Basic Operators in the Aggregation Framework

In case you missed Part 1 in this series:

  • Make sure you view the guide Load Test Data to load a test dataset so you can follow along with our examples.
  • Download a good MongoDB IDE. I recommend Robo 3T, formerly named Robomongo. https://robomongo.org/   
  • Everything in MongoDB is case sensitive including database names, collection names, users, methods, etc.
  • It is important to familiarize yourself with the structure of a JSON document before you begin. This is a good place to start: http://www.json.org/

Counting documents

Count is an aggregation command (not a part of the aggregation framework, we’ll get to that later) that relies on the runCommand to perform a count in MongoDB. The db.runCommand is a helper used to run specific database commands like count. If we want to count all of the documents in the userprofile collection, we would run the following.

db.runCommand ({  
   count:"userprofile"
})

A more common scenario for counting records would be counting based on the value of something. In the following query we will introduce the “query” option for counting. The “query” option lets us specify a condition (all users older than 35) and count all records that meet this condition.

db.runCommand({  
   count:"userprofile",
   query:{  
      "age":{  
         $gt:35
      }
   }
})

Counting this way is not very productive; we’ll need to tap in to the aggregation framework for more advanced queries. We’ll get to that later in this guide.

Select Distinct Values

Like the counting aggregation, the distinct aggregation can also be run with the db.runCommand helper. There is also a distinct() method we can use; let's look at the two examples below. If we want to return all distinct names in our userprofile collection, we could run the following.

db.runCommand({  
   distinct:"userprofile",
   key:"name"
})

Or we could use the distinct() method:

db.userprofile.distinct("name")

In the first example we call the distinct aggregation with the userprofile collection. We have the key value we want to select distinct against, in this example the names of the users. In the second example we just call the distinct() method on the userprofile collection and use the names value to select our distinct user names.

Grouping Results and the Aggregation Framework

The grouping aggregation, which was used in the same way as the examples above, was deprecated in MongoDB version 3.4. If we want to group results we would instead use the aggregation framework. To be more specific, we would use the aggregation pipeline, which is a framework for data aggregation. The terms aggregation pipeline and aggregation framework are interchangeable, in case you hear it called one or the other. The aggregation framework is modeled on the concept of data processing pipelines, where documents enter a multi-stage pipeline that transforms the documents into aggregated results. Each stage transforms the documents as they pass through the pipeline. MongoDB provides the aggregate() method in the format of db.collection.aggregate(). Aggregation operators like group, count, sum, or max begin with a $ when using the aggregation framework. There are many aggregation pipeline operators--we are going to cover the basics.

You will recall in pervious examples in part 1 of this series, we talked about the find() method, filtering, and projection. That filter (the first set of curly braces following the find method) works the same way the $match operator does in the aggregation framework. Let’s compare a query using the two.

Using the find() method, return all documents where the age of each user is greater than 35.

db. userprofile.find({  
   "age":{  
      $gt:35
   }
})

Now we'll return the same result using the aggregation framework:

db.userprofile.aggregate ([  
   {  
      $match:{  
         "age":{  
            $gt:35
         }
      }
   }
])

In the find() method we match records in the first set of curly braces (this is the filter), no match keyword is needed in the find() method. In the aggregate() method we match records using the $match operator. Also notice that the aggregate() method begins with an open paren and an open square bracket. Now that we understand the basic format of the aggregation() method let's take the next step. In the following example we are going to return the total balance for users grouped by gender. This will use the $group operator and the $sum operator.

db.userprofile.aggregate([  
   {  
      $group:{  
         _id:"$gender",
         totalBalance:{  
            $sum:"$balance"
         }
      }
   }
])

The $group operator groups documents and then outputs the distinct grouping for each document to the next stage of the pipeline. In the example above, following the $group operator, the first thing you will see is the _id: field followed by $gender. The _id field is a mandatory field when using $group. We can think of  _id as an alias that represents the fields we want to group on. In the example above, we needed to group on the gender field so the value for _id: is "$gender". The $ preceding gender tells the operator to treat $gender as a Literal, which means that $gender evaluates to a field path—it allows us to directly access the gender field in the document. Next in our statement we see totalBalance, which is the alias name we are providing for the sum of balance. We use $sum to calculate a total of the users balance grouped by gender. In order for us to access the balance field in the document we use a Literal $balance like we did with $gender. I would encourage you to look at the results with and without the $ so you can see how it affects the output.

Now let's look at another example, this time grouping on multiple fields. We will take the same query up top but this time we'll group users by their favorite fruit and gender.

db.userprofile.aggregate([  
   {  
      $group:{  
         _id:{  
            userGender:"$gender",
            favoriteFruits:"$favoriteFruit"
         },
         totalBalance:{  
            $sum:"$balance"
         }
      }
   }
])

The main difference between this query the one before it is that the mandatory _id field is now an object, with both fields from our collection we wanted to group on. userGender is the alias for the gender field and favoriteFruits is the alias for the favorite Fruit field. This represents the grouping for the query. Run the query above to see the results and the grouping.

Using $group with embedded objects is a subject we will visit later in this series. There are additional functions needed when working with arrays of objects and the aggregation framework.

Projection

Like we discussed in part one, projection is the process of limiting what fields get returned in our query. In the find() method, the second set of curly braces represents our projection and no keyword is needed. In the aggregation framework, projection is represented by the $project operator.

db.userprofile.aggregate ([  
   {  
      $match:{  
         "age":{  
            $gt:35
         }
      }
   },
   {  
      $project:{  
         "name":1,
         "email":1,
         "_id":0
      }
   }
])

Counting

There are two ways to count in the aggregation framework, the first is using the $count operator and the second is using the $sum operator. That is not a typo, we can use $sum to return a count of records. Let's look at the different ways we can use count.

db.userprofile.aggregate([  
   {  
      $match:{  
         "age":{  
            $gt:35
         }
      }
   },
   {  
      $count:"totalCount"
   }
])

In the example above , we are returning a count of all documents where the user's age is greater than 35. This will return the alias name "totalCount" along with the number of records that match our filter. This is simple enough, but what if we want to return more than just the record count? Going back to our example where we grouped on gender and favoriteFruits, let's add a count to this query. We want to know how many records exist for each grouping. This query also returns a totalBalance using $sum, and you are about to see why this second method can be confusing.

db.userprofile.aggregate ([  
   {  
      $group:{  
         _id:{  
            userGender:"$gender",
            favoriteFruits:"$favoriteFruit"
         },
         totalBalance:{  
            $sum:"$balance"
         },
         documentCount:{  
            $sum:1
         }
      }
   }
])

In this example $sum is used both to provide a total balance as well as count the number of documents per grouping. You can see the difference in the syntax surrounding $sum. Where you see $sum:1, all this is doing is summing the value 1 for every record returned. If you change this to the number to 0 it will return 0 for the document count.

Putting it all together

Now we need to put all the pieces together. In the following query we are going to provide a total balance and document count for all active users grouped by gender and favorite fruit and we only want to return the balance and the count.

db.userprofile.aggregate ([  
   {  
      $match:{  
         "isActive":true
      }
   },
   {  
      $group:{  
         _id:{  
            userGender:"$gender",
            favoriteFruits:"$favoriteFruit"
         },
         totalBalance:{  
            $sum:"$balance"
         },
         documentCount:{  
            $sum:1
         }
      }
   },
   {  
      $project:{  
         "_id":0,
         "totalBalance":1,
         "documentCount":1
      }
   }
])

It's important to note the _id filter in this query. In our projection we specify _id:0 so it does not return the results of _id. If we were working with the find() method, this would simply suppress the object id in the result set. When used in the aggregate framework with $group, it's going to suppress the grouping fields. We can't suppress the grouping fields directly by applying a projection on those individual fields, instead we apply the projection on the _id that represents the grouping fields.

In the Part 3 of this series we will continue working with the aggregation framework and begin exploring arrays.