With the multitude of new features and tools coming out for MongoDB, it has become easier to analyze, compute and reshape data in ways that would have been slow and inefficient just a few years ago with map-reduce jobs. One major feature that came out a few years ago is the aggregation framework. It was released in 2012 with MongoDB version 2.2., and with every new release since, MongoDB has shored up the aggregation framework with new stages and operators to slice and dice the data in new and creative ways. This, in turn, has helped developers look at the data from new angles and gain new insights.

So, what is aggregation framework?

Aggregation framework is a pipeline-based framework made up of one or more stages. It takes in a stream of documents from a specified source collection and passes them through each and every stage. The output of one stage becomes the input of the next stage. Based on the supplied stages and operators, the system will analyze, compute or reshape the documents as they pass from one stage to another. This process causes no side effects or modifications to the original source data.

Aggregation framework enables a more declarative query style compared to the imperative format provided by the map-reduce paradigm.

Let's look at a few examples to illustrate the power of aggregation framework.

Here is a sample customer collection that holds customer information which includes their billing and service address. The document also has an array of services each customer is subscribed to which includes the type of service, actual price, discounted price and the number of users.

Now, let us explore the data by asking a few questions.

The below aggregation queries have been tested on MongoDB 3.6.1. Run the command - db.version() - from mongo shell to find out the version of mongod.

Explanation of the pipeline:

Stage 1: The $match stage fetches all the documents where the value of customerType is "Direct" and passes those documents to the $count stage. The $match stage takes advantage of indexes and, if possible, should be at the beginning of the pipeline to exclude unnecessary data from flowing through to the next stages.
Stage 2: The $count stage counts the number of documents outputted by the previous $match stage.

The same answer could also be derived using $group stage.

Explanation of the pipeline:

Stage 1: The $group stage groups all the documents with the same customerType key-value pair, and the $sum operator acts as an accumulator by adding 1 to the respective grouped category and maintaining the state until all the documents are passed through the $group stage.

  1. How many direct customers have at least 2 subscriptions?

Explanation of the pipeline:

Stage 1: The $match stage fetches all the documents with customerType set to "Direct" and then checks to see if there is an element in the subscriptions array at position 1, since arrays are zero-based. The $size operator cannot be used in this case because it cannot be combined with $gt or $lt operators.

  1. Get total number of subscriptions for direct customers across all products

Explanation of the pipeline:

Stage 1: $match retrieves and passes only direct customers to the next stage.
Stage 2: $unwind deconstructs the subscriptions array and creates a new document for each element in the subscriptions array.
Stage 3: $count stage keeps a count of the documents outputted by $unwind stage.

  1. List all direct customers’ names with Widgets subscriptions

Explanation:

Stage 1: $match gets all the documents where customerType is equal to "Direct" and passes them to the $unwind stage.
Stage 2: $unwind deconstructs the subscriptions array.
Stage 3: Second $match stage fetches all the documents where productName is equal to "Widgets" in the subscriptions subdocument.
Stage 4: $project stage tells the aggregation framework to select only customerName key-value and ignore the rest. _id is always selected by default. To exclude _id from the output, specify _id:0 in $project stage

  1. List of all subscriptions where discountedPrice is less than actualPrice

Explanation:

Stage 1: $unwind deconstructs the subscriptions array.
Stage 2: $match uses $expr operator to do field comparisons from the same document. In this case, it is used to check if discountedPrice is less than actualPrice in the subdocument.
Stage 3: $project stage selects customerName field and explicitly excludes _id field. It also selects the productName field from the subscriptions subdocument.

  1. Compute discount percentage and include a message if it is less than or more than 20%

Explanation:

Stage 1: Creates a new root document for every element in the subscriptions array.
Stage 2: $project computes new discount percentage field using arithmetic expression operators.
Stage 3: $addFields adds a conditional message using $switch operator without excluding other fields, unlike $project stage.

  1. Count and list all customer with Widgets subscriptions

Explanation:

Stage 1: $unwind deconstructs the subscriptions array.
Stage 2: $project required fields.
Stage 3: $group by productName field, count the number of customers with each type of subscription, and use the $addToSet operator to add the top-level document outputted by $project stage. System variable $$ROOT holds the reference to the top-level document.
Stage 4: $match fetches the documents where _id is equal to Widgets as outputted by $group stage
Stage 5: $project stage maps _id to a more descriptive field name, selects customerCount, customerName and finally explicitly excludes _id.

  1. Calculate subscription total and sort the customers by state and then total fields

Explanation:

Stage 1: $unwind deconstructs the subscriptions array.
Stage 2: $unwind deconstructs the customerAddress array.
Stage 3: $addFields adds a new compute field to the subscriptions subdocument without excluding other fields, unlike $project stage.
Stage 4: $sort stage first sorts the documents by state in ascending order, and then by total in descending order.
Stage 5: $project promotes state and productName from the subdocument level to top-level.
Stage 6: $limit retrieves only the first 3 documents. $limit can be combined with $skip to implement paging.

  1. List all Widgets subscriptions

Explanation:

Stage 1: $project uses $filter operator to keep array elements that match the criteria.
Stage 2: $unwind deconstructs the WidgetsSubscriptions subdocument.
Stage 3: $addFields includes customerName in the WidgetsSubscriptions subdocument.
Stage 4: $replaceRoot replaces the top-level document with the WidgetsSubscriptions subdocument.

  1. Add a computed field to the subdocuments in the subscriptions array to capture the total amount of each subscription

Explanation:

The $map operator computes a new field called total by multiplying discountedPrice and number of users in each item in the subscriptions array.

The above result could also be obtained using $unwind, $addFields and $group

  1. Create a new subdocument that holds the total of all subscriptions and lists them in a comma separated value

Explanation:

The $reduce operator adds a new subdocument that contains the key/value reduced from the subscriptions array.

As you can see, there are numerous stages and operators to help analyze, compute and reshape data in order to gain insight. In this article, we covered a few of the stages and operators to aggregate the data. I encourage you to further explore and find out all the different ways the aggregation framework can optimize your querying needs.

Comments

Subscribe Here!