Current Year, Previous Year, The Missing Dates Problem, and LAG

When working with data that has date gaps, we need to have different ways of filling in the missing dates. This could be something like previous revenue where you want to see what a customer purchased last year but they don't have a sale in the current year. Or maybe you want to carry a value forward over a certain period where you are missing dates, like a rate value over a holiday. I want to go over several different scenarios and ways to solve these problems. We will look at issues with LAG when dealing with missing years, single missing year vs. multiple missing years, and the different methods we can use to fill in the gaps.

The first problem we will look at is the previous year issue. We want to see revenue totals and previous years revenue totals for customers even if they don't have a sale in the current year. For simplicity, our sample data only contains a single customer, but the solutions here have proved performant across larger data sets. We have the customer, sales year, trimester, and revenue. You will see that the customer has no sales in 2013 or 2017 so we can address the missing years. Sales are mixed across different trimesters in different years to provide all the different previous year scenarios.

DROP TABLE IF EXISTS #Sales;

CREATE TABLE #Sales (
     CustomerNumber int
    ,SaleYear int
    ,Trimester int
    ,Revenue decimal(7,2) );

INSERT INTO #Sales
SELECT 333, 2012, 1, 25.00 UNION ALL SELECT 333, 2012, 1, 44.00 UNION ALL SELECT 333, 2012, 3, 18.00 UNION ALL SELECT 333, 2012, 3, 12.00 UNION ALL
SELECT 333, 2014, 1, 18.00 UNION ALL SELECT 333, 2014, 1, 24.00 UNION ALL SELECT 333, 2014, 3, 15.00 UNION ALL SELECT 333, 2014, 3, 24.00 UNION ALL
SELECT 333, 2015, 1, 56.00 UNION ALL SELECT 333, 2015, 2, 6.00  UNION ALL SELECT 333, 2015, 3, 31.00 UNION ALL
SELECT 333, 2016, 1, 43.00 UNION ALL SELECT 333, 2016, 1, 11.00 UNION ALL SELECT 333, 2016, 2, 36.00 UNION ALL SELECT 333, 2016, 3, 31.00 UNION ALL
SELECT 333, 2018, 1, 29.00 UNION ALL SELECT 333, 2018, 1, 33.00 UNION ALL SELECT 333, 2018, 2, 17.00 UNION ALL SELECT 333, 2018, 3, 16.00 ;

In our first example we look at how to get previous year, year over year, without addressing the missing year using the self-join method. Because our customer has multiple sales in the same year and trimester we summarize the data in a CTE before doing our self-join for previous year.

-- Get total revenue by Customer, SaleYear, and Trimester

WITH CTE AS (
    SELECT 
         CustomerNumber
        ,SaleYear
        ,Trimester
        ,SUM(Revenue) AS Revenue
    FROM #Sales 
    GROUP BY CustomerNumber, SaleYear, Trimester )

-- Then use self join to get revenue for previous year

SELECT 
     c.CustomerNumber
    ,c.SaleYear
    ,c.Trimester
    ,c.Revenue AS CurrentRevenue
    ,ISNULL(p.Revenue, 0) AS PreviousRevenue
FROM CTE c
LEFT JOIN CTE p ON c.CustomerNumber = p.CustomerNumber AND c.SaleYear = p.SaleYear + 1 AND c.Trimester = p.Trimester
ORDER BY c.Trimester, c.SaleYear;

In the query above we are self-joining the CTE; if we didn't have to summarize the data we could exclude the CTE and just self join to the table (personal note, I find CTE's more readable for queries like this). The key here is in the join condition c.SaleYear = p.SaleYear + 1. How this works is, if c.SaleYear = 2016 and p.SaleYear = 2015, then if we add 1 to p.SaleYear that equals 2016 and so the join condition evaluates to true, which effectively gives us the previous year record if that record exists. This solution returns the correct value for previous year when that year is missing but does not return the actual missing year (we will get to that shortly). Next let's take a look at the LAG function.

Anytime we start talking about previous value it's common to think about the LAG function, and while LAG works great with contiguous date periods, it falls apart on data sets that are missing dates. Take for example:

WITH CTE AS (
    SELECT 
         CustomerNumber
        ,SaleYear
        ,Trimester
        ,SUM(Revenue) AS Revenue
    FROM #Sales 
    GROUP BY CustomerNumber, SaleYear, Trimester )

SELECT 
     CustomerNumber
    ,SaleYear
    ,Trimester
    ,Revenue
    ,LAG(Revenue, 1, 0) OVER (PARTITION BY CustomerNumber, Trimester ORDER BY Trimester, SaleYear) AS PreviousRevenue 
FROM CTE c
ORDER BY c.Trimester, c.SaleYear

Where dates are contiguous it provides the correct result, but in the case of 2012 to 2014 it is showing 2012 as the previous revenue for 2014 and that is not what we are looking for. If you look at the previous revenue from 2014 to 2016 the previous year data is correct, but again it breaks down between 2016 and 2018. In order to use LAG in this scenario we would need to manufacture all possible years and trimesters for each customer. We will look at that in a minute but first lets look at a better way to fill in the missing years for the previous year problem.

In the following example we are again using a self-join, but this time a FULL OUTER JOIN to create records for missing years where there was a sale in the previous year but not the current year. Remember, we have no sales in 2013 but we still want to see previous year sales from 2012 on a line for 2013 regardless if there are no sales in 2013.

WITH CTE AS (
    SELECT 
         CustomerNumber
        ,SaleYear
        ,Trimester
        ,SUM(Revenue) AS Revenue
    FROM #Sales 
    GROUP BY CustomerNumber, SaleYear, Trimester )

-- Then use self join to get revenue for previous year

SELECT 
     COALESCE(c.CustomerNumber, p.CustomerNumber) AS CustomerNumber
    ,COALESCE(c.SaleYear, p.SaleYear + 1) AS SaleYear
    ,COALESCE(c.Trimester, p.Trimester) AS Trimester
    ,ISNULL(c.Revenue, 0) AS CurrentRevenue
    ,ISNULL(p.Revenue, 0) AS PreviousRevenue
FROM CTE c
FULL OUTER JOIN CTE p ON c.CustomerNumber = p.CustomerNumber AND c.SaleYear = p.SaleYear + 1 AND c.Trimester = p.Trimester
ORDER BY COALESCE(c.Trimester, p.Trimester), COALESCE(c.SaleYear, p.SaleYear);

If you look at the data you will see we now have records for the missing years or trimesters, and we have rows for both 2013 and 2017. This uses the same thinking as the first self-join we looked at with a few differences. First, we are using a FULL OUTER JOIN instead of the LEFT JOIN. The years are filled in in the SELECT using the COALESCE on SaleYear and you can see we are also using the SaleYear + 1 here, this will always give us the missing year. We use COALESCE for the customer and trimester to complete the data set. A simple way to get previous year and fill in missing years.

What if we wanted to do something like this using LAG? Again, we would need to manufacture every possible year and trimester for each customer so we could provide a contiguous date range to LAG on. There are a few different ways to do this, with a date table, a numbers tables, or recursive CTE; we are going to demonstrate this with recursive CTE's because it doesn't require the use of those other tables if you don't have them already, those other options will be more performant in most scenarios.

First we need to get all possible years starting from the first sale year, next we need to get all possible trimesters, then get our revenue totals, and finally, LAG on the previous year.

-- All possible SaleYears
WITH rCTEyr AS (
    SELECT DISTINCT
         CustomerNumber 
        ,MIN(SaleYear) OVER (PARTITION BY CustomerNumber) AS SaleYear
        ,CAST(0.00 AS decimal(7,2)) AS Revenue
    FROM #Sales

    UNION ALL

    SELECT 
         CustomerNumber
        ,SaleYear + 1
        ,Revenue
    FROM rCTEyr
    WHERE SaleYear <= 2017 ),

-- All possible Trimesters
rCTEQtr AS (
    SELECT 
         CustomerNumber
        ,SaleYear
        ,1 AS Trimester
        ,Revenue
    FROM rCTEyr

    UNION ALL

    SELECT 
         CustomerNumber
        ,SaleYear
        ,Trimester + 1
        ,Revenue
    FROM rCTEQtr
    WHERE Trimester < 3), 

-- Get our revenue totals 
CTEfinal AS (
    SELECT
         st.CustomerNumber
        ,st.SaleYear
        ,st.Trimester
        ,ISNULL(SUM(s.Revenue), 0) AS Revenue
    FROM rCTEQtr st
    LEFT JOIN #Sales s ON st.CustomerNumber = s.CustomerNumber AND s.SaleYear = st.SaleYear AND s.Trimester = st.Trimester
    GROUP BY st.CustomerNumber, st.SaleYear, st.Trimester )

SELECT DISTINCT
    CustomerNumber
    ,SaleYear
    ,Trimester
    ,Revenue
    ,LAG(Revenue, 1, 0) OVER (PARTITION BY CustomerNumber, Trimester ORDER BY Trimester, SaleYear) AS PreviousRevenue
FROM CTEfinal
ORDER BY Trimester, SaleYear

This creates a record for every possible combination of year and trimester which is useful, but maybe not so much for the previous year problem. You'll see we have the correct previous years revenue as well as the missing years now, but when you compare this with the simplicity of the self-join, the self-join is a faster, more efficient solution to this problem. This brings us to our second scenario, multiple missing dates or years. 

Imagine you have auto finance rates that change day to day. Rates don't change on the holidays or weekends or special events. You have a report that shows the rate for every day in a given period. We created several gaps in our data set below, gaps which span multiple days. We need to carry our rate forward for each missing day to show the rate didn't change for that day. This is the missing dates issue, we need to roll the value forward each missing day until we hit the next valid record. We are going to use a different data set this time to illustrate this.

DROP TABLE IF EXISTS #Test; 

CREATE TABLE #Test (
     RateDate date
    ,RateValue decimal(5,4) )

INSERT INTO #Test
SELECT '12/29/16', 1.2266
UNION ALL 
SELECT '12/30/16', 1.2345
UNION ALL
SELECT '01/03/17', 1.2240
UNION ALL
SELECT '01/06/17', 1.1902;

You will see that in the table above we have values for the 29th, 30th, 3rd, and the 6th but we are missing the dates in between. We need to carry the last rate value forward for each of the missing days. Let's image we have a report and our report accepts 2 parameters, the start date and the end date. We could generate all the days and then join them to the actual records in the table to effectively create missing dates and roll forward rate values. This is another example where you could use a date table, numbers table, or recursive CTE, we are using the CTE for convenience.

DECLARE 
     @StartDate date = '12/29/16'
    ,@EndDate date = '01/06/17';

WITH rCTE AS (
    SELECT @StartDate AS RateDate
    UNION ALL 
    SELECT DATEADD(DAY, 1, RateDate) FROM rCTE
    WHERE RateDate <= @EndDate )

SELECT 
     r.RateDate
    ,z.RateValue
FROM rCTE r
CROSS APPLY (
    SELECT TOP 1 RateValue
    FROM #Test t
    WHERE t.RateDate <= r.RateDate
    ORDER BY t.RateDate DESC ) AS z

The CROSS APPLY works great here and allows us to carry our rates forward for each day. The CROSS APPLY is the most performant of the different ways to solve this problem. I did want to include one more option because it's a clever way to solve this using ROWS UNBOUNDED, but might not be as performant as the CROSS APPLY.

DECLARE 
     @StartDate date = '12/29/16'
    ,@EndDate date = '01/06/17';

-- Get missing dates
WITH rCTE AS (
    SELECT @StartDate AS RateDate
    UNION ALL 
    SELECT DATEADD(DAY, 1, RateDate) FROM rCTE
    WHERE RateDate < @EndDate ),

-- Join missing dates to result
C1 AS (
    SELECT 
         c.RateDate
        ,RateValue
    FROM rCTE c
    LEFT JOIN #Test t ON c.RateDate = t.RateDate ),

-- Create date groups for your ranges
C2 AS (
    SELECT
         RateDate 
        ,RateValue 
        ,MAX(CASE WHEN RateValue IS NOT NULL THEN RateDate END) OVER(ORDER BY RateDate ROWS UNBOUNDED PRECEDING) AS DateGroup
    FROM C1 )

SELECT
     RateDate
    ,MIN(RateValue) OVER(PARTITION BY DateGroup) AS grp2
FROM C2

In summary, there are many different ways to solve the missing dates problem. It's about knowing when to use what patterns for which problems and ensuring those solutions are performant. The queries here are a great start and give you some options as you begin to explore solutions around the missing dates problems.

Calculate Year Over Year Variance - Subtract From Previous Row Using LAG

Say you want to summarize a value by year and find the variance between each year (row). Imagine you have a table that tracks company sales. You need to summarize sales totals by year and then compare the values year over year and find the variance. Here is a simple way to do this using the LAG function.

* Note * If your data does not have contiguous dates you will need to write different queries to fill in the missing dates. This is known as the missing dates problem and solutions to that can be found HERE

IF OBJECT_ID('tempdb..#SalesData') IS NOT NULL
DROP TABLE #SalesData
DECLARE @StartDate date = '2013-01-01';

WITH rCTE AS (
   SELECT
      CAST(2679 AS decimal(7,2)) AS SalesAmount
     ,@StartDate AS SalesDate
     ,CAST('nadia' as varchar) AS SalesPerson
     ,0 AS TransID

   UNION ALL

   SELECT
      CAST(SalesAmount + (SELECT CAST(CRYPT_GEN_RANDOM(1) AS INT) + 3 - CAST(CRYPT_GEN_RANDOM(1) AS INT)) AS decimal(7,2))
    ,DATEADD(dd, + 3, SalesDate)
    ,REVERSE(SalesPerson)
    ,TransID + 1
   FROM rCTE
   WHERE TransID < 500 )

SELECT
   SalesAmount
  ,SalesDate
  ,SalesPerson
INTO #SalesData
FROM rCTE
WHERE SalesAmount > 1
OPTION (MAXRECURSION 600)

--Let's take a look at the data

SELECT * FROM #SalesData

Now that we have our sales data table, we are going to summarize the data by year and load this to a temp table. Notice the ORDER BY, this is important since we will be using the LAG function next.

SELECT DISTINCT
   YEAR(SalesDate) SalesYear
  ,SUM(SalesAmount) OVER (PARTITION BY YEAR(SalesDate)) AS TotalSalesByYear
INTO #SalesTotals
FROM #SalesData
ORDER BY SalesYear ASC;

After we load our temp table with the summarized annual sales, now we are going to use the LAG function to get the previous row for TotalSalesByYear --this will effectively create a new column for the previous year's sales. You can see in the LAG function below we use the TotalSalesByYear value and then you see a comma and the number 1. The number is the offset or how many rows back you want to look from the current row. You can also specify a default value after the offset, this is the value to return if that offset value returns NULL. By excluding that default value, the default value will return NULL, which is what we want so we don't get a divide by zero error when calculating the variance. Here is a link for LAG usage.

WITH CTEazyE AS (
   SELECT
      SalesYear
     ,TotalSalesByYear
     ,LAG(TotalSalesByYear, 1) OVER(ORDER BY SalesYear ASC) AS PreviousYearSales
   FROM #SalesTotals )    

-- Now we calculate the variance year over year.

SELECT
     SalesYear
    ,TotalSalesByYear
    ,PreviousYearSales
    ,(TotalSalesByYear - PreviousYearSales) AS VarianceAmount
    ,100 * (TotalSalesByYear - PreviousYearSales) / PreviousYearSales AS VariancePercent
FROM CTEazyE

There are other ways to calculate this but the intent of this post was to introduce you to the LAG (LEAD) function. Mind you, the LAG function is a SQL 2012 + feature.