Current Year, Previous Year, The Missing Dates Problem, and LAG

When working with data that has date gaps, we need to have different ways of filling in the missing dates. This could be something like previous revenue where you want to see what a customer purchased last year but they don't have a sale in the current year. Or maybe you want to carry a value forward over a certain period where you are missing dates, like a rate value over a holiday. I want to go over several different scenarios and ways to solve these problems. We will look at issues with LAG when dealing with missing years, single missing year vs. multiple missing years, and the different methods we can use to fill in the gaps.

The first problem we will look at is the previous year issue. We want to see revenue totals and previous years revenue totals for customers even if they don't have a sale in the current year. For simplicity, our sample data only contains a single customer, but the solutions here have proved performant across larger data sets. We have the customer, sales year, trimester, and revenue. You will see that the customer has no sales in 2013 or 2017 so we can address the missing years. Sales are mixed across different trimesters in different years to provide all the different previous year scenarios.

DROP TABLE IF EXISTS #Sales;

CREATE TABLE #Sales (
     CustomerNumber int
    ,SaleYear int
    ,Trimester int
    ,Revenue decimal(7,2) );

INSERT INTO #Sales
SELECT 333, 2012, 1, 25.00 UNION ALL SELECT 333, 2012, 1, 44.00 UNION ALL SELECT 333, 2012, 3, 18.00 UNION ALL SELECT 333, 2012, 3, 12.00 UNION ALL
SELECT 333, 2014, 1, 18.00 UNION ALL SELECT 333, 2014, 1, 24.00 UNION ALL SELECT 333, 2014, 3, 15.00 UNION ALL SELECT 333, 2014, 3, 24.00 UNION ALL
SELECT 333, 2015, 1, 56.00 UNION ALL SELECT 333, 2015, 2, 6.00  UNION ALL SELECT 333, 2015, 3, 31.00 UNION ALL
SELECT 333, 2016, 1, 43.00 UNION ALL SELECT 333, 2016, 1, 11.00 UNION ALL SELECT 333, 2016, 2, 36.00 UNION ALL SELECT 333, 2016, 3, 31.00 UNION ALL
SELECT 333, 2018, 1, 29.00 UNION ALL SELECT 333, 2018, 1, 33.00 UNION ALL SELECT 333, 2018, 2, 17.00 UNION ALL SELECT 333, 2018, 3, 16.00 ;

In our first example we look at how to get previous year, year over year, without addressing the missing year using the self-join method. Because our customer has multiple sales in the same year and trimester we summarize the data in a CTE before doing our self-join for previous year.

-- Get total revenue by Customer, SaleYear, and Trimester

WITH CTE AS (
    SELECT 
         CustomerNumber
        ,SaleYear
        ,Trimester
        ,SUM(Revenue) AS Revenue
    FROM #Sales 
    GROUP BY CustomerNumber, SaleYear, Trimester )

-- Then use self join to get revenue for previous year

SELECT 
     c.CustomerNumber
    ,c.SaleYear
    ,c.Trimester
    ,c.Revenue AS CurrentRevenue
    ,ISNULL(p.Revenue, 0) AS PreviousRevenue
FROM CTE c
LEFT JOIN CTE p ON c.CustomerNumber = p.CustomerNumber AND c.SaleYear = p.SaleYear + 1 AND c.Trimester = p.Trimester
ORDER BY c.Trimester, c.SaleYear;

In the query above we are self-joining the CTE; if we didn't have to summarize the data we could exclude the CTE and just self join to the table (personal note, I find CTE's more readable for queries like this). The key here is in the join condition c.SaleYear = p.SaleYear + 1. How this works is, if c.SaleYear = 2016 and p.SaleYear = 2015, then if we add 1 to p.SaleYear that equals 2016 and so the join condition evaluates to true, which effectively gives us the previous year record if that record exists. This solution returns the correct value for previous year when that year is missing but does not return the actual missing year (we will get to that shortly). Next let's take a look at the LAG function.

Anytime we start talking about previous value it's common to think about the LAG function, and while LAG works great with contiguous date periods, it falls apart on data sets that are missing dates. Take for example:

WITH CTE AS (
    SELECT 
         CustomerNumber
        ,SaleYear
        ,Trimester
        ,SUM(Revenue) AS Revenue
    FROM #Sales 
    GROUP BY CustomerNumber, SaleYear, Trimester )

SELECT 
     CustomerNumber
    ,SaleYear
    ,Trimester
    ,Revenue
    ,LAG(Revenue, 1, 0) OVER (PARTITION BY CustomerNumber, Trimester ORDER BY Trimester, SaleYear) AS PreviousRevenue 
FROM CTE c
ORDER BY c.Trimester, c.SaleYear

Where dates are contiguous it provides the correct result, but in the case of 2012 to 2014 it is showing 2012 as the previous revenue for 2014 and that is not what we are looking for. If you look at the previous revenue from 2014 to 2016 the previous year data is correct, but again it breaks down between 2016 and 2018. In order to use LAG in this scenario we would need to manufacture all possible years and trimesters for each customer. We will look at that in a minute but first lets look at a better way to fill in the missing years for the previous year problem.

In the following example we are again using a self-join, but this time a FULL OUTER JOIN to create records for missing years where there was a sale in the previous year but not the current year. Remember, we have no sales in 2013 but we still want to see previous year sales from 2012 on a line for 2013 regardless if there are no sales in 2013.

WITH CTE AS (
    SELECT 
         CustomerNumber
        ,SaleYear
        ,Trimester
        ,SUM(Revenue) AS Revenue
    FROM #Sales 
    GROUP BY CustomerNumber, SaleYear, Trimester )

-- Then use self join to get revenue for previous year

SELECT 
     COALESCE(c.CustomerNumber, p.CustomerNumber) AS CustomerNumber
    ,COALESCE(c.SaleYear, p.SaleYear + 1) AS SaleYear
    ,COALESCE(c.Trimester, p.Trimester) AS Trimester
    ,ISNULL(c.Revenue, 0) AS CurrentRevenue
    ,ISNULL(p.Revenue, 0) AS PreviousRevenue
FROM CTE c
FULL OUTER JOIN CTE p ON c.CustomerNumber = p.CustomerNumber AND c.SaleYear = p.SaleYear + 1 AND c.Trimester = p.Trimester
ORDER BY COALESCE(c.Trimester, p.Trimester), COALESCE(c.SaleYear, p.SaleYear);

If you look at the data you will see we now have records for the missing years or trimesters, and we have rows for both 2013 and 2017. This uses the same thinking as the first self-join we looked at with a few differences. First, we are using a FULL OUTER JOIN instead of the LEFT JOIN. The years are filled in in the SELECT using the COALESCE on SaleYear and you can see we are also using the SaleYear + 1 here, this will always give us the missing year. We use COALESCE for the customer and trimester to complete the data set. A simple way to get previous year and fill in missing years.

What if we wanted to do something like this using LAG? Again, we would need to manufacture every possible year and trimester for each customer so we could provide a contiguous date range to LAG on. There are a few different ways to do this, with a date table, a numbers tables, or recursive CTE; we are going to demonstrate this with recursive CTE's because it doesn't require the use of those other tables if you don't have them already, those other options will be more performant in most scenarios.

First we need to get all possible years starting from the first sale year, next we need to get all possible trimesters, then get our revenue totals, and finally, LAG on the previous year.

-- All possible SaleYears
WITH rCTEyr AS (
    SELECT DISTINCT
         CustomerNumber 
        ,MIN(SaleYear) OVER (PARTITION BY CustomerNumber) AS SaleYear
        ,CAST(0.00 AS decimal(7,2)) AS Revenue
    FROM #Sales

    UNION ALL

    SELECT 
         CustomerNumber
        ,SaleYear + 1
        ,Revenue
    FROM rCTEyr
    WHERE SaleYear <= 2017 ),

-- All possible Trimesters
rCTEQtr AS (
    SELECT 
         CustomerNumber
        ,SaleYear
        ,1 AS Trimester
        ,Revenue
    FROM rCTEyr

    UNION ALL

    SELECT 
         CustomerNumber
        ,SaleYear
        ,Trimester + 1
        ,Revenue
    FROM rCTEQtr
    WHERE Trimester < 3), 

-- Get our revenue totals 
CTEfinal AS (
    SELECT
         st.CustomerNumber
        ,st.SaleYear
        ,st.Trimester
        ,ISNULL(SUM(s.Revenue), 0) AS Revenue
    FROM rCTEQtr st
    LEFT JOIN #Sales s ON st.CustomerNumber = s.CustomerNumber AND s.SaleYear = st.SaleYear AND s.Trimester = st.Trimester
    GROUP BY st.CustomerNumber, st.SaleYear, st.Trimester )

SELECT DISTINCT
    CustomerNumber
    ,SaleYear
    ,Trimester
    ,Revenue
    ,LAG(Revenue, 1, 0) OVER (PARTITION BY CustomerNumber, Trimester ORDER BY Trimester, SaleYear) AS PreviousRevenue
FROM CTEfinal
ORDER BY Trimester, SaleYear

This creates a record for every possible combination of year and trimester which is useful, but maybe not so much for the previous year problem. You'll see we have the correct previous years revenue as well as the missing years now, but when you compare this with the simplicity of the self-join, the self-join is a faster, more efficient solution to this problem. This brings us to our second scenario, multiple missing dates or years. 

Imagine you have auto finance rates that change day to day. Rates don't change on the holidays or weekends or special events. You have a report that shows the rate for every day in a given period. We created several gaps in our data set below, gaps which span multiple days. We need to carry our rate forward for each missing day to show the rate didn't change for that day. This is the missing dates issue, we need to roll the value forward each missing day until we hit the next valid record. We are going to use a different data set this time to illustrate this.

DROP TABLE IF EXISTS #Test; 

CREATE TABLE #Test (
     RateDate date
    ,RateValue decimal(5,4) )

INSERT INTO #Test
SELECT '12/29/16', 1.2266
UNION ALL 
SELECT '12/30/16', 1.2345
UNION ALL
SELECT '01/03/17', 1.2240
UNION ALL
SELECT '01/06/17', 1.1902;

You will see that in the table above we have values for the 29th, 30th, 3rd, and the 6th but we are missing the dates in between. We need to carry the last rate value forward for each of the missing days. Let's image we have a report and our report accepts 2 parameters, the start date and the end date. We could generate all the days and then join them to the actual records in the table to effectively create missing dates and roll forward rate values. This is another example where you could use a date table, numbers table, or recursive CTE, we are using the CTE for convenience.

DECLARE 
     @StartDate date = '12/29/16'
    ,@EndDate date = '01/06/17';

WITH rCTE AS (
    SELECT @StartDate AS RateDate
    UNION ALL 
    SELECT DATEADD(DAY, 1, RateDate) FROM rCTE
    WHERE RateDate <= @EndDate )

SELECT 
     r.RateDate
    ,z.RateValue
FROM rCTE r
CROSS APPLY (
    SELECT TOP 1 RateValue
    FROM #Test t
    WHERE t.RateDate <= r.RateDate
    ORDER BY t.RateDate DESC ) AS z

The CROSS APPLY works great here and allows us to carry our rates forward for each day. The CROSS APPLY is the most performant of the different ways to solve this problem. I did want to include one more option because it's a clever way to solve this using ROWS UNBOUNDED, but might not be as performant as the CROSS APPLY.

DECLARE 
     @StartDate date = '12/29/16'
    ,@EndDate date = '01/06/17';

-- Get missing dates
WITH rCTE AS (
    SELECT @StartDate AS RateDate
    UNION ALL 
    SELECT DATEADD(DAY, 1, RateDate) FROM rCTE
    WHERE RateDate < @EndDate ),

-- Join missing dates to result
C1 AS (
    SELECT 
         c.RateDate
        ,RateValue
    FROM rCTE c
    LEFT JOIN #Test t ON c.RateDate = t.RateDate ),

-- Create date groups for your ranges
C2 AS (
    SELECT
         RateDate 
        ,RateValue 
        ,MAX(CASE WHEN RateValue IS NOT NULL THEN RateDate END) OVER(ORDER BY RateDate ROWS UNBOUNDED PRECEDING) AS DateGroup
    FROM C1 )

SELECT
     RateDate
    ,MIN(RateValue) OVER(PARTITION BY DateGroup) AS grp2
FROM C2

In summary, there are many different ways to solve the missing dates problem. It's about knowing when to use what patterns for which problems and ensuring those solutions are performant. The queries here are a great start and give you some options as you begin to explore solutions around the missing dates problems.

Guide to Working With JSON in SQL 2016 Part 2 - Output and Format

In part 2 of this series we are going to explore the process of getting data out of a SQL database and in to a usable JSON document. Like FOR XML in the past, Microsoft introduced the FOR JSON clause to format the result of a query as a JSON document. If you want to produce a flat document the FOR JSON clause works just fine for that, but most documents are not flat and will contain nested objects and arrays so we need to know how to deal with that. That will be the focus of this post.

Part of the trick to outputting a meaningful JSON document is understanding the limitations and behaviors of the FOR JSON clause and how to work around them. We will explain in the examples below.

First we need to create our tables that represent the relational data.

-- Create the relational tables

CREATE TABLE #Products (
   ProductNumber int
  ,ProductName varchar(25)
  ,ProductType varchar(20)
  ,Runtime smallint );

INSERT INTO #Products 
SELECT 45975, 'Spiderman', 'Movie', 110
INSERT INTO #Products 
SELECT 96587, 'Batman', 'Movie', 98
INSERT INTO #Products 
SELECT 25893, 'SuperMan', 'Movie', 102

-- For releases and assets, only include records for a single movie to keep it simple

CREATE TABLE #Releases (
   Country varchar(30)
  ,HomeReleaseDate datetime2(0)
  ,ProductNumber int )

INSERT INTO #Releases 
SELECT 'United States', DATEADD(dd, -329, GETDATE()), 96587;
INSERT INTO #Releases 
SELECT 'France', DATEADD(dd, -256, GETDATE()), 96587;
INSERT INTO #Releases 
SELECT 'Germany', DATEADD(dd, -215, GETDATE()), 96587;

CREATE TABLE #Assets (
   ProductNumber int
  ,Languages varchar(20)
  ,TitleName varchar(50) )

INSERT INTO #Assets 
SELECT 96587, 'English', 'Batman';
INSERT INTO #Assets 
SELECT 96587, 'French', 'LeBat';
INSERT INTO #Assets 
SELECT 96587, 'German', 'Bachman';

Now we need to create a table where we can combine our results. We need to produce multiple JSON outputs for a single record and then combine all of those outputs. We are going to create our combine table and then insert the first record. We will come back and update this record with supporting data for the final output.

CREATE TABLE #Combine (
   Product nvarchar(max)
  ,Releases nvarchar(max)
  ,Assets nvarchar(max) )

INSERT INTO #Combine (Product)
VALUES ((
   SELECT
      ProductNumber
     ,ProductName
     ,ProductType
     ,Runtime
   FROM #Products
   WHERE ProductNumber = 96587
   FOR JSON PATH, WITHOUT_ARRAY_WRAPPER))

The product field in the #Combine table represents a JSON document for the product 96587, Batman. The releases field represents an array of embedded objects that is part of the product document, the same with assets. We are producing three different related JSON results and then combining them in to one. The JSON PATH clause by itself cannot output embedded documents so we have to do this manually (well actually it can, kind of, if you use ROOT, but you don't have much control of the output and when dealing with multiple table joins, it does what it wants, which is not likely what you want). Also to note, in the FOR JSON PATH above we use the option WITHOUT_ARRAY_WRAPPER. This eliminates the square brackets from the JSON output. We only want to exclude this for the parent document. Now we are going to update the other fields and take a look at what we have so far.

UPDATE #Combine 
   SET Releases = (
      SELECT
         ProductNumber
        ,Country
        ,HomeReleaseDate
       FROM #Releases
FOR JSON PATH, ROOT('Releases'))

UPDATE #Combine
   SET Assets = (
      SELECT 
         Languages
        ,TitleName
      FROM #Assets
FOR JSON AUTO, ROOT('Assets') )

SELECT * FROM #Combine

So we effectively have 3 JSON documents we need to combine in to a single document. To do this we use the following.

SELECT
   STUFF(Product,LEN(Product),1, ',') + 
   STUFF(STUFF(Releases, 1, 1, ''), LEN(Releases) - 1, 1, ',') +
   STUFF(Assets, 1, 1, '')
FROM #Combine

The query above doesn't do anything special except combine the documents. Copy the result and paste it in to a JSON validator to see the end result.   

It's not pretty but it works. Would I use it in production? Probably not. Keep in mind this is only version 1.0, hopefully we will see improvements to this in future releases. We can hope that Microsoft follows in the footsteps of Postgres, with a true native data type. Or of course you could do all of this in an application just as easily.

Guide to Working With JSON in SQL 2016 Part 1 - Storing and Querying

Microsoft SQL Server 2016 introduced support for JSON. Microsoft added the FOR JSON clause to build a JSON structured result from an existing relational DB, and they also added a handful of new functions to query a JSON document stored in a SQL table. That is the focus of Part 1 in this series, how to store JSON documents in a SQL database and how to query those documents.

Here are the new functions we will be exploring in this post:

JSON_VALUE – Extract a value from a JSON string
OPENJSON – Table Valued Function that parses JSON text. Returns objects and properties as rows and columns.

A few important notes:

  • JSON is case sensitive. If the JSON object is "FilmDetails" and you try querying "filmdetails", it will fail. Make sure you check your case when trouble shooting your queries.
  • There isn't a JSON data type in SQL. We have JSON support in SQL 2016 not a data type. JSON documents are stored as nvarchar(max), but don't let this discourage you from exploring these features.
  • You can create a CHECK constraint on the table you store your JSON documents to check the document is valid JSON, but this is not required (ISJSON(<JSONData>) > 0
  • You can create your table as memory optimized, FYI.

If you need a refresher on JSON structures, Look Here.

Here is our table definition and the JSON document we will be storing in the table:

CREATE TABLE JSONTable (
   ID int IDENTITY (1,1) PRIMARY KEY CLUSTERED
  ,JSONDocument nvarchar(max) );

INSERT INTO JSONTable
SELECT '{  
   "FilmDetails":{  
      "ProductNumber":"9912088751",
      "Title":"Brave",
      "Type":"Movie",
      "Runtime":93,
      "ReleaseYear":2012,
      "Synopses":[  
         {  
            "Locale":"en",
            "Text":"Princess Merida uses her bravery and archery skills to battle a curse and restore peace..."
         },
         {  
            "Locale":"de",
            "Text":"Animiert"
         },
         {  
            "Locale":"fr",
            "Text":"Au coeur des contrées sauvages dÉcosse, Merida, la fille du roi Fergus et de la reine Elinor..."
         }
      ],
      "Assets":[  
         {  
            "Name":"Stickers",
            "AssetType":"Stickers",
            "FileType":"PDF",
            "Location":"http://media.brave.stickers.pdf",
            "Locale":"en-US"
         },
         {  
            "Name":"Trailer - WMV",
            "AssetType":"Trailer - WMV",
            "FileType":"Video",
            "Location":"http://youtu.be/Shg79Shgn",
            "Locale":"en-US"
         }
      ]
   }
}';

First let's look at the JSON_VALUE function, we will use this to return values from our JSON document.

SELECT
   JSON_VALUE(JSONDocument, '$.FilmDetails.ProductNumber') AS ProductNumber
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Title') AS Title
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Type') AS ContentType
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Runtime') AS Runtime
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.ReleaseYear') AS ReleaseYear
FROM JSONTable

In the example above, we are simply grabbing the values from the name/value pairs found in the FilmDetails object. Simple enough, but we also have a few arrays defined and we want to extract values from the different arrays as well. You could try calling the specific array key, but if you have many arrays you might not know the position. For example:

SELECT
   JSON_VALUE(JSONDocument, '$.FilmDetails.ProductNumber') AS ProductNumber
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Synopses[0].Text') AS SynopsesTextEn
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Synopses[1].Locale') AS Locale
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Assets[1].Location') AS TrailerLocation
FROM JSONTable

You see the [0], [1] which follows the Synopses object? That is the key value for the element we are trying to extract our values from. While this does work, let's take a look at another option, OPENJSON. OPENJSON is a table valued function that returns objects and properties as rows and columns. You can use OPENJSON with the default schema or you can create a custom schema for your output. Each produces very different results so let's look at both of them. First using the default schema:

SELECT
   JSON_VALUE(JSONDocument, '$.FilmDetails.ProductNumber') AS ProductNumber
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Title') AS Title
  ,[key] AS JSONObject
  ,[value] AS JSONValue
  ,[type] AS TypeOfValue
FROM JSONTable
CROSS APPLY OPENJSON(JSONDocument)

When you use OPENJSON with the default schema, the function returns a table with one row for each property of the object or for each element in the array. What happens if we target an object in the function?

SELECT
   JSON_VALUE(JSONDocument, '$.FilmDetails.ProductNumber') AS ProductNumber
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Title') AS Title
  ,[key] AS ArrayElementKey
  ,[value] AS ArrayValue
  ,[type] AS TypeOfValue
FROM JSONTable
CROSS APPLY OPENJSON(JSONDocument, '$.FilmDetails.Synopses')

If jsonExpression contains a JSON object (in this case $.FilmDetails.Synopses) , OPENJSON returns all the first level properties of the object. That's cool, but we still need to return values from the different arrays. Let's try to define a custom schema and see what happens:

SELECT *
FROM JSONTable
CROSS APPLY OPENJSON(JSONDocument, '$.FilmDetails.Synopses')
WITH (
   Locale varchar(3) '$.Locale'
  ,SynopsesText nvarchar(2000) '$.Text')
WHERE JSON_VALUE(JSONDocument, '$.FilmDetails.Title') LIKE '%Brave%'

Added a where clause to simplify the result.

We now have results for Locale and SynopsesText, but we are also getting the entire JSONDocument and ID. We want to be able to return a single filtered result. Notice in the query below we can use the column names defined in the WITH clause of the OPENJSON table valued function at the beginning of our select. We are also adding the value from JSON_VALUE from the FilmDetails object. The output is starting to look more meaningful.

SELECT
   JSON_VALUE(JSONDocument, '$.FilmDetails.ProductNumber') AS ProductNumber
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Title') AS Title
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Type') AS ContentType
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Runtime') AS Runtime
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.ReleaseYear') AS ReleaseYear
  ,Locale
  ,SynopsesText
FROM JSONTable
CROSS APPLY OPENJSON(JSONDocument, '$.FilmDetails.Synopses')
WITH (
   Locale varchar(3) '$.Locale'
  ,SynopsesText nvarchar(2000) '$.Text')
WHERE JSON_VALUE(JSONDocument, '$.FilmDetails.Title') LIKE '%Brave%'
   AND Locale = 'en'

We can assume that our JSON document will have multiple arrays, and we'll need to extract the details from those arrays, so how do we do that? Let's try to add another CROSS APPLY to the "Assets" array and add the values to our main SELECT to see what we get:

SELECT
   JSON_VALUE(JSONDocument, '$.FilmDetails.ProductNumber') AS ProductNumber
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Title') AS Title
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Type') AS ContentType
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.Runtime') AS Runtime
  ,JSON_VALUE(JSONDocument, '$.FilmDetails.ReleaseYear') AS ReleaseYear
  ,Locale
  ,SynopsesText
  ,Name AS AssetName
  ,FileType AS AssetFileType
  ,[Location] AS AssetLocation
FROM JSONTable
CROSS APPLY OPENJSON(JSONDocument, '$.FilmDetails.Synopses')
WITH (
   Locale varchar(3) '$.Locale'
  ,SynopsesText nvarchar(2000) '$.Text')
CROSS APPLY OPENJSON(JSONDocument, '$.FilmDetails.Assets')
WITH (
   Name varchar(25) '$.Name'
  ,FileType varchar(25) '$.FileType'
  ,[Location] nvarchar(500) '$.Location' )
WHERE JSON_VALUE(JSONDocument, '$.FilmDetails.Title') LIKE '%Brave%'
   AND Locale = 'en'
   AND FileType = 'video'

At this point you should be returning a single line of filtered data that gets values from multiple arrays and objects in your JSON document. You can add indexes to your JSON data to increase performance.  Microsoft has introduced an interesting new feature set with JSON support in 2016, you should check it out in its entirety.

In part 2 of this series we are going to explore outputting JSON documents from normalized tables. We will look at outputting arrarys and assembling complex JSON documents from SQL.

Guide to Using Temporal Tables in SQL 2016

Temporal tables give us the ability to add version history to our data. If you want to track data changes to a table, see what a value was at a specific point in time, find when a value was last changed, or union historic records with current table records, temporal tables can do all of that. Temporal tables can be created on new or existing tables. Temporal tables are an exact replica of a new or existing table. You can think of temporal tables as a versioning table with the same definition as the table it's tracking, they are paired. This sounds more complicated than it is, let's get to the scripts.

First we're going to create a new table that also includes our temporal table. 

CREATE TABLE dbo.Orders (
   Id int IDENTITY (1,1) PRIMARY KEY CLUSTERED
  ,OrderNumber int NOT NULL
  ,ProductID int NOT NULL
  ,SalesPersonID int NOT NULL
  ,Amount decimal (7,2) NOT NULL
  ,SalesDate datetime2(0) DEFAULT GETDATE()
  ,ValidFrom datetime2(2) GENERATED ALWAYS AS ROW START -- Required field, can be named whatever you like.
  ,Validto datetime2(2) GENERATED ALWAYS AS ROW END -- Required field, can be named whatever you like.
  ,PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo) )
WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.Orders_Archive));

Let's look at what's new in our table definition. See the ValidFrom and ValidTo columns, these are referred to as period columns and are used to track the state of the row at the time the row was updated, the data type here should be datetime2(2), these columns are required. The GENERATED ALWAYS (START | END) arguments specify a start and end time for when the record was valid. The PERIOD FOR SYSTEM_TIME argument specifies the names of the columns the system will use to record the valid period of a row of data, in this case ValidfFrom and ValidTo. Finally we have the table option WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.Orders_Archive)). SYSTEM_VERSIONING just enables versioning (for the temporal table) on the table you are creating. HISTORY_TABLE lets us target a specific history table, this could be an existing history table or a new history table. The above script will create a new history table named dbo.Orderes_Archive.

Now that we have our temporal table setup, let's see how it works. First we need to insert some data, notice we do not specify values for the ValidFrom and ValidTo columns. 

INSERT INTO dbo.Orders (OrderNumber, ProductID, SalesPersonID, Amount)
SELECT 85462, 989, 11, 10.99
INSERT INTO dbo.Orders (OrderNumber, ProductID, SalesPersonID, Amount)
SELECT 85486, 325, 12, 14.95
INSERT INTO dbo.Orders (OrderNumber, ProductID, SalesPersonID, Amount)
SELECT 85501, 193, 11, 21.99 

Now let's see what's in the tables we just created. We won't have anything in the history table Orders_Archive, not until we actually modify data in the Orders table. Query the tables, then run the update, and then query both tables again to see the results. 

SELECT * FROM dbo.Orders;
SELECT * FROM dbo.Orders_Archive;

 --Now we need to update the Orders table.

UPDATE dbo.Orders
   SET Amount = 50.00
WHERE Id = 2; 

After the update we should now have a single row in the Orders_Archive history table. This row represents the data before it was modified and the period of time that is was valid using the ValidFrom and ValidTo values.

Now what if we want to query the Orders table and this time include the table history values. We can do this using the SYSTEM_TIME clause in our FROM. Here we want to return all values that were valid between yesterday and tomorrow from both the parent and history table. Notice we don't have to call the Orders_Archive table directly, using SYSTEM_TIME allows us to query both tables by referencing only the parent Orders table. 

DECLARE 
   @Start datetime2(2) = DATEADD(DD, -1, GETDATE())
  ,@End datetime2(2) = DATEADD(DD, 1, GETDATE())
SELECT * 
FROM dbo.Orders
FOR SYSTEM_TIME BETWEEN @Start AND @End

There are 4 temporal-specific sub-clauses we can use with the SYSTEM_TIME clause to perform different time based analysis. As follows:

AS OF <date_time>
FROM <start_date_time> TO <end_date_time>
BETWEEN <start_date_time> AND <end_date_time>
CONTAINED IN (<start_date_time> , <end_date_time>)

This gives a lot of flexibility to how we query the data. We can query just the Orders data, just the Orders_Archive data, or both.

Some notes about table management:

  • You can create indexes on either the Orders or Orders_Archive table, creating an index on one does not create it on the other. Same goes for statistics.
  • Modifying a data type on the parent table Orders, will automatically modify the data type in the history table, Orders_Archive.
  • Adding a new column to the Orders table will automatically add that column to the history table.
  • Before you can drop a table that has versioning enabled on it, we must first disable versioning. You can drop a parent table without dropping the history table, they are independent of each other when you disable versioning.
-- Adding an index

CREATE NONCLUSTERED INDEX NCIX_SalesPersonID ON dbo.Orders (
SalesPersonID )
WITH (FILLFACTOR = 90);

-- Must be added to both

CREATE NONCLUSTERED INDEX NCIX_SalesPersonID ON dbo.Orders_Archive (
SalesPersonID )
WITH (FILLFACTOR = 90);

-- Altering the data type of an existing column or adding a new column automatically adds it to the history table

ALTER TABLE dbo.Orders ALTER COLUMN Amount decimal(12,2);
ALTER TABLE dbo.Orders ADD Region varchar(3);

-- Disable versioning and drop the table

ALTER TABLE dbo.Orders SET (SYSTEM_VERSIONING = OFF);
DROP TABLE dbo.Orders;
DROP TABLE dbo.Orders_Archive;

Temporal tables are a nice addition to SQL 2016, and believe or not I think they are available in standard edition. Have fun!

Dynamically Build a MERGE Statement - Auto Generated MERGE

Once I worked  on a project that involved moving data from DB2 to MSSQL. We setup jobs in SSIS to copy the data over to staging tables in MSSQL server and then we ran MERGE procedures to merge the data from our staging tables to production. We had more than a 100 jobs that followed this pattern so imagine all the work writing those merge statements. After writing a handful of those I decided to come up with a solution that would write them for me, so here you go, the auto generated merge. 

In the script below, we have a table definition that will hold details for your merge. I usually create this once and then truncate it each time I want to write a new merge statement. Following that we have our variables which define your target and source data. All you need to do is make sure the DynamicMerge table is created and plug in the database name, schema name, and table name of your target table and source table in your variables. Follow the naming convention in the example below. The merge statement assumes you are merging two tables that are identical, but of course you could modify it to meet your needs.

This script can handle null values in the WHEN MATCHED lookup, we use EXCEPT to find unmatched values between source and target.

I like to output the results to text so I can review it before running or placing it in an SP. So when you run the query you will want to output to text and not grid (the default). To do this you will need to change a few settings in SSMS, more specifically you need to change query and text results options in SSMS to allow more characters to be output to text. From the menu: Query>QueryOptions>Results>Text (Uncheck 'Include Column Headers') and Tools>Options>QueryResults>SQLServer>ResultsToText (Increase the maximum number of characters displayed in each column to like 8000). If you don't make these two changes then the output text will be truncated and or you will have a bunch of dashes at the top of the script. 

--Create the DynamicMerge table

CREATE TABLE Staging.dbo.DynamicMerge (
   TableCatalog varchar(500)
  ,TableSchema varchar(100)
  ,TableName varchar(500)
  ,ColumnName varchar(500)
  ,MergeOn varchar(1000)
  ,UpdateOn varchar(1000)
  ,InsertOn varchar(1000)
  ,ExceptSource varchar(500)
  ,ExceptTarget varchar(500)
  ,IsPK bit DEFAULT 0)

USE <DatabaseName>; -- This is the DB of your Target in the merge, make sure you set this.
GO

SET NOCOUNT ON;
GO

-- Truncate the DynamicMerge table every time you use it
TRUNCATE TABLE Staging.dbo.DynamicMerge;

---------------------------------------------------------

-- Insert details to the DynamicMerge table

DECLARE
   @TableCatalog varchar(500) = 'TargetDBName' -- This is the database name
  ,@TableSchema varchar(500) = 'TargetSchemaName' -- Schema for your target table
  ,@TableName varchar(500) = 'TargetTableName' -- This is the table name
  ,@TargetName varchar(1000) = 'DBName.SchemaName.TargetTableName' -- Three part name
  ,@SourceName varchar(1000) = 'DBName.SchemaName.SourceTableName' -- Three part name

INSERT INTO Staging.dbo.DynamicMerge (
   TableCatalog
  ,TableSchema
  ,TableName
  ,ColumnName
  ,MergeOn
  ,UpdateOn
  ,InsertOn
  ,ExceptSource
  ,ExceptTarget )

SELECT
   TABLE_CATALOG
  ,TABLE_SCHEMA
  ,TABLE_NAME
  ,COLUMN_NAME
  ,'Target1.' + COLUMN_NAME + ' = Source1.' + COLUMN_NAME
  ,COLUMN_NAME + ' = Source1.' + COLUMN_NAME
  ,'Source1.' + COLUMN_NAME -- Except Source
  ,'Source1.' + COLUMN_NAME -- Except Target
  ,'Target1.' + COLUMN_NAME 
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_CATALOG = @TableCatalog
   AND TABLE_SCHEMA = @TableSchema
   AND TABLE_NAME = @TableName; --target table name

--------------------------------------------------------

-- Update the Primary Key flag

UPDATE dt
   SET IsPK = 1
FROM Staging.dbo.DynamicMerge dt
INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE ku ON dt.ColumnName =ku.COLUMN_NAME
INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS tc ON tc.CONSTRAINT_TYPE ='PRIMARY KEY'
   AND tc.CONSTRAINT_NAME = ku.CONSTRAINT_NAME
   AND ku.TABLE_NAME = dt.TableName;

------------------------------------------------------------------

-- Check the Results (optional)

--SELECT * FROM Staging.dbo.DynamicMerge;

-----------------------------------------------------------------

-- Create the Column list variable

DECLARE @ColumnList varchar(max)
   SET @ColumnList = (
      SELECT STUFF((SELECT ',' + ColumnName + CHAR(10)
      FROM Staging.dbo.DynamicMerge x1
      WHERE x1.TableName = x2.TableName
      FOR XML PATH ('')), 1, 1,'') AS MergeOnColumns
      FROM Staging.dbo.DynamicMerge x2
GROUP BY TableName);

--------------------------------------------------------

-- Create the MergeOnSource variable - matching the unique key

DECLARE @MergeOn varchar(max)
   SET @MergeOn = (
      SELECT STUFF((SELECT 'AND ' + MergeOn + CHAR(10)
      FROM Staging.dbo.DynamicMerge x1
      WHERE x1.TableName = x2.TableName AND x1.IsPK = 1
      FOR XML PATH ('')), 1, 4,'') -- AS MergeOnColumns
      FROM Staging.dbo.DynamicMerge x2
      WHERE IsPK = 1
GROUP BY TableName);

-----------------------------------------

-- Create the Merge EXCEPT Target varable

DECLARE @MExceptTarget varchar(max)
   SET @MExceptTarget = (
      SELECT STUFF((SELECT ', ' + ExceptTarget
      FROM Staging.dbo.DynamicMerge x1
      WHERE x1.TableName = x2.TableName AND x1.IsPK = 0
      FOR XML PATH ('')), 1, 1,'') -- AS UpdateOnColumns
      FROM Staging.dbo.DynamicMerge x2
      WHERE IsPK = 0
GROUP BY TableName );

-----------------------------------------

-- Create the Merge EXCEPT Source variable

DECLARE @MExceptSource varchar(max)
   SET @MExceptSource = (
      SELECT STUFF((SELECT ', ' + ExceptSource
      FROM Staging.dbo.DynamicMerge x1
      WHERE x1.TableName = x2.TableName AND x1.IsPK = 0
      FOR XML PATH ('')), 1, 1,'') -- AS UpdateOnColumns
      FROM Staging.dbo.DynamicMerge x2
      WHERE IsPK = 0
GROUP BY TableName );

-----------------------------------

-- Create the Merge UPDATE variable

DECLARE @MUpdate varchar(max)
   SET @MUpdate = (
      SELECT STUFF((SELECT ',' + UpdateOn + CHAR(10)
      FROM Staging.dbo.DynamicMerge x1
      WHERE x1.TableName = x2.TableName AND x1.IsPK = 0
      FOR XML PATH ('')), 1, 1,'') -- AS UpdateOnColumns
      FROM Staging.dbo.DynamicMerge x2
      WHERE IsPK = 0
GROUP BY TableName);

-----------------------------------

-- Create the Merge INSERT variable

DECLARE @MInsert varchar(max)
   SET @MInsert = (
      SELECT STUFF((SELECT ',' + InsertOn + CHAR(10)
      FROM Staging.dbo.DynamicMerge x1
      WHERE x1.TableName = x2.TableName
      FOR XML PATH ('')), 1, 1,'') -- AS UpdateOnColumns
      FROM Staging.dbo.DynamicMerge x2
GROUP BY TableName);

--------------------------------------------------------------------------

/* Now build the MERGE statement. In the query results window, make sure to output to text and not grid. You will need to increase the character output from the defaults. This is explained at the beginning of this document. */

-- Note* CHAR(10) is a line feed control character. CHAR(13) is a carriage return.

SELECT 'MERGE ' + @TargetName + ' AS Target1' + CHAR(10)
   + 'USING (SELECT ' + @ColumnList + 'FROM ' + @SourceName + ') AS Source1' + CHAR(10)
   + CHAR(10)
   + 'ON (' + CHAR(10)
   + @MergeOn + ')' + CHAR(10)
   + CHAR(10)
   + 'WHEN MATCHED AND EXISTS' + CHAR(10)
   + '(' + CHAR(10)
   + ' SELECT ' + @MExceptSource + CHAR(10)
   + ' EXCEPT' + CHAR(10)
   + ' SELECT ' + @MExceptTarget + CHAR(10)
   + ')' + CHAR(10)
   + CHAR(10)
   + 'THEN UPDATE' + CHAR(10)
   + 'SET ' + @MUpdate
   + CHAR(10)
   + 'WHEN NOT MATCHED BY TARGET THEN INSERT (' + @ColumnList + ')' + CHAR(10)
   + CHAR(10)
   + 'VALUES (' + @MInsert + ')' + CHAR(10)
   + 'WHEN NOT MATCHED BY SOURCE THEN DELETE;' -- Optional DELETE if that is desired

This script has been tested from SQL 2008 through 2016. It was last used against a SQL 2016 server with 2017 SSMS. It works!

Calculate Year Over Year Variance - Subtract From Previous Row Using LAG

Say you want to summarize a value by year and find the variance between each year (row). Imagine you have a table that tracks company sales. You need to summarize sales totals by year and then compare the values year over year and find the variance. Here is a simple way to do this using the LAG function.

* Note * If your data does not have contiguous dates you will need to write different queries to fill in the missing dates. This is known as the missing dates problem and solutions to that can be found HERE

IF OBJECT_ID('tempdb..#SalesData') IS NOT NULL
DROP TABLE #SalesData
DECLARE @StartDate date = '2013-01-01';

WITH rCTE AS (
   SELECT
      CAST(2679 AS decimal(7,2)) AS SalesAmount
     ,@StartDate AS SalesDate
     ,CAST('nadia' as varchar) AS SalesPerson
     ,0 AS TransID

   UNION ALL

   SELECT
      CAST(SalesAmount + (SELECT CAST(CRYPT_GEN_RANDOM(1) AS INT) + 3 - CAST(CRYPT_GEN_RANDOM(1) AS INT)) AS decimal(7,2))
    ,DATEADD(dd, + 3, SalesDate)
    ,REVERSE(SalesPerson)
    ,TransID + 1
   FROM rCTE
   WHERE TransID < 500 )

SELECT
   SalesAmount
  ,SalesDate
  ,SalesPerson
INTO #SalesData
FROM rCTE
WHERE SalesAmount > 1
OPTION (MAXRECURSION 600)

--Let's take a look at the data

SELECT * FROM #SalesData

Now that we have our sales data table, we are going to summarize the data by year and load this to a temp table. Notice the ORDER BY, this is important since we will be using the LAG function next.

SELECT DISTINCT
   YEAR(SalesDate) SalesYear
  ,SUM(SalesAmount) OVER (PARTITION BY YEAR(SalesDate)) AS TotalSalesByYear
INTO #SalesTotals
FROM #SalesData
ORDER BY SalesYear ASC;

After we load our temp table with the summarized annual sales, now we are going to use the LAG function to get the previous row for TotalSalesByYear --this will effectively create a new column for the previous year's sales. You can see in the LAG function below we use the TotalSalesByYear value and then you see a comma and the number 1. The number is the offset or how many rows back you want to look from the current row. You can also specify a default value after the offset, this is the value to return if that offset value returns NULL. By excluding that default value, the default value will return NULL, which is what we want so we don't get a divide by zero error when calculating the variance. Here is a link for LAG usage.

WITH CTEazyE AS (
   SELECT
      SalesYear
     ,TotalSalesByYear
     ,LAG(TotalSalesByYear, 1) OVER(ORDER BY SalesYear ASC) AS PreviousYearSales
   FROM #SalesTotals )    

-- Now we calculate the variance year over year.

SELECT
     SalesYear
    ,TotalSalesByYear
    ,PreviousYearSales
    ,(TotalSalesByYear - PreviousYearSales) AS VarianceAmount
    ,100 * (TotalSalesByYear - PreviousYearSales) / PreviousYearSales AS VariancePercent
FROM CTEazyE

There are other ways to calculate this but the intent of this post was to introduce you to the LAG (LEAD) function. Mind you, the LAG function is a SQL 2012 + feature. 

Running Totals Using the OVER() Clause with ORDER BY

Window functions and the OVER clause have been around since 2005, but back in 2005 you couldn't use ORDER BY for aggregate functions in the OVER clause. As a matter of fact it wasn't until SQL 2012 that they introduced the ability to use ORDER BY with aggregates in your OVER() clause. Here's how you can use the ORDER BY in the OVER() clause to get running totals.

-- Generate some test data

DECLARE @StartDate date = '2014-01-01'
DECLARE @EndDate date = '2014-12-31';
WITH rCTE AS (
   SELECT
      CAST((500 * RAND()) AS decimal(7,2)) AS SalesAmount
     ,@StartDate AS SalesDate
     ,0 AS Level

   UNION ALL

   SELECT
      CAST(SalesAmount + ((SELECT CHECKSUM(NewId()) % 5) * RAND()) AS decimal(7,2))
     ,DATEADD(dd, + 7, SalesDate)
     ,Level + 1
FROM rCTE
WHERE Level < 52)

SELECT SalesAmount, SalesDate
INTO #SalesData
FROM rCTE

In the first query below we use 3 different OVER clauses to partition the data different ways. In the first OVER clause we are partitioning by the month. Since our sales dates are by day and not by month we use the DATEPART function to get the month of the SalesDate and partitioning on that. This will give us the total sales for each month.

Next we want to get running totals by month. In the second partition we are again using the DATEPART function but this time we are using it with year instead of month and we are adding an ORDER BY using the DATEPART month. What this is doing is getting a total sales amount and then ordering by the month to give us running totals by month--the key here is the ORDER BY, this is what gives us running totals.

The last partition is the same as the second partition except it doesn't include an ORDER BY, so it shows us a total for the entire year.

We use DISTINCT in the SELECT to eliminate duplicate rows. The first query also has a WHERE clause to limit the year to 2014 only.

-- Running totals by month for a single year

SELECT DISTINCT
   DATEPART(YEAR, SalesDate) AS SalesYear
  ,DATEPART(MONTH, SalesDate) AS SalesMonth
  ,SUM(SalesAmount) OVER (PARTITION BY DATEPART(MM,SalesDate)) AS [Totals by Month]
  ,SUM(SalesAmount) OVER (PARTITION BY DATEPART(YY,SalesDate) ORDER BY DATEPART(MM,SalesDate)) AS [Running Totals by Month]
  ,SUM(SalesAmount) OVER (PARTITION BY DATEPART(YY,SalesDate)) AS [Total for the Year]
FROM #SalesData
WHERE DATEPART(YEAR, SalesDate) = 2014

Now we'll look at the same query but this time remove the year form the WHERE clause so we have months over multiple years. Since we are expanding to multiple years, if we want to get a simple total by month we would need to include both year and month in the PARTITION. For running totals, nothing changes from the first query.

-- Running totals by year and month over multiple years

SELECT DISTINCT
   DATEPART(YEAR, SalesDate) AS SalesYear
  ,DATEPART(MONTH, SalesDate) AS SalesMonth
  ,SUM(SalesAmount) OVER (PARTITION BY DATEPART(YEAR,SalesDate), DATEPART(MONTH,SalesDate)) AS [Totals by Month]
  ,SUM(SalesAmount) OVER (PARTITION BY DATEPART(YEAR,SalesDate) ORDER BY DATEPART(MM,SalesDate)) AS [Running Totals by Month]
  ,SUM(SalesAmount) OVER (PARTITION BY DATEPART(YEAR,SalesDate)) AS [Total for the Year]
FROM #SalesData

An easy way to get running totals in SQL 2012 +.

Recursive Common Table Expressions - The Recursive CTE

I've seen a lot of posts online on how recursive CTEs work. Almost all of those examples use the same set of data and the same scenario – organizational hierarchy for adventure works employees or something similar. It's not a bad example but I wanted to post something a little different. In this post we will demonstrate exactly how the recursive CTE works from simple to practical examples none of which use the organizational hierarchy examples you will find everywhere else.

First, how do we define a recursive CTE (Common Table Expression)?

A recursive CTE is a CTE that is self-referencing. The CTE is made up of an Anchor member query and a Recursive member query and recursion is made possible using a UNION ALL between the two members. The Recursive member query will always reference the 

The following example is the simplest way of demonstrating the way a recursive CTE works. All we are doing is taking a starting number (0) and adding 5 to it until we hit 50. The anchor query SELECTs the @StartingNumber variable, in this case 0. The recursive member query SELECTs FirstNumber and adds 5 to it (FirstNumber is the alias we defined for @StartingNumber in the anchor query). The anchor query returns the first result and then the recursive member uses that result to get the next result. The anchor result is 0, the recursive member adds 5 to that result for a total of 5.   

A way to visualize this is imagine you have a temp table associated with your anchor query. The anchor query writes the first result to this temp table and then the recursive member uses that result to produce the next row. The new row is inserted into the temp table and the recursive member keeps on processing. So 0 is the first row + 5 = 5. Now 5 is the next row, + 5 = 10. Now 10 is the next row, + 5 = 15. Now 15 is the next row, etc.

With this behavior in mind you can imagine how an incorrectly constructed query could run an infinite number of times. In our example below we have a WHERE clause in the recursive member that will limit our results to a max value of 50. */

DECLARE @StartingNumber int = 0
DECLARE @EndingNumber int = 50;

WITH CTEazyE AS (
-- Anchor member

   SELECT
      @StartingNumber AS FirstNumber -- FirstNumber is the alias we will reference in the recursive query

-- The union is required for recursion

   UNION ALL

--Recursive member

   SELECT
      FirstNumber + 5 -- Adds 5 to the FirstNumber value
   FROM CTEazyE
   WHERE FirstNumber < @EndingNumber ) -- Applying a filter to stop recursion.

SELECT *
FROM CTEazyE

Go ahead and make changes to the query above so you can see the different results. If you are comfortable with execution plans have a look at the execution plan, this will better explain the behavior of how the recursion works. Remember that temp table I mentioned earlier, that is going to be your table spool in the execution plan.

Now let's look at another example this time with dates. In the following query we want to return each month (date) for the last 12 months. The following query uses a start date, in this example 12 months ago today. It then adds one month to the anchor date to get the next date value. If your starting value is (today's date - 12 months), add 1 month to that and you should have (today's date - 11 months). It will keep doing this until it reaches the current date.

DECLARE @StartDate date = DATEADD(mm, -12, GETDATE())
DECLARE @Enddate date = GETDATE()

;WITH CTEazyE AS (
   SELECT
      @StartDate AS StartDate      

   UNION ALL

   SELECT
      DATEADD(mm,1,StartDate)
   FROM CTEazyE
   WHERE StartDate < @Enddate )          

SELECT *
FROM CTEazyE

Let's look at one more example. Say a developer approaches you and says they need to delete a row from a production table. You know this table has other tables referencing it with foreign keys but you don't know which ones off the top of your head. Of those tables referencing this main table it's also possible there are other tables referencing the tables that are referencing this main table. You don't use cascade delete so you need to delete records manually in each of these tables. You need to produce a Foreign Key chain for all tables involved in this delete.

In the script below we use the @PTable variable to define the target parent table--this is the table you want to delete the row from. The anchor query returns the object_id and table name of the target parent table. We define the rest of the columns and populate these columns with NULL values. These NULL fields will get actual values from the recursive member query each execution that follows the anchor. The key here is the INNER JOIN, we are joining the referenced_object_id field from the recursive query to the ObjectID of the anchor query. The referenced_object_id is the parent table in a foreign key. So each time we produce a new referenced_object_id we are going to get all objects that reference the that object id.

In the example below we are creating a few temp tables with FK constraints. Table 2 references table 1 and table 3 references table 2. This query will allow us to see the foreign key chain using our recursive CTE.   

USE <databaseName>;


CREATE TABLE t1 (ID INT NOT NULL IDENTITY(1,1) PRIMARY KEY, Blah NVARCHAR(MAX));
CREATE TABLE t2 (ID INT NOT NULL IDENTITY(1,1) PRIMARY KEY, t1ID INT NOT NULL, Blah NVARCHAR(MAX), CONSTRAINT FK_t2 FOREIGN KEY (t1ID) REFERENCES t1 (ID) );
CREATE TABLE t3 (ID INT NOT NULL IDENTITY(1,1) PRIMARY KEY, t2ID INT NOT NULL, Blah NVARCHAR(MAX), CONSTRAINT FK_t3 FOREIGN KEY (t2ID) REFERENCES t2 (ID) );

GO

DECLARE @PTable nvarchar(250) = 't1';

WITH CTEazyE AS (
   SELECT
      [object_id] AS ObjectID
      ,name AS TableName
      ,CAST(NULL AS sysname) AS ForeignKeyName
      ,NULL AS ReferencedTableID
      ,CAST(NULL AS sysname) AS ReferencedTableName
   FROM sys.tables
   WHERE name = @PTable

   UNION ALL

   SELECT
      fk.parent_object_id AS ID
     ,OBJECT_NAME(fk.parent_object_id) AS name
     ,fk.name
     ,ct.ObjectID AS ReferencedTableID
     ,ct.TableName AS ReferencedTableName
FROM sys.foreign_keys fk
INNER JOIN CTEazyE ct ON fk.referenced_object_id = ct.ObjectID )

SELECT DISTINCT
   ObjectID
   ,TableName
   ,ForeignKeyName
   ,ReferencedTableID
   ,ReferencedTableName
FROM CTEazyE
ORDER BY ReferencedTableName;

DROP TABLE t3;
DROP TABLE t2;
DROP TABLE t1;

In the results you should see the first row as the parent table and then t2 which references the parent table. Since t3 references t2 you will also see t3 in the results. The recursive CTE will keep going until it doesn't find any more constraints between tables. You can try this out on a table you know to see more meaningful results.

Recursive CTEs are a nice tool but need to be used with caution, like any recursive type of query. 

Working with XML in SQL Server

Sometimes we need to work with XML inside of SQL. This guide will demonstrate some ways you might store and retrieve XML data from SQL.

For this first example, we are going to store the XML data in an nvarchar field and then cast it to xml when we query the data.

CREATE TABLE #FormData (
   FormID int IDENTITY(1,1)
  ,FormDetailXML nvarchar(1024) );
  
INSERT INTO #FormData (FormDetailXML)
VALUES ('<form><Organization>Contoso</Organization><Contact>Jake Jones</Contact><Title>Manager</Title><EmailFrom>JakeJones@contoso.com</EmailFrom><Phone>555-555-5555</Phone><Address>125 Cherry Ln</Address><City>NY</City><States>NY</States><Zip>99999-9999</Zip></form>');

INSERT INTO #FormData (FormDetailXML)
VALUES ('<form><Organization>Northwind</Organization><Contact>Bill Blass</Contact><Title>Director</Title><EmailFrom>bblass@northwind.com.com</EmailFrom><Phone>333-333-3333</Phone><Address>1024 Bytes Blvd</Address><City>LA</City><States>CA</States><Zip>77777-7777</Zip></form>');

Now that we have our XML strings stored in the table let's cast the FormDetailXML to XML and view the results. Click on the XML link returned from the query, this will show you an organized view of the data.

SELECT 
   FormID
  ,CAST(FormDetailXML AS xml) AS FormDetailXML FROM #FormData

Now we are going to SELECT from the table this time parsing the XML in to individual columns per XML element. XML will always have a Node, typically with Elements and/or Attributes. "Form" is the Node and inside the node are the different Elements. Organization, Title, Address, Phone, zip, etc., these are all the Elements we want to parse out in to individual columns.

Here is our query to parse our XML Elements to individual columns.

SELECT
   FormID
  ,CAST(FormDetailXML AS XML).value('data(/form/Organization)[1]','varchar(100)') AS Organization
  ,CAST(FormDetailXML AS XML).value('data(/form/Contact)[1]','varchar(100)') AS Contact
  ,CAST(FormDetailXML AS XML).value('data(/form/Title)[1]','varchar(100)') AS Title
  ,CAST(FormDetailXML AS XML).value('data(/form/EmailFrom)[1]','varchar(100)') AS EmailAddress
  ,CAST(FormDetailXML AS XML).value('data(/form/Phone)[1]','varchar(100)') AS PhoneNumber
  ,CAST(FormDetailXML AS XML).value('data(/form/Address)[1]','varchar(100)') AS HomeAddress
  ,CAST(FormDetailXML AS XML).value('data(/form/City)[1]','varchar(100)') AS CityCode
  ,CAST(FormDetailXML AS XML).value('data(/form/States)[1]','varchar(100)') AS StateCode
  ,CAST(FormDetailXML AS XML).value('data(/form/Zip)[1]','varchar(100)') AS Zipcode
FROM #FormData

Let's breakdown a single Xquery line from the query above, we'll use the first line for Organization:

CAST(FormDetailXML AS XML).value('data(/form/Organization)[1]','varchar(100)') AS Organization

First we are casting the FormDetailXML data that is stored as nvarchar to XML:

CAST(FormDetailXML AS XML)

Next we use the value() method to perform an Xquery against our XML target, in this example the '(data(/form/Organization)[1]',

CAST(FormDetailXML AS XML).value

Ref: https://msdn.microsoft.com/en-us/library/ms178030(v=sql.105).aspx

Looking at the structure of our Xquery, we start with the word "data" followed by open paren'. "Data" is the data function and returns the typed value for each item specified.

('data(/form/Organization)[1]','varchar(100)')

Ref: https://msdn.microsoft.com/en-us/library/ms187038.aspx

Following the data function you see "(/form", that is our XML Node. This is followed by "/Organization", that is our XML Element. Finally we have our datatype, in this example varchar(100). This is the datatype you want to render the parsed Element to.*/

Note - There is another way you can write this query, moving the CAST function to a single line. Here is another way to query the XML data.

SELECT
   XMLFormDetails.value('data(/form/Organization)[1]','varchar(100)') AS Organization
  ,XMLFormDetails.value('data(/form/Contact)[1]','varchar(100)') AS Contact
  ,XMLFormDetails.value('data(/form/Title)[1]','varchar(100)') AS Title
  ,XMLFormDetails.value('data(/form/EmailFrom)[1]','varchar(100)') AS EmailAddress
  ,XMLFormDetails.value('data(/form/Phone)[1]','varchar(100)') AS PhoneNumber
  ,XMLFormDetails.value('data(/form/Address)[1]','varchar(100)') AS HomeAddress
  ,XMLFormDetails.value('data(/form/City)[1]','varchar(100)') AS CityCode
  ,XMLFormDetails.value('data(/form/States)[1]','varchar(100)') AS StateCode
  ,XMLFormDetails.value('data(/form/Zip)[1]','varchar(100)') AS Zipcode
FROM (SELECT CAST(FormDetailXML AS xml) AS XMLFormDetails FROM #FormData) AS fdx

Drop our temp table.

DROP TABLE #FormData;

This should give you a basic understanding of how to get Element values from Nodes in an XML string stored in SQL, but what if you have XML that also has attributes and you want to return the Attribute value for a given Element.

If you have ever tried to parse data output from an Extended Event session you probably already have some scripts that help you accomplish this. Let's look at what we need to do to parse an XML string and retrieve Attributes for specific Elements. We will use the output from an Extended Events session in our next example.

CREATE TABLE #XESessionData (
   FormID int IDENTITY(1,1)
  ,EventDetails nvarchar(1024) );

Below is  data capture from an Event Session, we are going to insert this in to our table:

INSERT INTO #XESessionData (EventDetails)
VALUES ('<event name="sql_statement_completed" package="sqlserver" timestamp="2015-02-13T20:56:19.482Z"><data name="duration"><value>23</value></data>
<data name="cpu_time"><value>889745</value></data><data name="physical_reads"><value>1245</value></data>
<data name="logical_reads"><value>22546</value></data><data name="writes"><value>17</value></data>
<action name="collect_system_time" package="package0"><value>2015-02-13T20:56:19.482Z</value></action>
<action name="session_id" package="sqlserver"><value>71</value></action><action name="database_name" package="sqlserver"><value>SQLAdmin</value></action>
<action name="sql_text" package="sqlserver"><value>use [SQLAdmin]</value></action></event>');

Take a look at the data in its stored form and cast as XML. Click on the XML link to get a better view of the data.

SELECT EventDetails FROM #XESessionData;

SELECT CAST(EventDetails AS xml) AS EventDetailsXML FROM #XESessionData;

--Here is the query you'd use to parse the XML and get your specific values for certain attributes.

SELECT
   xexml.value('(./action[@name="collect_system_time"]/value)[1]', 'varchar(4000)') as EventCollectTime
  ,xexml.value('(./action[@name="session_id"]/value)[1]', 'nvarchar(500)') as SessionID
  ,xexml.value('(./action[@name="database_name"]/value)[1]', 'nvarchar(500)') as DatabaseName
  ,xexml.value('(./data[@name="duration"]/value)[1]', 'bigint') as Duration
  ,xexml.value('(./data[@name="cpu_time"]/value)[1]', 'bigint') as CPUTime
  ,xexml.value('(./data[@name="physical_reads"]/value)[1]', 'bigint') as PhysicalReads
  ,xexml.value('(./data[@name="logical_reads"]/value)[1]', 'bigint') as LogicalReads
  ,xexml.value('(./data[@name="writes"]/value)[1]', 'bigint') as Writes
  ,xexml.value('(./action[@name="sql_text"]/value)[1]', 'varchar(4000)') as SQLText
FROM (SELECT CAST(EventDetails AS XML) XMLEventData FROM #XESessionData ) AS EventTable
CROSS APPLY XMLEventData.nodes('/event') n (xexml)

In the example above, we wrote this query in the same format as the second query in our first example. Here we also used the nodes() method to target the parent node "/event". Our XML has a "data" attribute that contains a child element for the "value", it also has an "action" attribute with a child element as well.

Ref: https://msdn.microsoft.com/en-us/library/ms188282.aspx

In the example below we are calling to the specific value of "duration", we are looking for the value found in the "value" child element of the duration attribute.

xexml.value('(./data[@name="duration"]/value)[1]', 'bigint')

Working through the above queries should give you a solid start in working with XML in SQL.

DROP TABLE #XESessionData