SQL 数据仓库中的 Group By 选项Group by options in SQL Data Warehouse

有关在开发解决方案时实现 Azure SQL 数据仓库中的 Group By 选项的技巧。Tips for implementing group by options in Azure SQL Data Warehouse for developing solutions.

GROUP BY 的作用是什么?What does GROUP BY do?

GROUP BY T-SQL 子句用于将数据聚合成摘要行集。The GROUP BY T-SQL clause aggregates data to a summary set of rows. GROUP BY 具有 SQL 数据仓库不支持的一些选项。GROUP BY has some options that SQL Data Warehouse does not support. 这些选项都有解决方法。These options have workarounds.

这些选项包括:These options are

  • 带 ROLLUP 的 GROUP BYGROUP BY with ROLLUP
  • GROUPING SETSGROUPING SETS
  • 带 CUBE 的 GROUP BYGROUP BY with CUBE

Rollup 和 grouping sets 选项Rollup and grouping sets options

此处最简单的选项是改为使用 UNION ALL 来执行汇总,而不是依赖显式语法。The simplest option here is to use UNION ALL instead to perform the rollup rather than relying on the explicit syntax. 结果应完全相同The result is exactly the same

下面的示例使用了具有 ROLLUP 选项的 GROUP BY 语句:The following example using the GROUP BY statement with the ROLLUP option:

SELECT [SalesTerritoryCountry]
,      [SalesTerritoryRegion]
,      SUM(SalesAmount)             AS TotalSalesAmount
FROM  dbo.factInternetSales s
JOIN  dbo.DimSalesTerritory t       ON s.SalesTerritoryKey       = t.SalesTerritoryKey
GROUP BY ROLLUP (
                        [SalesTerritoryCountry]
                ,       [SalesTerritoryRegion]
                )
;

通过使用 ROLLUP,前面的示例请求以下聚合:By using ROLLUP, the preceding example requests the following aggregations:

  • 国家/地区和区域Country and Region
  • 国家/地区Country
  • 总计Grand Total

若要替换 ROLLUP 并返回相同的结果,可以使用 UNION ALL 并显式指定所需的聚合:To replace ROLLUP and return the same results, you can use UNION ALL and explicitly specify the required aggregations:

SELECT [SalesTerritoryCountry]
,      [SalesTerritoryRegion]
,      SUM(SalesAmount) AS TotalSalesAmount
FROM  dbo.factInternetSales s
JOIN  dbo.DimSalesTerritory t     ON s.SalesTerritoryKey       = t.SalesTerritoryKey
GROUP BY
       [SalesTerritoryCountry]
,      [SalesTerritoryRegion]
UNION ALL
SELECT [SalesTerritoryCountry]
,      NULL
,      SUM(SalesAmount) AS TotalSalesAmount
FROM  dbo.factInternetSales s
JOIN  dbo.DimSalesTerritory t     ON s.SalesTerritoryKey       = t.SalesTerritoryKey
GROUP BY
       [SalesTerritoryCountry]
UNION ALL
SELECT NULL
,      NULL
,      SUM(SalesAmount) AS TotalSalesAmount
FROM  dbo.factInternetSales s
JOIN  dbo.DimSalesTerritory t     ON s.SalesTerritoryKey       = t.SalesTerritoryKey;

若要替换 GROUPING SETS,示例原则也适用。To replace GROUPING SETS, the sample principle applies. 只需要为希望查看的聚合级别创建 UNION ALL 部分。You only need to create UNION ALL sections for the aggregation levels you want to see.

Cube 选项Cube options

可以使用 UNION ALL 方法创建 GROUP BY WITH CUBE。It is possible to create a GROUP BY WITH CUBE using the UNION ALL approach. 问题在于,代码可能很快就会变得庞大且失控。The problem is that the code can quickly become cumbersome and unwieldy. 若要避免此情况,可以使用这种更高级的方法。To mitigate this, you can use this more advanced approach.

使用上述示例。Let's use the example above.

第一步是定义“cube”,它定义我们想要创建的所有聚合级别。The first step is to define the 'cube' that defines all the levels of aggregation that we want to create. 请务必记下两个派生表的 CROSS JOIN。It is important to take note of the CROSS JOIN of the two derived tables. 这样就会生成所有级别。This generates all the levels for us. 剩余代码确实可以设置格式。The rest of the code is really there for formatting.

CREATE TABLE #Cube
WITH
(   DISTRIBUTION = ROUND_ROBIN
,   LOCATION = USER_DB
)
AS
WITH GrpCube AS
(SELECT    CAST(ISNULL(Country,'NULL')+','+ISNULL(Region,'NULL') AS NVARCHAR(50)) as 'Cols'
,          CAST(ISNULL(Country+',','')+ISNULL(Region,'') AS NVARCHAR(50))  as 'GroupBy'
,          ROW_NUMBER() OVER (ORDER BY Country) as 'Seq'
FROM       ( SELECT 'SalesTerritoryCountry' as Country
             UNION ALL
             SELECT NULL
           ) c
CROSS JOIN ( SELECT 'SalesTerritoryRegion' as Region
             UNION ALL
             SELECT NULL
           ) r
)
SELECT Cols
,      CASE WHEN SUBSTRING(GroupBy,LEN(GroupBy),1) = ','
            THEN SUBSTRING(GroupBy,1,LEN(GroupBy)-1)
            ELSE GroupBy
       END AS GroupBy  --Remove Trailing Comma
,Seq
FROM GrpCube;

下面显示了 CTAS 的结果:The following shows the results of the CTAS:

按多维数据集分组

第二步是指定目标表用于存储临时结果:The second step is to specify a target table to store interim results:

DECLARE
 @SQL NVARCHAR(4000)
,@Columns NVARCHAR(4000)
,@GroupBy NVARCHAR(4000)
,@i INT = 1
,@nbr INT = 0
;
CREATE TABLE #Results
(
 [SalesTerritoryCountry] NVARCHAR(50)
,[SalesTerritoryRegion]  NVARCHAR(50)
,[TotalSalesAmount]      MONEY
)
WITH
(   DISTRIBUTION = ROUND_ROBIN
,   LOCATION = USER_DB
)
;

第三步是是循环访问执行聚合的列 cube。The third step is to loop over our cube of columns performing the aggregation. 查询为 #Cube 临时表中的每个行运行一次,并将结果存储在 #Results 临时表中The query will run once for every row in the #Cube temporary table and store the results in the #Results temp table

SET @nbr =(SELECT MAX(Seq) FROM #Cube);

WHILE @i<=@nbr
BEGIN
    SET @Columns = (SELECT Cols    FROM #Cube where seq = @i);
    SET @GroupBy = (SELECT GroupBy FROM #Cube where seq = @i);

    SET @SQL ='INSERT INTO #Results
              SELECT '+@Columns+'
              ,      SUM(SalesAmount) AS TotalSalesAmount
              FROM  dbo.factInternetSales s
              JOIN  dbo.DimSalesTerritory t  
              ON s.SalesTerritoryKey = t.SalesTerritoryKey
              '+CASE WHEN @GroupBy <>''
                     THEN 'GROUP BY '+@GroupBy ELSE '' END

    EXEC sp_executesql @SQL;
    SET @i +=1;
END

最后,可以通过简单地从 #Results 临时表进行读取来返回结果Lastly, you can return the results by simply reading from the #Results temporary table

SELECT *
FROM #Results
ORDER BY 1,2,3
;

通过将代码拆分成不同的部分并生成循环构造,代码将更好管理和维护。By breaking the code up into sections and generating a looping construct, the code becomes more manageable and maintainable.

后续步骤Next steps

有关更多开发技巧,请参阅 开发概述For more development tips, see development overview.