具体化视图(预览版)Materialized views (preview)

具体化视图公开了对源表的聚合查询。Materialized views expose an aggregation query over a source table. 具体化视图始终返回聚合查询的最新结果(始终是全新的)。Materialized views always return an up-to-date result of the aggregation query (always fresh). 查询具体化视图比直接对源表运行聚合的性能更高,每个查询都会执行该聚合。Querying a materialized view is more performant than running the aggregation directly over the source table, which is performed each query.

备注

具体化视图有一些限制,无法保证适用于所有方案。Materialized views have some limitations, and aren't guaranteed to work well for all scenarios. 使用此功能之前,请查看性能注意事项Review the performance considerations before working with the feature.

使用以下命令来管理具体化视图:Use the following commands to manage materialized views:

为何使用具体化视图?Why use materialized views?

通过为常用聚合的具体化视图投资资源(数据存储、后台 CPU 周期),可以获得以下优势:By investing resources (data storage, background CPU cycles) for materialized views of commonly-used aggregations, you get the following benefits:

  • 性能提升: 对于相同的聚合函数,查询具体化视图通常比查询源表性能更好。Performance improvement: Querying a materialized view commonly performs better than querying the source table for the same aggregation function(s).

  • 时效性: 具体化视图查询始终返回最新结果,而不受上次进行具体化的时间的影响。Freshness: A materialized view query always returns the most up-to-date results, independent of when materialization last took place. 查询组合了视图的具体化部分和源表中尚未具体化的记录(delta 部分),始终提供最新结果。The query combines the materialized part of the view with the records in the source table, which haven't yet been materialized (the delta part), always providing the most up-to-date results.

  • 成本缩减: 与对源表执行聚合操作相比,查询具体化视图消耗群集中的资源较少。Cost reduction: Querying a materialized view consumes less resources from the cluster than doing the aggregation over the source table. 如果只需要聚合,则可以减少源表的保留策略。Retention policy of source table can be reduced if only aggregation is required. 此设置可减少源表的热缓存开销。This setup reduces hot cache costs for the source table.

具体化视图用例Materialized views use cases

下面是可以使用具体化视图解决的常见方案:The following are common scenarios that can be addressed by using a materialized view:

  • 使用 arg_max()(聚合函数)查询每个实体的最后一条记录。Query last record per entity using arg_max() (aggregation function).
  • 使用 any()(聚合函数)消除表中的重复记录。De-duplicate records in a table using any() (aggregation function).
  • 通过对原始数据计算定期统计信息来减少数据的解析。Reduce the resolution of data by calculating periodic statistics over the raw data. 按时间段使用各种聚合函数Use various aggregation functions by period of time.
    • 例如,使用 T | summarize dcount(User) by bin(Timestamp, 1d) 维护每天不同用户的最新快照。For example, use T | summarize dcount(User) by bin(Timestamp, 1d) to maintain an up-to-date snapshot of distinct users per day.

有关所有用例的示例,请参阅具体化视图 create 命令For examples of all use cases, see materialized view create command.

具体化视图的工作原理How materialized views work

具体化视图由两个组件组成:A materialized view is made of two components:

  • 具体化部分 - 一个Azure 数据资源管理器表,其中包含源表中已处理的聚合记录。A materialized part - an Azure Data Explorer table holding aggregated records from the source table, which have already been processed. 该表始终按聚合的分组依据组合保存单个记录。This table always holds a single record per the aggregation's group-by combination.
  • 增量 - 源表中尚未处理的新引入记录。A delta - the newly ingested records in the source table that haven't yet been processed.

查询具体化视图会将具体化部分与增量部分组合起来,提供聚合查询的最新结果。Querying the materialized view combines the materialized part with the delta part, providing an up-to-date result of the aggregation query. 脱机具体化过程将增量中的新记录引入到具体化表,替换现有记录。The offline materialization process ingests new records from the delta to the materialized table, and replaces existing records. 替换是通过重新生成保存待替换记录的区来完成的。The replacement is done by rebuilding extents that hold records to replace. 如果增量中的记录与具体化部分中的所有数据分片持续相交,则每个具体化循环都需要重新生成整个具体化部分,可能无法跟上引入速率。If records in the delta constantly intersect with all data shards in the materialized part, each materialization cycle will require rebuilding the entire materialized part, and may not keep up with the ingestion rate. 在这种情况下,视图会变得不正常,增量会不断增长。In that case, the view will become unhealthy and the delta will constantly grow. 监视部分介绍了如何排查此类情况的问题。The monitoring section explains how to troubleshoot such situations.

具体化视图查询Materialized views queries

  • 查询具体化视图的主要方法是按其名称进行查询,就像查询表引用一样。The primary way of querying a materialized view is by its name, like querying a table reference. 查询具体化视图时,它会将视图的具体化部分与源表中尚未具体化的记录组合在一起 (delta)。When the materialized view is queried, it combines the materialized part of the view with the records in the source table that haven't been materialized yet (the delta). 查询具体化视图时,会始终根据引入到源表的所有记录返回最新结果。Querying the materialized view will always return the most up-to-date results, based on all records ingested to the source table. 有关具体化视图中的具体化和非具体化部分的详细信息,请参阅具体化视图的工作原理For more information about the materialized vs. non-materialized parts in materialized view, see how materialized views work.

    • 将具体化部分与 delta 合并会在幕后执行联接。Combining the materialized part with the delta executes a join behind the scenes. 默认情况下,此联接不是随机的。By default, this join isn't shuffled. 可以通过添加名为 materialized_view_shuffle客户端请求属性来强制查询的随机执行。You can force shuffling of the query by adding a client request property named materialized_view_shuffle. 请参阅以下示例See examples below. 添加此属性有时可以显示提高具体化视图查询的性能。Adding this property can sometimes significantly improve the performance of the materialized view query. 将来,将基于具体化视图的当前状态自动推导随机执行的需求,但现在这需要显式的提示。In the future, the need for shuffling will be automatically deduced based on the materialized view's current state, but for now this requires an explicit "hint".
  • 查询该视图的另一种方法是使用 materialized_view() 函数Another way of querying the view is by using the materialized_view() function. 此选项支持仅查询该视图的具体化部分,同时指定用户愿意容忍的最大延迟。This option supports querying only the materialized part of the view, while specifying the max latency the user is willing to tolerate. 此选项不保证返回最新记录,但与查询整个视图相比,此选项始终更加高效。This option isn't guaranteed to return the most up-to-date records, but it should always be more performant than querying the entire view. 此函数适用于你愿意舍弃一些时效性以提高性能的方案,例如,适用于遥测仪表板。This function is useful for scenarios in which you're willing to sacrifice some freshness for performance, for example for telemetry dashboards.

视图可以参与跨群集或跨数据库查询,但不包括在通配符联合或搜索中。The view can participate in cross-cluster or cross-database queries, but aren't included in wildcard unions or searches.

示例Examples

  1. 查询整个视图。Query the entire view. 包括源表中的最新记录:The most recent records in source table are included:

    ViewName
    
  2. 查询整个视图,并提供提示以使用 shuffle 策略。Query the entire view, and provide a "hint" to use shuffle strategy. 包括源表中的最新记录:The most recent records in source table are included:

    • 示例 #1:基于 Id 列随机执行(类似于使用 hint.shufflekey=Id):Example #1: shuffle based on the Id column (similarly to using hint.shufflekey=Id):
    set materialized_view_shuffle = dynamic([{"Name" : "ViewName", "Keys" : [ "Id" ] }]);
    ViewName
    
    • 示例 #2:基于所有键随机执行(类似于使用 hint.strategy=shuffle):Example #2: shuffle based on all keys (similarly to using hint.strategy=shuffle):
    set materialized_view_shuffle = dynamic([{"Name" : "ViewName" }]);
    ViewName
    
  3. 仅查询该视图的具体化部分,而不考虑其上次进行具体化的时间。Query the materialized part of the view only, regardless of when it was last materialized.

    materialized_view("ViewName")
    

性能注意事项Performance considerations

可能会影响具体化视图运行状况的主要参与者有:The main contributors that can impact a materialized view health are:

  • 群集资源: 与群集上运行的任何其他进程一样,具体化视图使用群集中的资源(CPU、内存)。Cluster resources: Like any other process running on the cluster, materialized views consume resources (CPU, memory) from the cluster. 如果群集超载,将具体化视图添加到其中可能会导致群集性能下降。If the cluster is overloaded, adding materialized views to it may cause a degradation in the cluster's performance. 可使用群集运行状况指标监视群集的运行状况。Monitor your cluster's health using cluster health metrics. 优化的自动缩放目前不将正在考虑的具体化视图运行状况作为自动缩放规则的一部分。Optimized autoscale currently doesn't take materialized views health under consideration as part of autoscale rules.

  • 与具体化数据重叠: 在具体化期间,自上次具体化后引入到源表中的所有新记录(增量)会被处理并具体化到视图中。Overlap with materialized data: During materialization, all new records ingested to source table since the last materialization (the delta) are processed and materialized into the view. 新记录与已具体化记录之间的交集越大,具体化视图的性能就越差。The higher the intersection between new-records and already-materialized-records is, the worse the performance of the materialized view will be. 如果要更新的记录数(例如,在 arg_max 视图中的记录数)是源表的一小部分,则具体化视图的效果最佳。A materialized view will work best if the number of records being updated (for example, in arg_max view) is a small subset of the source table. 如果所有或大多数具体化视图记录需要在每个具体化循环中进行更新,则具体化视图将不能很好地执行。If all or most of the materialized view records need to be updated in every materialization cycle, then the materialized view won't perform well. 请使用区重新生成指标来确定这种情况。Use extents rebuild metrics to identify this situation.

    • 当新记录与具体化视图之间的交集相对较高时,将群集移到引擎 V3 应该会对具体化视图性能产生显著影响。Moving the cluster to Engine V3 should have a significant performance impact and the materialized view, when the intersection between new records and the materialized view is relatively high. 这是因为引擎 V3 中的盘区重新生成阶段比 V2 中的更加优化。This is because the extents rebuild phase in Engine V3 is much more optimized than in V2.
  • 引入速率: 在具体化视图的源表中,对数据量或引入速率没有硬编码限制。Ingestion rate: There are no hard-coded limits on the data volume or ingestion rate in the source table of the materialized view. 但是,具体化视图的建议引入速率不超过 1-2GB/秒。采用更高的引入速率可能仍然可以很好地执行。However, the recommended ingestion rate for materialized views is no more than 1-2GB/sec. Higher ingestion rates may still perform well. 性能取决于群集大小、可用资源以及与现有数据的交集量。Performance depends on cluster size, available resources, and amount of intersection with existing data.

  • 群集中的具体化视图数: 上述注意事项适用于在群集中定义的每个单独的具体化视图。Number of materialized views in cluster: The above considerations apply to each individual materialized view defined in the cluster. 每个视图都使用其自己的资源,许多视图会彼此争用可用资源。Each view consumes its own resources, and many views will compete with each other on available resources. 对于群集中具体化视图的数量,没有硬编码限制。There are no hard-coded limits to the number of materialized views in a cluster. 但是,一般情况下,群集上的具体化视图建议不要超过 10 个。However, the general recommendation is to have no more than 10 materialized views on a cluster. 如果在群集中定义了多个具体化视图,则可能会调整容量策略The capacity policy may be adjusted if more than a single materialized view is defined in the cluster.

  • 具体化视图定义:必须根据可获得最佳查询性能的查询最佳做法定义具体化视图定义。Materialized view definition: The materialized view definition must be defined according to query best practices for best query performance. 有关详细信息,请参阅 create 命令性能提示For more information, see create command performance tips.

具体化视图策略Materialized views policies

你可以定义实例化视图的保留策略缓存策略,就像任何 Azure 数据资源管理器表一样。You can define the retention policy and caching policy of a materialized view, like any Azure Data Explorer table.

默认情况下,具体化视图会派生数据库保留和缓存策略。The materialized view derives the database retention and caching policies by default. 可以使用保留策略控制命令缓存策略控制命令来更改策略。The policies can be changed using retention policy control commands or caching policy control commands.

  • 具体化视图的保留策略与源表的保留策略无关。The retention policy of the materialized view is unrelated to the retention policy of the source table.
  • 如果不以其他方式使用源表记录,则可以尽量减少源表的保留策略。If the source table records aren't otherwise used, the retention policy of the source table can be dropped to a minimum. 具体化视图仍会根据视图上设置的保留策略来存储数据。The materialized view will still store the data according to the retention policy set on the view.
  • 当具体化视图处于预览模式时,建议至少允许数据存储 7 天,并将可恢复性设置为 true。While materialized views are in preview mode, the recommendation is to allow a minimum of at least seven days and recoverability set to true. 此设置可用来从错误中快速恢复以及进行诊断。This setting allows for fast recovery for errors and for diagnostic purposes.

备注

当前不支持源表上的零保留策略。Zero retention policy on the source table is currently not supported.

具体化视图监视Materialized views monitoring

通过以下方式监视具体化视图的运行状况:Monitor the materialized view's health in the following ways:

备注

具体化永远不会跳过任何数据,即使存在恒定的失败。Materialization never skips any data, even if there are constant failures. 该视图始终保证根据源表中的所有记录返回最新的查询快照。The view is always guaranteed to return the most up-to-date snapshot of the query, based on all records in the source table. 恒定的失败会显著降低查询性能,但不会导致视图查询中出现不正确的结果。Constant failures will significantly degrade query performance, but won't cause incorrect results in view queries.

跟踪资源消耗情况Track resource consumption

具体化视图资源消耗: 可以使用 .show commands-and-queries 命令跟踪具体化视图具体化过程消耗的资源。Materialized views resource consumption: the resources consumed by the materialized views materialization process can be tracked using the .show commands-and-queries command. 可以使用以下命令(替换 DatabaseNameViewName)筛选特定视图的记录:Filter the records for a specific view using the following (replace DatabaseName and ViewName):

.show commands-and-queries 
| where Database  == "DatabaseName" and ClientActivityId startswith "DN.MaterializedViews;ViewName;"

排查不正常的具体化视图Troubleshooting unhealthy materialized views

MaterializedViewHealth 指标指示具体化视图是否正常。The MaterializedViewHealth metric indicates whether a materialized view is healthy. 具体化视图可能会由于以下任一或所有原因而变得不正常:A materialized view can become unhealthy for any or all of the following reasons:

  • 具体化过程失败。The materialization process is failing.
  • 群集没有足够的容量来按时具体化所有传入数据。The cluster doesn't have sufficient capacity to materialize all incoming data on-time. 执行中不会有失败。There won't be failures in execution. 但是,视图仍会处于不正常状态,因为它会滞后,无法跟上引入速率。However, the view will still be unhealthy, since it will be lagging behind and not able to keep up with the ingestion rate.

在具体化视图变得不正常之前,其由 MaterializedViewAgeMinutes 指标记录的时间会逐渐增加。Before a materialized view becomes unhealthy, its age, noted by the MaterializedViewAgeMinutes metric, will gradually increase.

故障排除示例Troubleshooting examples

以下示例可帮助你诊断和修复不正常视图:The following examples can help you diagnose and fix unhealthy views:

  • 失败: 源表已更改或删除,视图未设置为 autoUpdateSchema,或者源表中的更改不支持自动更新。Failure: The source table was changed or deleted, the view wasn't set to autoUpdateSchema, or the change in source table isn't supported for auto-updates.
    诊断: 触发了 MaterializedViewResult 指标,且 Result 维度设置为 SourceTableSchemaChange/SourceTableNotFoundDiagnostic: A MaterializedViewResult metric is fired, and the Result dimension is set to SourceTableSchemaChange/SourceTableNotFound.

  • 失败: 具体化过程因群集资源不足而失败,并且达到了查询限制。Failure: Materialization process fails due insufficient cluster resources, and query limits are hit.
    诊断: MaterializedViewResult 指标 Result 维度设置为 InsufficientResourcesDiagnostic: MaterializedViewResult metric Result dimension is set to InsufficientResources. Azure 数据资源管理器会尝试自动从此状态中恢复,因此,此错误可能是暂时性的。Azure Data Explorer will try to automatically recover from this state, so this error may be transient. 但是,如果视图不正常,并且系统持续发出此错误,则当前群集的配置可能无法跟上引入速率,群集需要纵向或横向扩展。However, if view is unhealthy and this error is constantly emitted, it's possible that the current cluster's configuration isn't able to keep up with ingestion rate, and cluster needs to be scaled up or out.

  • 失败: 由于任何其他(未知)原因,具体化过程失败。Failure: The materialization process is failing because of any other (unknown) reason.
    诊断MaterializedViewResult 指标的 Result 将是 UnknownErrorDiagnostic: MaterializedViewResult metric's Result will be UnknownError. 如果此失败频繁发生,请开具支持票证,以便 Azure 数据资源管理器团队进一步进行调查。If this failure happens frequently, open a support ticket for the Azure Data Explorer team to investigate further.

如果没有具体化失败,则每次成功执行时会触发 MaterializedViewResult 指标,且 Result=SuccessIf there are no materialization failures, MaterializedViewResult metric will be fired on every successful execution, with Result=Success. 如果具体化视图滞后(Age 超出阈值),则它可能不正常,不管执行是否成功。A materialized view can be unhealthy, despite successful executions, if it's lagging behind (Age is above threshold). 在以下情况下,可能会出现这种情形:This situation can happen in the following circumstances:

  • 具体化速度较慢,因为在每个具体化循环中要重新生成的区太多。Materialization is slow since there are too many extents to rebuild in each materialization cycle. 若要详细了解区重新生成为何会影响视图性能,请参阅具体化视图的工作原理To learn more about why extents rebuilds impact the view's performance, see how materialized views work.
  • 如果每个具体化循环都需要重新生成视图中接近 100% 的区,此视图可能无法跟上进度并会变得不正常。If each materialization cycle needs to rebuild close to 100% of the extents in the view, the view may not keep up, and will become unhealthy. 每个循环中重新生成的区数在 MaterializedViewExtentsRebuild 指标中提供。The number of extents rebuilt in each cycle is provided in the MaterializedViewExtentsRebuild metric. 对于这种情况,请考虑以下解决方案:Consider the following solutions for this case:
  • 群集中有更多具体化视图,但群集没有足够的容量来运行所有视图。There are additional materialized views in the cluster, and the cluster doesn't have sufficient capacity to run all views. 请参阅具体化视图容量策略来更改执行的具体化视图数量的默认设置。See materialized view capacity policy to change the default settings for number of materialized views executed concurrently.

后续步骤Next steps