缓存策略(热缓存和冷缓存)Cache policy (hot and cold cache)

Azure 数据资源管理器会将其引入数据存储在可靠的存储中(通常是 Azure Blob 存储),使这些数据远离其实际处理(例如 Azure 计算)节点。Azure Data Explorer stores its ingested data in reliable storage (most commonly Azure Blob Storage), away from its actual processing (such as Azure Compute) nodes. 为了加快对数据的查询,Azure 数据资源管理器会将这些数据或部分数据缓存在其处理节点 SSD 甚至 RAM 中。To speed up queries on that data, Azure Data Explorer caches it, or parts of it, on its processing nodes, SSD, or even in RAM. Azure 数据资源管理器包含一种复杂的缓存机制,旨在智能地确定要缓存的数据对象。Azure Data Explorer includes a sophisticated cache mechanism designed to intelligently decide which data objects to cache. 缓存使 Azure 数据资源管理器能够描述其使用的数据项目,这样一来,使更重要的数据可以占据优先地位。The cache enables Azure Data Explorer to describe the data artifacts that it uses, so that more important data can take priority. 例如,列索引和列数据分片。For example, column indexes and column data shards,

当所有引入数据缓存后,可以获得最佳查询性能。The best query performance is achieved when all ingested data is cached. 有时,对于某些数据而言,在本地 SSD 存储中“暖”存储这些数据所花费的成本是不值得的。Sometimes, certain data doesn't justify the cost of keeping it "warm" in local SSD storage. 例如,许多团队认为很少访问的早期日志记录并非那么重要。For example, many teams consider that rarely accessed older log records are of lesser importance. 他们愿意在查询这些数据时拥有降低的性能,而不是花钱让数据一直保持为暖状态。They prefer to have reduced performance when querying this data, rather than pay to keep it warm all the time.

Azure 数据资源管理器缓存提供了一种精细的缓存策略,客户可以使用此策略来区分热数据缓存和冷数据缓存 。Azure Data Explorer cache provides a granular cache policy that customers can use to differentiate between: hot data cache and cold data cache. Azure 数据资源管理器缓存尝试将属于热数据缓存类别的所有数据保留在本地 SSD(或 RAM)中,直到达到定义的热数据缓存大小为止。Azure Data Explorer cache attempts to keep all data that falls into the hot data cache category, in local SSD (or RAM), up to the defined size of the hot data cache. 剩余的本地 SSD 空间将用于保存未分类为“热”的数据。The remaining local SSD space will be used to hold data that isn't categorized as hot. 这种设计的一个有用之处在于,从可靠存储中加载大量冷数据的查询不会将数据从热数据缓存中逐出。One useful implication of this design is that queries that load lots of cold data from reliable storage won't evict data from the hot data cache. 因此,不会对涉及热数据缓存中的数据的查询产生重大影响。As a result, there won't be a major impact on queries involving the data in the hot data cache.

设置热缓存策略的主要影响在于:The main implications of setting the hot cache policy are:

  • 成本:使可靠存储的成本显著低于本地 SSD。Cost: The cost of reliable storage can be dramatically lower than for local SSD. 目前,在 Azure 中,它的成本要便宜近 45 倍。It's currently about 45 times cheaper in Azure.
  • 性能:当数据存储在本地 SSD 中时,查询数据的速度更快,特别对于扫描大量数据的范围查询而言。Performance: Data is queried faster when it's in local SSD, particularly for range queries that scan large amounts of data.

使用缓存策略命令来管理缓存策略。Use the cache policy command to manage the cache policy.

提示

Azure 数据资源管理器专为临时查询而设计,中间结果集适合群集的总 RAM。Azure Data Explorer is designed for ad-hoc queries with intermediate result sets fitting the cluster's total RAM. 对于需要将中间结果存储在持久性存储(例如 SSD)中的大型作业(例如 map-reduce),请使用连续导出功能。For large jobs, like map-reduce, where you want to store intermediate results in persistent storage such as an SSD, use the continuous export feature. 此功能使你可以使用 HDInsight 或 Azure Databricks 等服务执行长时间运行的批处理查询。This feature enables you to do long-running batch queries using services like HDInsight or Azure Databricks.

如何应用缓存策略How cache policy is applied

将数据引入 Azure 数据资源管理器时,系统将跟踪引入的日期和时间以及创建的盘区。When data is ingested into Azure Data Explorer, the system keeps track of the date and time of the ingestion, and of the extent that was created. 盘区的引入日期和时间值(如果盘区是从多个预先存在的盘区中构建的,则为最大值)用于计算缓存策略。The extent's ingestion date and time value (or maximum value, if an extent was built from multiple pre-existing extents), is used to evaluate the cache policy.

备注

可以使用引入属性 creationTime 来指定引入日期和时间的值。You can specify a value for the ingestion date and time by using the ingestion property creationTime.

默认情况下,有效策略为 null,这意味着所有数据都被视为热数据。By default, the effective policy is null, which means that all the data is considered hot. null 表级别的策略会重写数据库级别的策略。A non-null table-level policy overrides a database-level policy.

将查询范围限定为热缓存Scoping queries to hot cache

Kusto 支持范围仅限于热缓存数据的查询。Kusto supports queries that are scoped down to hot cache data only.

备注

数据范围设置仅适用于支持缓存策略的实体,例如表。Data scoping applies only to entities that support caching policies, such as tables. 对于其他实体(例如外部表),它会被忽略。It's ignored for other entities, such as external tables.

以下为几种可能的查询:There are several query possibilities:

  • 将名为 query_datascope 的客户端请求属性添加到查询中。Add a client request property called query_datascope to the query. 可能的值:defaultallhotcachePossible values: default, all, and hotcache.
  • 在查询文本中使用 set 语句:set query_datascope='...'Use a set statement in the query text: set query_datascope='...'. 可能的值与客户端请求属性的值相同。Possible values are the same as for the client request property.
  • 在查询正文中紧跟表引用的后面添加 datascope=... 文本。Add a datascope=... text immediately after a table reference in the query body. 可能的值为 allhotcachePossible values are all and hotcache.

default 值指示使用群集默认设置,该设置决定查询应涵盖所有数据。The default value indicates use of the cluster default settings, which determine that the query should cover all data.

如果不同的方法之间存在差异,set 则优先于客户端请求属性。If there's a discrepancy between the different methods, then set takes precedence over the client request property. 为表引用指定值要优先于两者。Specifying a value for a table reference takes precedence over both.

例如,在下面的查询中,所有表引用将仅使用热缓存数据,除了对“T”的第二个引用,该引用的范围是所有数据:For example, in the following query all table references will use hot cache data only, except for the second reference to "T", that is scoped to all the data:

set query_datascope="hotcache";
T | union U | join (T datascope=all | where Timestamp < ago(365d) on X

缓存策略与保留策略Cache policy vs retention policy

缓存策略独立于保留策略Cache policy is independent of retention policy:

  • 缓存策略定义如何设置资源的优先级。Cache policy defines how to prioritize resources. 查询重要数据的速度将更快,并且能够抵抗查询不太重要的数据所造成的影响。Queries over important data will be faster and resistant to the impact of queries over less important data.
  • 保留策略定义表/数据库中可查询数据的范围(特别是 SoftDeletePeriod)。Retention policy defines the extent of the queryable data in a table/database (specifically, SoftDeletePeriod).

根据预期的查询模式,配置此策略以在成本和性能之间实现最佳平衡。Configure this policy to achieve the optimal balance between cost and performance, based on the expected query pattern.

示例:Example:

  • SoftDeletePeriod = 56dSoftDeletePeriod = 56d
  • hot cache policy = 28dhot cache policy = 28d

在此示例中,最后 28 天的数据将位于群集 SSD 上,另外 28 天的数据将存储在 Azure Blob 存储中。In the example, the last 28 days of data will be on the cluster SSD and the additional 28 days of data will be stored in Azure blob storage. 你可以对全部 56 天的数据运行查询。You can run queries on the full 56 days of data.