使用用于容器的 Azure Monitor 配置 Prometheus 指标的抓取Configure scraping of Prometheus metrics with Azure Monitor for containers

Prometheus 是一种常用的开源指标监视解决方案,是 Cloud Native Compute Foundation 的一部分。Prometheus is a popular open source metric monitoring solution and is a part of the Cloud Native Compute Foundation. 用于容器的 Azure Monitor 提供可收集 Prometheus 指标的无缝加入体验。Azure Monitor for containers provides a seamless onboarding experience to collect Prometheus metrics. 通常,若要使用 Prometheus,你需要通过一个存储来设置和管理 Prometheus 服务器。Typically, to use Prometheus, you need to set up and manage a Prometheus server with a store. 与 Azure Monitor 集成后,不需要 Prometheus 服务器。By integrating with Azure Monitor, a Prometheus server is not required. 你只需要通过导出程序或 Pod(应用程序)公开 Prometheus 指标终结点,用于容器的 Azure Monitor 的容器化代理即可为你抓取指标。You just need to expose the Prometheus metrics endpoint through your exporters or pods (application), and the containerized agent for Azure Monitor for containers can scrape the metrics for you.

Prometheus 的容器监视体系结构

备注

支持抓取 Prometheus 指标的最低代理版本为 ciprod07092019 或更高,支持在 KubeMonAgentEvents 表中写入配置和代理错误的代理版本为 ciprod10112019。The minimum agent version supported for scraping Prometheus metrics is ciprod07092019 or later, and the agent version supported for writing configuration and agent errors in the KubeMonAgentEvents table is ciprod10112019.

有关代理版本和每个版本中包含的内容的详细信息,请参阅代理发行说明For more information about the agent versions and what's included in each release, see agent release notes. 若要验证代理版本,请在“节点”选项卡中选择一个节点,然后在属性窗格中记下“代理映像标记”属性的值。To verify your agent version, from the Node tab select a node, and in the properties pane note value of the Agent Image Tag property.

支持从承载在以下项上的 Kubernetes 群集抓取 Prometheus 指标:Scraping of Prometheus metrics is supported with Kubernetes clusters hosted on:

  • Azure Kubernetes 服务 (AKS)Azure Kubernetes Service (AKS)
  • Azure Stack 或本地Azure Stack or on-premises

Prometheus 擦除设置Prometheus scraping settings

从以下两个角度之一执行从 Prometheus 主动擦除指标的操作:Active scraping of metrics from Prometheus is performed from one of two perspectives:

  • 群集范围 - 获取 HTTP URL,并从列出的服务终结点发现目标。Cluster-wide - HTTP URL and discover targets from listed endpoints of a service. 例如,k8s 服务(例如 kube-dns 和 kube-state-metrics),以及特定于应用程序的 Pod 注释。For example, k8s services such as kube-dns and kube-state-metrics, and pod annotations specific to an application. 将在 ConfigMap 节 [Prometheus data_collection_settings.cluster] 中定义此上下文中收集的指标。Metrics collected in this context will be defined in the ConfigMap section [Prometheus data_collection_settings.cluster].
  • 节点范围 - 获取 HTTP URL,并从列出的服务终结点发现目标。Node-wide - HTTP URL and discover targets from listed endpoints of a service. 将在 ConfigMap 节 [Prometheus_data_collection_settings.node] 中定义此上下文中收集的指标。Metrics collected in this context will be defined in the ConfigMap section [Prometheus_data_collection_settings.node].
终结点Endpoint 范围Scope 示例Example
Pod 批注Pod annotation 群集范围Cluster-wide 批注:annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/mymetrics"
prometheus.io/port: "8000"
prometheus.io/scheme: "http"
Kubernetes 服务Kubernetes service 群集范围Cluster-wide http://my-service-dns.my-namespace:9100/metrics
https://metrics-server.kube-system.svc.cluster.local/metrics
URL/终结点url/endpoint 单节点和/或群集范围Per-node and/or cluster-wide http://myurl:9101/metrics

指定 URL 后,用于容器的 Azure Monitor 仅擦除此终结点。When a URL is specified, Azure Monitor for containers only scrapes the endpoint. 指定 Kubernetes 服务后,将使用群集 DNS 服务器来解析服务名称以获取 IP 地址,然后擦除已解析的服务。When Kubernetes service is specified, the service name is resolved with the cluster DNS server to get the IP address and then the resolved service is scraped.

范围Scope 密钥Key 数据类型Data type Value 描述Description
群集范围Cluster-wide 指定以下三种方法中的任何一种,以擦除指标的终结点。Specify any one of the following three methods to scrape endpoints for metrics.
urls StringString 逗号分隔的数组Comma-separated array HTTP 终结点(指定的 IP 地址或有效的 URL 路径)。HTTP endpoint (Either IP address or valid URL path specified). 例如:urls=[$NODE_IP/metrics]For example: urls=[$NODE_IP/metrics]. ($NODE_IP 是容器参数的特定 Azure Monitor,可以使用它来代替节点 IP 地址。($NODE_IP is a specific Azure Monitor for containers parameter and can be used instead of node IP address. 必须全部大写。)Must be all uppercase.)
kubernetes_services StringString 逗号分隔的数组Comma-separated array 用于从 kube-state-metrics 擦除指标的 Kubernetes 服务数组。An array of Kubernetes services to scrape metrics from kube-state-metrics. 例如:kubernetes_services = ["https://metrics-server.kube-system.svc.cluster.local/metrics",http://my-service-dns.my-namespace:9100/metrics]For example,kubernetes_services = ["https://metrics-server.kube-system.svc.cluster.local/metrics",http://my-service-dns.my-namespace:9100/metrics].
monitor_kubernetes_pods BooleanBoolean true 或 falsetrue or false 如果在群集范围设置中将此项设置为 true,则容器代理的 Azure Monitor 将在整个群集中擦除以下 Prometheus 批注的 Kubernetes pod:When set to true in the cluster-wide settings, Azure Monitor for containers agent will scrape Kubernetes pods across the entire cluster for the following Prometheus annotations:
prometheus.io/scrape:
prometheus.io/scheme:
prometheus.io/path:
prometheus.io/port:
prometheus.io/scrape BooleanBoolean true 或 falsetrue or false 启用 pod 擦除。Enables scraping of the pod. monitor_kubernetes_pods 必须设置为 truemonitor_kubernetes_pods must be set to true.
prometheus.io/scheme 字符串String http 或 httpshttp or https 默认为通过 HTTP 擦除。Defaults to scrapping over HTTP. 必要时设置为 httpsIf necessary, set to https.
prometheus.io/path StringString 逗号分隔的数组Comma-separated array 要从中提取指标的 HTTP 资源路径。The HTTP resource path on which to fetch metrics from. 如果指标路径不是 /metrics,请使用此批注定义它。If the metrics path is not /metrics, define it with this annotation.
prometheus.io/port 字符串String 91029102 指定要从其擦除的端口。Specify a port to scrape from. 如果未设置端口,则默认为 9102。If port is not set, it will default to 9102.
monitor_kubernetes_pods_namespaces StringString 逗号分隔的数组Comma-separated array 一个允许列表,其中的命名空间可以从 Kubernetes Pod 抓取指标。An allow list of namespaces to scrape metrics from Kubernetes pods.
例如 monitor_kubernetes_pods_namespaces = ["default1", "default2", "default3"]For example, monitor_kubernetes_pods_namespaces = ["default1", "default2", "default3"]
节点范围Node-wide urls StringString 逗号分隔的数组Comma-separated array HTTP 终结点(指定的 IP 地址或有效的 URL 路径)。HTTP endpoint (Either IP address or valid URL path specified). 例如:urls=[$NODE_IP/metrics]For example: urls=[$NODE_IP/metrics]. ($NODE_IP 是容器参数的特定 Azure Monitor,可以使用它来代替节点 IP 地址。($NODE_IP is a specific Azure Monitor for containers parameter and can be used instead of node IP address. 必须全部大写。)Must be all uppercase.)
节点范围或群集范围Node-wide or Cluster-wide interval 字符串String 60s60s 收集间隔默认为 1 分钟(60 秒)。The collection interval default is one minute (60 seconds). 可将 [prometheus_data_collection_settings.node] 和/或 [prometheus_data_collection_settings.cluster] 的收集间隔设置为 s、m、h 等时间单位。You can modify the collection for either the [prometheus_data_collection_settings.node] and/or [prometheus_data_collection_settings.cluster] to time units such as s, m, h.
节点范围或群集范围Node-wide or Cluster-wide fieldpass
fielddrop
StringString 逗号分隔的数组Comma-separated array 可以通过设置允许 (fieldpass) 和禁止 (fielddrop) 列表,来指定要从终结点收集或不收集的特定指标。You can specify certain metrics to be collected or not from the endpoint by setting the allow (fieldpass) and disallow (fielddrop) listing. 必须先设置允许列表。You must set the allow list first.

ConfigMap 是一个全局列表,只能将一个 ConfigMap 应用到代理。ConfigMaps is a global list and there can be only one ConfigMap applied to the agent. 不能使用推翻收集规则的其他 ConfigMap。You cannot have another ConfigMaps overruling the collections.

使用 ConfigMap 进行配置和部署Configure and deploy ConfigMaps

执行以下步骤,为下列群集配置 ConfigMap 配置文件:Perform the following steps to configure your ConfigMap configuration file for the following clusters:

  • Azure Kubernetes 服务 (AKS)Azure Kubernetes Service (AKS)
  • Azure Stack 或本地Azure Stack or on-premises
  1. 下载模板 ConfigMap yaml 文件,并将其保存为 container-azm-ms-agentconfig.yaml。Download the template ConfigMap yaml file and save it as container-azm-ms-agentconfig.yaml.

  2. 编辑 ConfigMap yaml 文件,以添加用于抓取 Prometheus 指标的自定义设置。Edit the ConfigMap yaml file with your customizations to scrape Prometheus metrics.

    • 若要在群集范围内收集 Kubernetes 服务,请使用以下示例来配置 ConfigMap 文件。To collect of Kubernetes services cluster-wide, configure the ConfigMap file using the following example.

      prometheus-data-collection-settings: |- 
      # Custom Prometheus metrics data collection settings
      [prometheus_data_collection_settings.cluster] 
      interval = "1m"  ## Valid time units are s, m, h.
      fieldpass = ["metric_to_pass1", "metric_to_pass12"] ## specify metrics to pass through 
      fielddrop = ["metric_to_drop"] ## specify metrics to drop from collecting
      kubernetes_services = ["http://my-service-dns.my-namespace:9102/metrics"]
      
    • 若要跨群集配置从特定 URL 擦除 Prometheus 指标的操作,请使用以下示例来配置 ConfigMap 文件。To configure scraping of Prometheus metrics from a specific URL across the cluster, configure the ConfigMap file using the following example.

      prometheus-data-collection-settings: |- 
      # Custom Prometheus metrics data collection settings
      [prometheus_data_collection_settings.cluster] 
      interval = "1m"  ## Valid time units are s, m, h.
      fieldpass = ["metric_to_pass1", "metric_to_pass12"] ## specify metrics to pass through 
      fielddrop = ["metric_to_drop"] ## specify metrics to drop from collecting
      urls = ["http://myurl:9101/metrics"] ## An array of urls to scrape metrics from
      
    • 对于群集中的每个单独的节点,若要配置从代理的 DaemonSet 擦除 Prometheus 指标的操作,请在 ConfigMap 中配置以下内容:To configure scraping of Prometheus metrics from an agent's DaemonSet for every individual node in the cluster, configure the following in the ConfigMap:

      prometheus-data-collection-settings: |- 
      # Custom Prometheus metrics data collection settings 
      [prometheus_data_collection_settings.node] 
      interval = "1m"  ## Valid time units are s, m, h. 
      urls = ["http://$NODE_IP:9103/metrics"] 
      fieldpass = ["metric_to_pass1", "metric_to_pass2"] 
      fielddrop = ["metric_to_drop"] 
      

      备注

      $NODE_IP 是容器参数的特定 Azure Monitor,可以使用它来代替节点 IP 地址。$NODE_IP is a specific Azure Monitor for containers parameter and can be used instead of node IP address. 它必须全部大写。It must be all uppercase.

    • 若要通过指定 Pod 注释来配置擦除 Prometheus 指标的操作,请执行以下步骤:To configure scraping of Prometheus metrics by specifying a pod annotation, perform the following steps:

      1. 在 ConfigMap 中指定以下项:In the ConfigMap, specify the following:

        prometheus-data-collection-settings: |- 
        # Custom Prometheus metrics data collection settings
        [prometheus_data_collection_settings.cluster] 
        interval = "1m"  ## Valid time units are s, m, h
        monitor_kubernetes_pods = true 
        
      2. 为 Pod 注释指定以下配置:Specify the following configuration for pod annotations:

        - prometheus.io/scrape:"true" #Enable scraping for this pod 
        - prometheus.io/scheme:"http" #If the metrics endpoint is secured then you will need to set this to `https`, if not default ‘http’
        - prometheus.io/path:"/mymetrics" #If the metrics path is not /metrics, define it with this annotation. 
        - prometheus.io/port:"8000" #If port is not 9102 use this annotation
        

        如果要将监视限定于具有批注的 Pod 的特定命名空间,例如仅包含专用于生产工作负荷的 Pod,请在 ConfigMap 中将 monitor_kubernetes_pod 设置为 true,并添加命名空间筛选器 monitor_kubernetes_pods_namespaces,指定要从中进行抓取的命名空间。If you want to restrict monitoring to specific namespaces for pods that have annotations, for example only include pods dedicated for production workloads, set the monitor_kubernetes_pod to true in ConfigMap, and add the namespace filter monitor_kubernetes_pods_namespaces specifying the namespaces to scrape from. 例如 monitor_kubernetes_pods_namespaces = ["default1", "default2", "default3"]For example, monitor_kubernetes_pods_namespaces = ["default1", "default2", "default3"]

  3. 运行以下 kubectl 命令:kubectl apply -f <configmap_yaml_file.yaml>Run the following kubectl command: kubectl apply -f <configmap_yaml_file.yaml>.

    示例:kubectl apply -f container-azm-ms-agentconfig.yamlExample: kubectl apply -f container-azm-ms-agentconfig.yaml.

配置更改可能需要几分钟时间才能完成并生效,群集中的所有 omsagent pod 将会重启。The configuration change can take a few minutes to finish before taking effect, and all omsagent pods in the cluster will restart. 所有 omsagent pod 的重启是轮流式的重启,而不是一次性全部重启。The restart is a rolling restart for all omsagent pods, not all restart at the same time. 重启完成后,系统会显示包含结果的消息,如下所示:configmap "container-azm-ms-agentconfig" createdWhen the restarts are finished, a message is displayed that's similar to the following and includes the result: configmap "container-azm-ms-agentconfig" created.

应用已更新的 ConfigMapApplying updated ConfigMap

如果你已将 ConfigMap 部署到群集,但想要使用较新的配置更新 ConfigMap,可以编辑以前用过的 ConfigMap 文件,然后使用前面提到的相同命令应用该文件。If you have already deployed a ConfigMap to your cluster and you want to update it with a newer configuration, you can edit the ConfigMap file you've previously used, and then apply using the same commands as before.

对于以下 Kubernetes 环境:For the following Kubernetes environments:

  • Azure Kubernetes 服务 (AKS)Azure Kubernetes Service (AKS)
  • Azure Stack 或本地Azure Stack or on-premises

请运行命令 kubectl apply -f <configmap_yaml_file.yamlrun the command kubectl apply -f <configmap_yaml_file.yaml.

配置更改可能需要几分钟时间才能完成并生效,群集中的所有 omsagent pod 将会重启。The configuration change can take a few minutes to finish before taking effect, and all omsagent pods in the cluster will restart. 所有 omsagent pod 的重启是轮流式的重启,而不是一次性全部重启。The restart is a rolling restart for all omsagent pods, not all restart at the same time. 重启完成后,系统会显示包含结果的消息,如下所示:configmap "container-azm-ms-agentconfig" updatedWhen the restarts are finished, a message is displayed that's similar to the following and includes the result: configmap "container-azm-ms-agentconfig" updated.

验证配置Verify configuration

若要验证配置是否已成功应用于群集,请使用以下命令查看代理 Pod 的日志:kubectl logs omsagent-fdf58 -n=kube-systemTo verify the configuration was successfully applied to a cluster, use the following command to review the logs from an agent pod: kubectl logs omsagent-fdf58 -n=kube-system.

如果 omsagent pod 存在配置错误,输出中会显示如下所示的错误:If there are configuration errors from the omsagent pods, the output will show errors similar to the following:

***************Start Config Processing******************** 
config::unsupported/missing config schema version - 'v21' , using defaults

还可以查看有关应用配置更改的错误。Errors related to applying configuration changes are also available for review. 以下选项可用于对配置更改和 Prometheus 指标抓取执行其他故障排除:The following options are available to perform additional troubleshooting of configuration changes and scraping of Prometheus metrics:

  • 从代理 Pod 日志(使用同一 kubectl logs 命令)From an agent pod logs using the same kubectl logs command

  • 从实时数据(预览版)From Live Data (preview). 实时数据(预览版)日志显示类似于以下内容的错误:Live Data (preview) logs show errors similar to the following:

    2019-07-08T18:55:00Z E! [inputs.prometheus]: Error in plugin: error making HTTP request to http://invalidurl:1010/metrics: Get http://invalidurl:1010/metrics: dial tcp: lookup invalidurl on 10.0.0.10:53: no such host
    
  • 从 Log Analytics 工作区中的 KubeMonAgentEvents 表。From the KubeMonAgentEvents table in your Log Analytics workspace. 对于严重性为“警告”的抓取错误和严重性为“错误”的配置错误,数据每小时发送一次。 Data is sent every hour with Warning severity for scrape errors and Error severity for configuration errors. 如果没有错误,表中的条目将包含严重性为“信息”的数据,这些数据不会报告错误。If there are no errors, the entry in the table will have data with severity Info, which reports no errors. Tags 属性包含有关发生错误的 Pod 和容器 ID 的详细信息、第一次发生错误的 Pod 和容器 ID、最后一次发生错误的 Pod 和容器 ID 以及最后一小时内的错误计数。The Tags property contains more information about the pod and container ID on which the error occurred and also the first occurrence, last occurrence, and count in the last hour.

错误阻止了 omsagent 分析文件,导致其重启并使用默认配置。Errors prevent omsagent from parsing the file, causing it to restart and use the default configuration. 更正群集上的 ConfigMap 中的错误后,保存 yaml 文件,并运行以下命令来应用已更新的 ConfigMap:kubectl apply -f <configmap_yaml_file.yamlAfter you correct the error(s) in ConfigMap on clusters, save the yaml file and apply the updated ConfigMaps by running the command: kubectl apply -f <configmap_yaml_file.yaml.

查询 Prometheus 指标数据Query Prometheus metrics data

若要查看 Azure Monitor 抓取的 Prometheus 指标和代理报告的任何配置/抓取错误,请查看查询 Prometheus 指标数据查询配置或抓取错误To view prometheus metrics scraped by Azure Monitor and any configuration/scraping errors reported by the agent, review Query Prometheus metrics data and Query config or scraping errors.

在 Grafana 中查看 Prometheus 指标View Prometheus metrics in Grafana

用于容器的 Azure Monitor 支持在 Grafana 仪表板中查看 Log Analytics 工作区中存储的指标。Azure Monitor for containers supports viewing metrics stored in your Log Analytics workspace in Grafana dashboards. 我们提供了一个模板,你可从 Grafana 的仪表板存储库中下载以供入门和参考,它可帮助你了解如何从受监视的群集查询其他数据,来在自定义 Grafana 仪表板中直观显示。We have provided a template that you can download from Grafana's dashboard repository to get you started and reference to help you learn how to query additional data from your monitored clusters to visualize in custom Grafana dashboards.

查看 Prometheus 数据使用情况Review Prometheus data usage

若要确定每个指标大小 (GB) 的每日引入量以了解其是否偏高,请提供以下查询。To identify the ingestion volume of each metrics size in GB per day to understand if it is high, the following query is provided.

InsightsMetrics
| where Namespace contains "prometheus"
| where TimeGenerated > ago(24h)
| summarize VolumeInGB = (sum(_BilledSize) / (1024 * 1024 * 1024)) by Name
| order by VolumeInGB desc
| render barchart

输出将显示类似于以下内容的结果:The output will show results similar to the following:

屏幕截图显示了数据引入量的日志查询结果

若要估算月份中的每个指标大小 (GB) 以了解工作区中收到的引入数据量是否偏高,请提供以下查询。To estimate what each metrics size in GB is for a month to understand if the volume of data ingested received in the workspace is high, the following query is provided.

InsightsMetrics
| where Namespace contains "prometheus"
| where TimeGenerated > ago(24h)
| summarize EstimatedGBPer30dayMonth = (sum(_BilledSize) / (1024 * 1024 * 1024)) * 30 by Name
| order by EstimatedGBPer30dayMonth desc
| render barchart

输出将显示类似于以下内容的结果:The output will show results similar to the following:

数据引入量的日志查询结果

使用 Azure Monitor 日志管理使用情况和成本中提供了有关如何监视数据使用情况和分析成本的更多信息。Further information on how to monitor data usage and analyze cost is available in Manage usage and costs with Azure Monitor Logs.

后续步骤Next steps

此处详细了解如何为 stdout、stderr 和容器工作负荷中的环境变量配置代理收集设置。Learn more about configuring the agent collection settings for stdout, stderr, and environmental variables from container workloads here.