Service Fabric 群集中的证书管理Certificate management in Service Fabric clusters

本文说明了如何管理用于保护 Azure Service Fabric 群集中的通信的证书,并且补充了对 Service Fabric 群集安全性的介绍,还对 Service Fabric 中基于 X.509 证书的身份验证进行了补充说明。This article addresses the management aspects of certificates used to secure communication in Azure Service Fabric clusters, and complements the introduction to Service Fabric cluster security as well as the explainer on X.509 certificate-based authentication in Service Fabric. 假设读者熟悉基本的安全性概念,以及 Service Fabric 公开的用于配置群集安全性的控制措施。We assume the reader is familiar with fundamental security concepts, and also with the controls that Service Fabric exposes to configure the security of a cluster.

本文涵盖以下方面:Aspects covered under this title:

  • “证书管理”的具体涵义是什么?What exactly is 'certificate management'?
  • 证书管理中涉及的角色和实体Roles and entities involved in certificate management
  • 证书的历程The journey of a certificate
  • 深入探讨一个示例Deep dive into an example
  • 故障排除和常见问题解答Troubleshooting and Frequently Asked Questions

不过,首先是一个免责声明:本文尝试将理论与实践相结合,这就要求受众了解服务、技术等方面的细节。But first - a disclaimer: this article attempts to pair its theoretical side with hands-on examples, which require, well, specifics of services, technologies, and so on. 由于受众中有很大一部分是 Microsoft 内部员工,因此我们将提到的服务、技术和产品是特定于 Azure 的。Since a sizeable part of the audience is Microsoft-internal, we'll refer to services, technologies, and products specific to Azure. 如果特定于 Microsoft 的细节不适用于你的情况,请在 Azure 支持中提问,以获得详细说明或相关指导。Feel free to ask in the Azure support for clarifications or guidance, where the Microsoft-specific details do not apply in your case.

定义证书管理Defining certificate management

正如我们在配套文章中看到的那样,证书是一个加密对象,它本质上是将某个非对称密钥对与描述它所表示的实体的属性绑定在一起。As we've seen in the companion article, a certificate is a cryptographic object essentially binding an asymmetric key pair with attributes describing the entity it represents. 但它也是一个“易变质”对象,原因在于它的生存期有限,并且容易泄漏 - 不管是意外泄露还是被攻击者成功利用,都可能会使证书变得毫无用处(从安全角度来看)。However, it is also a 'perishable' object, in that its lifetime is finite and is susceptible to compromises - accidental disclosure or a successful exploit can render a certificate useless from a security standpoint. 这意味着需要更改证书 - 定期更改或为了响应安全事件而更改。This implies the need to change certificates - either routinely or in response to a security incident. 管理的另一方面(其本身就是一整个主题)是保护证书私钥,或者保护相关机密,这些机密保护那些涉及证书获得和预配过程的实体的标识。Another aspect of management (and is an entire topic on its own) is the safeguarding of certificate private keys, or of secrets protecting the identities of the entities involved in procuring and provisioning certificates. 我们将用于获取证书并将其安全地传输到需要证书的位置的流程和过程称为“证书管理”。We refer to the processes and procedures used to obtain certificates and to safely (and securely) transport them to where they are needed as 'certificate management'. 某些管理操作(例如注册、策略设置和授权控制)超出了本文的范围。Some of the management operations - such as enrollment, policy setting, and authorization controls - are beyond the scope of this article. 还有其他一些功能(如预配、续订、重新生成密钥或吊销)只是偶尔与 Service Fabric 相关。尽管如此,我们仍会在此对其进行某种程度的阐述,因为了解这些操作有助于用户正确保护其群集。Still others - such as provisioning, renewal, re-keying, or revocation - are only incidentally related to Service Fabric; nonetheless, we'll address them here to some degree, as understanding these operations can help with properly securing one's cluster.

目标是尽可能将证书管理自动化,以确保群集的不间断可用性并提供安全保证,因为此过程无需用户参与。The goal is to automate certificate management as much as possible to ensure uninterrupted availability of the cluster and offer security assurances, given that the process is user-touch-free. 此目标目前在 Azure Service Fabric 群集中是可以达到的;本文其余部分将首先析构证书管理,然后重点介绍如何启用自动滚动更新。This goal is attainable currently in Azure Service Fabric clusters; the remainder of the article will first deconstruct certificate management, and later will focus on enabling autorollover.

具体而言,范围内的主题包括:Specifically, the topics in scope are:

  • 与管理证书时将所有者和平台的属性分开相关的假设assumptions related to the separation of attributions between owner and platform, in the context of managing certificates
  • 证书从颁发到使用这一过程的长管道the long pipeline of certificates from issuance to consumption
  • 证书轮换 - 原因、操作方法和时间certificate rotation - why, how and when
  • 哪些方面可能会出错?what could possibly go wrong?

保护/管理域名、注册到证书中,或者通过设置授权控制来强制颁发证书等方面的内容超出了本文的范围。Aspect such as securing/managing domain names, enrolling into certificates, or setting up authorization controls to enforce certificate issuance are out of the scope of this article. 请访问你最喜欢的公钥基础结构 (PKI) 服务的注册机构 (RA)。Refer to the Registration Authority (RA) of your favorite Public Key Infrastructure (PKI) service. Microsoft 内部使用者:请联系 Azure 安全部门。Microsoft-internal consumers: please reach out to Azure Security.

证书管理中涉及的角色和实体Roles and entities involved in certificate management

Service Fabric 群集中的安全方法是这样的:群集所有者声明它,Service Fabric 运行时强制执行它。The security approach in a Service Fabric cluster is a case of "cluster owner declares it, Service Fabric runtime enforces it". 这意味着,几乎没有任何证书、密钥或其他参与群集功能的标识凭据来自服务本身;它们都由群集所有者声明。By that we mean that almost none of the certificates, keys, or other credentials of identities participating in a cluster's functioning come from the service itself; they are all declared by the cluster owner. 此外,群集所有者还负责将证书预配到群集中、根据需要续订这些证书,并且始终确保证书的安全性。Furthermore, the cluster owner is also responsible for provisioning the certificates into the cluster, renewing them as needed, and ensuring the security of the certificates at all times. 更具体地说,群集所有者必须确保:More specifically, the cluster owner must ensure that:

  • 在群集清单的 NodeType 节中声明的证书可以在该类型的每个节点上找到(根据出示规则certificates declared in the NodeType section of the cluster manifest can be found on each node of that type, according to the presentation rules
  • 上面声明的证书在安装时包含其相应的私钥certificates declared above are installed inclusive of their corresponding private keys
  • 在出示规则中声明的证书应能够通过验证规则certificates declared in the presentation rules should pass the validation rules

Service Fabric 自身将承担以下职责:Service Fabric, for its part, assumes the responsibilities of:

  • 定位/查找与群集定义中的声明匹配的证书locating/finding certificates matching the declarations in the cluster definition
  • 根据“需要”向 Service Fabric 控制的实体授予相应私钥的访问权限granting access to the corresponding private keys to Service Fabric-controlled entities on a 'need' basis
  • 严格按照已制定的安全性最佳做法和群集定义来验证证书validating certificates in strict accordance with established security best-practices and the cluster definition
  • 在证书即将过期或无法执行基本的证书验证步骤时发出警报raising alerts on impending expiration of certificates, or failures to perform the basic steps of certificate validation
  • (在一定程度上)验证主机的基础配置是否满足群集定义中与证书相关的方面的要求validating (to some degree) that the certificate-related aspects of the cluster definition are met by the underlying configuration of the hosts

由此可见,证书管理负担(以活动操作的形式)完全落在群集所有者身上。It follows that the certificate management burden (as active operations) falls solely on the cluster owner. 在以下部分,我们将详细介绍每个管理操作,以及可用机制及其对群集的影响。In the following sections, we'll take a closer look at each of the management operations, with available mechanisms and their impact on the cluster.

证书的历程The journey of a certificate

让我们快速回顾一下在 Service Fabric 群集环境中证书从颁发到使用的整个过程:Let us quickly revisit the progression of a certificate from issuance to consumption in the context of a Service Fabric cluster:

  1. 域所有者向 PKI 的 RA 注册要与继而生成的证书关联的域或使用者,而这些证书则构成对上述域或使用者的所有权证明A domain owner registers with the RA of a PKI a domain or subject that they'd like to associate with ensuing certificates; the certificates will, in turn, constitute proofs of ownership of said domain or subject
  2. 域所有者还在 RA 中指定已授权请求者的标识。这些请求者是特定的实体,有权请求将证书注册到指定的域或使用者;在 Azure 中,默认标识提供者是 Azure Active Directory,经授权的请求者是通过其相应的 AAD 标识(或安全组)指定的。The domain owner also designates in the RA the identities of authorized requesters - entities that are entitled to request the enrollment of certificates with the specified domain or subject; in Azure, the default identity provider is Azure Active Directory, and authorized requesters are designated by their corresponding AAD identity (or via security groups)
  3. 然后,经授权的请求者通过机密管理服务注册到证书中;在 Azure 中,所选 SMS 为 Azure Key Vault (AKV),它可以安全地存储机密/证书,并允许经授权的实体检索它们。An authorized requester then enrolls into a certificate via a Secret Management Service; in Azure, the SMS of choice is Azure Key Vault (AKV), which securely stores and allows the retrieval of secrets/certificates by authorized entities. AKV 还会根据已关联证书策略中的配置对证书执行续订/重新生成密钥操作。AKV also renews/re-keys the certificate as configured in the associated certificate policy. (AKV 使用 AAD 作为标识提供者。)(AKV uses AAD as the identity provider.)
  4. 经授权的检索器(我们称之为“预配代理”)会从保管库检索证书(包括其私钥),并将其安装在用于托管群集的计算机上An authorized retriever - which we'll refer to as a 'provisioning agent' - retrieves the certificate, inclusive of its private key, from the vault, and installs it on the machines hosting the cluster
  5. Service Fabric 服务(在每个节点上以提升的权限运行)将证书访问权限授予那些获得允许的 Service Fabric 实体;这些实体由本地组指定,在 ServiceFabricAdministrators 和 ServiceFabricAllowedUsers 之间进行拆分The Service Fabric service (running elevated on each node) grants access to the certificate to allowed Service Fabric entities; these are designated by local groups, and split between ServiceFabricAdministrators and ServiceFabricAllowedUsers
  6. Service Fabric 运行时会访问并使用证书来建立联合身份验证,或者对经过授权的客户端发出的入站请求进行身份验证The Service Fabric runtime accesses and uses the certificate to establish federation, or to authenticate to inbound requests from authorized clients
  7. 预配代理监视保管库证书,并在检测到续订时触发预配流;然后,群集所有者会根据需要更新群集定义,表明要滚动更新证书的意图。The provisioning agent monitors the vault certificate, and triggers the provisioning flow upon detecting renewal; subsequently, the cluster owner updates the cluster definition, if needed, to indicate the intent to roll over the certificate.
  8. 预配代理或群集所有者还负责清除/删除未使用的证书The provisioning agent or the cluster owner is also responsible for cleaning up/deleting unused certificates

对于我们的目的而言,以上一系列步骤中的前两个步骤在很大程度上是不相关的;唯一的联系是,证书的使用者公用名称是在群集定义中声明的 DNS 名称。For our purposes, the first two steps in the sequence above are largely unrelated; the only connection is that the subject common name of the certificate is the DNS name declared in the cluster definition.

这些步骤如下所示;请注意分别通过指纹和公用名称声明的证书在预配方面的差异。These steps are illustrated below; note the differences in provisioning between certificates declared by thumbprint and common name, respectively.

图 1.Fig. 1. 通过指纹声明的证书的颁发和预配流。Issuance and provisioning flow of certificates declared by thumbprint. 预配通过指纹声明的证书Provisioning certificates declared by thumbprint

图 2.Fig. 2. 通过使用者公用名称声明的证书的颁发和预配流。Issuance and provisioning flow of certificates declared by subject common name. 预配通过所有者公用名称声明的证书Provisioning certificates declared by subject common name

证书注册Certificate enrollment

有关本主题的详细介绍,请参阅 Key Vault 文档;我们在此提供一个概要是为了保持连续性,并且也方便你参考。This topic is covered in detail in the Key Vault documentation; we're including a synopsis here for continuity and easier reference. 在继续使用 Azure 的情况下,如果使用 Azure Key Vault 作为机密管理服务,则经授权的证书请求者必须至少在保管库上具有由保管库所有者授予的证书管理权限;然后,请求者会注册到证书中,如下所示:Continuing with Azure as the context, and using Azure Key Vault as the secret management service, an authorized certificate requester must have at least certificate management permissions on the vault, granted by the vault owner; the requester would then enroll into a certificate as follows:

  • 在 Azure Key Vault (AKV) 中创建一个证书策略,该策略指定证书的域/所有者、所需颁发者、密钥类型和长度、预期的密钥用法,等等;有关详细信息,请参阅 Azure Key Vault 中的证书creates a certificate policy in Azure Key Vault (AKV), which specifies the domain/subject of the certificate, the desired issuer, key type and length, intended key usage and more; see Certificates in Azure Key Vault for details.
  • 使用上面指定的策略在同一保管库中创建一个证书;该证书接下来会生成一个用作保管库对象的密钥对、一个使用私钥签名的证书签名请求,然后该请求会转发给指定的颁发者进行签名。creates a certificate in the same vault with the policy specified above; this, in turn, generates a key pair as vault objects, a certificate signing request signed with the private key, and which is then forwarded to the designated issuer for signing
  • 颁发者(证书颁发机构)使用签名证书进行答复后,系统会将结果合并到保管库中,然后证书即可用于以下操作:once the issuer (Certificate Authority) replies with the signed certificate, the result is merged into the vault, and the certificate is available for the following operations:
    • 在 {vaultUri}/certificates/{name} 下:包含公钥和元数据的证书under {vaultUri}/certificates/{name}: the certificate including the public key and metadata
    • 在 {vaultUri}/keys/{name} 下:证书的私钥,可用于加密操作(包装/解包、签名/验证)under {vaultUri}/keys/{name}: the certificate's private key, available for cryptographic operations (wrap/unwrap, sign/verify)
    • 在 {vaultUri}/secrets/{name} 下:包含私钥的证书,可作为不受保护的 pfx 或 pem 文件下载。请回想一下,保管库证书实际上是一系列按时间排序的共享某个策略的证书实例。under {vaultUri}/secrets/{name}: the certificate inclusive of its private key, available for downloading as an unprotected pfx or pem file Recall that a vault certificate is, in fact, a chronological line of certificate instances, sharing a policy. 证书版本将根据策略的生存期和续订属性创建。Certificate versions will be created according to the lifetime and renewal attributes of the policy. 强烈建议不要让保管库证书共享使用者或域/DNS 名称;在一个群集中,若要从不同的保管库证书预配证书实例,并且这些证书的使用者相同,但其他属性(例如颁发者、密钥用法等)却存在实质性的差异,则可能会造成中断。It is highly recommended that vault certificates not share subjects or domains/DNS names; it can be disruptive in a cluster to provision certificate instances from different vault certificates, with identical subjects but substantially different other attributes, such as issuer, key usages etc.

此时,保管库中存在一个可供使用的证书。At this point, a certificate exists in the vault, ready for consumption. 接下来讲述:Onward to:

证书预配Certificate provisioning

我们提到过“预配代理”,这是一个实体,用于从保管库检索证书(包括其私钥),并将其安装在群集的每个主机上。We mentioned a 'provisioning agent', which is the entity that retrieves the certificate, inclusive of its private key, from the vault and installs it on to each of the hosts of the cluster. (请回想一下,Service Fabric 不预配证书。)在我们的上下文中,群集会托管在一系列 Azure VM 和/或虚拟机规模集上。(Recall that Service Fabric does not provision certificates.) In our context, the cluster will be hosted on a collection of Azure VMs and/or virtual machine scale sets. 在 Azure 中,可以使用以下机制将证书从保管库预配到 VM/VMSS - 如上所述,假设预配代理以前已被保管库所有者授予保管库的“获取”权限:In Azure, provisioning a certificate from a vault to a VM/VMSS can be achieved with the following mechanisms - assuming, as above, that the provisioning agent was previously granted 'get' permissions on the vault by the vault owner:

  • 即席:操作员从保管库中检索证书(以 pfx/PKCS #12 或 pem 的形式),并将其安装在每个节点上ad-hoc: an operator retrieves the certificate from the vault (as pfx/PKCS #12 or pem) and installs it on each node
  • 在部署过程中作为虚拟机规模集“机密”:计算服务代表操作员使用第一方标识从一个启用了模板部署的保管库检索证书,然后将其安装在虚拟机规模集的每个节点上(像这样);请注意,这只允许预配进行了版本控制的机密as a virtual machine scale set 'secret' during deployment: the Compute service retrieves, using its first party identity on behalf of the operator, the certificate from a template-deployment-enabled vault and installs it on each node of the virtual machine scale set (like so); note this allows the provisioning of versioned secrets only
  • 使用 Key Vault VM 扩展;这样可以使用无版本声明来预配证书,并可定期刷新观察到的证书。using the Key Vault VM extension; this allows the provisioning of certificates using version-less declarations, with periodic refreshing of observed certificates. 在这种情况下,VM/VMSS 应该有一个托管标识,该标识已被授予对观察到的证书所在的保管库的访问权限。In this case, the VM/VMSS is expected to have a managed identity, an identity that has been granted access to the vault(s) containing the observed certificates.

出于多种原因(包括从安全性到可用性在内的各种原因),不建议使用即席机制,我们在这里不对其进行进一步的讨论。有关详细信息,请参阅虚拟机规模集中的证书The ad-hoc mechanism is not recommended for multiple reasons, ranging from security to availability, and won't be discussed here further; for details, refer to certificates in virtual machine scale sets.

基于 VMSS/计算的预配具有安全性和可用性方面的优势,但也存在限制。The VMSS-/Compute-based provisioning presents security and availability advantages, but it also presents restrictions. 根据设计,它要求将证书声明为进行版本控制的机密,因此仅适用于通过指纹声明的证书所保护的群集。It requires - by design - declaring certificates as versioned secrets, which makes it suitable only for clusters secured with certificates declared by thumbprint. 与之相反,基于 Key Vault VM 扩展的预配会始终安装每个观察到的证书的最新版本,因此仅适用于通过使用者公用名称声明的证书所保护的群集。In contrast, the Key Vault VM extension-based provisioning will always install the latest version of each observed certificate, which makes it suitable only for clusters secured with certificates declared by subject common name. 需要强调的是,请不要对按实例(即按指纹)声明的证书使用自动刷新预配机制(例如 KVVM 扩展)- 失去可用性的风险很大。To emphasize, do not use an autorefresh provisioning mechanism (such as the KVVM extension) for certificates declared by instance (that is, by thumbprint) - the risk of losing availability is considerable.

可能存在其他预配机制;上面是目前接受的可用于 Azure Service Fabric 群集的机制。Other provisioning mechanisms may exist; the above are currently accepted for Azure Service Fabric clusters.

证书的使用和监视Certificate consumption and monitoring

如前所述,Service Fabric 运行时负责查找和使用在群集定义中声明的证书。As mentioned earlier, the Service Fabric runtime is responsible for locating and using the certificates declared in the cluster definition. 有关基于证书的身份验证的文章详细说明了 Service Fabric 分别如何实施出示和验证规则,在此不再讲述。The article on certificate-based authentication explained in detail how Service Fabric implements the presentation and validation rules, respectively, and won't be revisited here. 我们将探讨访问权限和权限的授予和监视。We are going to look at access and permission granting, as well as monitoring.

请回想一下,Service Fabric 中的证书用于多种目的,从联合层中的相互身份验证到管理终结点的 TLS 身份验证,不一而足;这需要不同的组件或系统服务才能访问证书的私钥。Recall that certificates in Service Fabric are used for a multitude of purposes, from mutual authentication in the federation layer to TLS authentication for the management endpoints; this requires various components or system services to have access to the certificate's private key. Service Fabric 运行时定期扫描证书存储,查找每个已知出示规则的匹配项;对于每个匹配的证书,会找到相应的私钥,并更新其自由访问控制列表,使其包含权限(通常为读取和执行权限),这些权限授予给需要它们的相应标识。The Service Fabric runtime regularly scans the certificate store, looking for matches for each of the known presentation rules; for each of the matching certificates, the corresponding private key is located, and its discretionary access control list is updated to include permissions - typically Read and Execute - granted to the respective identity that requires them. (此过程非正式地称为“ACL”。)此过程在 1 分钟的时间段内运行,并且还介绍了“应用程序”证书,例如用于加密设置的证书或终结点证书。(This process is informally referred to as 'ACLing'.) The process runs on a 1-minute cadence, and also covers 'application' certificates, such as those used to encrypt settings or as endpoint certificates. ACL 遵循出示规则,因此请务必记住:通过指纹声明的证书和未进行后续群集配置更新的自动刷新的证书将不可访问。ACLing follows the presentation rules, so it's important to keep in mind that certificates declared by thumbprint and which are autorefreshed without the ensuing cluster configuration update will not be accessible.

证书轮换Certificate rotation

注意:IETF RFC 3647 正式将续订定义为使用与要替换的证书相同的属性来颁发证书 - 将会保留颁发者、使用者的公钥和信息,并会在颁发证书时使用新的密钥对来重新生成密钥,不对是否可以更改颁发者进行限制。As a side note: IETF RFC 3647 formally defines renewal as the issuance of a certificate with the same attributes as the certificate being replaced - the issuer, the subject's public key and information is preserved, and re-keying as the issuance of a certificate with a new key pair, and no restrictions on whether or not the issuer may change. 考虑到其中的区别可能很重要(试想一下在颁发者固定的情况下通过使用者公用名称来声明证书),我们将选择中性术语“轮换”来涵盖这两种方案。Given the distinction may be important (consider the case of certificates declared by subject common name with issuer pinning), we'll opt for the neutral term 'rotation' to cover both scenarios. (请记住,在非正式使用的情况下,“续订”事实上指的是重新生成密钥。)(Do keep in mind that, when informally used, 'renewal' refers, in fact, to re-keying.)

之前我们曾见到 Azure Key Vault 支持自动证书轮换:关联的证书策略定义在保管库中轮换证书的时间点,不管在表示时采用的是过期前的天数还是总生存期的百分比。We've seen earlier that Azure Key Vault supports automatic certificate rotation: the associate certificate policy defines the point in time, whether by days before expiration or percentage of total lifetime, when the certificate is rotated in the vault. 必须在此时间点之后(现在的旧证书过期之前)调用预配代理,然后才能将此新证书分发到群集的所有节点。The provisioning agent must be invoked after this point in time, and prior to the expiration of the now-previous certificate, to distribute this new certificate to all of the nodes of the cluster. 如果证书的到期日期(当前在群集中使用)早于预先确定的时间间隔,Service Fabric 会协助发出运行状况警告。Service Fabric will assist by raising health warnings when the expiration date of a certificate (and which is currently in use in the cluster) occurs sooner than a predetermined interval. 配置为观察保管库证书的自动预配代理(即 KeyVault VM 扩展)会定期轮询保管库、检测轮换情况,以及检索并安装新证书。An automatic provisioning agent (i.e. the KeyVault VM extension), configured to observe the vault certificate, will periodically poll the vault, detect the rotation, and retrieve and install the new certificate. 通过 VM/VMSS 的“机密”功能进行预配时,需要一位经授权的操作员使用与新证书相对应的进行了版本控制的 KeyVault URI 来更新 VM/VMSS。Provisioning done via VM/VMSS 'secrets' feature will require an authorized operator to update the VM/VMSS with the versioned KeyVault URI corresponding to the new certificate.

不管什么情况,轮换的证书现在都已预配到了所有节点,而我们也介绍了 Service Fabric 采用的用于检测轮换情况的机制;接下来,让我们看看会发生什么 - 假设轮换适用于通过使用者公用名称声明的群集证书(在撰写本文时全都适用,Service Fabric 的运行时版本为 7.1.409):In either case, the rotated certificate is now provisioned to all of the nodes, and we have described the mechanism Service Fabric employs to detect rotations; let us examine what happens next - assuming the rotation applied to the cluster certificate declared by subject common name (all applicable as of the time of this writing, and Service Fabric runtime version 7.1.409):

  • 对于群集内部的新连接以及要发到群集中的新连接,Service Fabric 运行时都会找到并选择具有最远到期日期(证书的“NotAfter”属性,通常缩写为“na”)的匹配证书。for new connections within, as well as into the cluster, the Service Fabric runtime will find and select the matching certificate with the farthest expiration date (the 'NotAfter' property of the certificate, often abbreviated as 'na')
  • 现有连接将保持活动状态,或者获允以自然方式过期或终止;系统会通知内部处理程序:存在新的匹配项existing connections will be kept alive/allowed to naturally expire or otherwise terminate; an internal handler will have been notified that a new match exists

这将转换为以下重要观察结果:This translates into the following important observations:

  • 如果续订证书的到期日期早于当前使用的证书的到期日期,系统会忽略续订证书。The renewal certificate may be ignored if its expiration date is sooner than that of the certificate currently in use.
  • 群集或已托管应用程序的可用性优先于要求轮换证书的指令;群集最终会聚合到新证书,但没有时间保证。The availability of the cluster, or of the hosted applications, takes precedence over the directive to rotate the certificate; the cluster will converge on the new certificate eventually, but without timing guarantees. 结果就是:It follows that:
  • 在观察者看来,并不是立刻就能看到轮换的证书彻底取代其前身;若要确保这一点(对群集证书来说),唯一方法是重启主机。It may not be immediately obvious to an observer that the rotated certificate completely replaced its predecessor; the only way to ensure that is (for cluster certificates) to reboot the host machines. 请注意,重启 Service Fabric 节点是不够的,因为在群集中形成租约连接的内核模式组件不会受到影响。Note it is not sufficient to restart the Service Fabric nodes, as kernel mode components which form lease connections in a cluster will not be affected. 另请注意,重启 VM/VMSS 可能会导致暂时丢失可用性。Also note that restarting the VM/VMSS may cause temporary loss of availability. (对于应用程序证书,只需重启各自的应用程序实例即可。)(For application certificates, it is sufficient to restart the respective application instances only.)
  • 引入不符合验证规则的重新生成密钥的证书可能会实质性地破坏群集。Introducing a re-keyed certificate that does not meet the validation rules can effectively break the cluster. 这方面最常见的示例是出现意外的颁发者:群集证书是在固定颁发者的情况下通过使用者公用名称声明的,但轮换的证书是由新的/未声明的颁发者颁发的。The most common example of this is the case of an unexpected issuer: the cluster certificates are declared by subject common name with issuer pinning, but the rotated certificate was issued by a new/undeclared issuer.

证书清除Certificate cleanup

目前,Azure 中没有可以显式删除证书的预配。At this time, there are no provisions in Azure for explicit removal of certificates. 通常情况下,很难确定是否在给定的时间使用了给定的证书。It is often a non-trivial task to determine whether or not a given certificate is being used at a given time. 相比于群集证书,应用程序证书更加难以进行上述确定。This is more difficult for application certificates than for cluster certificates. Service Fabric 本身(而不是预配代理)在任何情况下都不会删除用户声明的证书。Service Fabric itself, not being the provisioning agent, will not delete a certificate declared by the user under any circumstance. 对于标准预配机制来说,会出现以下情况:For the standard provisioning mechanisms:

  • 声明为 VM/VMSS 机密的证书会被预配,前提是这些证书在 VM/VMSS 定义中被引用,它们可以从保管库中检索(删除保管库机密/证书会导致后续 VM/VMSS 部署失败;同样,禁用保管库中的机密版本也会导致引用该机密版本的 VM/VMSS 部署失败)Certificates declared as VM/VMSS secrets will be provisioned as long as they are referenced in the VM/VMSS definition, and they are retrievable from the vault (deleting a vault secret/certificate will fail subsequent VM/VMSS deployments; similarly, disabling a secret version in the vault will also fail VM/VMSS deployments, which reference that secret version)
  • 通过 KeyVault VM 扩展预配的先前版本的证书不一定存在于 VM/VMSS 节点上。Previous versions of certificates provisioned via the KeyVault VM extension may or may not be present on a VM/VMSS node. 代理仅检索并安装当前版本,不删除任何证书。The agent only retrieves and installs the current version, and does not remove any certificates. 重置节点映像(通常每月进行一次)会将证书存储重置为 OS 映像的内容,因此会隐式删除以前的版本。Reimaging a node (which typically occurs every month) will reset the certificate store to the content of the OS image, and so previous versions will implicitly be removed. 但请考虑一下,纵向扩展虚拟机规模集将仅导致系统安装所观察证书的当前版本;在已安装证书的情况下,请勿采用节点同质化操作。Consider, though, that scaling up a virtual machine scale set will result in only the current version of observed certificates being installed; do not assume homogeneity of nodes with regard to installed certificates.

简化管理 - 自动滚动更新示例Simplifying management - an autorollover example

我们介绍了各种机制、限制,概述了复杂的规则和定义,并对中断进行了极端预测。We've described mechanisms, restrictions, outlined intricate rules and definitions, and made dire predictions of outages. 或许还需要说明如何设置自动证书管理,以避免所有这些问题。It is, perhaps, time to show how to set up automatic certificate management to avoid all of these concerns. 我们会在 Azure Service Fabric 群集(在 PaaSv2 虚拟机规模集上运行)的上下文中执行此操作,使用 Azure Key Vault 进行机密管理,并利用托管标识,如下所示:We're doing so in the context of an Azure Service Fabric cluster running on an PaaSv2 virtual machine scale set, using Azure Key Vault for secrets management and leveraging managed identities, as follows:

  • 证书的验证从指纹固定更改为使用者 + 颁发者固定:给定颁发者颁发的具有给定使用者的任何证书都会受到相同的信任Validation of certificates is changed from thumbprint-pinning to subject + issuer pinning: any certificate with a given subject from a given issuer is equally trusted
    • 证书的注册和获取在受信任的存储 (Key Vault) 中进行,并由代理(在本例中为 KeyVault VM 扩展)进行刷新Certificates are enrolled into and obtained from a trusted store (Key Vault), and refreshed by an agent - in this case, the KeyVault VM extension
    • 证书的预配不再在部署时进行,也不再基于版本(像 ComputeRP 那样),而是在部署后进行,并且使用无版本的 KeyVault URIProvisioning of certificates is changed from deployment-time and version-based (as done by ComputeRP) to post-deployment and using version-less KeyVault URIs
    • 通过用户分配的托管标识授予对 KeyVault 的访问权限;在部署过程中,将创建 UA 标识并将其分配给虚拟机规模集Access to KeyVault is granted via user-assigned managed identities; the UA identity is created and assigned to the virtual machine scale set during deployment
    • 部署后,代理(KV VM 扩展)会轮询并刷新在虚拟机规模集的每个节点上观察到的证书;因此,证书轮换是完全自动的,因为 SF 会自动选取最远的有效证书After deployment, the agent (the KV VM extension) will poll and refresh observed certificates on each node of the virtual machine scale set; certificate rotation is thus fully automated, as SF will automatically pick up the farthest valid certificate

此序列可以完全脚本化/自动化,可以在不让用户接触的情况下对某个已配置为证书自动滚动更新的群集进行初始部署。The sequence is fully scriptable/automated and allows a user-touch-free initial deployment of a cluster configured for certificate autorollover. 下面提供了详细步骤。Detailed steps are provided below. 我们将混合使用 PowerShell cmdlet 和 json 模板片段。We'll use a mix of PowerShell cmdlets and fragments of json templates. 可以使用与 Azure 交互时所有支持的方法来实现相同的功能。The same functionality is achievable with all supported means of interacting with Azure.

备注

此示例假设保管库中已经存在一个证书;注册和续订 KeyVault 托管证书需要完成充当先决条件的手动步骤,如本文前面所述。This example assumes a certificate exists already in the vault; enrolling and renewing a KeyVault-managed certificate requires prerequisite manual steps as described earlier in this article. 对于生产环境,请使用 KeyVault 托管证书 - 下面提供了特定于 Microsoft 内部 PKI 的示例脚本。For production environments, use KeyVault-managed certificates - a sample script specific to a Microsoft-internal PKI is included below. 证书自动滚动更新仅适用于 CA 颁发的证书;使用自签名证书(包括在 Azure 门户中部署 Service Fabric 群集时生成的证书)是不合理的,但自签名证书仍可用于本地/开发人员托管的部署,方法是将颁发者指纹声明为与叶证书的相同。Certificate autorollover only makes sense for CA-issued certificates; using self-signed certificates, including those generated when deploying a Service Fabric cluster in the Azure portal, is nonsensical, but still possible for local/developer-hosted deployments, by declaring the issuer thumbprint to be the same as of the leaf certificate.

起点Starting point

为了简单起见,我们将假定采用以下开始状态:For brevity, we will assume the following starting state:

  • Service Fabric 群集存在,并使用通过指纹声明的 CA 颁发的证书进行保护。The Service Fabric cluster exists, and is secured with a CA-issued certificate declared by thumbprint.
  • 此证书存储在保管库中,并预配为虚拟机规模集机密The certificate is stored in a vault, and provisioned as a virtual machine scale set secret
  • 将使用同一证书将群集转换为基于常用名称的证书声明,因此可以通过所有者和颁发者进行验证;如果不是这种情况,请获取用于此目的的 CA 颁发的证书,并按指纹将其添加到群集定义中,如此处介绍的那样The same certificate will be used to convert the cluster to common name-based certificate declarations, and so can be validated by subject and issuer; if this is not the case, obtain the CA-issued certificate intended for this purpose, and add it to the cluster definition by thumbprint as explained here

下面是与此类状态相对应的模板的 json 摘录 - 注意,这里省略了许多必需的设置,仅演示了与证书相关的方面:Here is a json excerpt from a template corresponding to such a state - note this omits many required settings, and only illustrates the certificate-related aspects:

  "resources": [
    {   ## VMSS definition
      "apiVersion": "[variables('vmssApiVersion')]",
      "type": "Microsoft.Compute/virtualMachineScaleSets",
      "name": "[variables('vmNodeTypeName')]",
      "location": "[variables('computeLocation')]",
      "properties": {
        "virtualMachineProfile": {
          "extensionProfile": {
            "extensions": [
            {
                "name": "[concat('ServiceFabricNodeVmExt','_vmNodeTypeName')]",
                "properties": {
                  "type": "ServiceFabricNode",
                  "autoUpgradeMinorVersion": true,
                  "publisher": "Microsoft.Azure.ServiceFabric",
                  "settings": {
                    "clusterEndpoint": "[reference(parameters('clusterName')).clusterEndpoint]",
                    "nodeTypeRef": "[variables('vmNodeTypeName')]",
                    "dataPath": "D:\\SvcFab",
                    "durabilityLevel": "Bronze",
                    "certificate": {
                        "thumbprint": "[parameters('primaryClusterCertificateTP')]",
                        "x509StoreName": "[parameters('certificateStoreValue')]"
                    }
                  },
                  "typeHandlerVersion": "1.1"
                }
            },}},
          "osProfile": {
            "adminPassword": "[parameters('adminPassword')]",
            "adminUsername": "[parameters('adminUsername')]",
            "secrets": [
            {
                "sourceVault": {
                    "id": "[resourceId('Microsoft.KeyVault/vaults', parameters('keyVaultName'))]"
                },
                "vaultCertificates": [
                {
                    "certificateStore": "[parameters('certificateStoreValue')]",
                    "certificateUrl": "[parameters('clusterCertificateUrlValue')]"
                },
            ]}]
        },
    },
    {   ## cluster definition
        "apiVersion": "[variables('sfrpApiVersion')]",
        "type": "Microsoft.ServiceFabric/clusters",
        "name": "[parameters('clusterName')]",
        "location": "[parameters('clusterLocation')]",
        "certificate": {
            "thumbprint": "[parameters('primaryClusterCertificateTP')]",
            "x509StoreName": "[parameters('certificateStoreValue')]"
        },
    }
  ]

上面的内容本质上是说明:具有指纹 json [parameters('primaryClusterCertificateTP')] 且在 KeyVault URI json [parameters('clusterCertificateUrlValue')] 中找到的证书已按指纹声明为群集的唯一证书。The above essentially says that certificate with thumbprint json [parameters('primaryClusterCertificateTP')] and found at KeyVault URI json [parameters('clusterCertificateUrlValue')] is declared as the cluster's sole certificate, by thumbprint. 接下来,我们将设置所需的其他资源,以确保证书的自动滚动更新。Next we'll set up the additional resources needed to ensure the autorollover of the certificate.

设置必备资源Setting up prerequisite resources

如前所述,由 Microsoft.Compute 资源提供程序服务代表部署操作员使用其第一方标识从保管库中检索已预配为虚拟机规模集机密的证书。As mentioned before, a certificate provisioned as a virtual machine scale set secret is retrieved from the vault by the Microsoft.Compute Resource Provider service, using its first-party identity and on behalf of the deployment operator. 对于自动滚动更新,该操作需更改 - 我们将改用分配给虚拟机规模集的托管标识,该标识被授予对保管库机密的访问权限。For autorollover, that will change - we'll switch to using a managed identity, assigned to the virtual machine scale set, and which is granted permissions to the vault's secrets.

所有后续摘录应进行伴随部署 - 它们单个列出,可以以逐个播放的方式进行分析和说明。All of the subsequent excerpts should be deployed concomitantly - they are listed individually for play-by-play analysis and explanations.

首先定义用户分配的标识(默认值作为示例包括)- 有关最新信息,请参阅官方文档First define a user assigned identity (default values are included as examples) - refer to the official documentation for up-to-date information:

{
  "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "userAssignedIdentityName": {
      "type": "string",
      "defaultValue": "sftstuaicus",
      "metadata": {
        "description": "User-assigned managed identity name"
      }
    },
  },
  "variables": {
      "vmssApiVersion": "2018-06-01",
      "sfrpApiVersion": "2018-02-01",
      "miApiVersion": "2018-11-30",
      "kvApiVersion": "2018-02-14",
      "userAssignedIdentityResourceId": "[resourceId('Microsoft.ManagedIdentity/userAssignedIdentities', parameters('userAssignedIdentityName'))]"
  },    
  "resources": [
    {
      "type": "Microsoft.ManagedIdentity/userAssignedIdentities",
      "name": "[parameters('userAssignedIdentityName')]",
      "apiVersion": "[variables('miApiVersion')]",
      "location": "[resourceGroup().location]"
    },
  ]}

然后授予此标识访问保管库机密的权限 - 有关最新信息,请参阅官方文档Then grant this identity access to the vault secrets - refer to the official documentation for current information:

  "resources":
  [{
      "type": "Microsoft.KeyVault/vaults/accessPolicies",
      "name": "[concat(parameters('keyVaultName'), '/add')]",
      "apiVersion": "[variables('kvApiVersion')]",
      "properties": {
        "accessPolicies": [
          {
            "tenantId": "[reference(variables('userAssignedIdentityResourceId'), variables('miApiVersion')).tenantId]",
            "objectId": "[reference(variables('userAssignedIdentityResourceId'), variables('miApiVersion')).principalId]",
            "dependsOn": [
              "[variables('userAssignedIdentityResourceId')]"
            ],
            "permissions": {
              "secrets": [
                "get",
                "list"
              ]}}]}}]

在下一步中,我们将执行以下操作:In the next step, we'll:

  • 将用户分配的标识分配给虚拟机规模集assign the user-assigned identity to the virtual machine scale set
  • 在创建托管标识并向其授予对保管库的访问权限后声明虚拟机规模集的依赖项declare the virtual machine scale set' dependency on the creation of the managed identity, and on the result of granting it access to the vault
  • 声明 KeyVault VM 扩展,要求它在启动时检索观察到的证书(官方文档declare the KeyVault VM extension, requiring that it retrieves observed certificates on startup (official documentation)
  • 更新 Service Fabric VM 扩展的定义,使其依赖于 KVVM 扩展,并将群集证书转换为公用名称(我们将在单个步骤中进行这些更改,因为它们属于同一资源的范围。)update the definition of the Service Fabric VM extension to depend on the KVVM extension, and to convert the cluster cert to common name (We're making these changes in a single step since they fall under the scope of the same resource.)
  "parameters": {
    "kvvmextPollingInterval": {
      "type": "string",
      "defaultValue": "3600",
      "metadata": {
        "description": "kv vm extension polling interval in seconds"
      }
    },
    "kvvmextLocalStoreName": {
      "type": "string",
      "defaultValue": "MY",
      "metadata": {
        "description": "kv vm extension local store name"
      }
    },
    "kvvmextLocalStoreLocation": {
      "type": "string",
      "defaultValue": "LocalMachine",
      "metadata": {
        "description": "kv vm extension local store location"
      }
    },
    "kvvmextObservedCertificates": {
      "type": "array",
      "defaultValue": [
                "https://sftestcus.vault.azure.cn/secrets/sftstcncluster",
                "https://sftestcus.vault.azure.cn/secrets/sftstcnserver"
            ],
      "metadata": {
        "description": "kv vm extension observed certificates versionless uri"
      }
    },
    "certificateCommonName": {
      "type": "string",
      "defaultValue": "cus.cluster.sftstcn.system.servicefabric.azure-int",
      "metadata": {
        "description": "Certificate Common name"
      }
    },
  },
  "resources": [
  {
    "apiVersion": "[variables('vmssApiVersion')]",
    "type": "Microsoft.Compute/virtualMachineScaleSets",
    "name": "[variables('vmNodeTypeName')]",
    "location": "[variables('computeLocation')]",
    "dependsOn": [
      "[variables('userAssignedIdentityResourceId')]",
      "[concat('Microsoft.KeyVault/vaults/', concat(parameters('keyVaultName'), '/accessPolicies/add'))]"
    ],
    "identity": {
      "type": "UserAssigned",
      "userAssignedIdentities": {
        "[variables('userAssignedIdentityResourceId')]": {}
      }
    },
    "virtualMachineProfile": {
      "extensionProfile": {
        "extensions": [
        {
          "name": "KVVMExtension",
          "properties": {
            "publisher": "Microsoft.Azure.KeyVault",
            "type": "KeyVaultForWindows",
            "typeHandlerVersion": "1.0",
            "autoUpgradeMinorVersion": true,
            "settings": {
                "secretsManagementSettings": {
                    "pollingIntervalInS": "[parameters('kvvmextPollingInterval')]",
                    "linkOnRenewal": false,
                    "observedCertificates": "[parameters('kvvmextObservedCertificates')]",
                    "requireInitialSync": true
                }
            }
          }
        },
        {
        "name": "[concat('ServiceFabricNodeVmExt','_vmNodeTypeName')]",
        "properties": {
          "type": "ServiceFabricNode",
          "provisionAfterExtensions" : [ "KVVMExtension" ],
          "publisher": "Microsoft.Azure.ServiceFabric",
          "settings": {
            "certificate": {
                "commonNames": [
                    "[parameters('certificateCommonName')]"
                ],
                "x509StoreName": "[parameters('certificateStoreValue')]"
            }
            },
            "typeHandlerVersion": "1.0"
          }
        },
  ] } ## extension profile
  },  ## VM profile
  "osProfile": {
    "adminPassword": "[parameters('adminPassword')]",
    "adminUsername": "[parameters('adminUsername')]",
  } 
  }
  ]

请注意,我们已从虚拟机规模集的“OsProfile”部分中删除了保管库证书 URL(虽然上面未明确列出)。Note, although not explicitly listed above, that we removed the vault certificate URL from the 'OsProfile' section of the virtual machine scale set. 最后一步是更新群集定义,将证书声明从指纹更改为公用名称 - 此处也会固定颁发者指纹:The last step is to update the cluster definition to change the certificate declaration from thumbprint to common name - here we are also pinning the issuer thumbprints:

  "parameters": {
    "certificateCommonName": {
      "type": "string",
      "defaultValue": "cus.cluster.sftstcn.system.servicefabric.azure-int",
      "metadata": {
        "description": "Certificate Common name"
      }
    },
    "certificateIssuerThumbprint": {
      "type": "string",
      "defaultValue": "1b45ec255e0668375043ed5fe78a09ff1655844d,d7fe717b5ff3593764f4d90654d86e8362ec26c8,3ac7c3cac8de0dd392c02789c8be97474f456960,96ea05926e2e42cc207e358668be2c316857fb5e",
      "metadata": {
        "description": "Certificate issuer thumbprints separated by comma"
      }
    },
  },
  "resources": [
    {
      "apiVersion": "[variables('sfrpApiVersion')]",
      "type": "Microsoft.ServiceFabric/clusters",
      "name": "[parameters('clusterName')]",
      "location": "[parameters('clusterLocation')]",
      "properties": {
        "certificateCommonNames": {
          "commonNames": [{
              "certificateCommonName": "[parameters('certificateCommonName')]",
              "certificateIssuerThumbprint": "[parameters('certificateIssuerThumbprint')]"
          }],
          "x509StoreName": "[parameters('certificateStoreValue')]"
        },
  ]

此时,你可以在单个部署中运行上述更新;Service Fabric 资源提供程序服务本身将按几个步骤拆分群集升级,如将群集证书从指纹转换为公用名称中所述。At this point, you can run the updates mentioned above in a single deployment; for its part, the Service Fabric Resource Provider service will split the cluster upgrade in several steps, as described in the segment on converting cluster certificates from thumbprint to common name.

分析和观察Analysis and observations

本部分全方位概述了上述详细步骤,并让用户注意重要方面。This section is a catch-all for explaining steps detailed above, as well as drawing attention to important aspects.

证书预配(已介绍)Certificate provisioning, explained

作为预配代理的 KVVM 扩展以预定的频率连续运行。The KVVM extension, as a provisioning agent, runs continuously on a predetermined frequency. 如果无法检索观察到的证书,则会转到行中的下一目标,然后休眠至下一周期。On failing to retrieve an observed certificate, it would continue to the next in line, and then hibernate until the next cycle. 作为群集启动代理的 SFVM 扩展将需要声明的证书,否则群集无法形成。The SFVM extension, as the cluster bootstrapping agent, will require the declared certificates before the cluster can form. 这反过来意味着,SFVM 扩展只能在成功检索群集证书之后运行,在此处由 json "provisionAfterExtensions" : [ "KVVMExtension" ]" 子句表示,以及由 KeyVaultVM 扩展的 json "requireInitialSync": true 设置表示。This, in turn, means that the SFVM extension can only run after the successful retrieval of the cluster certificate(s), denoted here by the json "provisionAfterExtensions" : [ "KVVMExtension" ]" clause, and by the KeyVaultVM extension's json "requireInitialSync": true setting. 这是向 KVVM 扩展指示:首次运行时(在部署或重启后)必须循环访问其观察到的证书,直到全部下载成功。This indicates to the KVVM extension that on the first run (after deployment or a reboot) it must cycle through its observed certificates until all are downloaded successfully. 如果将此参数设置为“false”,则在检索群集证书失败的情况下,群集部署也会失败。Setting this parameter to false, coupled with a failure to retrieve the cluster certificate(s) would result in a failure of the cluster deployment. 相反,如果在要求初始同步时所使用的已观察到的证书的列表不正确/无效,则会导致 KVVM 扩展失败,因此会再次导致群集部署失败。Conversely, requiring an initial sync with an incorrect/invalid list of observed certificates would result in a failure of the KVVM extension, and so, again, a failure to deploy the cluster.

证书链接(已介绍)Certificate linking, explained

你可能已注意到 KVVM 扩展的“linkOnRenewal”标志,以及它已设置为 false 这一事实。You may have noticed the KVVM extension's 'linkOnRenewal' flag, and the fact that it is set to false. 在这里,我们将深入探讨此标志控制的行为及其对群集功能的影响。We're addressing here in depth the behavior controlled by this flag and its implications on the functioning of a cluster. 请注意,此行为特定于 Windows。Note this behavior is specific to Windows.

根据其定义According to its definition:

"linkOnRenewal": <Only Windows. This feature enables auto-rotation of SSL certificates, without necessitating a re-deployment or binding.  e.g.: false>,

用于建立 TLS 连接的证书通常通过 S 通道安全支持提供程序作为句柄获取,即客户端不直接访问证书本身的私钥。Certificates used to establish a TLS connection are typically acquired as a handle via the S-channel Security Support Provider - that is, the client does not directly access the private key of the certificate itself. S 通道支持对证书扩展 (CERT_RENEWAL_PROP_ID) 形式的凭据进行重定向(链接):如果设置此属性,则其值表示“续订”证书的指纹,因此 S 通道将改为尝试加载链接的证书。S-channel supports redirection (linking) of credentials in the form of a certificate extension (CERT_RENEWAL_PROP_ID): if this property is set, its value represents the thumbprint of the 'renewal' certificate, and so S-channel will instead attempt to load the linked certificate. 事实上,它会遍历此链接的(非循环的,希望如此)列表,直到它最后出现“最终”证书(一个没有续订标记的证书)。In fact, it will traverse this linked (and hopefully acyclic) list until it ends up with the 'final' certificate - one without a renewal mark. 此功能在谨慎使用的情况下,对因证书过期(此处为举例)而导致的可用性损失是一项很好的缓解措施。This feature, when used judiciously, is a great mitigation against loss of availability caused by expired certificates (for instance). 在其他情况下,它可能导致难以诊断和缓解的中断。In other cases, it can be the cause of outages that are difficult to diagnose and mitigate. S 通道无条件地在其续订属性上执行证书遍历,而不考虑使用者、颁发者或其他任何特定属性,这些属性参与客户端生成的证书的验证。S-channel executes the traversal of certificates on their renewal properties unconditionally - irrespective of subject, issuers, or any other specific attributes that participate in the validation of the resulting certificate by the client. 实际上,可能是生成的证书没有关联的私钥,或者该密钥尚未被 ACL 到其预期使用者。It is possible, indeed, that the resulting certificate has no associated private key, or the key has not been ACLed to its prospective consumer.

如果启用了链接,则在从保管库检索观测到的证书时,KeyVault VM 扩展会尝试查找匹配的现有证书,以便通过续订扩展属性来链接它们。If linking is enabled, the KeyVault VM extension, upon retrieving an observed certificate from the vault, will attempt to find matching, existing certificates in order to link them via the renewal extension property. 匹配完全基于使用者可选名称 (SAN),其原理如下所示。The matching is (exclusively) based on Subject Alternative Name (SAN), and works as exemplified below. 假设有两个现有的证书,如下所示:答:CN = "Alice's accessories", SAN = {"alice.universalexports.com"}, renewal = '' B:CN = "Bob's bits", SAN = {"bob.universalexports.com", "bob.universalexports.net"}, renewal = ''Assume two existing certificates, as follows: A: CN = "Alice's accessories", SAN = {"alice.universalexports.com"}, renewal = '' B: CN = "Bob's bits", SAN = {"bob.universalexports.com", "bob.universalexports.net"}, renewal = ''

假设证书 C 通过 KVVM 扩展检索:CN = "Mallory's malware", SAN = {"alice.universalexports.com", "bob.universalexports.com", "mallory.universalexports.com"}Assume a certificate C is retrieved by the KVVM ext: CN = "Mallory's malware", SAN = {"alice.universalexports.com", "bob.universalexports.com", "mallory.universalexports.com"}

A 的 SAN 列表已完全包含在 C 的该列表中,因此 A.renewal = C.thumbprint;B 的 SAN 列表与 C 的有一个共同的交集,但并不完全包含在其中,因此 B.renewal 将保留为空。A's SAN list is fully included in C's, and so A.renewal = C.thumbprint; B's SAN list has a common intersection with C's, but is not fully included in it, so B.renewal remains empty.

在证书 A 上,在此状态下调用 AcquireCredentialsHandle(S 通道)的任何尝试实际上最终都会将 C 发送到远程方。Any attempt to invoke AcquireCredentialsHandle (S-channel) in this state on certificate A will actually end up sending C to the remote party. 在使用 Service Fabric 的情况下,群集的传输子系统使用 S 通道进行相互身份验证,因此上述行为直接影响群集的基本通信。In the case of Service Fabric, the Transport subsystem of a cluster uses S-channel for mutual authentication, and so the behavior described above affects the cluster's fundamental communication directly. 让我们继续上面的示例,假设 A 是群集证书,接下来发生的情况取决于:Continuing the example above, and assuming A is the cluster certificate, what happens next depends:

  • 如果 C 的私钥尚未 ACL 到 Fabric 运行时采用的帐户,我们会看到获取私钥(SEC_E_UNKNOWN_CREDENTIALS 或类似项)的操作失败if C's private key is not ACLd to the account that Fabric is running as, we'll see failures to acquire the private key (SEC_E_UNKNOWN_CREDENTIALS or similar)
  • 如果 C 的私钥是可访问的,我们会看到其他节点返回的授权失败(CertificateNotMatched、未授权,等等)。if C's private key is accessible, then we'll see authorization failures returned by the other nodes (CertificateNotMatched, unauthorized etc.)

在任一情况下,传输都会失败,群集可能会关闭;症状各不相同。In either case, transport fails and the cluster may go down; the symptoms vary. 使问题变得更糟的是,链接取决于续订顺序,而续订取决于 KVVM 扩展中已观察到的证书的列表顺序、保管库中的续订计划甚至那些会改变检索顺序的暂时性错误。To make things worse, the linking depends on the order of renewal - which is dictated by the order of the list of observed certificates of the KVVM extension, the renewal schedule in the vault or even transient errors that would alter the order of retrieval.

若要缓解此类事件,我们建议你采取以下措施:To mitigate against such incidents, we recommend:

  • 不要混合使用不同保管库证书的 SAN,每个保管库证书应具有不同的用途,且其使用者和 SAN 应通过特异性反映出相关情况do not mix the SANs of different vault certificates; each vault certificate should serve a distinct purpose, and their subject and SAN should reflect that with specificity
  • 在 SAN 列表中包括使用者公用名称(按原义,即 "CN =")include the subject common name in the SAN list (as, literally, "CN=")
  • 在不确定的情况下,禁用通过 KVVM 扩展预配的证书的续订链接if unsure, disable linking on renewal for certificates provisioned with the KVVM extension

为什么使用用户分配的托管标识?Why use a user-assigned managed identity? 使用它会产生什么影响?What are the implications of using it?

如图上述 json 代码段中出现的一样,需要对操作/更新进行特定的排序才能确保转换成功并维护群集的可用性。As it emerged from the json snippets above, a specific sequencing of the operations/updates is required to guarantee the success of the conversion, and to maintain the availability of the cluster. 具体而言,虚拟机规模集资源会声明并使用其标识在单个(从用户的角度来看)更新中检索机密。Specifically, the virtual machine scale set resource declares and uses its identity to retrieve secrets in a single (from the user's perspective) update. Service Fabric VM 扩展(用于引导群集)取决于是否已完成 KeyVault VM 扩展,而后者依赖于是否已成功检索观察到的证书。The Service Fabric VM extension (which bootstraps the cluster) depends on the completion of KeyVault VM extension, which depends on the successful retrieval of observed certificates. KVVM 扩展使用虚拟机规模集的标识来访问保管库,这意味着在部署虚拟机规模集之前,保管库上的访问策略必须已更新。The KVVM extension uses the virtual machine scale set's identity to access the vault, which means that the access policy on the vault must have been already updated prior to the deployment of the virtual machine scale set.

若要释放托管标识的创建操作或将其分配给其他资源,部署操作员必须具有订阅或资源组中的必需角色 (ManagedIdentityOperator),以及管理模板中引用的其他资源所需的角色。To dispose the creation of a managed identity, or to assign it to another resource, the deployment operator must have the required role (ManagedIdentityOperator) in the subscription or the resource group, in addition to the roles required to manage the other resources referenced in the template.

请回想一下,从安全角度来看,可以将虚拟机(规模集)视为与其 Azure 标识相关的安全边界。From a security standpoint, recall that the virtual machine (scale set) is considered a security boundary with regards to its Azure identity. 这意味着,在 VM 上托管的任何应用程序原则上都可以获取一个表示 VM 的访问令牌 - 托管标识访问令牌是从未经身份验证的 IMDS 终结点获取的。That means that any application hosted on the VM could, in principle, obtain an access token representing the VM - managed identity access tokens are obtained from the unauthenticated IMDS endpoint. 如果将 VM 视为共享的或多租户的环境,则可能不会指示此检索群集证书的方法。If you consider the VM to be a shared, or multi-tenant environment, then perhaps this method of retrieving cluster certificates is not indicated. 不过,它是适用于证书自动滚动更新的唯一预配机制。It is, however, the only provisioning mechanism suitable for certificate autorollover.

故障排除和常见问题解答Troubleshooting and frequently asked questions

:如何以编程方式注册到 KeyVault 托管的证书?Q: How to programmatically enroll into a KeyVault-managed certificate? :找到 KeyVault 文档中的颁发者名称,然后在下面的脚本中将其替换。A: Find out the name of the issuer from the KeyVault documentation, then replace it in the script below.

  $issuerName=<depends on your PKI of choice>
    $clusterVault="sftestcus"
    $clusterCertVaultName="sftstcncluster"
    $clusterCertCN="cus.cluster.sftstcn.system.servicefabric.azure-int"
    Set-AzKeyVaultCertificateIssuer -VaultName $clusterVault -Name $issuerName -IssuerProvider $issuerName
    $distinguishedName="CN=" + $clusterCertCN
    $policy = New-AzKeyVaultCertificatePolicy `
        -IssuerName $issuerName `
        -SubjectName $distinguishedName `
        -SecretContentType 'application/x-pkcs12' `
        -Ekus "1.3.6.1.5.5.7.3.1", "1.3.6.1.5.5.7.3.2" `
        -ValidityInMonths 4 `
        -KeyType 'RSA' `
        -RenewAtPercentageLifetime 75        
    Add-AzKeyVaultCertificate -VaultName $clusterVault -Name $clusterCertVaultName -CertificatePolicy $policy

    # poll the result of the enrollment
    Get-AzKeyVaultCertificateOperation -VaultName $clusterVault -Name $clusterCertVaultName 

:如果证书是由未声明/未指定的颁发者颁发的,会发生什么情况?Q: What happens when a certificate is issued by an undeclared/unspecified issuer? 在哪里可以获取给定 PKI 的活动颁发者的详尽列表?Where can I obtain the exhaustive list of active issuers of a given PKI? :如果证书声明指定了颁发者指纹,并且该证书的直接颁发者未包含在固定颁发者列表中,则该证书会被视为无效,无论其根是否受客户端信任。A: If the certificate declaration specifies issuer thumbprints, and the direct issuer of the certificate is not included in the list of pinned issuers, the certificate will be considered invalid - irrespective of whether or not its root is trusted by the client. 因此,必须确保颁发者列表是最新的,这一点非常重要。Therefore it is critical to ensure the list of issuers is current/up to date. 引入新的颁发者是一种罕见事件,应在开始颁发证书之前广而告之。The introduction of a new issuer is a rare event, and should be widely publicized prior to it beginning to issue certificates.

通常情况下,根据 IETF RFC 7382 的要求,PKI 会发布和维护认证实践声明。In general, a PKI will publish and maintain a certification practice statement, in accordance with IETF RFC 7382. 此外,它还包括所有活动的颁发者。Among other information, it will include all active issuers. 以编程方式检索此列表时,一个 PKI 可能不同于另一个 PKI。Retrieving this list programmatically may differ from a PKI to another.

关于 Microsoft 内部 PKI,请参阅用于检索经授权的颁发者的终结点/SDK 的相关内部文档;群集所有者负责定期探测此列表,确保其群集定义包括所有预期颁发者。For Microsoft-internal PKIs, please consult the internal documentation on the endpoints/SDKs used to retrieve the authorized issuers; it is the cluster owner's responsibility to probe this list periodically, and ensure their cluster definition includes all expected issuers.

:是否支持多个 PKI?Q: Are multiple PKIs supported? :是的;你不能在群集清单中用相同的值声明多个 CN 项,但可以列出与同一 CN 对应的多个 PKI 中的颁发者。A: Yes; you may not declare multiple CN entries in the cluster manifest with the same value, but can list issuers from multiple PKIs corresponding to the same CN. 建议不要这样做,证书透明度规范可能会阻止颁发此类证书。It is not a recommended practice, and certificate transparency practices may prevent such certificates from being issued. 然而,这是从一个 PKI 迁移到另一个 PKI 的方法,是一种可接受的机制。However, as means to migrate from one PKI to another, this is an acceptable mechanism.

:如果当前群集证书不是由 CA 颁发的,或者没有所需的主题,该怎么办?Q: What if the current cluster certificate is not CA-issued, or does not have the intended subject? :使用预期的使用者获取证书,并根据指纹将其作为次要证书添加到群集的定义中。A: Obtain a certificate with the intended subject, and add it to the cluster's definition as a secondary, by thumbprint. 升级成功完成后,请启动另一次群集配置升级,将证书声明转换为公用名称。Once the upgrade completed successfully, initiate another cluster configuration upgrade to convert the certificate declaration to common name.