生产就绪情况核对清单Production readiness checklist

应用程序和群集是否准备好接收生产流量?Is your application and cluster ready to take production traffic? 运行和测试应用程序及群集并不一定意味着它们已准备好投入生产。Running and testing your application and your cluster doesn't necessarily mean it's ready to go into production. 通过查看以下核对清单,使应用程序和群集保持平稳运行。Keep your application and cluster running smoothly by going through the following checklist. 强烈建议检查所有这些项目。We strongly recommend all these items to be checked off. 显然,可以选择为特定明细项目(例如,自己的诊断框架)使用替代解决方案。Obviously, you can choose to use alternative solutions for a particular line item (for example, your own diagnostics frameworks).

生产的先决条件Prerequisites for production

  1. Azure Service Fabric 最佳做法:应用程序设计安全性网络容量规划和缩放基础结构即代码Azure Service Fabric best practices: Application Design, Security, Networking, Capacity planning and scaling, and Infrastructure as Code.

  2. 如果使用执行组件编程模型,则实现 Reliable Actors 安全配置Implement the Reliable Actors security configuration if using the Actors programming model

  3. 对于具有超过 20 个核心或 10 个节点的群集,请为系统服务创建专用的主节点类型。For clusters with more than 20 cores or 10 nodes, create a dedicated primary node type for system services. 添加放置约束,保留系统服务的主节点类型。Add placement constraints to reserve the primary node type for system services.

  4. 对主节点类型使用 D2v2 或更高版本的 SKU。Use a D2v2 or higher SKU for the primary node type. 建议选择至少有 50 GB 硬盘容量的 SKU。It is recommended to pick a SKU with at least 50 GB hard disk capacity.

  5. 生产群集必须是安全的Production clusters must be secure. 有关设置安全群集的示例,请参阅此群集模板For an example of setting up a secure cluster, see this cluster template. 使用证书的通用名称,避免使用自签名证书。Use common names for certificates and avoid using self signed certs.

  6. 添加容器和服务的资源约束,以便它们消耗的节点资源不超过 75%。Add resource constraints on containers and services, so that they don't consume more than 75% of node resources.

  7. 理解并设置持续性级别Understand and set the durability level. 对于运行有状态工作负载的节点类型,建议使用银级或更高的持续性级别。Silver or higher durability level is recommended for node types running stateful workloads. 主节点类型应将持续性级别设置为银级或更高级别。The primary node type should have a durability level set to Silver or higher.

  8. 理解并选取节点类型的可靠性级别Understand and pick the reliability level of the node type. 建议使用银级或更高级别的可靠性。Silver or higher reliability is recommended.

  9. 对工作负载进行负载和缩放测试,确定群集的容量需求Load and scale test your workloads to identify capacity requirements for your cluster.

  10. 服务和应用程序受到监视,并且系统会生成并存储应用程序日志,并会发送警报。Your services and applications are monitored and application logs are being generated and stored, with alerting. 例如,请参阅向 Service Fabric 应用程序添加日志记录For example, see Add logging to your Service Fabric application.

  11. 该群集始终具有主要和辅助证书(因此不会将你拒之门外)。The cluster has primary and secondary certificates always (so you don't get locked out).

  12. 维护用于开发、暂存和生产的独立集群。Maintain separate clusters for development, staging, and production.

  13. 首先会在开发和暂存群集中测试应用程序升级群集升级Application upgrades and cluster upgrades are tested in development and staging clusters first.

  14. 关闭生产集群中的自动升级,并对开发和暂存群集启用自动升级(根据需要回滚)。Turn off automatic upgrades in production clusters, and turn it on for development and staging clusters (rollback as needed).

  15. 为服务建立恢复点目标 (RPO),设置灾难恢复流程并进行测试。Establish a Recovery Point Objective (RPO) for your service, and set up a disaster recovery process and test it out.

  16. 手动或以编程方式计划群集的缩放Plan for scaling your cluster manually or programmatically.

  17. 计划群集节点的修补Plan for patching your cluster nodes.

  18. 建立一个 CI/CD 管道,以便不断测试最新的更改。Establish a CI/CD pipeline so that your latest changes are being continually tested. 例如,使用 Azure DevOpsFor example, using Azure DevOps.

  19. 使用故障分析服务测试负载下的开发和暂存集群,并引入受控混沌测试Test your development & staging clusters under load with the Fault Analysis Service and induce controlled chaos.

  20. 计划应用程序的缩放Plan for scaling your applications.

如果使用 Service Fabric Reliable Services 或 Reliable Actors 编程模型,则需检查以下项目:If you're using the Service Fabric Reliable Services or Reliable Actors programming model, the following items need to be checked off:

  1. 在本地开发期间升级应用程序,检查服务代码是否履行了 RunAsync 方法中的取消令牌,并关闭了自定义通信侦听器。Upgrade applications during local development to check that your service code is honoring the cancellation token in the RunAsync method and closing custom communication listeners.
  2. 使用可靠集合时,避免出现常见错误Avoid common pitfalls when using Reliable Collections.
  3. 运行负载测试时监视 .NET CLR 内存性能计数器,并检查是否存在高垃圾回收率或堆增长是否失控。Monitor the .NET CLR memory performance counters when running load tests and check for high rates of Garbage Collection or runaway heap growth.
  4. 维护 Reliable Services 和 Reliable Actors 的脱机备份,并测试恢复过程。Maintain offline backup of Reliable Services and Reliable Actors and test the restoration process.
  5. 理想情况下,主 NodeType 虚拟机实例计数应该等于群集可靠性层的最小值;适于超过层最小值的条件包括:在垂直缩放主 NodeTypes 虚拟机规模集 SKU 时临时超过。Your Primary NodeType Virtual Machine instance count should ideally be equal to the minimum for your Clusters Reliability tier; conditions when appropriate to exceed the Tier minimum includes: temporarily when vertically scaling your Primary NodeTypes Virtual Machine Scale Set SKU.

可选的最佳做法Optional best practices

虽然上述列表是投入生产的先决条件,但也应考虑以下各项:While the above lists are pre-requisites to go into production, the following items should also be considered:

  1. 插入 Service Fabric 运行状况模型以扩展内置运行状况评估和报告。Plug into the Service Fabric health model for extending the built-in health evaluation and reporting.
  2. 部署一个监视应用程序并报告负载以实现资源均衡的自定义监视程序。Deploy a custom watchdog that is monitoring your application and reports load for resource balancing.

后续步骤Next steps