Production readiness checklist
Is your application and cluster ready to take production traffic? Running and testing your application and your cluster doesn't necessarily mean it's ready to go into production. Keep your application and cluster running smoothly by going through the following checklist. We strongly recommend all these items to be checked off. Obviously, you can choose to use alternative solutions for a particular line item (for example, your own diagnostics frameworks).
Prerequisites for production
Configure FabricTransport settings if you're using the Reliable Actors programming model and require secure inter-service communication.
For clusters with more than 20 cores or 10 nodes, create a dedicated primary node type for system services. Add placement constraints to reserve the primary node type for system services.
Use a D2v2 or higher SKU for the primary node type. It is recommended to pick a SKU with at least 50 GB hard disk capacity.
Add resource constraints on containers and services, so that they don't consume more than 75% of node resources.
Understand and set the durability level. Silver or higher durability level is recommended for node types running stateful workloads, and required for production.
Understand and pick the reliability level of the node type. Silver or higher reliability is recommended, and required for production.
Load and scale test your workloads to identify capacity requirements for your cluster.
Your services and applications are monitored and application logs are being generated and stored, with alerting. For example, see Add logging to your Service Fabric application.
The underlying virtual machine scale set infrastructure is monitored with alerting (for example, with Azure Monitor logs.
The cluster has primary and secondary certificates always (so you don't get locked out).
Maintain separate clusters for development, staging, and production.
Turn off automatic upgrades in production clusters, and turn it on for development and staging clusters (rollback as needed).
Establish a Recovery Point Objective (RPO) for your service, and set up a disaster recovery process and test it out.
Plan for scaling your cluster manually or programmatically.
Plan for patching your cluster nodes.
Plan for scaling your applications.
If you're using the Service Fabric Reliable Services or Reliable Actors programming model, the following items need to be checked off:
- Upgrade applications during local development to check that your service code is honoring the cancellation token in the
RunAsyncmethod and closing custom communication listeners.
- Avoid common pitfalls when using Reliable Collections.
- Monitor the .NET CLR memory performance counters when running load tests and check for high rates of Garbage Collection or runaway heap growth.
- Maintain offline backup of Reliable Services and Reliable Actors and test the restoration process.
- Your Primary NodeType Virtual Machine instance count should ideally be equal to the minimum for your Clusters Reliability tier; conditions when appropriate to exceed the Tier minimum includes: temporarily when vertically scaling your Primary NodeTypes Virtual Machine Scale Set SKU.
Optional best practices
While the above lists are pre-requisites to go into production, the following items should also be considered:
- Plug into the Service Fabric health model for extending the built-in health evaluation and reporting.
- Deploy a custom watchdog that is monitoring your application and reports load for resource balancing.