Service discovery resiliency

With Azure Container Apps resiliency, you can proactively prevent, detect, and recover from service request failures using simple resiliency policies. In this article, you learn how to configure Azure Container Apps resiliency policies when initiating requests using Azure Container Apps service discovery.

Note

Currently, resiliency policies can't be applied to requests made using the Dapr Service Invocation API.

Policies are in effect for each request to a container app. You can tailor policies to the container app accepting requests with configurations like:

  • The number of retries
  • Retry and timeout duration
  • Retry matches
  • Circuit breaker consecutive errors, and others

The following screenshot shows how an application uses a retry policy to attempt to recover from failed requests.

Diagram demonstrating container app to container app resiliency using a container app's service name.

Supported resiliency policies

Configure resiliency policies

Whether you configure resiliency policies using Bicep, the CLI, or the Azure portal, you can only apply one policy per container app.

When you apply a policy to a container app, the rules are applied to all requests made to that container app, not to requests made from that container app. For example, a retry policy is applied to a container app named App B. All inbound requests made to App B automatically retry on failure. However, outbound requests sent by App B aren't guaranteed to retry in failure.

The following resiliency example demonstrates all of the available configurations.

resource myPolicyDoc 'Microsoft.App/containerApps/resiliencyPolicies@2023-11-02-preview' = {
  name: 'my-app-resiliency-policies'
  parent: '${appName}'
  properties: {
    timeoutPolicy: {
        responseTimeoutInSeconds: 15
        connectionTimeoutInSeconds: 5
    }
    httpRetryPolicy: {
        maxRetries: 5
        retryBackOff: {
          initialDelayInMilliseconds: 1000
          maxIntervalInMilliseconds: 10000
        }
        matches: {
            headers: [
                {
                    header: 'x-ms-retriable'
                    match: { 
                        exactMatch: 'true'
                    }
                }
            ]
            httpStatusCodes: [
                502
                503
            ]
            errors: [
                'retriable-status-codes'
                '5xx'
                'reset'
                'connect-failure'
                'retriable-4xx'
            ]
        }
    } 
    tcpRetryPolicy: {
        maxConnectAttempts: 3
    }
    circuitBreakerPolicy: {
        consecutiveErrors: 5
        intervalInSeconds: 10
        maxEjectionPercent: 50
    }
    tcpConnectionPool: {
        maxConnections: 100
    }
    httpConnectionPool: {
        http1MaxPendingRequests: 1024
        http2MaxRequests: 1024
    }
  }
}

Policy specifications

Timeouts

Timeouts are used to early-terminate long-running operations. The timeout policy includes the following properties.

properties: {
  timeoutPolicy: {
      responseTimeoutInSeconds: 15
      connectionTimeoutInSeconds: 5
  }
}
Metadata Required Description Example
responseTimeoutInSeconds Yes Timeout waiting for a response from the container app. 15
connectionTimeoutInSeconds Yes Timeout to establish a connection to the container app. 5

Retries

Define a tcpRetryPolicy or an httpRetryPolicy strategy for failed operations. The retry policy includes the following configurations.

httpRetryPolicy

properties: {
    httpRetryPolicy: {
        maxRetries: 5
        retryBackOff: {
          initialDelayInMilliseconds: 1000
          maxIntervalInMilliseconds: 10000
        }
        matches: {
            headers: [
                {
                    header: 'x-ms-retriable'
                    match: { 
                       exactMatch: 'true'
                    }
                }
            ]
            httpStatusCodes: [
                502
                503
            ]
            errors: [
                'retriable-headers'
                'retriable-status-codes'
            ]
        }
    } 
}
Metadata Required Description Example
maxRetries Yes Maximum retries to be executed for a failed http-request. 5
retryBackOff Yes Monitor the requests and shut off all traffic to the impacted service when timeout and retry criteria are met. N/A
retryBackOff.initialDelayInMilliseconds Yes Delay between first error and first retry. 1000
retryBackOff.maxIntervalInMilliseconds Yes Maximum delay between retries. 10000
matches Yes Set match values to limit when the app should attempt a retry. headers, httpStatusCodes, errors
matches.headers Y* Retry when the error response includes a specific header. *Headers are only required properties if you specify the retriable-headers error property. Learn more about available header matches. X-Content-Type
matches.httpStatusCodes Y* Retry when the response returns a specific status code. *Status codes are only required properties if you specify the retriable-status-codes error property. 502, 503
matches.errors Yes Only retries when the app returns a specific error. Learn more about available errors. connect-failure, reset
Header matches

If you specified the retriable-headers error, you can use the following header match properties to retry when the response includes a specific header.

matches: {
  headers: [
    { 
      header: 'x-ms-retriable'
      match: {
        exactMatch: 'true'
      }
    }
  ]
}
Metadata Description
prefixMatch Retries are performed based on the prefix of the header value.
exactMatch Retries are performed based on an exact match of the header value.
suffixMatch Retries are performed based on the suffix of the header value.
regexMatch Retries are performed based on a regular expression rule where the header value must match the regex pattern.
Errors

You can perform retries on any of the following errors:

matches: {
  errors: [
    'retriable-headers'
    'retriable-status-codes'
    '5xx'
    'reset'
    'connect-failure'
    'retriable-4xx'
  ]
}
Metadata Description
retriable-headers HTTP response headers that trigger a retry. A retry is performed if any of the header-matches match the response headers. Required if you'd like to retry on any matching headers.
retriable-status-codes HTTP status codes that should trigger retries. Required if you'd like to retry on any matching status codes.
5xx Retry if server responds with any 5xx response codes.
reset Retry if the server doesn't respond.
connect-failure Retry if a request failed due to a faulty connection with the container app.
retriable-4xx Retry if the container app responds with a 400-series response code, like 409.

tcpRetryPolicy

properties: {
    tcpRetryPolicy: {
        maxConnectAttempts: 3
    }
}
Metadata Required Description Example
maxConnectAttempts Yes Set the maximum connection attempts (maxConnectionAttempts) to retry on failed connections. 3

Circuit breakers

Circuit breaker policies specify whether a container app replica is temporarily removed from the load balancing pool, based on triggers like the number of consecutive errors.

properties: {
    circuitBreakerPolicy: {
        consecutiveErrors: 5
        intervalInSeconds: 10
        maxEjectionPercent: 50
    }
}
Metadata Required Description Example
consecutiveErrors Yes Consecutive number of errors before a container app replica is temporarily removed from load balancing. 5
intervalInSeconds Yes The amount of time given to determine if a replica is removed or restored from the load balance pool. 10
maxEjectionPercent Yes Maximum percent of failing container app replicas to eject from load balancing. Removes at least one host regardless of the value. 50

Connection pools

Azure Container App's connection pooling maintains a pool of established and reusable connections to container apps. This connection pool reduces the overhead of creating and tearing down individual connections for each request.

Connection pools allow you to specify the maximum number of requests or connections allowed for a service. These limits control the total number of concurrent connections for each service. When this limit is reached, new connections aren't established to that service until existing connections are released or closed. This process of managing connections prevents resources from being overwhelmed by requests and maintains efficient connection management.

httpConnectionPool

properties: {
    httpConnectionPool: {
        http1MaxPendingRequests: 1024
        http2MaxRequests: 1024
    }
}
Metadata Required Description Example
http1MaxPendingRequests Yes Used for http1 requests. Maximum number of open connections to a container app. 1024
http2MaxRequests Yes Used for http2 requests. Maximum number of concurrent requests to a container app. 1024

tcpConnectionPool

properties: {
    tcpConnectionPool: {
        maxConnections: 100
    }
}
Metadata Required Description Example
maxConnections Yes Maximum number of concurrent connections to a container app. 100

Resiliency observability

You can perform resiliency observability via your container app's metrics and system logs.

Resiliency logs

From the Monitoring section of your container app, select Logs.

Screenshot demonstrating where to find the logs for your container app.

In the Logs pane, write and run a query to find resiliency via your container app system logs. For example, run a query similar to the following to search for resiliency events and show their:

  • Time stamp
  • Environment name
  • Container app name
  • Resiliency type and reason
  • Log messages
ContainerAppSystemLogs_CL
| where EventSource_s == "Resiliency"
| project TimeStamp_s, EnvironmentName_s, ContainerAppName_s, Type_s, EventSource_s, Reason_s, Log_s

Click Run to run the query and view results.

Screenshot showing resiliency query results based on provided query example.

Resiliency metrics

From the Monitoring menu of your container app, select Metrics. In the Metrics pane, select the following filters:

  • The scope to the name of your container app.
  • The Standard metrics metrics namespace.
  • The resiliency metrics from the drop-down menu.
  • How you'd like the data aggregated in the results (by average, by maximum, etc.).
  • The time duration (last 30 minutes, last 24 hours, etc.).

Screenshot demonstrating how to access the resiliency metrics filters for your container app.

For example, if you set the Resiliency Request Retries metric in the test-app scope with Average aggregation to search within a 30-minute timeframe, the results look like the following:

Screenshot showing the results from example metrics filters for resiliency.