Manage connectivity and reliable messaging by using Azure IoT Hub device SDKs
This article provides high-level guidance to help you design device applications that are more resilient. It shows you how to take advantage of the connectivity and reliable messaging features in Azure IoT device SDKs. The goal of this guide is to help you manage the following scenarios:
- Fixing a dropped network connection
- Switching between different network connections
- Reconnecting because of service transient connection errors
Implementation details may vary by language. For more information, see the API documentation or specific SDK:
Python SDK (Reliability not yet implemented)
Designing for resiliency
IoT devices often rely on non-continuous or unstable network connections (for example, GSM or satellite). Errors can occur when devices interact with cloud-based services because of intermittent service availability and infrastructure-level or transient faults. An application that runs on a device has to manage the mechanisms for connection, reconnection, and the retry logic for sending and receiving messages. Also, the retry strategy requirements depend heavily on the device's IoT scenario, context, capabilities.
The Azure IoT Hub device SDKs aim to simplify connecting and communicating from cloud-to-device and device-to-cloud. These SDKs provide a robust way to connect to Azure IoT Hub and a comprehensive set of options for sending and receiving messages. Developers can also modify existing implementation to customize a better retry strategy for a given scenario.
The relevant SDK features that support connectivity and reliable messaging are covered in the following sections.
Connection and retry
This section gives an overview of the reconnection and retry patterns available when managing connections. It details implementation guidance for using a different retry policy in your device application and lists relevant APIs from the device SDKs.
Error patterns
Connection failures can happen at many levels:
- Network errors: disconnected socket and name resolution errors
- Protocol-level errors for HTTP, AMQP, and MQTT transport: detached links or expired sessions
- Application-level errors that result from either local mistakes: invalid credentials or service behavior (for example, exceeding the quota or throttling)
The device SDKs detect errors at all three levels. OS-related errors and hardware errors are not detected and handled by the device SDKs. The SDK design is based on The Transient Fault Handling Guidance from the Azure Architecture Center.
Retry patterns
The following steps describe the retry process when connection errors are detected:
- The SDK detects the error and the associated error in the network, protocol, or application.
- The SDK uses the error filter to determine the error type and decide if a retry is needed.
- If the SDK identifies an unrecoverable error, operations like connection, send, and receive are stopped. The SDK notifies the user. Examples of unrecoverable errors include an authentication error and a bad endpoint error.
- If the SDK identifies a recoverable error, it retries according to the specified retry policy until the defined timeout elapses. Note that the SDK uses Exponential back-off with jitter retry policy by default.
- When the defined timeout expires, the SDK stops trying to connect or send. It notifies the user.
- The SDK allows the user to attach a callback to receive connection status changes.
The SDKs provide three retry policies:
- Exponential back-off with jitter: This default retry policy tends to be aggressive at the start and slow down over time until it reaches a maximum delay. The design is based on Retry guidance from Azure Architecture Center.
- Custom retry: For some SDK languages, you can design a custom retry policy that is better suited for your scenario and then inject it into the RetryPolicy. Custom retry isn't available on the C SDK.
- No retry: You can set retry policy to "no retry," which disables the retry logic. The SDK tries to connect once and send a message once, assuming the connection is established. This policy is typically used in scenarios with bandwidth or cost concerns. If you choose this option, messages that fail to send are lost and can't be recovered.
Retry policy APIs
SDK | SetRetryPolicy method | Policy implementations | Implementation guidance | |
---|---|---|---|---|
C/iOS | IOTHUB_CLIENT_RESULT IoTHubClient_SetRetryPolicy | Default: IOTHUB_CLIENT_RETRY_EXPONENTIAL_BACKOFF Custom: use available retryPolicy No retry: IOTHUB_CLIENT_RETRY_NONE |
C/iOS implementation | |
Java | SetRetryPolicy | Default: ExponentialBackoffWithJitter class Custom: implement RetryPolicy interface No retry: NoRetry class |
Java implementation | |
.NET | DeviceClient.SetRetryPolicy | Default: ExponentialBackoff class Custom: implement IRetryPolicy interface No retry: NoRetry class |
C# implementation | |
Node | setRetryPolicy | Default: ExponentialBackoffWithJitter class Custom: implement RetryPolicy interface No retry: NoRetry class |
Node implementation | |
Python | Not currently supported | Not currently supported | Not currently supported |
The following code samples illustrate this flow:
.NET implementation guidance
The following code sample shows how to define and set the default retry policy:
// define/set default retry policy
IRetryPolicy retryPolicy = new ExponentialBackoff(int.MaxValue, TimeSpan.FromMilliseconds(100), TimeSpan.FromSeconds(10), TimeSpan.FromMilliseconds(100));
SetRetryPolicy(retryPolicy);
To avoid high CPU usage, the retries are throttled if the code fails immediately. For example, when there's no network or route to the destination. The minimum time to execute the next retry is 1 second.
If the service responds with a throttling error, the retry policy is different and can't be changed via public API:
// throttled retry policy
IRetryPolicy retryPolicy = new ExponentialBackoff(RetryCount, TimeSpan.FromSeconds(10),
TimeSpan.FromSeconds(60), TimeSpan.FromSeconds(5)); SetRetryPolicy(retryPolicy);
The retry mechanism stops after DefaultOperationTimeoutInMilliseconds
, which is currently set at 4 minutes.
Other languages implementation guidance
For code samples in other languages, review the following implementation documents. The repository contains samples that demonstrate the use of retry policy APIs.
Python SDK (Reliability not yet implemented)