Idempotency in Distributed APIs — Part 4: Retry Logic and Exponential Backoff

Idempotency in Distributed APIs — Part 4: Retry Logic and Exponential Backoff

Idempotency on the server is only half the story. The other half is the client knowing when and how to retry safely. A client that retries too aggressively can turn a recovering service into a failed one. A client that never retries wastes the safety guarantees you just built.

This part covers the mechanics of retry logic: exponential backoff, jitter, retry budgets, and which errors are safe to retry at all.

Which Errors Are Safe to Retry

Not every error means “try again.” Retrying the wrong errors wastes resources or makes things worse.

ConditionRetry?Why
Network timeoutYesRequest may not have reached server
Connection refusedYesServer temporarily unavailable
5xx Server ErrorYes (with limits)Server-side transient failure
429 Too Many RequestsYes (with Retry-After)Rate limited, back off first
408 Request TimeoutYesServer did not finish in time
400 Bad RequestNoClient error, retrying won’t help
401 UnauthorizedNoFix auth first
404 Not FoundNoResource does not exist
422 UnprocessableNoLogic error, not transient

Exponential Backoff

The simplest retry strategy — retrying immediately after failure — is also the most dangerous at scale. If 10,000 clients all fail simultaneously and all retry at once, they generate a thundering herd that can push a recovering server back into failure.

Exponential backoff spaces retries out by doubling the wait time between each attempt. The delay after attempt n is roughly: base_delay * 2^n.

gantt
    title Retry Timeline (base_delay = 1s, max = 30s)
    dateFormat X
    axisFormat %Ls

    section Attempts
    Attempt 1 (fail)     :0, 1
    Wait 1s              :1, 2
    Attempt 2 (fail)     :2, 3
    Wait 2s              :3, 5
    Attempt 3 (fail)     :5, 6
    Wait 4s              :6, 10
    Attempt 4 (fail)     :10, 11
    Wait 8s              :11, 19
    Attempt 5 (success)  :19, 20

Adding Jitter

Backoff alone is not enough if all clients started at the same time. They will still all retry at the same intervals in sync — the thundering herd just moves to later. Jitter adds randomness to the delay so clients desynchronize.

AWS recommends “full jitter”: pick a random value between zero and the calculated backoff. This spreads retries uniformly across the window and significantly reduces load spikes on recovering services.

use std::time::Duration;
use rand::Rng;

pub struct RetryConfig {
    pub max_attempts: u32,
    pub base_delay_ms: u64,
    pub max_delay_ms: u64,
}

impl RetryConfig {
    pub fn delay_for_attempt(&self, attempt: u32) -> Duration {
        // Exponential backoff: base * 2^attempt
        let exponential = self.base_delay_ms * 2u64.pow(attempt);
        // Cap at max delay
        let capped = exponential.min(self.max_delay_ms);
        // Full jitter: random between 0 and capped
        let jittered = rand::thread_rng().gen_range(0..=capped);
        Duration::from_millis(jittered)
    }
}

// Usage
let config = RetryConfig {
    max_attempts: 5,
    base_delay_ms: 200,
    max_delay_ms: 30_000, // 30 seconds cap
};

A Full Retry Client in Rust

Here is a complete retry wrapper using reqwest that handles idempotency key persistence across attempts and applies full jitter backoff.

use reqwest::{Client, Response, StatusCode};
use serde::Serialize;
use std::time::Duration;
use tokio::time::sleep;
use uuid::Uuid;

pub struct IdempotentClient {
    inner: Client,
    config: RetryConfig,
}

impl IdempotentClient {
    pub fn new(config: RetryConfig) -> Self {
        Self {
            inner: Client::new(),
            config,
        }
    }

    pub async fn post_with_retry<T: Serialize>(
        &self,
        url: &str,
        body: &T,
    ) -> anyhow::Result<Response> {
        // Generate key once -- reused across all retries
        let idempotency_key = Uuid::new_v4().to_string();
        let body_bytes = serde_json::to_vec(body)?;

        let mut last_error = None;

        for attempt in 0..self.config.max_attempts {
            let result = self
                .inner
                .post(url)
                .header("Idempotency-Key", &idempotency_key)
                .header("Content-Type", "application/json")
                .body(body_bytes.clone())
                .timeout(Duration::from_secs(30))
                .send()
                .await;

            match result {
                Ok(resp) => {
                    let status = resp.status();

                    // Success
                    if status.is_success() {
                        return Ok(resp);
                    }

                    // Do not retry client errors (4xx), except 408 and 429
                    if status.is_client_error()
                        && status != StatusCode::REQUEST_TIMEOUT
                        && status != StatusCode::TOO_MANY_REQUESTS
                    {
                        return Ok(resp); // Return to caller to handle
                    }

                    // For 429, respect Retry-After header if present
                    if status == StatusCode::TOO_MANY_REQUESTS {
                        if let Some(retry_after) = resp.headers().get("Retry-After") {
                            if let Ok(secs) = retry_after.to_str().unwrap_or("0").parse::<u64>() {
                                sleep(Duration::from_secs(secs)).await;
                                continue;
                            }
                        }
                    }

                    last_error = Some(anyhow::anyhow!("Server error: {}", status));
                }
                Err(e) if e.is_timeout() || e.is_connect() => {
                    last_error = Some(anyhow::anyhow!("Network error: {}", e));
                }
                Err(e) => {
                    return Err(anyhow::anyhow!("Non-retryable error: {}", e));
                }
            }

            // Not the last attempt -- wait before retrying
            if attempt + 1 < self.config.max_attempts {
                let delay = self.config.delay_for_attempt(attempt);
                tracing::warn!(
                    attempt = attempt + 1,
                    delay_ms = delay.as_millis(),
                    key = %idempotency_key,
                    "Retrying request"
                );
                sleep(delay).await;
            }
        }

        Err(last_error.unwrap_or_else(|| anyhow::anyhow!("Max retries exceeded")))
    }
}

Retry Budgets

Exponential backoff caps how long any single retry sequence runs. But in a microservice with many instances, each making its own retries, the aggregate retry load can still overwhelm a downstream service. A retry budget limits the fraction of total requests that can be retries at any given time.

The idea: track how many of your last N requests were retries. If retries exceed a threshold (say 10%), stop retrying and fail fast. This prevents a cascade where one struggling service causes all its callers to saturate it further with retry traffic.

use std::sync::atomic::{AtomicU32, Ordering};
use std::sync::Arc;

pub struct RetryBudget {
    total_requests: Arc<AtomicU32>,
    retry_requests: Arc<AtomicU32>,
    max_retry_fraction: f64, // e.g. 0.10 for 10%
}

impl RetryBudget {
    pub fn can_retry(&self) -> bool {
        let total = self.total_requests.load(Ordering::Relaxed) as f64;
        let retries = self.retry_requests.load(Ordering::Relaxed) as f64;
        if total == 0.0 {
            return true;
        }
        (retries / total) < self.max_retry_fraction
    }

    pub fn record_attempt(&self, is_retry: bool) {
        self.total_requests.fetch_add(1, Ordering::Relaxed);
        if is_retry {
            self.retry_requests.fetch_add(1, Ordering::Relaxed);
        }
    }
}

Non-Idempotent Operations: Do Not Retry

There is one hard rule: never retry a request without an idempotency key if the operation has side effects. Without a key, the server cannot deduplicate — and retrying is identical to sending two separate requests. The client code we built above always attaches an idempotency key for POST requests, which is exactly right. Remove the key and the retry logic becomes dangerous.

Summary

Retries are necessary for reliability. Naive retries cause thundering herds and duplicate side effects. Exponential backoff with full jitter spreads retry load. Retry budgets prevent cascading amplification. And none of this is safe without idempotency keys on the server side. In Part 5, we move to a different delivery mechanism entirely — message queues — where the retry semantics are controlled by the broker rather than the client.

References

Written by:

657 Posts

View All Posts
Follow Me :
How to whitelist website on AdBlocker?

How to whitelist website on AdBlocker?

  1. 1 Click on the AdBlock Plus icon on the top right corner of your browser
  2. 2 Click on "Enabled on this site" from the AdBlock Plus option
  3. 3 Refresh the page and start browsing the site