Retry for, not retry count

A common pattern for performing actions that can fail is to retry them. The classic example of this is a network request. It could fail due to a transient error which does not indicate a failure of the application. By retrying the request, you make your app resilient to these sorts of failures. Sounds great! Okay well, this is a transient error so we don’t want to just retry once, we want to retry a few times, perhaps with some backoff so we don’t spam to resource we’re hitting. This might look like:

def get(url, max_retries: 3)
    retries_left = max_retries
    loop do
        begin
            response = HTTParty.get(url)
            if response.success?
                return response
            end
        rescue
        end
        retries_left -= 1
        if retries_left == 0
            raise "Failed to get #{url} after #{max_retries} retries"
        end
        sleep(2 ** (max_retries - retries_left))
    end
end

We’ll retry the request up to three times by default with exponential backoff after each unsuccessful request.

How long will this call block for? It’s not obvious if you just see something like this in the codebase:

get("https://example.com", max_retries: 3)

This will block for at least 6 seconds before failing. That’s 6 seconds total of backoff (it increases exponentially) plus however long it takes for each request to fail. That’s a very long time, especially if someone is waiting for a response.

Instead, we should make it obvious how long it will block for and it can be a detail of the algorithm how many retries that is. For example:

def get(url, retry_for: 6)
    start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    loop do
        begin
            response = HTTParty.get(url)
            if response.success?
                return response
            end
        rescue
        end
        now = Process.clock_gettime(Process::CLOCK_MONOTONIC)
        if now - start > 6
            raise "Failed to get #{url} after #{max_retries} retries"
        end
        sleep(2 ** (max_retries - retries_left))
    end
end

Now we know exactly how long this call will block for and it’ll be retried some amount of time over that period. I think this is a much better interface because it makes the consequences of the retry much more apparent. It does mean that, for a very slow request, it might not even get retried. However, I think that’s usually a good tradeoff. Is this individual request so important that it should make someone wait around for ages to complete? Unlikely.

If you’re keen on making sure a certain number of retries are performed, we could extend the interface with some retry strategies. For example, “retry 3 times over 6 seconds”. We could calculate what the backoff could be to reach that call rate. It wouldn’t be perfect because we don’t know exactly how long each request will take but we could make a good approximation.

I think this comes back to a reliability principal I think about a lot: the least shit for the greatest number. Using a retry count focuses on ensuring this particular task succeeds but we should be zooming out more than that. Retry for helps us focus on the impact that the retry could have on other things.