Retrying a network request with a delay in Combine

Published on: May 25, 2020

Combine comes with a handy retry operator that allows developers to retry an operation that failed. This is most typically used to retry a failed network request. As soon as the network request fails, the retry operator will resubscribe to the DataTaskPublisher, kicking off a new request hoping that the request will succeed this time. When you use retry, you can specify the number of times you want to retry the operation to avoid endlessly retrying a network request that will never succeed.

While this is great in some scenarios, there are also cases where this behavior is not what you want.

For example, if your network request failed due to being rate limited or the server being too busy, you should probably wait a little while before retrying your network call since retrying immediately is unlikely to succeed anyway.

In this week's post you will explore some options you have to implement this behavior using nothing but operators and publishers that are available in Combine out of the box.

Implementing a simple retry

Before I show you the simplest dretry mechanism with a delay I could come up with, I want to show you what an immediate retry looks like since I'll be using that as the starting point for this post:

var cancellables = Set<AnyCancellable>()

let url = URL(string: "https://practicalcombine.com")!
let dataTaskPublisher = URLSession.shared.dataTaskPublisher(for: url)

dataTaskPublisher
  .retry(3)
  .sink(receiveCompletion: { completion in
    // handle errors and completion
  }, receiveValue: { response in
    // handle response
  })
  .store(in: &cancellables)

This code will fire a network request, and if the request fails it will be retried three times. That means that at most we'd make this request 4 times in total (once for the initial request and then three more times for the retries).

Note that a 404, 501 or any other error status code does not count as a failed request in Combine. The request made it to the server and the server responded. A failed request typically means that the request wasn't executed because the device making the request is offline, the server failed to respond in a timely manner, or any other reason where we never received a response from the server.

For all of these cases it probably makes sense to retry the request immediately. But how should an HTTP status code of 429 (Too Many Requests / Rate Limit) or 503 (Server Busy) be handled? These will be seen as successful outcomes by Combine so we'll need to inspect the server's response, raise an error and retry the request with a couple of seconds delay since we don't want to make the server even busier than it already is (or continue hitting our rate limit).

The first step doesn't really have anything to do with our simple retry yet but it's an important prerequisite. We'll need to extract the HTTP status code from the response we received and see if we should retry the request. For simplicity I will only check for 429 and 503 codes. In your code you'll probably want to check which status codes can be returned by your API and adapt accordingly.

enum DataTaskError: Error {
  case invalidResponse, rateLimitted, serverBusy
}

let dataTaskPublisher = URLSession.shared.dataTaskPublisher(for: url)
  .tryMap({ response -> (data: Data, response: URLResponse) in
    // just so we can debug later in the post
    print("Received a response, checking status code")

    guard let httpResponse = response.response as? HTTPURLResponse else {
      throw DataTaskError.invalidResponse
    }

    if httpResponse.statusCode == 429 {
      throw DataTaskError.rateLimitted
    }

    if httpResponse.statusCode == 503 {
      throw DataTaskError.serverBusy
    }

    return response
  })

By applying a tryMap on the dataTaskPublisher we can get the response from the data task and check its HTTP status code. Depending on the status code I throw different errors. If the status code is not 429 or 503 it's up to the subscriber of dataTaskPublisher to handle any errors. Since this tryMap is fairly lengthy I will omit the definition of the let dataTaskPublisher and the DataTaskError enum in the rest of this post and instead just refer to dataTaskPublisher:

Now that we have a publisher that fails when we want it to fail we can implement a delayed retry mechanism.

Implementing a delayed retry

Since retry doesn't allow us to specify a delay we'll need to come up with a clever solution. Luckily, I am not the first person to try and come up with something because Joseph Heck and Matt Neuburg both wrote some information and their approaches on Stackoverflow.

So why am I writing this if there's already something on Stackoverflow?

Well, neither of the solutions there is the solution. At least not in Xcode 11.5. Maybe they worked in an older Xcode version but I didn't check.

The general idea of their suggestions still stands though. Use a catch to capture any errors, and return the initial publisher with a delay from the catch. Then place a retry after the catch operator. That could would look a bit like this:

// This code is not the final solution
dataTaskPublisher
  .tryCatch({ error -> AnyPublisher<(data: Data, response: URLResponse), Error> in
    print("In the tryCatch")

    switch error {
    case DataTaskError.rateLimitted, DataTaskError.serverBusy:
      return dataTaskPublisher
        .delay(for: 3, scheduler: DispatchQueue.global())
        .eraseToAnyPublisher()
    default:
      throw error
    }
  })
  .retry(2)
  .sink(receiveCompletion: { completion in
    print(completion)
  }, receiveValue: { value in
    print(value)
  })
  .store(in: &cancellables)

The solution above uses the dataTaskPublisher from the earlier code snippet. I use a tryCatch to inspect any errors coming from the data task publisher. If the error matches one of the errors where I want to perform a delayed retry, I return the dataTaskPublisher with a delay applied to it. This will delay the delivery of values from the data task publisher that I return from tryCatch. I also erase the resulting publisher to AnyPublisher because it looks nicer.

Note that any errors emitted by dataTaskPublisher are now replaced by a new publisher that's based on dataTaskPublisher. These publishers are not the same publisher. The new publisher will begin running immediately and emit its output with a delay of 3 seconds.

This means that the publisher that has the delay applied will delay the delivery of both its success and failure values by three seconds.

When this second publisher emits an error, the retry will re-subscribe to the initial data task publisher immediately, kicking off a new network request. And this dance continues until retry was hit twice. With the code as-is, the output looks a bit like this:

Received a response, checking status code # 0 seconds in from the initial dataTaskPublisher
In the tryCatch # 0 seconds in from the initial dataTaskPublisher
Received a response, checking status code # 0 seconds in from the publisher returned by tryCatch
Received a response, checking status code # 3 seconds in from the initial dataTaskPublisher
In the tryCatch # 3 seconds in from the publisher returned by tryCatch
Received a response, checking status code # 3 seconds in from the publisher returned by tryCatch
Received a response, checking status code # 6 seconds in from the initial dataTaskPublisher
In the tryCatch # 6 seconds in from the publisher returned by tryCatch
Received a response, checking status code # 6 seconds in from the publisher returned by tryCatch

That's not exactly what we expected, right? This shows a total of 6 responses being received. Double what we wanted. And more importantly, we make requests in pairs of two. So the publisher that's created in the tryCatch executes immediately, but just doesn't emit its values until 3 seconds later which is why it takes three seconds for the initial dataTaskPublisher to fire again.

Let's see how we can fix this. First, I'll show you an interesting yet incorrect approach at implementing this.

An incorrect approach to a delayed retry

We can bring down this number to a more sensible number of requesting by applying the share() operator to the initial publisher. This will make it so that we only execute the first data task only once:

dataTaskPublisher.share()
  .tryCatch({ error -> AnyPublisher<(data: Data, response: URLResponse), Error> in
    print("In the tryCatch")
    switch error {
    case DataTaskError.rateLimitted, DataTaskError.serverBusy:
      return dataTaskPublisher
        .delay(for: 3, scheduler: DispatchQueue.global())
        .eraseToAnyPublisher()
    default:
      throw error
    }
  })
  .retry(2)
  .sink(receiveCompletion: { completion in
    print(completion)
  }, receiveValue: { value in
    print(value)
  })
  .store(in: &cancellables)

By applying share() to the dataTaskPublisher a new publisher is created that will execute when it receives its initial subscriber and replays its results for any subsequent subscribers. In our case, this results in the following output:

Received a response, checking status code # 0 seconds in from the initial dataTaskPublisher
In the tryCatch # 0 seconds in from the initial dataTaskPublisher
Received a response, checking status code # 0 seconds in from the publisher returned by tryCatch
In the tryCatch # 3 seconds in from the publisher returned by tryCatch
Received a response, checking status code # 3 seconds in from the publisher returned by tryCatch
In the tryCatch # 6 seconds in from the publisher returned by tryCatch
Received a response, checking status code # 6 seconds in from the publisher returned by tryCatch

We're closer to the desired outcome but not quite there yet. Now that we use a shared publisher as the initial publisher, it will no longer execute its data task and the tryMap that we defined on the dataTaskPublisher earlier is no longer called. The result of the tryMap is cached in the share() and this cached result is immediately emitted when retry resubscribes. This means that share() will re-emit whatever error we received the first time it made its request.

This behavior will make it look like we're correctly retrying our request but there's actually a problem. Or rather, there are a couple of problems with this approach.

The retry operator in Combine will catch any errors that occur upstream and resubscribe to the pipeline so far. This means that any errors that occur above the retry will make it so we resubsribe to dataTaskPublisher.share(). In other words, the tryCatch that we have after dataTaskPublisher.share() will always receive the same error. So if the initial request failed due to being rate limitted and our retried request fails because we couldn't make a request, the tryCatch will still think we ran into a rate limit error and retry the request even though the logic in the tryCatch says we want to throw an error if we encountered something other than DataTaskError.rateLimitted or DataTaskError.serverBusy.

And on top of that, when we encounter something other than DataTaskError.rateLimitted or DataTaskError.serverBusy we still hit our retry with an error. This means that we'll resubscribe to dataTaskPublisher.share(), hit the tryCatch, throw an error, and retry again until we've retried the specified amount of times (2 in this example).

We should fix so that:

  1. We always receive the current / latest error in the tryCatch.
  2. We don't retry when we caught a non-retryable error.

This means that we should get rid of the share() and actually run the network request when the retry resubscribes to dataTaskPublisher while making sure we don't get the extra requests that we wanted to get rid of in the previous section.

A correct way to retry a network request with a delay

The first thing we should do in order to fix our retry mechanism is redefine how the dataTaskPublisher property is created. The changes we need to make are fairly small but they have a large impact on our final result. As I mentioned in the previous section, retry will resubscribe to the upstream publisher whenever it encounters an error. This means that a failing network call would trigger our retry even though we only want to retry when we enounter an error that we consider worth retrying the call for. In this post I assume that we should retry for "rate limitted" and "server busy" status codes. Any other failure should not be retried.

To achieve this, we need to make the retry operator think that our network call always succeeds unless we encounter one of our retryable errors. We can do this by converting the network call's output to a Result object that has the data task publisher's output as it's Output and Error as its failure. If the network call comes back with a retryable error, we'll throw an error from tryMap to trigger the retry. Otherwise, we'll return a Swift Result that can hold an error, or our output. This will make it look like everything went well so the retry doesn't trigger, but we'll be able to extract errors later if needed.

Let's take a look at what this means for how the dataTaskPublisher is defined:

let dataTaskPublisher = networkCall
  .tryMap({ dataTaskOutput -> Result<URLSession.DataTaskPublisher.Output, Error> in
    print("Received a response, checking status code")

    guard let response = dataTaskOutput.response as? HTTPURLResponse else {
      return .failure(DataTaskError.invalidResponse)
    }

    if response.statusCode == 429 {
      throw DataTaskError.rateLimitted
    }

    if response.statusCode == 503 {
      throw DataTaskError.serverBusy
    }

    return .success(dataTaskOutput)
  })

If we would erase this pipeline to AnyPublisher, we'd have the following type for our publisher: AnyPublisher<Result<URLSession.DataTaskPublisher.Output, Error>, Error>. The Error in the Result is what we'll use to send non-retryable errors down the pipeline. The publisher's error is what we'll use for retryable errors.

For example, I don't want to retry my network request when I receive an invalid response so I map the data task output to .failure(DataTaskError.invalidResponse) which means the request shouldn't be retried but we can still extract and use the invalid response error after the retry.

When the request succeeded and we're happy I return .success(dataTaskOutput) so I can extract and use the data task output later.

If a retryable error occured I throw an error so we can catch that error later to setup our delayed retry in a similar fashion as what you've seen in the previous section:

dataTaskPublisher
  .catch({ (error: Error) -> AnyPublisher<Result<URLSession.DataTaskPublisher.Output, Error>, Error> in
    print("In the catch")
    switch error {
    case DataTaskError.rateLimitted,
         DataTaskError.serverBusy:
      print("Received a retryable error")
      return Fail(error: error)
        .delay(for: 3, scheduler: DispatchQueue.main)
        .eraseToAnyPublisher()
    default:
      print("Received a non-retryable error")
      return Just(.failure(error))
        .setFailureType(to: Error.self)
        .eraseToAnyPublisher()
    }
  })
  .retry(2)

Instead of a tryCatch I use catch in this example. We want to catch any errors that originated from making the network request (for example if the request couldn't be made) or the tryMap (if we encountered a retryable error).

In the catch I check whether we encountered one of the retryable errors. If we did, I create a publisher that will immediately fail with the received error. I delay the delivery of this error by three seconds and I erase to any publisher so I can have a consistent return type for my catch. This code path will trigger the retry after three seconds and will make it so we resubscribe to dataTaskPublisher and execute the network call again because we don't use the share() anymore.

If we encounter a non-retryable error, I return a Just publisher that will immediately emit a single value. Similar to the tryMap, I wrap this error in a Swift Result to make the retry think everything is fine because we don't emit an error from the Just publisher.

At this point, our pipeline will only emit an error if we encounter a retryable error. Any other errors are wrapped in a Result and sent down the pipeline as Output.

We'll want to transform our Result back to an Error event after the tryMap so we'll receive errors in our sink's receiveCompletion and the receiveValue only receives succesful output.

Here's how we can achieve this:

dataTaskPublisher
  .catch({ (error: Error) -> AnyPublisher<DataTaskResult, Error> in
    print("In the catch")
    switch error {
    case DataTaskError.rateLimitted,
         DataTaskError.serverBusy:
      print("Received a retryable error")
      return Fail(error: error)
        .delay(for: 3, scheduler: DispatchQueue.main)
        .eraseToAnyPublisher()
    default:
      print("Received a non-retryable error")
      return Just(.failure(error))
        .setFailureType(to: Error.self)
        .eraseToAnyPublisher()
    }
  })
  .retry(2)
  .tryMap({ result in
    // Result -> Result.Success or emit Result.Failure
    return try result.get()
  })
  .sink(receiveCompletion: { completion in
    print(completion)
  }, receiveValue: { value in
    print(value)
  })
  .store(in: &cancellables)

By placing a tryMap after the retry we can grab our Result<URLSession.DataTaskPublisher.Output, Error> value and call try result.get() to either return the success case of our result, or throw the error in our failure case.

By doing this, we'll receive errors in receiveCompletion and receiveValue only receives succesful values. This means we won't have to deal with the Result in our receiveValue.

The output for this example would look like this:

Received a response, checking status code # 0 seconds after the initial data task
In the catch # 0 seconds after the initial data task
Received a retryable error # 0 seconds after the initial data task
Received a response, checking status code # 3 seconds after the initial data task
In the catch # 3 seconds after the initial data task
Received a retryable error # 3 seconds after the initial data task
Received a response, checking status code # 6 seconds after the initial data task
In the catch # 6 seconds after the initial data task
Received a retryable error # 6 seconds after the initial data task
failure(__lldb_expr_5.DataTaskError.rateLimitted) # 9 seconds after the initial data task

By delaying the delivery of certain errors, we can manipulate the start of the retried request. One downside is that if every request fails, we'll also delay the delivery of the final failure by the specified interval.

One thing I really like about this approach is that you can use different intervals for different errors, and you can even have the server tell you how long you should wait before retrying a request if the server includes this information in an HTTP header or as part of the response body. The delay could be configured in the tryMap where you have access to the HTTP response and you could associate the delay with your custom error case as an associated value.

In summary

What started as a simple question "How do I implement a delayed retry in Combine" turned out to be quite an adventure for me. Every time I thought I had found a solution, like the one I linked to from Stackoverflow there was always something about each solution I didn't like. As it turns out, there is no quick and easy way in Combine to implement a delayed retry that only applies to specific errors. I even had to update this post months after writing it because Alex Grebenyuk pointed out some interesting issues with the initial solution proposed in this post.

In this post you saw the various solutions I've tried, and why they were not to my liking. In the last section I showed you a tailor-made solution that works by delaying the delivery of specific errors rather than attempting to delay the start of the next request. Ultimately that mechanism delays delivery of all results, including success if the initial request failed. My final solution does not have this drawback which, in my opinion, is much nicer than delaying everything.

I have made the code used in this post available as a GitHub gist here. You can paste it in a Playground and it should work immediately. The code is sllightly modified to proof that network calls get re-executed and I have replaced the network with a Future so you have full control over the fake network call. To learn more about Combine's Future, you might want to read this post.

If you have your own solution for this problem and think it's more elegant, shorter or better than please, do reach out to me on Twitter so I can update this post. I secretly hope that this post is obsolete by the time WWDC 2020 comes along but who knows. For now I think this is the best we have.

Categories

Combine