Retrying a network request with a delay in Combine

Published by donnywals on

Combine comes with a handy retry operator that allows developers to retry an operation that failed. This is most typically used to retry a failed network request. As soon as the network request fails, the retry operator will resubscribe to the DataTaskPublisher, kicking off a new request hoping that the request will succeed this time. When you use retry, you can specify the number of times you want to retry the operation to avoid endlessly retrying a network request that will never succeed.

While this is great in some scenarios, there are also cases where this behavior is not what you want.

For example, if your network request failed due to being rate limited or the server being too busy, you should probably wait a little while before retrying your network call since retrying immediately is unlikely to succeed anyway.

In this week's post you will explore some options you have to implement this behavior using nothing but operators and publishers that are available in Combine out of the box.

Implementing a simple retry

Before I show you the simplest retry mechanism I could come up with, I want to show you what an immediate retry looks like since I'll be using that as the starting point for this post:

var cancellables = Set<AnyCancellable>()

let url = URL(string: "https://practicalcombine.com")!
let dataTaskPublisher = URLSession.shared.dataTaskPublisher(for: url)

dataTaskPublisher
  .retry(3)
  .sink(receiveCompletion: { completion in
    // handle errors and completion
  }, receiveValue: { response in
    // handle response
  })
  .store(in: &cancellables)

This code will fire a network request, and if the request fails it will be retried three times. That means that at most we'd make this request 4 times in total (once for the initial request and then three more times for the retries).

Note that a 404, 501 or any other error status code does not count as a failed request in Combine. The request made it to the server and the server responded. A failed request typically means that the request wasn't executed because the device making the request is offline, the server failed to respond in a timely manner, or any other reason where we never received a response from the server.

For all of these cases it probably makes sense to retry the request immediately. But how should an HTTP status code of 429 (Too Many Requests / Rate Limit) or 503 (Server Busy) be handled? These will be seen as successful outcomes by Combine so we'll need to inspect the server's response, raise an error and retry the request but a couple of seconds delay to avoid hammering the server.

The first step doesn't really have anything to do with our simple retry yet but it's an important prerequisite. We'll need to extract the HTTP status code and see if we should retry the request. For simplicity I will only check for 429 and 503 codes. In your code you'll probably want to check which status codes can be returned by your API and adapt accordingly.

enum DataTaskError: Error {
  case invalidResponse, rateLimitted, serverBusy
}

let dataTaskPublisher = URLSession.shared.dataTaskPublisher(for: url)
  .tryMap({ response -> (data: Data, response: URLResponse) in
    // just so we can debug later in the post
    print("Received a response, checking status code") 

    guard let httpResponse = response.response as? HTTPURLResponse else {
      throw DataTaskError.invalidResponse
    }

    if httpResponse.statusCode == 429 {
      throw DataTaskError.rateLimitted
    }

    if httpResponse.statusCode == 503 {
      throw DataTaskError.serverBusy
    }

    return response
  })

By applying a tryMap on the dataTaskPublisher we can get the response from the data task and check its HTTP status code. Depending on the status code I throw different errors. If the status code is not 429 or 503 it's up to the subscriber of dataTaskPublisher to handle any errors. Since this tryMap is fairly lengthy I will omit the definition of the let dataTaskPublisher and the DataTaskError enum in the rest of this post and instead just refer to dataTaskPublisher:

Now that we have a publisher that fails when we want it to fail we can implement a delayed retry mechanism.

Implementing a delayed retry

Since retry doesn't allow us to specify a delay we'll need to come up with a clever solution. Luckily, I am not the first person to try and come up with something because Joseph Heck and Matt Neuburg both wrote some information and their approaches on Stackoverflow.

So why am I writing this if there's already something on Stackoverflow?

Well, neither of the solutions there is the solution. At least not in Xcode 11.5. Maybe they worked in an older Xcode version but I didn't check.

The general idea of their suggestions still stands though. Use a catch to capture any errors, and return the initial publisher with a delay from the catch. Then place a retry after the catch operator. That could would look a bit like this:

// This code is not the final solution
dataTaskPublisher
  .tryCatch({ error -> AnyPublisher<(data: Data, response: URLResponse), Error> in
    print("In the tryCatch")

    switch error {
    case DataTaskError.rateLimitted, DataTaskError.serverBusy:
      return dataTaskPublisher
        .delay(for: 3, scheduler: DispatchQueue.global())
        .eraseToAnyPublisher()
    default:
      throw error
    }
  })
  .retry(2)
  .sink(receiveCompletion: { completion in
    print(completion)
  }, receiveValue: { value in
    print(value)
  })
  .store(in: &cancellables)

The solution above uses the dataTaskPublisher from the earlier code snippet. I use a tryCatch to inspect any errors coming from the data task publisher. If the error matches one of the errors where I want to perform a delayed retry, I return the dataTaskPublisher with a delay applied to it. This will delay the delivery of values from the data task publisher that I return from tryCatch. I also erase the resulting publisher to AnyPublisher because it looks nicer.

Note that any errors emitted by dataTaskPublisher are now replaced by a new publisher that's based on dataTaskPublisher. These publishers are not the same publisher. The new publisher will begin running immediately and emit its output with a delay of 3 seconds.

This means that the publisher that has the delay applied will delay the delivery of both its success and failure values by three seconds.

When this second publisher emits an error, the retry will re-subscribe to the initial data task publisher immediately, kicking off a new network request. And this dance continues until retry was hit twice. With the code as-is, the output looks a bit like this:

Received a response, checking status code # 0 seconds in from the initial dataTaskPublisher
In the tryCatch # 0 seconds in from the initial dataTaskPublisher
Received a response, checking status code # 0 seconds in from the publisher returned by tryCatch
Received a response, checking status code # 3 seconds in from the initial dataTaskPublisher
In the tryCatch # 3 seconds in from the publisher returned by tryCatch
Received a response, checking status code # 3 seconds in from the publisher returned by tryCatch
Received a response, checking status code # 6 seconds in from the initial dataTaskPublisher
In the tryCatch # 6 seconds in from the publisher returned by tryCatch
Received a response, checking status code # 6 seconds in from the publisher returned by tryCatch

That's not exactly what we expected, right? This shows a total of 6 responses being received. Double what we wanted. And more importantly, we make requests in pairs of two. So the publisher that's created in the tryCatch executes immediately, but just doesn't emit its values until 3 seconds later which is why it takes three seconds for the initial dataTaskPublisher to fire again.

We can bring down this number to a more sensible number of requesting by applying the share() operator to the initial publisher. This will make it so that we only execute the first data task only once:

dataTaskPublisher.share()
  .tryCatch({ error -> AnyPublisher<(data: Data, response: URLResponse), Error> in
    print("In the tryCatch")
    switch error {
    case DataTaskError.rateLimitted, DataTaskError.serverBusy:
      return dataTaskPublisher
        .delay(for: 3, scheduler: DispatchQueue.global())
        .eraseToAnyPublisher()
    default:
      throw error
    }
  })
  .retry(2)
  .sink(receiveCompletion: { completion in
    print(completion)
  }, receiveValue: { value in
    print(value)
  })
  .store(in: &cancellables)

By applying share() to the dataTaskPublisher a new publisher is created that will execute when it receives its initial subscriber and replays its results for any subsequent subscribers. In our case, this results in the following output:

Received a response, checking status code # 0 seconds in from the initial dataTaskPublisher
In the tryCatch # 0 seconds in from the initial dataTaskPublisher
Received a response, checking status code # 0 seconds in from the publisher returned by tryCatch
In the tryCatch # 3 seconds in from the publisher returned by tryCatch
Received a response, checking status code # 3 seconds in from the publisher returned by tryCatch
In the tryCatch # 6 seconds in from the publisher returned by tryCatch
Received a response, checking status code # 3 seconds in from the publisher returned by tryCatch

We're closer to the desired outcome but not quite there yet. Now that we use a shared publisher as the initial publisher, it will no longer execute its data task and the tryMap is no longer called. The result of the tryMap is cached in the share() and this cached result is immediately emitted when retry resubscribes. This means that share() will re-emit whatever error we received the first time it made its request.

Every time the shared data task publisher emits its error, the tryCatch creates a new publisher. Since this publisher is not shared, it will perform a network call every time, the result of this network call is then emitted with a delay.

This is why the initial retry is executed immediately and subsequent retries are executed three seconds after the previous retry.

At this stage it would be nice to somehow get rid of that extraneous request and make sure that we don't make a request immediately after the first failure.

We can get rid of the extraneous request by lowering the retry count by 1. After all, we make the initial request, then we make another request through the catch, and then we hit the retry. So before we attempt the first retry, two requests have been made. There's not much we can do about that.

Unfortunately, achieving this ultimate goal is non-trivial in Combine. Especially if you don't want to mess with the way the source publisher (the dataTaskPublisher in this case) emits its errors. If you're reading this, and you know of an elegant way to retry network calls with a delay that works reliably and does not result in the first request double-firing I would love to see it.

I have come up with two ways to achieve an semi-acceptable retry mechanism. One of them is to get rid of the initial request completely and start our requests in tryCatch:

Fail(error: DataTaskError.invalidResponse)
  .tryCatch({ error -> AnyPublisher<(data: Data, response: URLResponse), Error> in
    print("In the tryCatch")
    switch error {
    case DataTaskError.rateLimitted, DataTaskError.serverBusy:
      return dataTaskPublisher
        .delay(for: 3, scheduler: DispatchQueue.global())
        .eraseToAnyPublisher()
    default:
      throw error
    }
  })
  .retry(2)
  .sink(receiveCompletion: { completion in
    print(completion)
  }, receiveValue: { value in
    print(value)
  })
  .store(in: &cancellables)

I don't like this approach. I don't like it one bit. But it works! Since the initial publisher always fails, we immediately fall through to the tryCatch where the first, second, and third requests are made. All three seconds apart.

The main pain point I have with all of the approaches I've shown so far is that a successful response is also delayed by three seconds. Wouldn't it be great if we had a way to make a request and publish non-error responses immediately? Or even better, only delay retry attempts for status code where a delayed retry makes sense? I have one more solution to show you. It works like a charm, only delays responses where it makes sense and delivers success responses instantly.

It's far more involved than the previous solutions so it might not be as simple to integrate as you'd have hoped

My final solution

Before I explain my final solution, let's look at the code for my dataTaskPublisher. This replaces the code with the tryMap that I showed you at the beginning of this post:

let dataTaskPublisher = URLSession.shared.dataTaskPublisher(for: url)
  .mapError({ $0 as Error })
  .map({ response -> AnyPublisher<(data: Data, response: URLResponse), Error> in
    print("In the map")
    guard let httpResponse = response.response as? HTTPURLResponse else {
      return Fail(error: DataTaskError.invalidResponse)
        .eraseToAnyPublisher()
    }

    if httpResponse.statusCode == 429 {
      return Fail(error: DataTaskError.rateLimitted)
        .delay(for: 3, scheduler: DispatchQueue.global())
        .eraseToAnyPublisher()
    }

    if httpResponse.statusCode == 503 {
      return Fail(error: DataTaskError.serverBusy)
        .delay(for: 3, scheduler: DispatchQueue.global())
        .eraseToAnyPublisher()
    }

    return Just(response)
      .setFailureType(to: Error.self)
      .eraseToAnyPublisher()
  })
  .switchToLatest()
  .eraseToAnyPublisher()

Most of the new and improved work is done in this code. I manually change the data task publisher's error type to Error so I can use a map to create an all-new publisher with Error as its Failure. This code makes extensive use of Fail and Just to transform the output of my data task into a publisher. When I receive a response that's not convertible to HTTPURLResponse I return Fail(error: DataTaskError.invalidResponse).eraseToAnyPublisher(). This creates a publisher that sends a failure event immediately.

If the httpResponse.statusCode is 429 or 503 I return a Fail publisher that emits the appropriate DataTaskError with a delay. This means that these Fail publishers will hold on to their errors for three seconds before it's forwarded. If we didn't get an error with a status code we care about I return a Just that has Error as its failure type (Just has Never as its failure type by default).

All publishers are erased to AnyPublisher so I have a single return type in my map.

After creating a new publisher in map I use switchToLatest to replace the data task publisher from before the map with the publisher that I create in the map. I then erase the resulting publisher to AnyPublisher again so my final type is AnyPublisher<(data: Data, response: URLResponse), Error>. This almost looks just like a normal type erased data task publisher except it has some neat tricks up its in case we run into an error.

The code to subscriber to dataTaskPublisher and retry requests looks like this:

dataTaskPublisher.share()
  .tryCatch({ error -> AnyPublisher<(data: Data, response: URLResponse), Error> in
    print("In the tryCatch")
    switch error {
    case DataTaskError.rateLimitted, DataTaskError.serverBusy:
      return dataTaskPublisher
    default:
      throw error
    }
  })
  .retry(1)
  .sink(receiveCompletion: { completion in
    print(completion)
  }, receiveValue: { value in
    print(value)
  })
  .store(in: &cancellables)

Just like before, we try the initial request and use share() to prevent each request from being executed twice. If this request comes back with an error where we want to apply a delayed retry, the tryCatch will not receive that error immediately. Instead, it will receive the error after the delay that I applied in the previous code snippet where I created a Fail publisher.

The tryCatch then creates a new data task that will make a new request. And again, if it encounters an error where a delayed retry should happen, that publisher will hold on to the error for the amount of time specified (in this case three seconds) before it leaves the tryCatch and hits the retry.

The output for this example would look like this:

In the map # 0 seconds after the initial data task
In the tryCatch # 3 seconds after the initial data task
In the map # 3 seconds after the initial data task
In the tryCatch # 6 seconds after the initial data task
In the map # 6 seconds after the initial data task

By delaying the delivery of certain errors, we can manipulate the start of the retried request.

One thing I really like about this approach is that you can use different intervals for different errors, and you can even have the server tell you how long you should wait before retrying a request if the server includes this information in an HTTP header or as part of the response body. The delay is configured in the map where you have access to the HTTP response so you could read and use any values that come back as part of the server's response.

Since the delayed retry is now mostly driven and controlled by how errors are extracted and handled in the map you could get rid of the share() and tryCatch altogether if you always want to retry your requests.

dataTaskPublisher
  .retry(2)
  .sink(receiveCompletion: { completion in
    print(completion)
  }, receiveValue: { value in
    print(value)
  })
  .store(in: &cancellables)

This does remove that little bit of extra control where you get to decide which errors should be retried by returning a data task publisher for them, but if you plan to always retry I think it's a bit cleaner to write your code like this because it's easier to follow.

In summary

What started as a simple question "How do I implement a delayed retry in Combine" turned out to be quite an adventure for me. Every time I thought I had found a solution, like the one I linked to from Stackoverflow there was always something about each solution I didn't like. As it turns out, there is no quick and easy way in Combine to implement a delayed retry that only applies to specific errors.

In this post you saw the various solutions I've tried, and why they were not to my liking. In the last section I showed you a tailor-made solution that works by delaying the delivery of specific errors rather than attempting to delay the start of the next request. Ultimately that mechanism delays delivery of all results, including success if the initial request failed. My final solution does not have this drawback which, in my opinion, is much nicer than delaying everything.s

If you have your own solution for this problem and think it's more elegant, shorter or better than please, do reach out to me on Twitter so I can update this post. I secretly hope that this post is obsolete by the time WWDC 2020 comes along but who knows. For now I think this is the best we have.


25% off Practical Combine until June 30th!

Learn everything you need to know about Combine and how you can use it in your projects with my new book Practical Combine. You'll get eleven chapters, a Playground and a handful of sample projects to help you get up and running with Combine as soon as possible.

The book is available as a digital download for just $18.74!

Get Practical Combine

Receive weekly updates about my posts

Categories: Combine