Three Simple Patterns for Retrying Jobs in Elixir

Tags

Three hot air balloons ascending

Processing background jobs successfully in software systems is critical to most businesses. These jobs can include sending a notification through email or SMS; generating a business intelligence report; processing a payment; or anything else that creates value for users or businesses. The Elixir programming language helps us design custom applications in a way that lets them recover from both expected and unexpected errors. The primitives the language offers to deal with unexpected errors is where Elixir truly shines. This post is written for my developer friends who’re adopting Elixir.

I think it would be helpful to clarify the distinction between expected and unexpected errors. Here’s a simple example:

defmodule Mathy do
  def divide(dividend, divisor) do
    if is_number(dividend) and is_number(divisor) do
      {:ok, dividend / divisor}
    else
      {:error, :both_args_must_be_numbers}
    end
  end
end

Mathy.divide(10,2)      # => {:ok, 5.0} - a successful result
Mathy.divide("pizza",3) # => {:error, :both_args_must_be_numbers} - an expected error
Mathy.divide(10,0)      # => ArithmeticError - the function doesn't handle this situation

The last line in the example is an unexpected error because we either never considered it could happen or have no sensible way to handle it - an “exceptional” situation.

Depending on the guarantees our systems must provide, Elixir has a few great options for retrying failed jobs. We can rely on a database, if durability is a priority. We save our pending jobs in a database and update their status as jobs are completed. Whether we decide to use Redis, PostgreSQL, or some other database, depends on our needs. Regardless, the purpose of this article is not to address ‘durable’ jobs. Put them in a database and we’re done. Instead, I’d like to share three simple patterns I’ve seen Elixir engineers rely on to retry failed jobs without touching a database.

Option #1: Simple Retry for Expected Errors

Sometimes a simple retry mechanism for retrying jobs a few times, after expected errors and then giving up, is all that’s needed:

def perform(_, _retry = 0), do: :ignore

def perform(arg, retry \\ 3) do
  case do_work(arg) do
    :error -> perform(arg, retry - 1)
    :ok -> :ok
  end
end

In the example above, perform/2 will call do_work/1. If we handle expected errors gracefully within do_work/1, it will return :error. When this happens we call perform/2 again with an adjusted retry value, to account for the first attempt. Eventually, when the retry value gets to zero, we return :ignore to stop retrying. If do_work/1 returns :ok, at any point, it doesn’t trigger a retry. You can easily add a delay to each retry or even a backoff strategy. Notice that this approach doesn’t account for unexpected errors within do_work/1. If do_work/1 raises an exception, it never gets to return :error, so perform/2 with an adjusted retry argument is never called. The next option on our list is a simple approach we can use to retry jobs after unexpected errors.

Option #2: Letting a Supervisor Restart Tasks

The second option on our list is letting Task.Supervisor take care of the retry logic. This option works best to handle bugs, unexpected input, and flaky network connections. It’s a simple approach for retrying our jobs a few times and then giving up.

To do this we’re going to start a supervisor within a supervision tree and have our tasks be supervised by this supervisor.

To start a supervisor to dynamically supervise tasks, we add our supervisor under a supervision tree:

children = [
  {Task.Supervisor, name: MyApp.TaskSupervisor}
]

Supervisor.start_link(children, strategy: :one_for_one)

This starts a Task.Supervisor process with the given name. Task.Supervisor currently only supports the :one_for_one strategy.

The one_for_one strategy configures the supervisor so that if a child terminates, only that child is restarted. Child in this example refers to a background job.

Then in our Phoenix controller, or some other process, we start a supervised task with start_child/3 or start_child/5:

Task.Supervisor.start_child(MyApp.TaskSupervisor, fn ->
  MyApp.MyTask.perform(args)
end, restart: :transient)

# OR

options = [restart: :transient]

Task.Supervisor.start_child(MyApp.TaskSupervisor, MyApp.MyTask, :perform, args, options)

That’s it, we’re done. If our task fails, MyApp.TaskSupervisor will restart it according to the :restart value. The example above uses :transient but we have a few options to adjust for our needs.

  • :temporary: (default) never restart
  • :transient: restart if the exit is not :normal, :shutdown, or {:shutdown, reason}
  • :permanent: always restart

We can also configure the number of restart attempts allowed in a given time frame before the supervisor gives up:

  • :max_restarts: (default of 3) max number of restarts in :max_seconds
  • :max_seconds: (default of 5) time frame for :max_restarts

A common misunderstanding is thinking that the :max_restarts count is per child. The :max_restarts count will be a total count, including all children. Take a look at option #3 if you want more control as to how many times each job gets retried.

So there it is, an easy way to retry our failed jobs with Task.Supervisor. Up next, an approach that provides more granular control.

Option #3: Spawning Processes from GenServer for More Control

If we need more control, we can create a task using Task.Supervisor.async_nolink/3 or Task.Supervisor.async_nolink/5 inside an OTP behaviour such as GenServer.

This approach requires more effort than option #2 but it allows us to easily specify the retry logic for our background jobs. I really like how handling expected and unexpected errors differently, if needed, can be very straightforward with this approach.

First, similar to option #2, we want to add a supervisor to our supervision tree. This supervisor will be responsible for supervising our jobs. Additionally, we’ll also want to add a GenServer process to our supervision tree. We’ll call it MyApp.Worker. This process will be responsible for starting our jobs and handling our retry logic. Our supervision tree with a Task.Supervisor and GenServer:

children = [
  {Task.Supervisor, name: MyApp.TaskSupervisor},
  MyApp.Worker
]

Supervisor.start_link(children, strategy: :one_for_one)

Now let’s wire up MyApp.Worker for it to start up (start_link/1), start our job (start_task/0), and handle retries (handle_info/2). In this case we’ll wire this GenServer to run single tasks, for the sake of a simple example:

defmodule MyApp.Worker do
  use GenServer

  # Client

  def start_link(_) do
    GenServer.start_link(__MODULE__, %{ref: nil}, name: __MODULE__)
  end

  def start_task do
    GenServer.call(__MODULE__, :start_task)
  end

  # Server (callbacks)

  def init(state) do
    {:ok, state}
  end

  # If ref is already set, we ignore the request to start the task.
  def handle_call(:start_task, _from, %{ref: ref} = state) when is_reference(ref) do
    {:reply, :ok, state}
  end

  # If ref is nil, we start the task and save the task's ref.
  def handle_call(:start_task, _from, %{ref: nil} = state) do
    task = do_start_task()
    {:reply, :ok, %{state | ref: task.ref}}
  end

  # The task is done.
  def handle_info({ref, _result}, %{ref: ref} = state) do
    # No need to continue to monitor
    Process.demonitor(ref, [:flush])

    # Do something with the result here...

    {:noreply, %{state | ref: nil}}
  end

  # Unexpected failure
  def handle_info({:DOWN, ref, :process, _pid, _reason}, %{ref: ref} = state) do
    # restart the task
    task = do_start_task()
    {:noreply, %{state | ref: task.ref}}
  end

  defp do_start_task do
    Task.Supervisor.async_nolink(MyApp.TaskSupervisor, fn ->
      MyApp.MyTask.perform()
    end)
  end
end

When we create a task with async_nolink inside a GenServer we get two very useful things for free:

  1. A message is sent to our GenServer with the result of the task.
  2. Our GenServer monitors our task.

This allows us to add different handle_info/2 callbacks that match on success, expected, and unexpected errors. As it stands now, the handle_info/2 that matches on :DOWN messages is responsible for dealing with unexpected errors. Because we’re monitoring the task, we’ll get a {:DOWN message with a reason no matter how it terminates. If that’s the first message we get from it, it will be because it failed, as the reason will indicate. On the other hand, if the task completes successfully, we’ll get a {ref, answer} message first. At that point we will know the task is complete and won’t want its :DOWN message, so we’ll demonitor the process and :flush its :DOWN message from our mailbox if it’s already there.

As it stands now we’re dealing with unexpected errors but not with expected errors. Other than matching :DOWN messages, we’re not checking if a task completed successfully or if it returned an error. To add special handling for expected errors we’d have to add a handle_info/2 callback to match them. We could add a clause that restarts expected errors like this:

  # Expected error
  def handle_info({ref, {:error, _reason}}, %{ref: ref} = state) do
    # restart the task
    task = do_start_task()
    {:noreply, %{state | ref: task.ref}}
  end

If the return of MyApp.Job.perform/0 is an error tuple, we can match it and trigger a retry.

As we can see in this example, using Task.Supervisor.async_nolink/3 inside a GenServer gives us fine control without a lot of ceremony. Handling success, expected errors, and unexpected errors differently is a matter of adding a callback for each case. We can improve on the example above. Adding a retry count, delay, and/or backoff strategy to the example above wouldn’t take a ton of effort. We can keep track of retries in the process state, then use the retry count to add delays between tries, and give up after a configured maximum as needed. We could also modify this example to track refs of multiple tasks, along with retry counts for each. An alternative approach to tracking multple tasks per process would be to dynamically start a process to track individual tasks on demand.

Wrapping Up

There we have it. Three simple patterns for retrying jobs in Elixir. These are not the only three options though; Elixir and OTP are at our disposal to design more involved retry logic and data processing pipelines, too. Remember, if we need our jobs to survive a crash, save them in a database. Each BEAM process has its own internal state. So an alternative to durable storage is in-memory in the state of a specific process, as shown by options #2 and #3 above. These patterns can be incredibly useful and easy to maintain. I hope you enjoyed reading about them.

DockYard is a digital product agency offering exceptional strategy, design, full stack engineering, web app development, custom software, Ember, Elixir, and Phoenix services, consulting, and training. With a nationwide staff, we’ve got consultants in key markets across the U.S., including Seattle, San Francisco, Denver, Chicago, Dallas, Atlanta, and New York.