Processing background jobs successfully in software systems is critical to most businesses. These jobs can include sending a notification through email or SMS; generating a business intelligence report; processing a payment; or anything else that creates value for users or businesses. The Elixir programming language helps us design custom applications in a way that lets them recover from both expected and unexpected errors. The primitives the language offers to deal with unexpected errors is where Elixir truly shines. This post is written for my developer friends who’re adopting Elixir.
I think it would be helpful to clarify the distinction between expected and unexpected errors. Here’s a simple example:
defmodule Mathy do
def divide(dividend, divisor) do
if is_number(dividend) and is_number(divisor) do
{:ok, dividend / divisor}
else
{:error, :both_args_must_be_numbers}
end
end
end
Mathy.divide(10,2) # => {:ok, 5.0} - a successful result
Mathy.divide("pizza",3) # => {:error, :both_args_must_be_numbers} - an expected error
Mathy.divide(10,0) # => ArithmeticError - the function doesn't handle this situation
The last line in the example is an unexpected error because we either never considered it could happen or have no sensible way to handle it - an “exceptional” situation.
Depending on the guarantees our systems must provide, Elixir has a few great options for retrying failed jobs. We can rely on a database, if durability is a priority. We save our pending jobs in a database and update their status as jobs are completed. Whether we decide to use Redis, PostgreSQL, or some other database, depends on our needs. Regardless, the purpose of this article is not to address ‘durable’ jobs. Put them in a database and we’re done. Instead, I’d like to share three simple patterns I’ve seen Elixir engineers rely on to retry failed jobs without touching a database.
Option #1: Simple Retry for Expected Errors
Sometimes a simple retry mechanism for retrying jobs a few times, after expected errors and then giving up, is all that’s needed:
def perform(_, _retry = 0), do: :ignore
def perform(arg, retry \\ 3) do
case do_work(arg) do
:error -> perform(arg, retry - 1)
:ok -> :ok
end
end
In the example above, perform/2
will call do_work/1
. If we handle expected
errors gracefully within do_work/1
, it will return :error
. When this
happens we call perform/2
again with an adjusted retry value, to account for
the first attempt. Eventually, when the retry value gets to zero, we return
:ignore
to stop retrying. If do_work/1
returns :ok
, at any point, it
doesn’t trigger a retry. You can easily add a delay to each retry or even a
backoff strategy. Notice that this approach doesn’t account for unexpected
errors within do_work/1
. If do_work/1
raises an exception, it never gets to
return :error
, so perform/2
with an adjusted retry argument is never
called. The next option on our list is a simple approach we can use to retry
jobs after unexpected errors.
Option #2: Letting a Supervisor Restart Tasks
The second option on our list is letting Task.Supervisor
take care of the
retry logic. This option works best to handle bugs, unexpected input, and flaky
network connections. It’s a simple approach for retrying our jobs a few times
and then giving up.
To do this we’re going to start a supervisor within a supervision tree and have our tasks be supervised by this supervisor.
To start a supervisor to dynamically supervise tasks, we add our supervisor under a supervision tree:
children = [
{Task.Supervisor, name: MyApp.TaskSupervisor}
]
Supervisor.start_link(children, strategy: :one_for_one)
This starts a Task.Supervisor
process with the given name. Task.Supervisor
currently only supports the :one_for_one
strategy.
The one_for_one
strategy configures the supervisor so that if a child
terminates, only that child is restarted. Child in this example refers to a
background job.
Then in our Phoenix controller, or some other process, we start a supervised
task with start_child/3
or start_child/5
:
Task.Supervisor.start_child(MyApp.TaskSupervisor, fn ->
MyApp.MyTask.perform(args)
end, restart: :transient)
# OR
options = [restart: :transient]
Task.Supervisor.start_child(MyApp.TaskSupervisor, MyApp.MyTask, :perform, args, options)
That’s it, we’re done. If our task fails, MyApp.TaskSupervisor
will restart it according to the :restart
value. The example above uses
:transient
but we have a few options to adjust for our needs.
:temporary
: (default) never restart:transient
: restart if the exit is not:normal
,:shutdown
, or{:shutdown, reason}
:permanent
: always restart
We can also configure the number of restart attempts allowed in a given time frame before the supervisor gives up:
:max_restarts
: (default of 3) max number of restarts in:max_seconds
:max_seconds
: (default of 5) time frame for:max_restarts
A common misunderstanding is thinking that the :max_restarts
count is per
child. The :max_restarts
count will be a total count, including all children.
Take a look at option #3 if you want more control as to how many times each job
gets retried.
So there it is, an easy way to retry our failed jobs with Task.Supervisor
. Up
next, an approach that provides more granular control.
Option #3: Spawning Processes from GenServer for More Control
If we need more control, we can create a task using
Task.Supervisor.async_nolink/3
or Task.Supervisor.async_nolink/5
inside an
OTP behaviour such as GenServer
.
This approach requires more effort than option #2 but it allows us to easily specify the retry logic for our background jobs. I really like how handling expected and unexpected errors differently, if needed, can be very straightforward with this approach.
First, similar to option #2, we want to add a supervisor to our supervision
tree. This supervisor will be responsible for supervising our jobs.
Additionally, we’ll also want to add a GenServer
process to our supervision
tree. We’ll call it MyApp.Worker
. This process will be responsible for
starting our jobs and handling our retry logic. Our supervision tree with a
Task.Supervisor
and GenServer
:
children = [
{Task.Supervisor, name: MyApp.TaskSupervisor},
MyApp.Worker
]
Supervisor.start_link(children, strategy: :one_for_one)
Now let’s wire up MyApp.Worker
for it to start up (start_link/1
), start our
job (start_task/0
), and handle retries (handle_info/2
). In this case we’ll
wire this GenServer
to run single tasks, for the sake of a simple example:
defmodule MyApp.Worker do
use GenServer
# Client
def start_link(_) do
GenServer.start_link(__MODULE__, %{ref: nil}, name: __MODULE__)
end
def start_task do
GenServer.call(__MODULE__, :start_task)
end
# Server (callbacks)
def init(state) do
{:ok, state}
end
# If ref is already set, we ignore the request to start the task.
def handle_call(:start_task, _from, %{ref: ref} = state) when is_reference(ref) do
{:reply, :ok, state}
end
# If ref is nil, we start the task and save the task's ref.
def handle_call(:start_task, _from, %{ref: nil} = state) do
task = do_start_task()
{:reply, :ok, %{state | ref: task.ref}}
end
# The task is done.
def handle_info({ref, _result}, %{ref: ref} = state) do
# No need to continue to monitor
Process.demonitor(ref, [:flush])
# Do something with the result here...
{:noreply, %{state | ref: nil}}
end
# Unexpected failure
def handle_info({:DOWN, ref, :process, _pid, _reason}, %{ref: ref} = state) do
# restart the task
task = do_start_task()
{:noreply, %{state | ref: task.ref}}
end
defp do_start_task do
Task.Supervisor.async_nolink(MyApp.TaskSupervisor, fn ->
MyApp.MyTask.perform()
end)
end
end
When we create a task with async_nolink
inside a GenServer
we get two very
useful things for free:
- A message is sent to our
GenServer
with the result of the task. - Our
GenServer
monitors our task.
This allows us to add different handle_info/2
callbacks that match on
success, expected, and unexpected errors. As it stands now, the handle_info/2
that matches on :DOWN
messages is responsible for dealing with unexpected
errors. Because we’re monitoring the task, we’ll get a {:DOWN
message with a
reason
no matter how it terminates. If that’s the first message we get from
it, it will be because it failed, as the reason
will indicate. On the other
hand, if the task completes successfully, we’ll get a {ref, answer}
message
first. At that point we will know the task is complete and won’t want its
:DOWN
message, so we’ll demonitor
the process and :flush
its :DOWN
message from our mailbox if it’s already there.
As it stands now we’re dealing with unexpected errors but not with expected
errors. Other than matching :DOWN
messages, we’re not checking if a task
completed successfully or if it returned an error. To add special handling for
expected errors we’d have to add a handle_info/2
callback to match them.
We could add a clause that restarts expected errors like this:
# Expected error
def handle_info({ref, {:error, _reason}}, %{ref: ref} = state) do
# restart the task
task = do_start_task()
{:noreply, %{state | ref: task.ref}}
end
If the return of MyApp.Job.perform/0
is an error tuple, we can match it
and trigger a retry.
As we can see in this example, using Task.Supervisor.async_nolink/3
inside a
GenServer
gives us fine control without a lot of ceremony. Handling success,
expected errors, and unexpected errors differently is a matter of adding a
callback for each case. We can improve on the example above. Adding a retry
count, delay, and/or backoff strategy to the example above wouldn’t take a ton
of effort. We can keep track of retries in the process state, then use the
retry count to add delays between tries, and give up after a configured maximum
as needed. We could also modify this example to track ref
s of multiple tasks,
along with retry counts for each. An alternative approach to tracking multple
tasks per process would be to dynamically start a process to track individual
tasks on demand.
Wrapping Up
There we have it. Three simple patterns for retrying jobs in Elixir. These are not the only three options though; Elixir and OTP are at our disposal to design more involved retry logic and data processing pipelines, too. Remember, if we need our jobs to survive a crash, save them in a database. Each BEAM process has its own internal state. So an alternative to durable storage is in-memory in the state of a specific process, as shown by options #2 and #3 above. These patterns can be incredibly useful and easy to maintain. I hope you enjoyed reading about them.
DockYard is a digital product agency offering exceptional strategy, design, full stack engineering, web app development, custom software, Ember, Elixir, and Phoenix services, consulting, and training. With a nationwide staff, we’ve got consultants in key markets across the U.S., including Seattle, San Francisco, Denver, Chicago, Dallas, Atlanta, and New York.