Deploying Machine Learning Models with Elixir

Introduction

In my last post, we walked through how to build an Axon model to detect credit card fraud from anonymized features of real credit card transactions.

While the final model performed relatively well at the fraud detection task, we can’t actually detect any fraud from our Livebook. Models aren’t meant to live in a notebook forever. After your model has been validated-which is a beast of a topic I will discuss in another post-you should begin to consider how you will put your model into production.

In this post, we’ll go over considerations for model deployment, and what model deployment scenarios might look like for this particular example. This is the first of a few posts I plan to do on machine learning operations or MLOps.

It’s important to note that there is no definitive guide to model deployments. What your model looks like in production depends entirely on your business needs.

Even prior to training your model you should set forth objectives and evaluate your model against those objectives to determine if the model you choose to deploy meets your needs, or if training a model is even worth it.

While this tutorial focuses mainly on deployment options for Deep Learning based models, you might find much more success, and a lot less cost, in deploying simpler models. 80% of “machine learning” problems can probably reasonably be solved with linear regression.

You should really only introduce complexity when it becomes absolutely necessary, or if you’re an academic trying to get published.

Exporting and Versioning Large Models

In order to operationalize a model, you need a way to bring it outside of the environment you trained it in. Fortunately, Axon recently introduced two new functions for the purpose of exporting trained neural networks: Axon.serialize/3 and Axon.deserialize/2.

Axon.serialize/3 serializes tuples of {model, params} into an Elixir binary which you can write to an external file for later use. Under the hood, Axon uses :erlang.term_to_binary/2; however, if you attempt to deserialize the serialized model with :erlang.binary_to_term, you won’t get the result you’re expecting.

Before converting to a binary, Axon does some transformations to get both the Axon model and parameter container into a form suitable for serialization. The transformations are implementation details that help guarantee backwards compatibility as Axon is developed moving forward. You can reasonably assume that any model serialized using Axon.serialize/3 can be deserialized with Axon.deserialize/2.

To ensure compatibility, you should always serialize models with Axon.serialize/3 and always deserialize models with Axon.deserialize/2.

As a final consideration, you should only deserialize models from trusted sources. Axon.deserialize/2 uses the :safe option in :erlang.binary_to_term/2 under the hood; however, you shouldn’t attempt to load models from untrusted sources.

In the credit card fraud example, you train a model with the following training loop:

model_state =
  model
  |> Axon.Loop.trainer(loss, optimizer)
  |> Axon.Loop.metric(:precision)
  |> Axon.Loop.metric(:recall)
  |> Axon.Loop.run(batched_train, epochs: 30, compiler: EXLA)

model_state represents params in an Axon {model, params} tuple. You can write a serialized representation of your Axon model to disk by adding the following lines after your training loop:

model
|> Axon.serialize(params)
|> then(&File.write!("model.axon", &1))

This will save your serialized version to model.axon in the current working directory. You should save all of your Axon models with .axon as a convention. You can then easily re-use your model later on by reading it from disk into memory:

{model, params} = File.read!("model.axon") |> Axon.deserialize()
Axon.predict(model, params, input)

As you’ll see later on, there are other ways you’ll need to “persist” your models for use with external serving solutions. In that case, I still recommend saving a .axon version of your models in addition to other artifacts necessary for your deployment scenario. This will guarantee you can iterate and experiment with your model from Axon and convert to other persistence formats if necessary.

Persisting your model for use in deployments necessitates at least some sort of storage solution, and, if you plan on iterating over multiple models (which you should), a versioning solution for models as well.

For small projects, you can probably get away with just throwing models in a models directory and using git lfs. For larger projects, you’ll definitely want a better solution. Model version control has different flavors and requirements at each step of the model development and deployment cycle.

For example, when selecting and training models, you’ll want to track hyperparameters associated with each model, as well as the metadata and metrics collected during training which are associated with those versioned models.

Once you select a model, you need a way to update your application to use your new and improved model–preferably without downtime. This is where out-of-the-box model serving tools come into play. Most model serving solutions let you set up model repositories on remote filesystems such as S3, and typically support serving multiple versions of the same model at different endpoints.

In a real-world project, you’d typically also save checkpoints during model training. Checkpoints are snapshots of state at various time intervals. Axon allows you to checkpoint your entire training state using the Axon.Loop.checkpoint event handler.

In the Python ecosystem, checkpointing state is seen as a form of “fault-tolerance” because it allows you to resume training from a last good state in the event of some training failure. If you have a long-running training job, it’s definitely good practice to add checkpoints at fixed intervals. For example, in the fraud detection example, the training loop looks like this:

model_state =
  model
  |> Axon.Loop.trainer(loss, optimizer)
  |> Axon.Loop.metric(:precision)
  |> Axon.Loop.metric(:recall)
  |> Axon.Loop.run(batched_train, epochs: 30, compiler: EXLA)

You can simply add Axon.Loop.checkpoint to the loop:

model_state =
  model
  |> Axon.Loop.trainer(loss, optimizer)
  |> Axon.Loop.metric(:precision)
  |> Axon.Loop.metric(:recall)
  |> Axon.Loop.checkpoint()
  |> Axon.Loop.run(batched_train, epochs: 30, compiler: EXLA)

This will save training checkpoints under the checkpoints path with the file pattern checkpoint_{epoch}_{iteration}.ckpt after every training epoch. You can select any checkpoint and resume training with a few Axon functions:

# Load the last checkpoint from epoch 30
path = "checkpoints/checkpoint_30_1000.ckpt"
last_ckpt_state =
  path
  # read file
  |> File.read!()
  # deserialize last training state
  |> Axon.Loop.deserialize_state()
# Create a loop and tell Axon to run from previous state
model_state =
  model
  |> Axon.Loop.trainer(loss, optimizer)
  |> Axon.Loop.metric(:precision)
  |> Axon.Loop.metric(:recall)
  |> Axon.Loop.from_state(last_ckpt_state)
  |> Axon.Loop.run(batched_train, epochs: 30, compiler: EXLA)

Model Deployment Scenarios

With a trained model in hand, you need to go about integrating it into your application.

What that integration looks like is application dependent; however, there are a few common scenarios to consider when it comes to model deployment. Your deployment scenario will dictate how you integrate your model into a production solution. One thing that you’ll probably find is common to your deployment scenario no matter what is that latency is king.

Not considering functional performance (e.g. how accurate a model is), latency is the most critical metric to consider during model deployment. You would probably have more positive user feedback from serving random predictions in milliseconds than serving perfect predictions in minutes to hours (Please don’t actually do this, it’s incredibly irresponsible). Latency should be the driving consideration when thinking about your deployment scenario.

Cloud vs. Edge

The first thing to consider is whether you want to serve inferences from the cloud or at the edge. For simplicity, I am grouping on-prem solutions into the cloud bucket.

Cloud inference happens over the network. The model(s) live on a server at some endpoint, users make requests to the endpoint and receive inferences back. Edge deployments happen on edge devices. The model lives on individual devices, such as mobile phones, and serves inferences on demand without making requests to some inference server.

In some applications the choice of cloud vs. edge is obvious. For example, you wouldn’t try to deploy GPT-3 at the edge.

The choice is typically less obvious when you’re building an application meant to be used at the edge. In that case you need to consider how deploying to the cloud vs. deploying models at the edge impact your business needs.

If you intend your application to be functional without a reliable network connection, or to yield low-latency predictions without access to high-speed internet, you’re definitely going to want an edge deployment.

Today, most edge devices come built in with accelerators and runtimes specifically optimized for machine learning. iPhones, for example, include the CoreML framework which allows you to run machine learning models at the edge. TensorFlow Lite is another extremely popular framework for edge ML deployments.

The prospects for Elixir machine learning at the edge are particularly exciting when you consider Elixir already has an excellent edge framework in Nerves. If you’d like to help make this ecosystem grow, consider joining the EEF Machine Learning Working Group.

Cloud deployments are probably “easier” in the sense that you won’t need to target multiple devices and worry about optimizing your models to use less compute and storage. In the cloud you can always scale up compute, and the model you deploy can serve predictions to any device with a network connection.

Additionally, updating models in the cloud is significantly easier than updating models at the edge. As with critical firmware updates, you can never get 100% of your users to upgrade to newer versions, so you can essentially guarantee users will be walking around with outdated versions of your model. With a cloud deployment, your server is always the source of truth.

Let’s consider our fraud detection model. Does it make sense to deploy in the cloud or at the edge? While I’m sure there’s a scenario where an edge deployment makes sense, our particular model is probably better deployed in the cloud. We can assume reliable internet because credit card transactions already take place over the internet.

Online vs. Batch Inference

Another consideration for your application is whether you will need to perform online or batch inference.

Online inference happens on-demand–your model serves predictions upon requests. Batch inference happens offline–you perform batch prediction jobs, typically at a fixed interval, and serve or use cached predictions at application runtime.

Batch inference jobs are kind of going out of style, but they might make sense for your application. In some ways it’s advantageous to perform batch inference because you can scale up compute and deal with predictions in bulk (rather than at batch size 1). Additionally, latency is slightly less of a concern–though you should still be concerned about the latency impact of serving cached predictions.

In certain applications batch inference doesn’t make sense at all. In some situations you can’t possibly make batch predictions on every input you might see in production.

Applications that make use of image recognition, for example, are not good candidates for batch inference. You can’t generate predictions for every possible image you might see at runtime ahead of time.

However, even if a model is suitable for batch inference, it doesn’t mean that you should serve it in an offline manner. For example, recommendation systems can be served offline–you periodically update a user’s “embedding” based on their shopping, viewing, and whatever history, and find similar products to their saved embedding at runtime.

However, adaptive online models typically make for an enhanced user experience. TikTok is a good, albeit extreme, example of the trend towards real-time machine learning. Their model is excellent at capturing, and keeping user attention through its recommendations.

Even if you don’t plan on engineering the next TikTok, there are still benefits to performing inference in real-time. I highly recommend reading this article by Chip Huyen (also linked above) on the trend towards real-time machine learning.

Of course, online inference presents an entirely new set of engineering challenges.

First, deep learning models benefit from scale. That is to say that most frameworks are designed to deal with large numbers of examples at once. In a production setting, you’re dealing with a batch size of 1, which means you can’t benefit from parallelization across many examples.

Additionally, serving predictions in an online manner wraps all of the same issues you’d deal with in a traditional application around a computationally expensive model. You need to be concerned with fault tolerance, network latency, etc.

You also need to consider model payload serialization–the format in which you send information to your model can impact both latency and performance. For example, it’s common to send requests with JSON; however, this might not be ideal for sending requests to models. JSON doesn’t support flexible numeric types, so you can silently lose precision sending requests via JSON, leading to degraded performance.

Fortunately, most of the considerations you should have when serving a model for online inference have been built into industry standard model serving solutions. Before we discuss these solutions, I’ll briefly discuss how and why you might settle on a pure Elixir solution.

As a final exercise in determining what kind of deployment makes sense, consider the fraud detection model. Should you perform predictions in an online or offline manner? This decision largely stems from business requirements, and I don’t think you could go wrong in either case.

In a perfect world, you’d be able to immediately decline suspicious transactions before they go through; however, that might not be feasible. Instead, it might be more suitable to validate transactions in bulk very frequently.

Model Serving Solutions

Model serving solutions are software designed specifically to solve the challenges associated with the deployment of deep learning models at scale. Serving online models is difficult. Model serving solutions are designed to make it easier. Let’s quickly touch on the features you get out of the box with most open source model servers.

Flexibility

While there are framework-specific model servers we will discuss later, most model servers will support multiple frameworks out of the box and also give you the option to add runtimes to serve custom model formats.

As an example, NVIDIA Triton Inference Server supports TensorRT, ONNX, TensorFlow, Torch, and more. There are also abstractions which allow you to build custom runtimes into the existing server infrastructure.

Autobatching

As I mentioned before, online model serving typically happens with a batch size of 1. If you get overlapping requests, you’ll have to wait until your model processes each request before moving on to the next 1. This is slow! There’s generally minimal latency impact when adding additional examples to a batch because model runtimes process batches in parallel.

Processing a single example 1 at a time is inefficient. Autobatching solves this problem by settling for a slight latency increase waiting for requests, before processing multiple requests at once.

For example, let’s say you want to autobatch requests every 10 milliseconds with a maximum batch size of 64. When your model server receives a request, it will wait for 10 milliseconds from the first request, or until the queue fills up with 64 entries before sending any of the requests to the model. For models which receive lots of concurrent requests, auto batching can have a massive impact on performance.

Versioning

A good model server should allow you to serve multiple versions of the same model. As you deploy a new version of a model to production, you might want to perform some A/B testing on your new model versus your old model before fully rolling everyone over to the new version. To do this, you’ll need to be able to serve multiple versions of your model from different endpoints.

Concurrent execution

The need to server multiple versions of a model at the same time also implies that a good model server should be capable of serving models concurrently. With limited compute this can be a challenge, as both models might not be capable of executing simultaneously on the same server. In these cases, your model server needs to deal with load balancing.

Many more not listed

There are many other challenges not listed here. There isn’t a model serving solution which solves all of the challenges you might run into perfectly. Additionally, this is still very much an evolving field. It’s only been 10 years (!!) since deep learning had its coming out moment, and even more recently that Google and other large companies started successfully putting deep learning into production.

Serving Axon

With the challenges of model serving in mind, you need to pick a serving solution and get your Axon model into a format capable of running within that solution. For this example, you’ll use NVIDIA Triton Inference Server.

NVIDIA Triton Inference Server is an industry standard model serving solution, and comes out of the box with all of the features you need to successfully put your models into production. You can get up and running relatively quickly with Docker in a few steps.

Step 1: Convert your model to ONNX

Triton supports a ton of different execution formats. You can serve TensorFlow SavedModels, PyTorch TorchScript models, TensorRT Models, ONNX models, etc.

There are roundabout ways to convert Axon to TensorFlow SavedModels and PyTorch Torchscript models; however, there is much more support for ONNX conversion at this time.

If you’ve saved your model as per the directions outlined in a previous section of this post, you can load and serialize the model to ONNX with the following code:

Mix.install([
  {:axon_onnx, "~> 0.1.0-dev", github: "elixir-nx/axon_onnx",}
  {:axon, "~> 0.1.0-dev", github: "elixir-nx/axon", override: true}
])
{model, params} = File.read!("model.axon") |> Axon.deserialize()
AxonOnnx.Serialize.__export__(model, params, path: "model.onnx")

After running, you’ll have an ONNX model which can be served from Triton. Note that the AxonOnnx API is currently experimental and subject to breaking changes. However, the idea that you can simply export a model to ONNX from an Axon model and parameters will remain the same.

Step 2: Make a model repository

Triton requires you to specify a model repository with a structure that looks something like:

- <model_repository_name>
  - <model_endpoint_name>
    - 1
    - 2
  - <model_endpoint_name>
    - 1

Where 1 and 2 are versions 1 and 2 of your model respectively. Triton has a lot of neat things you can do with your model repository.

For example, you can load models from S3 or GCP, dynamically load models, and more. Your directory structure for this example should look like:

- models
  - fraud
    - 1
      - model.onnx

Step 3: Pull Triton

It’s easiest to start Triton from a Docker container. You can pull the pre-built container using:

docker pull nvcr.io/nvidia/tritonserver:21.12-py3

Step 4: Start Triton

With Triton pulled, you can now start the server:

docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/home/sean/blog/deployment_example/models:/models nvcr.io/nvidia/tritonserver:21.12-py3 tritonserver --model-repository=/models --strict-model-config=false

You should replace the mounting of -v/home/sean/blog/deployment_example/models with the absolute path to where you created your model repository. This binds ports 8000, 8001, and 8002 to the Triton server. After a short wait, you should see:

+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
| fraud | 1       | READY  |
+-------+---------+--------+

You’re ready to make predictions!

Step 5: Make a prediction

Triton serves both an HTTP and GRPC endpoint which you can query to make predictions. To get a quick prediction from the HTTP endpoint, you can use the following script:

Mix.install([
  {:nx, "~> 0.1.0"},
  {:req, "~> 0.2.0"},
  {:jason, "~> 1.2"}
])
data = Nx.random_uniform({1, 30}) |> Nx.to_flat_list()
req_data = %{
  "inputs": [
    %{
      "name": "input_0",
      "datatype": "FP32",
      "shape": [1, 30],
      "data": data
    }
  ]
}
Req.post!("http://localhost:8000/v2/models/fraud/infer", Jason.encode!(req_data)) |> IO.inspect

Remember that the model you serialized has an input shape of {nil, 30}, so we generate a request which sends a single example with random inputs. FP32 indicates that the model expects {:f, 32} input types.

Triton supports requests in a plain text format (like you see here), and a binary format. In practice you should probably use the binary format, but for the sake of simplicity the plain text format is shown here.

If you run in the terminal, you’ll see a response that looks something like:

%Req.Response{
  body: %{
    "model_name" => "fraud",
    "model_version" => "1",
    "outputs" => [
      %{
        "data" => [0.22510722279548645],
        "datatype" => "FP32",
        "name" => "dense_4",
        "shape" => [1, 1]
      }
    ]
  },
  ...
}

Notice how you can parse the prediction into a Tensor for further processing with Nx! You’ve successfully deployed a model using Elixir, ONNX, and NVIDIA Triton Inference Server!

Conclusion

This post was meant to serve as an introduction to model deployment and model serving using Elixir and Axon. It’s impossible to cover every detail of model deployment in a single post, but I hope this post points you in the right direction, and gets you thinking about how you can start serving your trained Axon models for others to use.

In my next post, we’ll cover model evaluation and monitoring! Until then!