Catching Fraud with Elixir and Axon

Hand holding credit card while using laptop
Sean Moriarity

Machine Learning Advisor

Sean Moriarity

Introduction

Fraud and identity theft affect millions of people every year. Credit card fraud, in which an individual makes an unauthorized payment using a credit card, debit card, or through other means, is just one common form.

There are around 1 billion credit card transactions every day—so fraud detection systems need to be able to process thousands of transactions per second. Additionally, the cost of fraud can be high. Downtime is not an option. Fraud detection systems must be built with scalability and fault tolerance in mind—both of which are strengths of Elixir as a programming language.

A fraud detection system also needs to be operationally reliable. It must correctly tag fraudulent transactions and avoid overflagging normal transactions or risk losing customer patience and trust. Given the sheer volume of transactions and the amount of data associated with each transaction, it would be impossible to detect fraudulent transactions by hand. Financial transactions come with a rich set of features and an abundance of examples of transactions. It is the perfect application of machine learning.

With Axon, you can marry the operational strengths of Elixir with the functional strengths of deep learning to design a scalable, fault-tolerant, and precise fraud detection system. In this post, you’ll learn how to use Axon to design a model to catch fraudulent transactions.

Installation

You’ll need to install Axon to create and train a model for fraud detection, EXLA for hardware acceleration, Nx for manipulating transaction data, and Explorer for parsing and slicing CSV data.

Mix.install([
  {:axon, "~> 0.1.0-dev", github: "elixir-nx/axon"},
  {:exla, "~> 0.1.0-dev", github: "elixir-nx/nx", sparse: "exla"},
  {:nx, "~> 0.1.0-dev", github: "elixir-nx/nx", sparse: "nx", override: true},
  {:explorer, "~> 0.1.0-dev", github: "elixir-nx/explorer"}
])

The Data

The data for this example can be downloaded on Kaggle. It consists of around 300,000 European card transactions, of which only around 500 are fraudulent. We’ll need to be cognizant of this massive imbalance in our data when designing and evaluating our model. If our model marks every transaction as legitimate, it would still achieve 99% accuracy!

In this dataset, the features are extracted from real-life features using Principal Component Analysis (PCA) in order to anonymize and protect sensitive user information. In a real system, you would do this kind of feature extraction yourself from transaction features such as amount, location, vendor, etc. In an end-to-end system, you would start by using Explorer to conduct an Exploratory Data Analysis (EDA), and then determine an appropriate set of features for a model you want to test.

Start by downloading the data to a local directory and loading the CSV using Explorer:

df = Explorer.DataFrame.read_csv!("creditcard.csv", dtypes: [{"Time", :float}])

Now, split your data into a train set and a test set. Splitting into train and test sets is important to validate that your model does not overfit to specific examples. You should notice above that the data is ordered temporally—the time of transaction increases from the first example to the last.

By taking a fixed number of examples from the end of the dataset, you are marking all transactions before a certain time as train and all transactions after a certain time as test. This is important to note because it could be a form of bias in your dataset. If your train set time window is not sufficiently representative of transactions, you will end up with a model which performs poorly. In a real system, you’d likely want to extend this train window over the course of several days. During data analysis you would want to determine what “normal” data should look like, and ensure both your train and test set are representative of that “normal.”

num_examples = Explorer.DataFrame.n_rows(df)
num_train = ceil(0.85 * num_examples)
num_test = num_examples - num_train

train_df = Explorer.DataFrame.slice(df, 0, num_train)
test_df = Explorer.DataFrame.slice(df, num_train, num_test)

This code takes the first 85% of examples for training, and leaves the last 15% of examples for testing.

Next, you’ll need to split both train and test sets into sets of features and sets of targets. If you recall from my previous Axon article, Axon requires examples to consist of tuples of {features, targets} to train a model. Your target for this example is Class - all other columns are features:

x_train_df = Explorer.DataFrame.select(train_df, &(&1 == "Class"), :drop)
y_train_df = Explorer.DataFrame.select(train_df, &(&1 == "Class"), :keep)
x_test_df = Explorer.DataFrame.select(test_df, &(&1 == "Class"), :drop)
y_test_df = Explorer.DataFrame.select(test_df, &(&1 == "Class"), :keep)

Notice how each of your examples is currently an Explorer DataFrame. Axon doesn’t understand how to work with DataFrames. Instead, you need to convert your data into Nx tensors which can be passed into an Axon training loop.

to_tensor = fn df ->
  df
  |> Explorer.DataFrame.names()
  |> Enum.map(&(Explorer.Series.to_tensor(df[&1]) |> Nx.new_axis(-1)))
  |> Nx.concatenate(axis: 1)
end

The function above is a bit verbose, but it gets the job done. There is an active issue for adding a native to_tensor function for Explorer DataFrames. Contributions are welcome!

x_train = to_tensor.(x_train_df)
y_train = to_tensor.(y_train_df)
x_test = to_tensor.(x_test_df)
y_test = to_tensor.(y_test_df)

You now have four large tensors representing the entirety of train and test sets. Axon requires training in minibatches which means you pass some number of examples to a single training step, update the model, and move on to the next minibatch. Each example is a tuple: {features, targets} where features is a batched tensor of example features and targets is a batched tensor of example targets. You must pass examples in a data structure that implements the Enumerable protocol. It’s most common to use a Stream to lazily load data into an Axon training loop.

batched_train_inputs = Nx.to_batched_list(x_train, 2048)
batched_train_targets = Nx.to_batched_list(y_train, 2048)
batched_train = Stream.zip(batched_train_inputs, batched_train_targets)

batched_test_inputs = Nx.to_batched_list(x_test, 2048)
batched_test_targets = Nx.to_batched_list(y_test, 2048)
batched_test = Stream.zip(batched_test_inputs, batched_test_targets)

batched_train and batched_test will lazily return the target-feature tuples required to train and evaluate your Axon model. With your data prepared for training, it’s time to implement the model.

Before training, there is one final step needed to maximize the performance of your model. You’ll want to normalize the input data such that each column is on a common scale between zero and one. You can achieve this in a number of ways. For this example, you’ll scale by dividing each feature by the max feature value in the training data:

train_max = Nx.reduce_max(x_train, axes: [0], keep_axes: true)

normalize = fn {batch, target} ->
  {Nx.divide(batch, train_max), target}
end

batched_train = batched_train |> Stream.map(&Nx.Defn.jit(normalize, [&1], compiler: EXLA))
batched_test = batched_test |> Stream.map(&Nx.Defn.jit(normalize, [&1], compiler: EXLA))

The Model

Fraud detection is a well-researched area of machine learning. There are entire books on the subject. There are a number of models you could choose from. In an end-to-end example, you would want to experiment with many different types of models such as simple regression, decision trees, neural networks, etc. In this example, you’ll just implement one such model. If you’d like to experiment with more, check out the new Scholar package which is intended to include a number of machine learning estimators that can be applied to problems such as this.

The fraud detection data you have is structured meaning it’s in a tabular form where each value in a row represents some target feature. You can naturally train a small feed-forward neural network to classify transactions as fraudulent or legitimate. Your model should output a probability between zero and one, with probabilities closer to one indicating a fraudulent transaction. Because you have a relatively small input feature space, you can settle for relatively small intermediate layers. In this example, you can get away with three hidden layers with a hidden size of 256. 256 is the dimensionality of each individual hidden layer. Feel free to experiment with an architecture different from the one seen here. For example, you might want to try different activations, hidden sizes, and dropout configurations.

While a neural network is a suitable model for this application, there are a few reasons you might not want to use a neural network in practice. For example, you might want a more interpretable model which can help you explain why certain transactions were marked as fraudulent. Additionally, you might find that simpler models achieve comparable performance with less compute. All of these factors would need to be considered in an end-to-end example where you train multiple types of models for the same problem. Choosing the correct model requires reasoning about functional and operational requirements of the overall system.

model =
  Axon.input({nil, 30})
  |> Axon.dense(256)
  |> Axon.relu()
  |> Axon.dense(256)
  |> Axon.relu()
  |> Axon.dropout(rate: 0.3)
  |> Axon.dense(256)
  |> Axon.relu()
  |> Axon.dropout(rate: 0.3)
  |> Axon.dense(1)
  |> Axon.sigmoid()

Your model will look like this:

------------------------------------------------------------------------------------------------------
                                                Model
======================================================================================================
 Layer                                Shape        Policy              Parameters   Parameters Memory
======================================================================================================
 input_0 ( input )                    {nil, 30}    p=f32 c=f32 o=f32   0            0 bytes
 dense_0 ( dense[ "input_0" ] )       {nil, 256}   p=f32 c=f32 o=f32   7936         31744 bytes
 relu_0 ( relu[ "dense_0" ] )         {nil, 256}   p=f32 c=f32 o=f32   0            0 bytes
 dense_1 ( dense[ "relu_0" ] )        {nil, 256}   p=f32 c=f32 o=f32   65792        263168 bytes
 relu_1 ( relu[ "dense_1" ] )         {nil, 256}   p=f32 c=f32 o=f32   0            0 bytes
 dropout_0 ( dropout[ "relu_1" ] )    {nil, 256}   p=f32 c=f32 o=f32   0            0 bytes
 dense_2 ( dense[ "dropout_0" ] )     {nil, 256}   p=f32 c=f32 o=f32   65792        263168 bytes
 relu_2 ( relu[ "dense_2" ] )         {nil, 256}   p=f32 c=f32 o=f32   0            0 bytes
 dropout_1 ( dropout[ "relu_2" ] )    {nil, 256}   p=f32 c=f32 o=f32   0            0 bytes
 dense_3 ( dense[ "dropout_1" ] )     {nil, 1}     p=f32 c=f32 o=f32   257          1028 bytes
 sigmoid_0 ( sigmoid[ "dense_3" ] )   {nil, 1}     p=f32 c=f32 o=f32   0            0 bytes
------------------------------------------------------------------------------------------------------

Training the Model

With your data prepped for training and your model defined, it’s time to train. Recall that your data is incredibly imbalanced, which means we need to account for this imbalance when updating the model. You need to penalize the model for missing fraudulent transactions. You can achieve this penalty in Axon using class weights.

In a problem where you have a balanced dataset where the number of examples per class is equal, you’d want to update your model parameters equally per class. For example, if you are classifying images of cats versus dogs with equal numbers of both classes, you’d want to update your model for failing to classify a picture of a cat proportional to how you’d update your model for failing to classify a picture of a dog.

With an imbalanced dataset, you want your updates to be proportional to the overall data distribution. You want to tell your model to really pay attention to low-density classes because they are much more important than common classes. Axon loss functions such as binary_cross_entropy/3 allow you to pass weight parameters that specify the importance or weight of each class. A common way to specify these weights is to make them proportional to the overall number of occurrences in the training set. In this example, you should count both positive and negative occurrences and specify the weights accordingly:

fraud = Nx.sum(y_train) |> Nx.to_number()
legit = Nx.size(y_train) - fraud

loss =
  &Axon.Losses.binary_cross_entropy(
    &1,
    &2,
    negative_weight: 1 / legit,
    positive_weight: 1 / fraud,
    reduction: :mean
  )

Axon’s training loop will accept an arity-2 function as a loss function, which means you can parameterize loss functions on target-prediction pairs. Notice how the positive weight will be much larger than the negative weight because the number of fraudulent transactions is far smaller than the number of legitimate transactions. This will ensure that the penalty for incorrectly classifying a fraudulent transaction will be much larger than for a legitimate transaction.

Next, you’ll need to define an optimizer. In this example, you will use the adam optimizer with a learning rate of 0.01. The choice of optimizer and learning rate is somewhat arbitrary, though adam typically achieves decent performance. Feel free to experiment with different optimizer and learning rate configurations to see which one yields the best performance!

optimizer = Axon.Optimizers.adam(1.0e-3)

Finally, you’ll need to define and run the training loop. You’ll want to track metrics, but accuracy doesn’t make sense in this case. Remember that just classifying all transactions as legitimate will result in greater than 99% accuracy. Instead, you’ll want to keep track of precision and recall. Precision measures the proportion of positive (fraudulent) classifications that we’re accurately predicted. More concretely, precision answers the question: How often is the model correct when it says a transaction is fraudulent? Recall measures the proportion of positive (fraudulent) classifications that we’re identified correctly. More concretely, recall answers the question: How often did the model catch a fraudulent transaction?

Axon supports precision and recall out of the box, so you can track them without issue in your training loop. To define the loop, start with the Axon.Loop.trainer/3 factory method, instrument the loop with metrics, and then call Axon.Loop.run/3. Feel free to adjust some of the loop parameters such as the number of epochs!

model_state =
  model
  |> Axon.Loop.trainer(loss, optimizer)
  |> Axon.Loop.metric(:precision)
  |> Axon.Loop.metric(:recall)
  |> Axon.Loop.run(batched_train, epochs: 30, compiler: EXLA)

During training, you will see:

Epoch: 0, Batch: 100, loss: 0.0000036 precision: 0.0453421 recall: 0.6534294
Epoch: 1, Batch: 100, loss: 0.0000025 precision: 0.0855192 recall: 0.7102020
Epoch: 2, Batch: 100, loss: 0.0000021 precision: 0.0726182 recall: 0.7330216
Epoch: 3, Batch: 100, loss: 0.0000018 precision: 0.0694896 recall: 0.7416568
Epoch: 4, Batch: 100, loss: 0.0000017 precision: 0.0664449 recall: 0.7444073
Epoch: 5, Batch: 100, loss: 0.0000016 precision: 0.0666172 recall: 0.7664092
Epoch: 6, Batch: 100, loss: 0.0000015 precision: 0.0627265 recall: 0.7720198
Epoch: 7, Batch: 100, loss: 0.0000015 precision: 0.0596212 recall: 0.7724599
Epoch: 8, Batch: 100, loss: 0.0000014 precision: 0.0618357 recall: 0.7760903
Epoch: 9, Batch: 100, loss: 0.0000014 precision: 0.0609748 recall: 0.7772452
Epoch: 10, Batch: 100, loss: 0.0000013 precision: 0.0494180 recall: 0.7753478
Epoch: 11, Batch: 100, loss: 0.0000013 precision: 0.0606631 recall: 0.7859913
Epoch: 12, Batch: 100, loss: 0.0000013 precision: 0.0624361 recall: 0.7868164
Epoch: 13, Batch: 100, loss: 0.0000012 precision: 0.0637707 recall: 0.8099188
Epoch: 14, Batch: 100, loss: 0.0000012 precision: 0.0655236 recall: 0.7998114
Epoch: 15, Batch: 100, loss: 0.0000012 precision: 0.0634568 recall: 0.8133428
Epoch: 16, Batch: 100, loss: 0.0000011 precision: 0.0657043 recall: 0.8067421
Epoch: 17, Batch: 100, loss: 0.0000011 precision: 0.0661718 recall: 0.7993163
Epoch: 18, Batch: 100, loss: 0.0000011 precision: 0.0702126 recall: 0.8174682
Epoch: 19, Batch: 100, loss: 0.0000011 precision: 0.0699272 recall: 0.8059170
Epoch: 20, Batch: 100, loss: 0.0000011 precision: 0.0635841 recall: 0.8119520
Epoch: 21, Batch: 100, loss: 0.0000010 precision: 0.0750699 recall: 0.8186468
Epoch: 22, Batch: 100, loss: 0.0000010 precision: 0.0725016 recall: 0.8114567
Epoch: 23, Batch: 100, loss: 0.0000010 precision: 0.0740755 recall: 0.8031117
Epoch: 24, Batch: 100, loss: 0.0000010 precision: 0.0740027 recall: 0.8257190
Epoch: 25, Batch: 100, loss: 0.0000010 precision: 0.0777240 recall: 0.8260726
Epoch: 26, Batch: 100, loss: 0.0000009 precision: 0.0769449 recall: 0.8318481
Epoch: 27, Batch: 100, loss: 0.0000009 precision: 0.0477672 recall: 0.8093942
Epoch: 28, Batch: 100, loss: 0.0000009 precision: 0.0697432 recall: 0.8230080
Epoch: 29, Batch: 100, loss: 0.0000009 precision: 0.0677453 recall: 0.8194719

The training loop returns a model state which can be used to make predictions and evaluate the model. You can see the model performs decently well during training, but we really only care about performance on the test set, so let’s see how our model does!

Evaluating the Model

First, let’s see how many examples are in the test set by inspecting the shape of our test tensor:

Nx.shape(y_test)

Which will show:

{42721, 1}

Overall we have 42721 transactions. We can calculate how many are fraudulent by computing the sum of labels in the test set because fraudulent transactions have a class label of 1:

Nx.sum(y_test)

Which will show:

#Nx.Tensor<
  s64
  52
>

Overall, there are 52 fraudulent transactions. Now, let’s see how well our model performs at detecting these fraudulent transactions. We can track the raw metrics which go into precision and recall and map those to real-life metrics. For example, in this evaluation loop, you can mark true positives as fraudulent transactions detected, true negatives as legitimate transactions accepted, false positives as legitimate transactions declined, and false negatives as fraudulent transactions accepted:

final_metrics =
  model
  |> Axon.Loop.evaluator(model_state)
  |> Axon.Loop.metric(:true_positives, "fraud_declined", :running_sum)
  |> Axon.Loop.metric(:true_negatives, "legit_accepted", :running_sum)
  |> Axon.Loop.metric(:false_positives, "legit_declined", :running_sum)
  |> Axon.Loop.metric(:false_negatives, "fraud_accepted", :running_sum)
  |> Axon.Loop.run(batched_test, compiler: EXLA)

Running the cell will show:

Batch: 20, fraud_accepted: 9 fraud_declined: 43 legit_accepted: 42214 legit_declined: 742

So overall our model correctly declined 43 transactions and incorrectly declined 742 transactions. It accepted nine fraudulent transactions and 42214 legitimate transactions. The model declined about 2% of the legitimate transactions in the dataset. These metrics aren’t bad, but there’s definitely room for improvement. See if you can tweak the model or training process to achieve better performance.

Moving Forward

In this post, you learned how to build an algorithm that identifies fraudulent credit card transactions. With a trained model, you could opt to expose the model’s predictions as a service in your system and provide real-time decisions on transactions. There are a number of architectural decisions which go into building a real-time machine learning system. Stay tuned for next month to see what those considerations are, and what makes Elixir a great choice for building real-time machine learning. Until next time :)

Newsletter

Stay in the Know

Get the latest news and insights on Elixir, Phoenix, machine learning, product strategy, and more—delivered straight to your inbox.

Narwin holding a press release sheet while opening the DockYard brand kit box