Join the Elixir FashionMNIST Challenge

Introduction

The Elixir machine learning ecosystem is growing rapidly. A long period of foundation-building is bearing fruit as our efforts with Nx and its dependent libraries stabilize. Additionally, our high-level libraries such as Bumblebee make it easier than ever to work with state-of-the-art machine learning models in Elixir. Now, it’s time for you to get in on the fun with the Elixir FashionMNIST Challenge!

The FashionMNIST Challenge was started by Scott Mueller. In his own words:

“I’ve published a challenge to the Elixir community to collectively iterate until we have State of the Art (SOTA) accuracy on the FashionMNIST dataset. FashionMNIST is a more difficult dataset than the MNIST dataset. However, it hasn’t been explored as heavily as other datasets. In December 2022, Jeremy Howard had a SOTA 5 epoch accuracy of 92.7%. While it might seem like a small data problem, by striving for SOTA, we’ll be learning techniques that can improve training for other larger image datasets. Additionally, some of those techniques are used in other domains like large language models.”

This challenge is an opportunity for the Elixir community to come together and learn from each other about how to iterate on machine learning solutions in Elixir. Scott kicked the challenge off with his own submission that achieves 87.4% accuracy on the Fashion MNIST dataset. This leaves plenty of room for community improvement. In this post, I’ll create my own submission to give some ideas on how you can approach this problem and potentially win the challenge!

Getting Started

Before beginning, we should probably take a closer look at what exactly FashionMNIST is, and what it means to achieve SOTA accuracy. If you’re familiar with machine learning, you might have heard of the MNIST dataset. The MNIST dataset is a collection of handwritten digits encoded as grayscale images. It’s often used as a “Hello World” for programmers learning machine learning.

FashionMNIST is a dataset designed in the spirit of MNIST, but rather than digits, the images are of items of clothing from Zalando’s. Like MNIST, the images are 28x28 grayscale. The dataset consists of 60,000 images and 10 different classes of images with the following labels:

0 T-shirt/top
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot

The goal of this challenge is to train a model that accurately classifies test images into one of those categories. FashionMNIST strikes a great balance for beginners because it’s a small dataset-which means you can’t just win by throwing more GPUs at the problem-and it’s diverse enough to be difficult.

The dataset also comes with 10,000 test images, which gives you an easy way to benchmark the results. Finally, with the reproducibility of Livebook, it’s easy for everyone to verify your score and see different ways to approach the problem.

Now let’s get into the challenge!

The Data

Before beginning, you’ll need to install some dependencies. For now, you’ll just need Axon, Nx, EXLA, and Scidata:

Mix.install([
  {:nx, "~> 0.5"},
  {:axon, "~> 0.5"},
  {:exla, "~> 0.5"},
  {:scidata, "~> 0.1"}
])

Next, you can use Scidata to grab the data:

{train_images, train_labels} = Scidata.FashionMNIST.download()

{{<<0, 0, 0, 0, ...>>, {:u, 8}, {60000, 1, 28, 28}},
 {<<9, 0, 0, 3, ...>>, {:u, 8}, {60000}}}

Next, you’ll want to convert both images and labels into tensors so you can use them in Axon. You’ll also want to normalize the images. Normalization is the process of changing data in your dataset to use a common scale. This is often an important step in achieving training stability in machine learning problems. You can convert the images using the following code:

{images_binary, images_type, _} = train_images

images =
  images_binary
  |> Nx.from_binary(images_type)
  |> Nx.reshape({:auto, 28, 28, 1})
  |> Nx.divide(255)

Next, you’ll want to convert the labels to a one-hot encoded tensor. One-hot encoding takes a numeric label, in this case a label from 0-9, and converts it into a vector. The vector essentially represents a table where each column maps to one class and the values of each column map to whether the input is a member of that class. For example, the label 3 would map to the vector [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]. You can convert your labels to one-hot encoded versions using this code:

{labels_binary, labels_type, _shape} = train_labels

labels =
  labels_binary
  |> Nx.from_binary(labels_type)
  |> Nx.new_axis(-1)
  |> Nx.equal(Nx.tensor(Enum.to_list(0..9)))

#Nx.Tensor<
  u8[60000][10]
  [
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    ...
  ]
>

And just like that, you have tensors for images and labels and you’re ready to train! As you experiment in Livebook, it is recommended you fork sections from this section so you don’t have to repeat this process over and over again. Now, let’s get into training.

Creating an Input Pipeline

When approaching a challenge like this there are three main places you can focus on experimenting with to improve your score: the data, the model, and the training process. The first area is the data.

When it comes to data, the best way to improve model performance is often to just collect more high-quality samples. With this challenge, your dataset is static, but you have options. A common approach with images is to use data augmentation to artificially increase the size of the dataset. Data augmentation applies transformations to input images to make your model robust against changes in color, orientation, and other features. Images are often unaffected by translations. This is a fancy way of stating the obvious: An image of a cat is still an image of a cat whether you flip it or rotate it.

To keep it simple, we’ll apply a single translation to the inputs: We’ll flip a fraction of the image from left to right. You can achieve this effect with:

seed = 42
batch_size = 32

{batched_images, _} =
  images
  |> Nx.to_batched(batch_size)
  |> Enum.map_reduce(Nx.Random.key(seed), fn batch, key ->
    fun =
      Nx.Defn.jit(
        fn regular, key ->
          {mask, key} = Nx.Random.uniform(key)
          flipped = Nx.reverse(regular, axes: [1])
          augmented = Nx.select(Nx.greater(mask, 0.5), flipped, regular)
          {augmented, key}
        end,
        compiler: EXLA
      )

    fun.(batch, key)
  end)

batched_labels =
  labels
  |> Nx.to_batched(batch_size)
  |> Enum.to_list()

[
  #Nx.Tensor<
    u8[32][10]
    [
      [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
      [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      ...
    ]
  >,
  #Nx.Tensor<
    u8[32][10]
    [
      [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
      [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
      ...
    ]
  >,
  ...
]

This code applies the random flip to individual tensors in the original batch. Note that this augmentation is suboptimal as it will be applied in the same manner on every epoch; however, it’s better than nothing. For your implementation, you might prefer to apply more augmentations or to augment in a way such that successive training epochs are not the same. For now, we’ll keep this implementation simple. Now you’re ready to move on to the model.

Implementing the Model

With your data ready, it’s time to implement a model. Scott’s implementation keeps it simple with a basic feed-forward network. One improvement we can make is to use a convolutional neural network (CNN). CNNs are common for image classification problems, and often significantly outperform plain feed-forward networks. We’ll keep the model relatively small for now:

model =
  Axon.input("features")
  |> Axon.conv(32, kernel_size: {3, 3}, padding: :same, activation: :relu)
  |> Axon.max_pool(kernel_size: 2)
  |> Axon.conv(64, kernel_size: {3, 3}, padding: :same, activation: :relu)
  |> Axon.max_pool(kernel_size: 2)
  |> Axon.flatten()
  |> Axon.dense(128, activation: :relu)
  |> Axon.dense(10, activation: :softmax)

The model is a simple CNN with two convolutional blocks with max pooling applied. This is a very, very simple CNN, which means you have plenty of room to improve this model as well! You might experiment with adding dropout, using normalization, changing activations, etc. There are plenty of approaches to experiment with!

The Training Process

The last area you can improve your model is in the training process. This could mean changing the optimizer, introducing additional regularization techniques, messing with hyperparameters such as number of epochs trained, and more. For now, we’ll just introduce a simple learning rate scheduler in place of the default static learning rate used in the Adam optimizer.

Learning rate schedules anneal the learning rate throughout the training process. For example, your learning rate might start at 0.01, but drop by some factor every step or every number of steps.

training_seed = 42
epochs = 5
learning_rate = 1.0e-3

schedule =
  Axon.Schedules.exponential_decay(
    5.0e-3,
    transition_steps: 1850,
    decay_rate: 0.5
  )

optimizer = Axon.Optimizers.adam(schedule)

trained_model_state =
  model
  |> Axon.Loop.trainer(:categorical_cross_entropy, optimizer)
  |> Axon.Loop.metric(:accuracy)
  |> Axon.Loop.run(Stream.zip(batched_images, batched_labels), %{},
    epochs: epochs,
    compiler: EXLA,
    seed: training_seed
  )

17:31:58.556 [debug] Forwarding options: [compiler: EXLA, seed: 42] to JIT compiler
Epoch: 0, Batch: 1850, accuracy: 0.8431928 loss: 0.4290566
Epoch: 1, Batch: 1825, accuracy: 0.9026046 loss: 0.3462997
Epoch: 2, Batch: 1850, accuracy: 0.9271171 loss: 0.2964030
Epoch: 3, Batch: 1825, accuracy: 0.9391943 loss: 0.2632902
Epoch: 4, Batch: 1850, accuracy: 0.9481869 loss: 0.2386037

Evaluating the Model

Now that your model’s trained, you can evaluate it against the test set to get your true score for the Elixir Fashion MNIST challenge. First, download the test set:

{test_images, test_labels} = Scidata.FashionMNIST.download_test()

{{<<0, 0, 0, 0, ...>>, {:u, 8}, {10000, 1, 28, 28}},
 {<<9, 2, 1, 1, ...>>, {:u, 8}, {10000}}}

Normalize and one-hot encode the data in the same way you did your training tensors:

{test_images_binary, _, _} = test_images

test_images =
  test_images_binary
  |> Nx.from_binary(images_type)
  |> Nx.reshape({:auto, 28, 28, 1})
  |> Nx.divide(255)

{test_labels_binary, _, _} = test_labels

test_labels =
  test_labels_binary
  |> Nx.from_binary(labels_type)
  |> Nx.new_axis(-1)
  |> Nx.equal(Nx.tensor(Enum.to_list(0..9)))

#Nx.Tensor<
  u8[10000][10]
  [
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
    [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
    ...
  ]
>

Finally, batch the images and labels and pass them through an Axon evaluation loop:

test_batched_images = Nx.to_batched(test_images, batch_size)
test_batched_labels = Nx.to_batched(test_labels, batch_size)

model
|> Axon.Loop.evaluator()
|> Axon.Loop.metric(:accuracy)
|> Axon.Loop.run(Stream.zip(test_batched_images, test_batched_labels), trained_model_state,
  compiler: EXLA
)

16:52:42.596 [debug] Forwarding options: [compiler: EXLA] to JIT compiler
Batch: 1874, accuracy: 0.9261500

And just like, that we get to 90.7% accuracy! Nice! There are still a lot of improvements we can make to this approach. I strongly encourage you to dive deep into different approaches and models you can apply to this problem. Even if you don’t top the leaderboard, share your result and approach with others! If anything, you might help someone else understand the problem better, or come up with a novel way to approach this problem. If you’re looking for a resource to learn more about creating and training models with Axon, check out my new book Machine Learning in Elixir.

Until next time!