Introducing EXGBoost: Gradient Boosting in Elixir

Introduction

Over the past 2.5 years, we've worked hard to significantly expand Elixir's machine learning capabilities. The Nx project makes it possible for Elixir programmers to implement efficient numeric algorithms directly in Elixir. Nx is a library for array-based programming. That means it's suitable for implementing things like neural networks, some traditional machine learning algorithms, and even applications using ordinary differential equations.

A number of other projects have also sprung out of the Elixir Nx efforts including:

Livebook - Interactive code notebooks for Elixir
Explorer - Dataset analysis and exploration
Bumblebee - Pretrained machine learning models and servings
VegaLite - VegaLite bindings for rich visualizations

The Elixir machine learning ecosystem is slowly closing the capabilities gap between itself and Python. One of the biggest remaining gaps I wrote about in my post comparing Elixir and Python for Data Science was the lack of a library for implementing decision trees in Elixir. Today, I am excited to introduce a library that fills that void: EXGBoost.

EXGBoost provides bindings to the popular XGBoost library. In the words of the documentation:

Xtreme Gradient Boosting (XGBoost) is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the Gradient Boosting framework.

EXGBoost brings a critical missing piece to the table for developers turning to Elixir for machine learning. In this post, we'll explore the EXGBoost API, and discuss why it's such an important addition to the Elixir Nx Ecosystem.

What is Gradient Boosting?

Gradient boosting is a popular machine learning approach for both regression and classification tasks. Gradient boosting is an ensemble method for machine learning that uses an ensemble of weak classifiers. You can kind of think of this as meaning that gradient boosting uses predictions from a lot of bad models to produce a single strong prediction. Ensembles are models comprised of many models, while weak classifiers are quite literally classifiers that barely perform better than random guessing.

The weak classifier most often used in gradient boosting is the decision tree. A decision tree is a type of model that learns branches of logic from a training set to classify inputs into one of a number of outputs. Decision trees are easily interpretable models. The trained classifier is quite literally a logical decision tree in which each input variable maps to decision branches and eventual outputs.

Gradient boosting trains many small decision trees on single input variables, and incorporates predictions from each into a final prediction. Gradient boosting is a powerful framework for training simple and interpretable machine learning models.

Why Gradient Boosting?

Gradient boosting is fundamentally different than any of the algorithms that exist in the Elixir machine learning ecosystem today. One of the greatest differences is that it's not easy to implement using purely Nx operations and numerical definitions. While there are efforts in the Python ecosystem, such as Hummingbird, to compile traditional machine learning models such as decision trees to tensor computations, these efforts require a trained model. The training logic for decision trees is not something that's easily expressed as tensor computations.

Given this fact, why even bother? Scholar has a growing number of traditional machine learning algorithms, and Axon has deep learning covered. Why bother with gradient boosting? As it turns out, gradient boosting is one of the few machine learning techniques that still rivals deep learning for certain modalities. As an example, models based on the gradient boosting framework consistently outperform their deep learning counterparts with tabular data. Additionally, the trained models are often simpler, more performant, and easier to interpret. Gradient boosting is simply too powerful to ignore.

Given that tabular datasets represent a significant chunk of business data, gradient boosting is extremely important to have in a data scientist's toolkit. Interpretability is a bonus. Deep learning is often jibed for a lack of interpretability. Deep learning models really are a black box. Interpretability is often an important factor in a company's decision to deploy a machine learning model. All of these factors make the addition of EXGBoost a huge boon to the Elixir machine learning ecosystem.

How can I use EXGBoost?

To get comfortable with EXGBoost, we'll go through a simple example: predicting the price of diamonds based on several attributes from this diamond dataset. Go ahead and download the dataset from Kaggle, and then install the following dependencies:

Mix.install([
 {:nx, "~> 0.5"},
 {:exgboost, "~> 0.2"},
 {:scholar, "~> 0.1"},
 {:explorer, "~> 0.5"}
])

You'll also want to require the Explorer macros:

require Explorer.DataFrame, as: DF

Next, read in the dataset using Explorer:

df = Explorer.DataFrame.from_csv!("/Users/sean/diamonds.csv")

You'll notice we have a number of both qualitative and quantitative features. The first column is just an ID column, so you can discard that:

df = DF.discard(df, 0)

Now, before we can convert the DataFrame to a tensor, we need to convert the string columns to numeric columns. The string columns represent categories that we can just numerically encode:

df =
 DF.mutate(
 df,
 for col <- across(~w[cut color clarity]) do
 {col.name, Explorer.Series.cast(col, :category)}
 end
 )

Now, we want to shuffle this data and split it into training and test sets:

n_rows = DF.n_rows(df)
split_at = floor(0.8 * n_rows)

df = DF.shuffle(df)
train_df = DF.slice(df, 0..split_at)
test_df = DF.slice(df, split_at..-1)

Finally, we want to convert both DataFrames into tensors:

features = ~w(carat cut color clarity depth table x y z)
targets = ~w(price)

x_train =
 train_df[features]
 |> Nx.stack(axis: 1)

y_train =
 train_df[targets]
 |> Nx.stack(axis: 1)

x_test =
 test_df[features]
 |> Nx.stack(axis: 1)

y_test =
 test_df[targets]
 |> Nx.stack(axis: 1)

With our data ready, we can start using EXGBoost to train a model. The EXGBoost API is intentionally small. At the top level, there are two important functions: train/3 and predict/3. For most use cases, this is all you'll ever need. You can configure both training and prediction with a number of options. There are more complex APIs if you need something more custom; however, for this use case, we'll stick to the basics.

To train a model, you just pass a training set to EXGBoost.train/3. You also need to specify an objective. The objective is the metric or loss function used to optimize the model. In this case, our objective is a regression, so we'll use the squared-error loss. You can tell EXGBoost to use this loss function by specifying obj: :reg_squarederror:

model = EXGBoost.train(x_train, y_train, obj: :reg_squarederror)

In addition to specifying an objective, you can specify evaluation strategies to evaluate the model during training. To do this, you just need to specify evals which is a list of tuples consisting of {features, targets, name}. In this case, we can evaluate the model on the training set during training:

model =
 EXGBoost.train(x_train, y_train,
 obj: :reg_squarederror,
 evals: [{x_train, y_train, "train"}]
 )

You'll notice our model trained for 10 iterations. You can also customize this by specifying num_boost_rounds:

model =
 EXGBoost.train(x_train, y_train,
 obj: :reg_squarederror,
 evals: [{x_train, y_train, "train"}],
 num_boost_rounds: 15
 )

There are a number of other options to control training such as callbacks, regularization techniques, and more. With a trained model, you simply need to call EXGBoost.predict/3 to get predictions on a set of inputs:

y_pred = EXGBoost.predict(model, x_test)

We can use Scholar to evaluate these predictions:

Scholar.Metrics.mean_absolute_error(Nx.squeeze(y_test), y_pred)

In total across 10788 examples, our model had an average absolute error of $333. You can see what this means by inspecting the pairwise error:

Nx.abs(Nx.subtract(Nx.squeeze(y_test), y_pred))

Notice that some predictions are quite close, while others miss the mark by a lot. That happens, and you can try messing around with the training process a bit more to bring this error down!

Conclusion

EXGBoost brings a new set of capabilities to the Elixir machine learning ecosystem. With a simple API built on top of a powerful framework, it opens a world of possibilities for machine learning developers in the Elixir ecosystem. Before concluding, I need to give a huge shoutout to Andrés Alejos for bringing gradient boosting to the Elixir community. Until next time!

Introduction

What is Gradient Boosting?

Why Gradient Boosting?

How can I use EXGBoost?

Conclusion

Newsletter

Stay in the Know