Three Years of Nx: Growing the Elixir Machine Learning Ecosystem

The Elixir machine learning ecosystem is growing—fast. Catch up on the latest machine learning blogs and get in touch with DockYard today to learn how we can put cutting edge machine learning to work for you.

It has officially been three years since the start of the Nx project. The first commit in the project that would become Nx was on October 28th, 2020. The commit itself is actually my first attempt at getting XLA built as an NIF separate from the TensorFlow repository—a feat that was much tricker than I expected given XLA’s (then) entanglement with TensorFlow and my lack of experience with NIFs.

Looking back, I never would have expected the project to grow in the way that it has. The community response to Nx has been nothing short of incredible. In three short years, Nx and related projects have matured from fledgling experiments to production-ready workhorses.

In honor of the three-year anniversary of Nx, I want to take the opportunity to recount the history of the project and talk about the next three years of machine learning in Elixir.

An Unlikely Start

Around 2019 I started working on what would become Genetic Algorithms in Elixir. At the time, I had built a toy framework for genetic algorithms called Genex. I intended for both the Genex project and the book to be solely educational. While I felt that Elixir would be a fun language to do machine learning in, I resigned myself to the same belief that everybody else had—it was not and would not ever be possible to do machine learning in Elixir.

Erlang, and thus Elixir, was not designed to be performant on the types of computations required to do machine learning at scale. Additionally, it was not really possible to call out to native code to do expensive machine learning workloads and number crunching prior to the introduction of dirty NIFs.

There were (and still are) shortcomings to doing machine learning and numerical computing on the BEAM. When I was a newcomer to the language, these shortcomings surprised me. After all, Elixir and Erlang are touted for their concurrency model, and that’s exactly what you need to do numerical computing efficiently right? Nope. Erlang’s, and thus Elixir’s, concurrency model is suitable for IO-bound workloads. These are the kinds of workloads you find in web applications, and thus is why Elixir is often seen as a great language for building web applications. However, these workloads are far different from the CPU-intensive workloads you find in machine learning.

After PragProg released Genetic Algorithms in Elixir, Brian Cardarella reached out to me about potentially working on bringing machine learning to Elixir. At the time, I wasn’t really convinced it was a good idea, but I was eager to work on something the community would want. Brian connected me with José Valim, and we began discussing what a machine-learning project would look like.

To Bind or Not to Bind

We considered a few different approaches (with a helpful suggestion from Jackal Cooper who had worked on machine learning frameworks previously):

Build bindings for an existing framework such as PyTorch or TensorFlow
Find a generic “NDArray” library and write a NIF around it
Look into this new thing called JAX and copy their approach using XLA

I actually briefly explored option one–attempting to write bindings for PyTorch’s lower-level internal framework called ATen. However, there were a few drawbacks to the bindings approach:

When you bind yourself to a particular framework, you lock yourself into that framework’s ecosystem. You are at the mercy of their release cycle.
Bindings, especially those to something like TensorFlow, would prevent us from building an extensible ecosystem. If we went the bindings route, we would end up needing an NIF for each additional library, and interoperability between libraries would suffer.
Elixir’s immutability makes dealing with large data structures that represent tensors difficult. We would either need to work exclusively with references to memory, which is somewhat of an anti-pattern in a functional language, or deal with the excessive copies that come with immutability.

Because of these drawbacks, we decided to explore XLA. XLA was a relatively new project from Google. It stands for Accelerated Linear Algebra, and is a machine-learning compiler that takes a computation graph representing a numerical program and compiles it to a native program that can target the CPU or GPU. XLA was perfect for our purposes because it required the user to write programs in a functional way.

Additionally, XLA forced us to design Nx in an extensible manner. XLA takes a ground-up approach. It relies on a small number of primitive operations. You can compose these primitive operations into complex numerical programs. Due to XLA’s flexibility, we were able to build a library that could support a large amount of use cases. In fact, many of the libraries built on top of Nx are pure-Elixir. They do not rely on external NIFs.

Once we decided to use XLA, we spent much of the next few weeks reverse engineering how it worked and how we could build an Elixir library around it. Slowly, the abstractions that Nx uses today started to develop. As the project developed, José began teasing our work on Twitter with some of our early benchmarks: https://elixirforum.com/t/anyone-who-wants-to-speculate-about-this-tweet-from-jose/35772. Our early benchmarks were encouraging, and it seemed that we had made the right choice using XLA over another framework. Unfortunately, there were also some early (and ongoing) struggles.

One of the drawbacks of going with XLA is that we needed to implement certain things ourselves from scratch. One essential component of the Nx library that gave us (mostly José) headaches for a long time was automatic differentiation. Machine learning algorithms like deep learning rely on gradient-based optimization which requires the use of automatic differentiation to efficiently compute gradients. If we had written bindings for PyTorch or another library, we would have been given automatic differentiation for free.

However, because we took a “from scratch” approach, we had to write, and rewrite (several times) automatic differentiaton for Nx. If you ask José about this, I am sure he will fondly recall the joy automatic differentiation brings to him.

The First Signs of Success

By December of 2020, we had most of the foundations of the project in place, and started looking for applications to test our progress on. For a first test of our work, we decided to see if we could train a neural network on MNIST using Nx. I recall working for a few days trying to get a neural network to properly train. Finally, after a bit of work, we were able to train a neural network from scratch with Nx.

Following the success of training our first neural network, we wanted to see how well the library could be used to implement more complex machine-learning algorithms.

The first step we took was to see if we could implement a machine-learning framework in Nx. In January 2021, I started working on Axon—which is a library for creating and training neural networks in Elixir.

As Axon developed, it became more and more clear that we would need a way to load pre-trained models from the Python ecosystem into Elixir in order to become a viable alternative to Python for machine learning.

One early attempt at this was implementing conversion to and from ONNX with Axon via axon_onnx. While Axon ONNX works for some models, it proved somewhat difficult to get working for certain complex applications. Because of this, we started exploring other options for making pre-trained models work in Elixir.

Beyond Machine Learning

After the public release of the Nx project in February at LambdaDays, many people in the Elixir community were excited to see what the Elixir machine-learning ecosystem would become. When drawing comparisons to the Python ecosystem, there were two glaring gaps in our work: code notebooks similar to Jupyter and a library for data manipulation and exploration similar to Pandas.

By January of 2021, José and Jonatan Kłosko had already started working on Livebook. Livebook is an Elixir application for creating interactive and collaborative code notebooks in Elixir. Many of the features built into Livebook came directly from José and Jonatan studying deficiencies and complaints people had with Jupyter notebooks.

While not explicitly designed with machine learning workloads in mind, there are some features built-in to Livebook that make machine learning and data exploration a joy in Elixir. Almost three years later, Livebook is still innovating in the code notebook space, and proving itself as a useful tool for every Elixir programmer.

The second gap, a library for data manipulation and exploration similar to Pandas, was quickly filled by Chris Grainger. Chris started work on Explorer. Explorer wrapped the Rust library Polars, and provided an implementation of Series and DataFrames in Elixir. Explorer has also grown significantly in the last three years, and allows you to explore and manipulate datasets from a variety of data sources with a unified API.

Growing the Ecosystem

After the release of Nx in February, the ecosystem and projects continued to grow. Notably, after the public release of Nx, we quickly got a number of external contributions for documentation fixes, additional features, and more. One of the first key contributors (and now a pivotal member of the core team and all-around Nx wizard) was Paulo Valente. Paulo’s first contribution came almost immediately after the public release of Nx. Since then, he’s been a driving force behind the direction of the project.

After building Nx to a more feature-complete state, the focus shifted to supporting features for external libraries. One such library was Scholar which started in February 2022 as an effort to implement traditional machine learning methods with Nx. Scholar has developed rapidly thanks to the diligent work of Mateusz Sluszniak and many others.

Later in May of 2022, we started working on an exciting new library capable of loading machine-learning models in Elixir. The project was based off of work from Dashbit which made it possible to unpickle Python pickle files directly in Elixir.

This was necessary because PyTorch stored model parameters as pickle files. Given we were now able to unpickle pickle files, we could start reading model parameters from pretrained PyTorch models in Elixir.

Thanks to this we were able to start work on Bumblebee. Bumblebee allows users to load powerful pretrained models directly in Elixir. I would credit Bumblebee with significantly growing the size of the Elixir machine-learning ecosystem, because it lowered the barrier to entry for Elixir users to access machine learning. Bumblebee has unlocked access to incredible things directly in Elixir. For example, you can use Bumblebee to interact with large language models such as Llama2 and Mistral.

…and Much, Much More

I want to emphasize that there are many more important projects and contributors in the ecosystem outside those mentioned in this post. I could go on forever about the incredible work the community has poured in to the Elixir machine-learning ecosystem over the last three years.

The Elixir Nx organization has grown to 18 total projects, with many more external libraries depending on Nx. The Nx project has grown to 85 contributors in three years. We have libraries for almost any machine learning application you can imagine. If there’s a way to do something machine learning related in Python, the chances are there’s a way to do the same thing in Elixir.

The Next 3 Years of Nx

It was impossible to predict three years ago what the state of artificial intelligence and machine learning would look like in three years. For example, I would have never predicted the rise in capabilities of large language models.

I think it is equally impossible now to predict what the state of artificial intelligence will look like in three years; however, I can make bold predictions about the state of the Elixir and the Nx project in three years:

Nx will be a competitive deployment option for large foundation models. There is a lot of positive momentum in the world of open-source foundation models. The capabilities of open-source models are growing closer and closer to the capabilities of proprietary models. This swing brings the need for a robust and performant deployment environment. I believe Elixir and Nx are well-suited to be the deployment environment of the large language model future. We will continue adding features that make large-scale model deployments viable. We will also continue adding features that make it possible to deploy large models on smaller and cheaper hardware.
Elixir will be a preferred language for machine learning applications. Regardless of if Nx is used, I think Elixir has proven itself as a wonderful language for orchestrating AI applications. Nx makes the choice even easier as you can build small parts of your machine learning workflows in Elixir, which greatly simplifies your development stack. In the next three years we will see a growing number of machine learning applications built in Elixir.

Overall, I predict the next three years of Nx will be just as exciting as the first three. And I am very excited to be along for the ride :) Until next time.