Introduction
If you’ve read the book Made to Stick by Chip and Dan Heath, you might recall the excerpt on “listeners” and “tappers.”
Very early on in the book, they describe an experiment conducted in the 90s by Stanford PhD Elizabeth Newton. Newton’s experiment divides individuals into “listeners” and “tappers.” Tappers are tasked with tapping assigned songs such as “Happy Birthday” and “The Star Spangled Banner.” Listeners are tasked with identifying the songs their counterparts are tapping.
Before the experiment started, tappers were asked to assess the probability with which the listener would correctly guess the song they were assigned to tap. The average tapper believed their listener counterpart would correctly guess the song about 50% of the time. In reality, listeners only correctly guessed the song three out of 120 times.
The purpose of the excerpt in the book is to demonstrate the Curse of Knowledge: When communicating topics from a position of knowledge or authority, it’s often difficult to fully understand what it’s like to lack that particular knowledge or authority.
In other words, you often have a hard time remembering what it was like before you knew what you know now, and it makes it difficult to convey difficult topics to beginners. I find this is especially true for machine learning–so this post is an attempt to convey the foundations and basics of machine learning from the perspective of a listener.
WARNING: I haven’t finished the book yet, so I might fail miserably :)
Machines that Learn?
The first question you might ask when starting your journey to Machine Learning enlightenment is What is machine learning? I think it’s easiest to understand machine learning with some historical context, and with a greater understanding of artificial intelligence as a whole.
You probably have some intuition about the definition of artificial intelligence. A formal definition of artificial intelligence is a system or systems which mimics human intelligence on a particular task.
Of course, this leaves a ton of room for interpretation around the definition of intelligence. Most academics can probably agree on some of the elements that must be present in order for a system to be considered intelligent; however, there are probably equally as many things they would vehemently disagree about.
For a general understanding, I think it’s fair to argue that you can consider any system which attempts to mimic the senses, behaviors, or actions of humans to be artificial intelligence. The common aspects of human cognitive ability you’ll see mimicked in artificial intelligence are things like vision, speech, writing, planning, movement, learning, etc.
Artificial Intelligence as an academic field of research sprung out of a 1956 Dartmouth Workshop proposed by John McCarthy, Marvin Minsky, Claude Shannon, and Nathaniel Rochester. The workshop coined the term “artificial intelligence.” Since then, the field of artificial intelligence has experienced a number of booms and busts–years of incredible progress and growth followed by long stretches of stagnation or AI winter.
The early consensus in artificial intelligence research was that logic-based approaches were the most promising path to creating systems that mimicked human intelligence.
At face value, this feels correct. The idea that the world is made up of concrete facts, patterns, and rules feels correct, and that we should be able to express these facts, patterns, and rules succinctly with logic is simple and somewhat beautiful.
Initially, logic-based approaches showed great promise and spawned a generation of research and tooling for logic-based artificial intelligence. You can explore many of these classic approaches on GitHub in Peter Norvig’s Paradigms in Artificial Intelligence Programming book.
Unfortunately, logic-based approaches break down on tasks that are seemingly simple, but layered with complexity.
For example, what set of rules would you use to describe images of birds? What set of rules would you use to describe the written English language? What set of rules would you use to describe the inflections and sound that make up English speech?
You might be able to come up with a few rules for each of these questions, but you’ll probably quickly realize that your rules don’t cover certain edge cases, or there are specific exceptions to your rules. The problem with attempting to model systems with concrete rules is that the complexity of the ruleset increases exponentially with the complexity of the problem. Enumerating all of the rules which govern the complexity of real-world problems is incredibly challenging. What if there was a way to learn these rules through experience?
Machine learning was born from the idea that machines could be programmed to learn from experience on given tasks. Perhaps they’d be given some initial knowledge about the world, but largely they would learn from observations. Machine learning is thus considered a subset of artificial intelligence. The formal definition of machine learning, as defined by Tom Mitchell in his 1997 book Machine Learning is:
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Those are the three defining characteristics of a machine learning algorithm: experience, tasks, and performance. The beauty of this definition is that it does not make any assumptions about the task, metrics, or training data. You can learn anything.
How do machines learn?
Armed with a high-level understanding of what machine learning is, you’re probably more confused or at least more curious to understand how machines learn. Before diving into that, let’s revisit one of the fatal flaws of rule-based approaches.
Remember that the number of rules required to capture a complex process increases exponentially with the complexity of the system. Why is that?
Rule-based approaches are rigid by default–they are comprised of facts, and query existing collections of facts with observations. There is no room for uncertainty, and thus no room for handling the countless edge cases that inevitably arrive in the chaos of the real-world. If, instead, we assumed uncertainty as a principle, we could capture the complexity of the world with fewer, simpler rules.
The classic example of uncertainty providing a better framework for modeling the real-world is from Ian Goodfellow’s Deep Learning Book.
Goodfellow and his co-authors ask the question, What birds fly? A rule based on uncertainty for this question might look something like: Most birds fly. You can even quantify this uncertainty with a probability if you have enough information: 80% of birds fly. Alternatively, a rule without uncertainty would look something like: All birds fly, except young birds, birds that are injured, penguins, ….. with a countless number of exceptions only an ornithologist would know.
The strength of modern machine learning approaches is that they are built on uncertainty. They make use of probability theory–which is the study of rules that quantify uncertainty–to model complex real-world processes.
Okay, but how do they do that? To answer this, we’ll consider machine learning in a supervised setting, because that’s the easiest to understand. The fundamental question in supervised learning is: Given X, how certain are you of Y? You can apply this to any supervised learning problem you might be familiar with:
- Given this image, how certain are you that it contains a bird?
- Given this text, how certain are you that the sentiment is positive?
- Given these stats, how certain are you the Sixers will win the NBA finals? (I’m not very certain.)
In reality, a machine learning model is just a function:
def model(observation) do
likelihood_of_event(observation)
end
Which returns the likelihood of an event given the observation. These likelihoods are expressed as probabilities, which is why you’ll often see scores associated with predictions. For example, if you train an Axon model to predict whether or not an image contains a bird, you’ll end up with predictions that look like:
#Nx.Tensor<
f32
0.75
>
This prediction can more or less be interpreted as a model saying: “I am 75% certain this image contains a bird.” In reality, a machine learning model is just a function that takes an observation and returns a likelihood.
But how does a model take a picture of a bird and spit out a probability? The actual process depends on the model–the description you would receive for something like decision trees differs from that of neural networks. I will try to explain the process in more general terms.
First, imagine for a second you were asked to physically sort a number of images on a scale from 0 to 1. Images placed near zero nearly certainly don’t contain a bird. Images placed near one almost certainly do contain birds.
This might be a difficult task to imagine because to you images concretely do or do not contain a bird. Rather than think of the problem as a discrete 0 or 1, try to consider the features of a bird and then think about what kind of images might lie left, middle, and center.
For example, you might place images of inanimate objects very close to 0. Images of other animals such as cats or dogs might be a little further from 0, but still not too far. As you drift towards the center you might find airplanes and other objects which have defining features in common with birds. As you approach 1, the images you place have more of the defining features of a bird.
At the end of the exercise, you have essentially created a probability distribution of images that contain birds. Mentally, you transformed the input features you saw in the images into a probability. This is the same thing we do in machine learning.
Rather than manually transforming inputs into probabilities, you want to create a function that does it for you. You assume a parameterized probability distribution:
def model(params, inputs) do
likelihood = f(params, inputs)
likelihood
end
In the above code, f/2
is a function which answers the question: Given inputs
what is the probability of Y
? This prediction is based on its parameterization params
.
The presence of params
means f/2
can try to replicate many other distributions from a family of probability distributions. When I say probability distribution, I’m really just talking about something that resembles the physical scale you created for bird images earlier. So f/2
is a function which represents a probability distribution and is how we actually obtain probabilities from inputs. But, that still doesn’t fully answer the question of how inputs become probabilities.
Remember from the exercise that you were asked to list the defining features of birds and use those features to predict probabilities. We can ask f/2
to do essentially the same thing. Our input images will come in as a vector or matrix of pixel color values. For each image, we’ll end up with just a bunch of numbers. We can transform these numbers in a number of ways using params
assuming that params
will “emphasize” certain groups of pixel values which contain defining bird features. For example, f/2
might look something like:
def f(params, input) do
params
|> Nx.dot(input)
|> Nx.logistic()
end
This is a linear transformation of the input
using params
. Visually, you can imagine that all of our original images could be plotted somewhere in space. The linear transformation projects them into a different, perhaps more ordered orientation in space, and the logistic function squeezes this space into a scale from 0 to 1. This is where Nx and some understanding of linear algebra comes in handy–you can transform inputs in a number of ways in an attempt to map inputs to probabilities.
Now, let’s assume that there exists some real function true_f
and some true_params
which represents the true probability distribution of all images that contain birds. And let’s assume that true_f
looks identical to f
:
def true_f(true_params, input) do
true_params
|> Nx.dot(input)
|> Nx.logistic()
end
If you’re just given a sample of images and their corresponding result from true_f
, how can you make f
match true_f
? In other words, how can you learn to model true_f
with f
? Well, you can try random guessing, or brute-force search; however, you’ll quickly find this to be a nearly impossible task. If you instead use some principles from optimization and probability theory, you can recover true_params
and true_f
more intelligently.
Your goal is pretty concrete here, you want f
to match true_f
which means params
have to match true_params
. You want to minimize the difference between f
and true_f
.
But, how do you measure the difference between functions without access to true_params
? Well, you can attempt to measure the difference by measuring the difference between each observation in the dataset given to you. In other words, you use your model to predict a probability for each image, measure the difference between the predicted probability and the true label, and then update your model accordingly. You end up with an objective that looks something like:
defn objective(params, input, label) do
probability = model(params, input)
measure_difference(label, probability)
end
The form of measure_difference
depends on the type of problem you’re trying to solve. This difference function is often called a loss function.
You now have a clear objective function that contains your model params
–by minimizing the objective function you are in essence minimizing the difference between f
and true_f
. You are learning to represent true_f
using f
from observations.
Okay, so how do you actually minimize objective/3
? The most common form of optimization in machine learning is probability gradient descent.
To explain gradient descent, I will use the same analogy I used in a previous blog post. Imagine you’re dropped somewhere randomly in the ocean and you’re asked to find the deepest point in the water using only your depth-finder as a guide. You could move about randomly, but you’d probably end up wandering pretty aimlessly for days. A smarter approach would be: Take depth measurements in small increments in every direction. Determine which direction is the direction of deepest descent. Move in that direction. Repeat.
This is the essence of gradient descent–your objective function is an ocean, and you want to find the deepest point by constantly descending in the direction of steepest descent.
But, how can you find the direction of steepest descent? Mathematically, this is the gradient of a function. In Nx, this is what grad
takes care of for you. You can use grad
of your objective function with respect to params
in order to update params
slightly in a direction that minimizes the objective function on an example. Each step looks something like:
defn step(params, input, label) do
direction_of_descent = grad(params, &objective(&1, input, label))
params - direction_of_descent * 1.0e-2
end
Here grad
with respect to params
tells you how to update params
to minimize objective
. The scale by 1.0e-2
is your learning rate, and ensures you don’t take too large of a step in any given direction. If you repeat this process iteratively over a given dataset, you end up closely replicating true_f
with f
.
What does this actually look like in Nx?
Now that you know all of these concepts somewhat in an abstract sense, you should be able to better recognize what’s going on in a concrete sense. Open up a new Livebook and install Nx:
Mix.install([
{:nx, "~> 0.2.0"}
])
:ok
Now let’s generate some training data. Rather than dealing with probabilities, let’s just assume true_f
spits out a dot-product between params
and x
:
true_params = Nx.random_uniform({16})
true_f = fn params, x ->
params
|> Nx.dot(x)
end
train_data =
for _ <- 1..10000 do
x = Nx.random_uniform({16})
{x, true_f.(true_params, x)}
end
[
{#Nx.Tensor<
f32[16]
[0.07947574555873871, 0.026878003031015396, 0.5867477655410767, 0.5776385068893433, 0.9754040241241455, 0.5079066753387451, 0.3611658215522766, 0.7247434854507446, 0.6224258542060852, 0.0817679837346077, 0.18870306015014648, 0.9963228702545166, 0.6838437914848328, 0.7353075742721558, 0.4642966091632843, 0.6851630210876465]
>,
#Nx.Tensor<
f32
4.527798652648926
>},
{#Nx.Tensor<
f32[16]
[0.3381848633289337, 0.3867352604866028, 0.5400522947311401, 0.03547948971390724, 0.7606191635131836, 0.1566101759672165, 0.291944682598114, 0.42579999566078186, 0.6438153982162476, 0.4992257356643677, 0.30716437101364136, 0.9808345437049866, 0.4933328628540039, 0.4456803798675537, 0.6096257567405701, 0.6286845207214355]
>,
#Nx.Tensor<
f32
4.5737714767456055
>},
{#Nx.Tensor<
f32[16]
[0.06392016261816025, 0.5789399743080139, 0.27656567096710205, 0.6276429295539856, 0.5487242341041565, 0.3903144896030426, 0.051697079092264175, 0.873468816280365, 0.9662443995475769, 0.4221976697444916, 0.5376619100570679, 0.38977575302124023, 0.03834615647792816, 0.09812478721141815, 0.31701961159706116, 0.5563293695449829]
>,
#Nx.Tensor<
f32
3.5150163173675537
>},
...
]
Now create a module which defines your model and objective functions. The form of measure_difference
is just the mean squared error between input and label.
defmodule Learn do
import Nx.Defn
defn f(params, input) do
params
|> Nx.dot(input)
end
defn model(params, input) do
prediction = f(params, input)
prediction
end
defn measure_difference(prediction, label) do
prediction
|> Nx.subtract(label)
|> Nx.power(2)
|> Nx.mean()
end
defn objective(params, input, label) do
prediction = model(params, input)
measure_difference(prediction, label)
end
defn step(params, input, label) do
direction_of_descent = grad(params, &objective(&1, input, label))
params - direction_of_descent * 1.0e-3
end
end
{:module, Learn, <<70, 79, 82, 49, 0, 0, 19, ...>>, {:step, 3}}
Now to actually run the gradient descent, we need an initial set of parameters:
params = Nx.random_normal({16})
#Nx.Tensor<
f32[16]
[-0.7124313116073608, -0.08391565829515457, 0.39997708797454834, 0.37166494131088257, -0.8821085095405579, 1.1669548749923706, -0.7852609157562256, -1.3352124691009521, 0.8125752806663513, 0.5284674763679504, 0.4762270152568817, -1.5248165130615234, -0.5238316059112549, -1.1385467052459717, 2.1005051136016846, -1.7426177263259888]
>
We can reduce through the dataset applying step
at each observation to slowly learn true_params
:
recovered_params =
for _ <- 1..10, reduce: params do
params ->
for {input, label} <- train_data, reduce: params do
params ->
Learn.step(params, input, label)
end
end
#Nx.Tensor<
f32[16]
[0.45476144552230835, 0.7819427251815796, 0.20503632724285126, 0.20784883201122284, 0.8274795413017273, 0.3705921173095703, 0.6816828846931458, 0.11929456144571304, 0.5488267540931702, 0.9811261892318726, 0.6553477048873901, 0.5045632719993591, 0.6943572163581848, 0.9759575128555298, 0.7533358335494995, 0.455101877450943]
>
Okay, so how well did we do? Let’s measure the error between model
and true_f
on a number of inputs:
differences =
for _ <- 1..1000 do
x = Nx.random_uniform({16})
pred = Learn.model(recovered_params, x)
actual = true_f.(true_params, x)
Learn.measure_difference(actual, pred)
end
|> Nx.tensor()
|> Nx.mean()
#Nx.Tensor<
f32
2.6374322864564093e-11
>
The error is nearly 0! Let’s see how this compares to our initial parameters
differences =
for _ <- 1..1000 do
x = Nx.random_uniform({16})
pred = Learn.model(params, x)
actual = true_f.(true_params, x)
Learn.measure_difference(actual, pred)
end
|> Nx.tensor()
|> Nx.mean()
#Nx.Tensor<
f32
38.93861389160156
>
It worked!
Conclusion
I hope this article offers you a fresh perspective on what it really means for a machine to learn. If you have any feedback or any topics you’d like to see in the future, please don’t hesitate to reach out.
The Nx team has some exciting things planned for the future. Come chat with us in the EEF ML WG Slack, and be sure to follow myself and DockYard on Twitter to catch all of my traffic.