Regression using Machine Learning

Authors

Andreu Vall

Mohamed Tarek

1 Learning objectives

Understand the basics of regression analysis using machine learning.
Understand the concept of robust regression and the use of mean absolute error (MAE).
Learn how to use neural networks for capturing nonlinear relationships in data.
Understand the concepts of underfitting and overfitting in machine learning models.
Explore the impact of model complexity and training iterations on model performance.
Learn about validation loss and its role in preventing overfitting.
Understand the importance of regularization and how to apply it in machine learning models.
Explore hyperparameter tuning and its role in optimizing model performance.

2 Setup

First, we need to load the necessary packages. We also set the neural network backend to :simplechains for this tutorial. DeepPumas supports multiple neural network backends. For more on backends, see the DeepPumas documentation.

using DeepPumas
using CairoMakie
using Distributions
using Random
using Optim: BFGS
Random.seed!(123)

set_mlp_backend(:simplechains)

3 A simple machine learning (ML) model

3.1 Linear `true_function` sample

First, we sample some synthetic data from known distributions. The input/covariate/regressor x is sampled from a uniform distribution. And the residual error ϵ is sampled from a standard normal distribution. The output/response y is then derived using the following formula:

\[ y = f(x) + \sigma \cdot \epsilon \]

where \(f(x) = x\) is true_function in the code snippet below, and \(\sigma\) is the standard deviation of the noise term in \(y\).

num_samples = 100
uniform = Uniform(-1, 1)
normal = Normal(0, 1)
σ = 0.25

true_function = x -> x # identity function
x = rand(uniform, 1, num_samples)
ϵ = rand(normal, 1, num_samples)
y = true_function.(x) .+ σ * ϵ

We then visualize the generated data.

fig = scatter(vec(x), vec(y); axis = (xlabel = "x", ylabel = "y"), label = "data")
lines!(-1 .. 1, true_function; color = :gray, label = "true")
axislegend(position = :rb)
fig

3.2 Linear regression model

In this section, we’ll fit a linear regression model to the synthetic data using DeepPumas. MLPDomain is used to define a “neural network” with a single input and single output node, no hidden layers, and an identity activation function. This setup is equivalent to a linear regression model.

When fitting a classical regression model using ordinary least squares (OLS), the objective function to minimize is the mean squared error (MSE) between the predicted values and the true values. The MSE is defined as:

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

Minimizing the MSE has an analytic solution that can be computed directly without the need for numerical optimization.

However, in the presence of potential outliers, the mean absolute error (MAE) is often preferred over the MSE as it is less sensitive to outliers. The MAE is defined as:

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| \]

Minimizing the MAE is often called robust regression. In DeepPumas, the fit function uses the MAE as the default loss function when fitting regression models. Robust regression generally does not have an analytic solution even in the linear case, so it requires numerical optimization methods to find the best-fitting parameters. In this case, we use the BFGS optimization algorithm, which is a quasi-Newton method that approximates the inverse Hessian matrix to find the optimal parameters iteratively.

To define a linear regression model in DeepPumas, we create an instance of MLPDomain as follows:

linreg = MLPDomain(1, (1, identity); bias = true)

We then call the preprocess function to combine x and y in a single object and call the fit function to fit the model to the data. The preprocess function is used to prepare the data for fitting, and it can handle various types of inputs and outputs. In this case, we are simply combining the input x and output y into a single object that can be passed to the fit function.

target = preprocess(x, y)
fitted_linreg = fit(linreg, target; optim_alg = BFGS())

3.3 Fitting regression models

The fit function returns a fitted model that can be used to make predictions on new data. The fitted model contains the learned parameters of the linear regression model. After fitting, we visualize the model’s predictions alongside the original data and the true underlying function \(f\) to assess the fit.

# get predictions
ŷ = fitted_linreg(x)
# plot the dat and predictions
fig = scatter(vec(x), vec(y); axis = (xlabel = "x", ylabel = "y"), label = "data")
scatter!(vec(x), vec(ŷ); label = "prediction")
lines!(-1 .. 1, true_function; color = :gray, label = "true")
axislegend(position = :rb)
fig

4 Capturing nonlinear relationships

4.1 Quadratic `true_function`

In this section, we explore what happens when the true function is quadratic and we use a linear function to fit the data. We first sample some data using a similar procedure as before but where the true_function is \(f(x) = x^2\).

true_function = x -> x^2
x = rand(uniform, 1, num_samples)
ϵ = rand(normal, 1, num_samples)
y = true_function.(x) .+ σ * ϵ

fig = scatter(vec(x), vec(y); axis = (xlabel = "x", ylabel = "y"), label = "data")
lines!(-1 .. 1, true_function; color = :gray, label = "true")
axislegend()
fig

4.2 Linear regression again

We follow a similar procedure as before:

Fitting a linear model to the data using the DeepPumas functions,
Using the fitted model to make predictions, and
Plotting the data next to the predictions and the true underlying function.

target = preprocess(x, y)
fitted_linreg = fit(linreg, target; optim_alg = BFGS(), optim_options = (; iterations = 50))

ŷ_ex22_50iter = fitted_linreg(x)
fig = scatter(vec(x), vec(y); axis = (xlabel = "x", ylabel = "y"), label = "data")
scatter!(vec(x), vec(ŷ_ex22_50iter); label = "prediction")
lines!(-1 .. 1, true_function; color = :gray, label = "true")
axislegend()
fig

Not surprisingly, the linear function was not able to capture the nonlinear shape of the true data generating quadratic function.

In a traditional linear regression setting, the modeler would then analyze the plot of the response \(y\) vs the covariate \(x\), and/or the plot of the residual \(y - \hat{y}\) vs covariate \(x\), and come to the conclusion that a nonlinear term needs to be added, e.g. a quadratic term. This is still called linear regression because the model would still be linear with respect to the model parameters, despite the nonlinearity in the covariates. In many cases, such manual enhancement of the model works well but can be time-consuming especially with a large number of input covariates.

Alternatively, we can utilize a flexible nonlinear function such as a neural network to fit the nonlinear relationship in the data. Using a neural network can enhance the model’s flexibility and the modeler’s productivity, but it sacrifices model interpretability and compactness, without any further post-processing.

4.3 Using a neural network (NN)

To make the MLPDomain a nonlinear neural network, we introduce 2 changes:

We add a hidden layer between the input and output layers, and
We make the activation function of the hidden layer nonlinear.

The addition of the hidden layer without the nonlinear activation function is useless! This is because multiple linear layers are exactly as expressive as a single linear layer. By adding nonlinearity to the activation function of the hidden layer, we have created a multilayered perceptron (MLP), hence the name of the struct MLPDomain.

We repeat the same procedure used to do linear regression above but with an MLP instead, with a single hidden layer of 10 nodes and a tanh activation function. There are many alternative activation functions that one could have used. In DeepPumas, you can also define your own Julia function and use it as an activation function!

nn = MLPDomain(1, (10, tanh), (1, identity); bias = true)
fitted_nn = fit(nn, target; optim_alg = BFGS(), optim_options = (; iterations = 50))

ŷ = fitted_nn(x)
fig = scatter(vec(x), vec(y); axis = (xlabel = "x", ylabel = "y"), label = "data")
scatter!(vec(x), vec(ŷ); label = "prediction")
lines!(-1 .. 1, true_function; color = :gray, label = "true")
axislegend()
fig

We see that the neural network is able to capture the essence of the nonlinear relationship in the data.

5 Basic underfitting and overfitting

When fitting a flexible neural network to data, it is important to consider the concepts of underfitting and overfitting. Underfitting occurs when the model is too simple to capture the underlying relationship in the data, e.g. the linear model above, while overfitting occurs when the model is too complex and captures noise in the data instead of the true signal.

5.1 Number of iterations

Previously, we used 50 iterations to fit the neural network. We can see how the number of iterations affects the model’s performance by fitting the model with fewer and more iterations. Note that no regularization is used in this example, so the model is prone to overfitting. Fitting with a few iterations is expected to underfit the data, while fitting with many iterations is expected to overfit the data.

We demonstrate this by comparing the predictions of the neural network after 2, 50, and 5000 iterations.

underfit_nn = fit(nn, target; optim_alg = BFGS(), optim_options = (; iterations = 2))
overfit_nn = fit(nn, target; optim_alg = BFGS(), optim_options = (; iterations = 5000))

ŷ_underfit = underfit_nn(x)
ŷ_overfit = overfit_nn(x)

fig = scatter(vec(x), vec(y); axis = (xlabel = "x", ylabel = "y"), label = "data")
scatter!(vec(x), vec(ŷ_underfit); label = "prediction (2 iter)")
scatter!(vec(x), vec(ŷ); label = "prediction (50 iter)")
scatter!(vec(x), vec(ŷ_overfit); label = "prediction (1000 iter)")
lines!(-1 .. 1, true_function; color = :gray, label = "true")
axislegend()
fig

Note that the model with 2 iterations is underfitting the data, while the model with 5000 iterations is overfitting the data. The model with 50 iterations is a good compromise between underfitting and overfitting.

5.2 Larger NN

The size of the neural network can also affect the model’s performance. A larger neural network can capture more complex relationships in the data, but it is also more prone to overfitting. We demonstrate this by comparing the predictions of the previously defined neural network with a larger neural network with 2 hidden layers, each with 32 nodes. Further fitting with 5000 iterations is expected to overfit the data even more than the previous model, since no regularization is used.

nn = MLPDomain(1, (32, tanh), (32, tanh), (1, identity); bias = true)
fitted_nn = fit(nn, target; optim_alg = BFGS(), optim_options = (; iterations = 5000))

ŷ = fitted_nn(x)
fig = scatter(vec(x), vec(y); axis = (xlabel = "x", ylabel = "y"), label = "data")
scatter!(vec(x), vec(ŷ); label = "prediction MLP(1,32,32,1)")
lines!(-1:0.1:1, true_function.(-1:0.1:1); color = :gray, label = "true")
axislegend()
fig

Looking at the predictions, we see the large NN trying to predict the supposedly unpredictable noise instead of the true underlying function.

6 Validation loss, regularization, and hyperparameter tuning

6.1 Validation loss

The loss function is a measure of how well the model is performing on a dataset. In this tutorial, the MAE is used as the loss function when fitting and evaluating models. When evaluating the performance of a machine learning model, it is important to use a validation set that is separate from the training set. This helps to assess how well the model generalizes to unseen data. In this section, we will create a validation set and track the validation loss during training. When the validation loss stops improving, we can stop training to avoid overfitting. Since overfitting is all about fitting the noise in the training data and noise is assumed to be random and different between each data point, a model that is fitted to the noise in the training data will generally not perform well on the validation set which has independent and unpredictable noise components. So the validation loss can be used to detect when overfitting is likely happening.

To demonstrate this, we will first create a validation set by sampling new data points from the same distribution as the training data. We will then fit the model to the training data and track the validation loss during training.

x_train, y_train = x, y
target_train = target

ϵ = rand(normal, 1, num_samples)
x_valid = rand(uniform, 1, num_samples)
y_valid = true_function.(x_valid) .+ σ * ϵ
target_valid = preprocess(x_valid, y_valid)

fig = scatter(vec(x_train), vec(y_train); label = "train")
scatter!(vec(x_valid), vec(y_valid); label = "valid")
lines!(-1 .. 1, true_function; color = :gray, label = "true")
axislegend()
fig

# history of training and validation losses
loss_train_l, loss_valid_l = Float64[], Float64[]

# run an initial fit with 10 iterations to define fitted_nn
fitted_nn = fit(nn, target_train; optim_alg = BFGS(), optim_options = (; iterations = 10))
push!(loss_train_l, mae(fitted_nn(x_train), y_train))
push!(loss_valid_l, mae(fitted_nn(x_valid), y_valid))

# run 10 more iterations at a time, and re-define the global `fitted_nn`, then track the training and validation losses
for _ = 2:100
    global fitted_nn = fit(
        nn,
        target_train,
        coef(fitted_nn);
        optim_alg = BFGS(),
        optim_options = (; iterations = 10),
    )
    push!(loss_train_l, mae(fitted_nn(x_train), y_train))
    push!(loss_valid_l, mae(fitted_nn(x_valid), y_valid))
end

# plotting the training and validation losses
iteration = 10 .* (1:100)
fig, ax = scatterlines(
    iteration,
    Float32.(loss_train_l);
    label = "training",
    axis = (; xlabel = "Iteration", ylabel = "MAE"),
)
scatterlines!(iteration, Float32.(loss_valid_l); label = "validation")
axislegend()
fig

We see that the training loss decreases as the number of iterations increases, while the validation loss initially decreases and then starts to increase after a certain number of iterations. This indicates that the model is overfitting the training data after a certain point. The point at which the validation loss starts to increase is called the “early stopping” point, and it is a good indication of when to stop training to avoid overfitting.

6.2 Regularization

While the modeler can use the validation loss tracking method to determine when to stop training, it is often useful to have a more principled way to prevent overfitting. Regularization is a technique that adds a penalty term to the loss function to discourage the model from fitting the noise in the training data.

Regularization is also commonly in traditional linear regression. In linear regression, the regularization term is added to the loss function when fitting to penalize large coefficients. The two most common types of regularization are L1 (LASSO) and L2 (Ridge) regularization. L1 regularization adds the absolute value of the coefficients to the loss function, while L2 regularization adds the square of the coefficients to the loss function.

\[ \text{L1 regularization penalty: } \lambda \cdot \sum_{i=1}^{p} |w_i| \]

\[ \text{L2 regularization penalty: } \lambda \cdot \sum_{i=1}^{p} w_i^2 \]

where \(\lambda\) is the regularization strength, \(w_i\) are the model coefficients, and \(p\) is the number of coefficients. The regularization strength \(\lambda\) controls how much the regularization term affects the loss function. A larger \(\lambda\) will result in a stronger penalty on the coefficients, while a smaller \(\lambda\) will result in a weaker penalty.

In DeepPumas, we can use L1 or L2 regularization by specifying a reg argument when defining the MLPDomain. The following code snippet shows how to define an MLP with L2 regularization:

reg_nn = MLPDomain(1, (32, tanh), (32, tanh), (1, identity); bias = true, reg = L2(0.1))

We then fit the model using the same procedure as before, but now the loss function will include the regularization term. The regularization term is added to the loss function during training, and it is used to penalize large coefficients in the model. The regularization term is not included in the validation loss, so we can still track the validation loss without the regularization term affecting it.

reg_loss_train_l, reg_loss_valid_l = Float64[], Float64[]

fitted_reg_nn =
    fit(reg_nn, target_train; optim_alg = BFGS(), optim_options = (; iterations = 10))
push!(reg_loss_train_l, mae(fitted_reg_nn(x_train), y_train))
push!(reg_loss_valid_l, mae(fitted_reg_nn(x_valid), y_valid))

for _ = 2:100
    global fitted_reg_nn = fit(
        reg_nn,
        target_train,
        coef(fitted_reg_nn);
        optim_alg = BFGS(),
        optim_options = (; iterations = 10),
    )
    push!(reg_loss_train_l, mae(fitted_reg_nn(x_train), y_train))
    push!(reg_loss_valid_l, mae(fitted_reg_nn(x_valid), y_valid))
end

iteration = 10 .* (1:100)
fig, ax = scatterlines(
    iteration,
    Float32.(loss_train_l);
    label = "training",
    axis = (; xlabel = "Blocks of 10 iterations", ylabel = "Mean absolute loss"),
);
scatterlines!(iteration, Float32.(loss_valid_l); label = "validation");
scatterlines!(iteration, Float32.(reg_loss_train_l); label = "training (L2)");
scatterlines!(iteration, Float32.(reg_loss_valid_l); label = "validation (L2)");
axislegend();
fig

We see that the validation loss does not increase as the number of iterations increases, indicating that the model is not overfitting the training data.

6.3 Programmatic hyperparameter tuning

A very high value of \(\lambda\) can indeed prevent overfitting by driving most parameters to be close to 0. However, this can also lead to severe underfitting. On the other hand, a very low value of \(\lambda\) can lead to overfitting. The optimal value of \(\lambda\) is often determined through hyperparameter tuning, which is the process of searching for the best hyperparameters for a model. Hyperparameter tuning is an essential step in machine learning, as it can significantly improve the model’s performance. To find the best hyperparameters, we can use techniques such as grid search, random search, or Bayesian optimization. These methods systematically explore the hyperparameter space to find the best combination of hyperparameters that minimizes the validation loss.

Cross-validation is often used to evaluate the performance of the model with different hyperparameters, where the dataset is split into multiple folds, and the model is trained and evaluated on each fold.

In DeepPumas, we can use the hyperopt function to perform hyperparameter tuning. The hyperopt function takes a model, a target dataset, and an optional algorithm for hyperparameter optimization. An example using hyperopt to find the best value of \(\lambda\) for L2 regularization is shown below, where we define a range of values for \(\lambda\) to search over by trying resolution = 30 evenly-spaced values in this interval in log scale. 7-fold cross-validation is used to evaluate the performance of the model with different hyperparameter values. The best hyperparameter value is then used to fit the model again using the whole training dataset.

using Pumas: KFold

nn_ho = hyperopt(
    reg_nn,
    target_train,
    GridSearch(
        (; lambda = (1e-7, 1e4));
        criteria = KFold(K = 7),
        scale = :log10,
        resolution = 30,
    ),
)
nn_ho.best_hyperparameters

(lambda = 0.008531678524172805,)

The best model can then be used to make predictions on the validation set, and we can visualize the predictions alongside the validation data and the true underlying function.

ŷ_ho = nn_ho(x_valid)

fig = scatter(vec(x_valid), vec(y_valid); label = "validation data");
scatter!(vec(x_valid), vec(ŷ_ho), label = "prediction (hyperparam opt.)");
lines!(-1 .. 1, true_function; color = :gray, label = "true");
axislegend(; position = :ct);
fig

We see that the hyperparameter tuning has found a good value of \(\lambda\) that prevents overfitting while still capturing the underlying relationship in the data.

Reuse

CC BY-SA 4.0