using Pumas
Logistic Regression
This lesson is really comprehensive and has a wide audience in mind. We recommend the reader to briefly navigate the Table of Contents and choose which topics to dive in. If you are in doubt feel free to read it from top to bottom.
In this lesson we’ll dive into logistic regression. We’ll start by defining what it is and, more importantly, when to use it.
1 What is Logistic Regression?
Logistic regression is the most used generalized linear model (GLM). GLMs uses a linear regression under the hood for inference but combines that with a non-linear transformation of the dependent variable (also called response).
This allows us to extend the interface of linear regression to more use cases that it would not be possible to apply it directly. This happens because of the nature of some dependent variables, being not linear and, subsequently, not continuous. Such instances occur in binary data, discrete data, count data, or ordinal data.
Logistic Regression is a GLM that can handle binary data. That is, when our response variable takes two distinct values only, generally coded as 0 and 1. It handles that by applying a non-linear transformation to the response variable. And guess what? This transformation is called the logistic function! This is why it is called logistic regression.
The logistic function is also called the sigmoid function because of the sigmoidal shape (S-shaped) of its graph.
This is the logistic function:
\[\mathrm{logistic}(x) = \frac{1}{1 + e^{-x}}\]
Logistic regression can be extended to more discrete outcomes than binary, i.e., it can also handle multiclass problems. If you want to learn more, search for softmax regression or multinomial logistic regression.
Let’s take a look at the graph of our logistic function:
using CairoMakie
using AlgebraOfGraphics
logistic_fun(x) = 1 / (1 + exp(-x));
data((; x = -10:0.1:10, y = logistic_fun.(-10:0.1:10))) *
mapping(:x, :y => "logistic(x)") *
visual(Lines; color = :steelblue, linewidth = 3) |> draw
As you can see the logistic function is a mapping from \((-\infty, +\infty)\) to \([0, 1]\). In other words, it just compresses anything to something between 0 and 1.
1.1 Dependent Variable \(y\)
This makes it the ideal candidate to model things that are defined between 0 and 1, such as probabilities. Thus, with binary data we can focus on the modeling of the probability that our dependent variable will take values 0 or 1. That is, if we say our dependent variable is \(y\) and the probability of it taking value 1 conditioned on covariate \(x\) is \(P(y=1 \mid x)\), then the logistic regression will give us:
\[P(y=1 \mid x) = \mathrm{logistic}(x) = \frac{1}{1 + e^{-x}}\]
The \(\mid\) in \(P(y \mid x)\) denotes the probability of \(y\) conditioned on \(x\). That is, once we know what value \(x\) takes, what is the probability of \(y\). This is the classical conditional probability notation.
1.2 Covariate \(x\)
Now with the dependent variable explained, let’s focus on covariate(s) \(x\). Let’s begin with the simple case when \(x\) is a scalar (not a vector).
Generally in regression problems, we attribute coefficients to act as weights to our covariate \(x\). These coefficients measure the strength of association between \(x\) and \(y\), where high positive coefficients mean high positive association while high negative coefficients indicate high negative association.
Coefficients are usually denominated as the greek letter \(\beta\). Thus, our logistic regression formula becomes:
\[P(y=1 \mid x, \beta) = \mathrm{logistic}(x, \beta) = \frac{1}{1 + e^{-\left(\beta x\right)}}\]
Almost everything that we want to infer or guess in statistics has greek letters such as \(\beta\) and \(\alpha\), while what is observed has roman letters such as \(x\) and \(y\).
Since \(y\) and \(x\) are known, all we have to do is find the best value of our coefficient \(\beta\). That process is called model fit and we will deal with it in detail further in this lesson.
1.2.1 How to Interpret Coefficient \(\beta\)?
Now, suppose we have found our ideal value for \(\beta\). How we would interpret our \(\beta\) estimated value?
First, to recap, \(\beta\) measures the strength of association between the covariate \(x\) and dependent variable \(y\). But, \(\beta\) is nested inside a non-linear transformation called logistic function:
\[\mathrm{logistic}(\beta) = \frac{1}{1 + e^{-\beta}}\]
So, our first step is to undo the logistic function. There is a function that is called logit function that is the inverse of the logistic function:
\[\mathrm{logistic}^{-1}(x) = \mathrm{logit}(x) = \ln\left(\frac{x}{1 -x}\right)\]
where \(\ln\) is the natural logarithm function.
If we analyze closely the logit function we can find that inside the \(\ln\) there is a disguised odds in the \(\frac{x}{1 -x}\). Hence, our \(\beta\) can be interpreted as the logarithm of the odds, or in short form: the log-odds.
Odds provide a measure of the likelihood of a particular outcome. They are calculated as the ratio of the number of events that produce that outcome to the number that do not. To convert probability \(p\) to \(\mathrm{odds}\) you can use the formula:
\[\mathrm{odds}(p) = \frac{p}{1-p}\]
The higher the odds, the higher the probability.
Since odds take values between \((0, \infty)\), and 1 is interpreted as neutral odds (like probability 0.5); the log-odds have a nice intuition that any negative value, that is the \(\ln\) of something between \((0, 1)\), lowers the \(P(y=1)\), while any positive value, that is the \(\ln\) of something between \((1, \infty)\), increases the \(P(y=1)\). Log-odds equal to 0 is exactly \(\ln(1)\), i.e. odds as 1:1; which means neutral effect on \(P(y=1)\).
The log-odds are the key to interpret coefficient \(\beta\). Any positive value of \(\beta\) means that there exists a positive association between \(x\) and \(P(y=1)\), while any negative values indicate a negative association between \(x\) and \(P(y=1)\). Values close to 0 demonstrates the lack of association between \(x\) and \(P(y=1)\).
1.2.2 Covariate \(\mathbf{x}\) as a Vector
Now that we covered the case when \(x\) is a scalar, let’s tackle the case when \(\mathbf{x}\) is a vector.
Note that bold case letters are restricted to denoting vectors and matrices. So, \(x\) is a scalar and \(\mathbf{x}\) is a vector.
Also, matrices are distinguished from vectors by using uppercase letters. For example, \(\mathbf{A}\) is a matrix while \(\mathbf{a}\) is a vector.
Furthermore, we can denote elements of a vector or matrix by their respective index. If we want the \(i\)th element of a vector \(\mathbf{x}\), we denote the element \(x_i\). Since matrices have two dimensions: rows and columns; we need an extra index. Thus, if we want the element of the \(i\)th row and \(j\)th column of a matrix \(\mathbf{A}\), we denote the element \(A_{ij}\) (also some use the lowercase \(a_{ij}\) notation).
If we replace our covariate scalar for a vector of covariates in the logistic function, we get:
\[\mathrm{logistic}(\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{x}}} = \frac{1}{1 + e^{-\left( x_1 + x_2 + ... + x_k \right)}}\]
where \(k\) is the length of \(\mathbf{x}\), that is the total number of covariates that we have.
The same way covariates can be represented as a vector, the coefficients \(\beta\) can be also represented as a vector \(\boldsymbol{\beta}\). The covariate-coefficient products can be efficiently represented as a matrix-vector product:
\[\beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k = \mathbf{X} \cdot \boldsymbol{\beta}\]
here \(\mathbf{X}\) is our data matrix, where each row is an observation and each column a covariate.
To conclude, our logistic function in the case of multiple covariates can be expressed in matrix notation as:
\[\mathrm{logistic}(\mathbf{X}) = \frac{1}{1 + e^{- \left( \mathbf{X} \cdot \boldsymbol{\beta} \right)}}\]
1.3 Intercept \(\alpha\)
Like in any other regression, we also need an intercept to act as our basal rate. That is the expected value of \(y\), in our model, when all covariates are fixed to 0.
In logistic regression the intercept represents the basal probability of our observations taking value \(y=1\), conditioned that all the covariates are zero:
\[P(y=1 \mid \mathbf{x} = \mathbf{0})\]
Traditionally, intercepts are represented as the greek letter \(\alpha\).
This is the final addition that we need to our logistic function. So, our final version of the logistic function in a logistic regression is:
\[\mathrm{logistic}(\mathbf{X}) = \frac{1}{1 + e^{- \left( \alpha + \mathbf{X} \cdot \boldsymbol{\beta} \right)}}\]
where \(\alpha\) is a scalar and \(\alpha + \mathbf{X} \cdot \boldsymbol{\beta}\) boils down to a scalar-vector addition:
\[\mathbf{\hat{y}} = \alpha + \mathbf{X} \cdot \mathbf{\beta} = \alpha + \mathbf{\hat{X}}\]
where \(\mathbf{\hat{y}}\) and \(\mathbf{\hat{X}}\) are the predicted vector of dependent variables and covariate matrix, respectively.
One more mathematical notation tip. Almost everything that has a “hat” on, like \(\hat{x}\), is an addition used to flag that it is the “predicted” or “estimated” version of the symbol. Thus \(\hat{x}\) could be either predicted or estimated \(x\). In other words, we do not observe \(\hat{x}\) but we recover it from a model based on model assumptions and conditional on data (though in simulations the data conditioning does not hold).
1.4 Likelihoods
We have almost everything we need for our logistic regression. We are only missing a final touch. Currently, our logistic function outputs a probability, i.e. a number between 0 and 1. That is not what we need in a binary data context.
Hence, we need to convert the probability to a discrete binary outcome. That is the role of the likelihood functions. More generally, likelihood functions convert the result of the non-linear mapping into the nature of our dependent variable. And, since we are in a discrete binary outcome (logistic regression, with a binary dependent variable), we need a likelihood function that outputs a discrete binary outcome. This could be anything like yes or no, 0 or 1, cat or dog, sad or happy, rain or shine. Actually, the label of the discrete outcome does not matter, so let’s use a fail/success discrete outcome for now on.
In logistic regression, we can use primarily two likelihood functions to model binary discrete outcomes:
- Bernoulli distribution
- Binomial distribution
To be precise, likelihood functions are not the same as probability functions, since they do not sum up to integrate to 1. They can be seen as an unnormalized probability function.
1.4.1 Bernoulli Likelihood
The Bernoulli distribution is named after mathematician Jacob Bernoulli, who is considered one of the fathers of statistical inference (alongside Laplace, Bayes, Quetelet and others), and also the first to discover the Law of the Large Numbers.
The Bernoulli distribution is used to model an independent single trial that has a binary fail/success outcome. It accepts only a single parameter named \(p\) which represents the probability of trial success.
Distributional parameters are values that control how a distribution behaves. They can be integers or reals, take restrained or unrestrained values and be either scalars, vectors or matrices. In a typical statistical inference setting, distributions are used to model data and the parameters are unknown. Thus, parameter inference plays a main role in statistical inference.
In most logistic regression applications, it is used the 0.5 threshold to denote the outcome of a Bernoulli likelihood. So, anything that has \(p < 0.5\) has outcome as fail or 0. And anything that has \(p \geq 0.5\) has outcome as success or 1.
Some fields and applications tend to fine tune the threshold of the logistic regression’s Bernoulli likelihood. Notoriously, the field of machine learning is one of such cases since the focus is on prediction accuracy rather than parameter inference.
If your data is composed of observations that have a dependent variable that only takes binary discrete outcomes, you’ll probably use the Bernoulli likelihood inside the logistic regression model.
Take a look at some Bernoulli distribution plots for different \(p\):
using Distributions
= [0, 1]
x = [0.5, 0.25, 0.75]
probs = [Bernoulli(p) for p in probs]
dists = mapreduce(d -> pdf.(d, x), vcat, dists)
pdfs =
plt data((; probs = repeat(probs; inner = 2), x = repeat(x; outer = 3), pdfs)) *
mapping(
:x => "outcome",
:pdfs => "PMF";
= :probs => nonnumeric => "parameter p",
color = :probs => nonnumeric,
col *
) visual(BarPlot)
draw(plt; axis = (; xticks = [0, 1]))
1.4.2 Binomial Likelihood
The binomial distribution is used to model discrete successes out of a number of independent trials, each asking having a success/fail outcome. It has two parameters: \(n\) the number of trials; and \(p\) the probability of trial success.
The Bernoulli distribution is a special case of the binomial distribution where \(n\) is fixed to 1, i.e. a single independent trial.
If your data is composed of columns that can be represented as attempts and successes, and you want to use successes as a dependent variable; then binomial likelihood is a great candidate for the likelihood function inside the logistic regression model.
Sometimes grouping your data from the trial observation level, i.e. every row is a single trial, to a success observation level, i.e. every row is a number of attempts and successes, can make your model much more efficient and faster to run.
For example, if you have data as:
ID | result |
---|---|
a | 1 |
b | 0 |
a | 0 |
a | 1 |
b | 1 |
And all trials (rows) are independent (this is important), you can convert it to the following:
ID | attempts | successes |
---|---|---|
a | 3 | 2 |
b | 2 | 1 |
Then, you can model the same data using a binomial likelihood instead of a Bernoulli likelihood. This will be more efficient because we are grouping rows and, thus, our inference algorithm has less data to process during model fitting.
Here are some plots of Binomial distributions for different \(n\) and \(p\):
= 0:20
x = [0.5, 0.25, 0.75]
probs = [5, 10, 20]
ns = [Binomial(n, p) for p in probs for n in ns]
dists = mapreduce(d -> pdf.(d, x), vcat, dists)
pdfs data((;
= repeat([p for p in probs for n in ns]; inner = 21),
probs = repeat([n for p in probs for n in ns]; inner = 21),
ns = repeat(x, 9),
x
pdfs,*
)) mapping(
:x => "outcome",
:pdfs => "PMF";
= :probs => nonnumeric => "parameter p",
color = :probs => nonnumeric,
row = :ns => nonnumeric,
col *
) visual(BarPlot) |> draw
2 When to use Logistic Regression?
Now that we know what logistic regression is, let’s cover when to use it. The simple answer is whenever your dependent variable is dichotomous, i.e. is a discrete binary outcome, that is takes only two discrete values. Then you can use a Bernoulli likelihood inside your logistic regression.
But if your observations (rows) are independent, you can group them into attempts and successes. and use a binomial likelihood.
Independence in this context means that knowing information about one observation does not give us any information about the other observations and vice versa. This is a classical assumption in statistical models. The independence assumption generally holds, but if you have time-dependent measurements, then your data probably violates this assumption.
Let’s see some examples that are often modeled as logistic regression:
- Clinical Trial where the outcome is binary, such as the absence or presence of a certain symptom.
- Observation study where the outcome is binary, such as cardiovascular disease.
- Survival analysis that tracks a binary outcome, such as cancer survival studies.
- Patient adherence to a certain treatment.
3 nausea
Dataset
Let’s use the nausea
dataset from PharmaDatasets.jl
to run our logistic regression example in this lesson:
using PharmaDatasets
using DataFrames
using DataFramesMeta
= dataset("nausea")
nausea first(nausea, 5)
Row | ID | Group | AUC | isF | SEX | WT | RACE | TRT | NAUSEA |
---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | String7 | Int64 | String15 | String7 | Int64 | |
1 | 1 | 1 | 0 | 1 | Female | 74 | Caucasian | Placebo | 0 |
2 | 2 | 1 | 0 | 0 | Male | 62 | Asian | Placebo | 0 |
3 | 3 | 1 | 0 | 1 | Female | 72 | Caucasian | Placebo | 0 |
4 | 4 | 1 | 0 | 0 | Male | 77 | Asian | Placebo | 0 |
5 | 5 | 1 | 0 | 0 | Male | 44 | Caucasian | Placebo | 0 |
We can see that we have 200 observations and also some key variables:
:NAUSEA
(dependent variable): which has a binary outcome – 0, 1.:SEX
: discrete binary – Female, Male.:WT
: weight in kg.:RACE
: discrete/categorical with 5 levels – Caucasian, Asian, Other, Hispanic, Black.:TRT
: treatment variable with 4 levels – Placebo, Dose20, Dose40, Dose80.:Group
: the same as:TRT
but with integers – 1, 2, 3, 4.:AUC
: Area Under the Curve.
3.1 Exploratory Data Analysis
Let’s see some summary statistics with describe()
:
describe(nausea)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | ID | 100.5 | 1 | 100.5 | 200 | 0 | Int64 |
2 | Group | 2.5 | 1 | 2.5 | 4 | 0 | Int64 |
3 | AUC | 561.85 | 0 | 190.5 | 4016 | 0 | Int64 |
4 | isF | 0.475 | 0 | 0.0 | 1 | 0 | Int64 |
5 | SEX | Female | Male | 0 | String7 | ||
6 | WT | 72.135 | 33 | 72.0 | 120 | 0 | Int64 |
7 | RACE | Asian | Other | 0 | String15 | ||
8 | TRT | Dose20 | Placebo | 0 | String7 | ||
9 | NAUSEA | 0.335 | 0 | 0.0 | 1 | 0 | Int64 |
As we can see :AUC
has mean 561.85 and median 190.5 which tells us is a long-tail distribution (or asymmetric shifted to the right).
Also, our :WT
variable has a mean and median not so different which signals a bell-shaped (normally-distributed) distribution. We have weight values ranging from 33 kg and 120 kg.
To conclude our summary statistics, :NAUSEA
, our dependent variable, has mean 0.335, which means that around 33.5% of our patients have the positive outcome for nausea symptoms.
It is also important to check the categorical columns to see what are the number of unique values (:nunique
) along with the first and last values:
describe(nausea, :first, :last, :nunique; cols = Not([:NAUSEA, :AUC, :WT]))
Row | variable | first | last | nunique |
---|---|---|---|---|
Symbol | Any | Any | Union… | |
1 | ID | 1 | 200 | |
2 | Group | 1 | 4 | |
3 | isF | 1 | 0 | |
4 | SEX | Female | Male | 2 |
5 | RACE | Caucasian | Caucasian | 5 |
6 | TRT | Placebo | Dose80 | 4 |
Ok, no surprises in our categorical covariates.
@by nausea [:TRT, :SEX, :NAUSEA] :count = length(:SEX)
Row | TRT | SEX | NAUSEA | count |
---|---|---|---|---|
String7 | String7 | Int64 | Int64 | |
1 | Placebo | Female | 0 | 19 |
2 | Placebo | Female | 1 | 9 |
3 | Placebo | Male | 0 | 22 |
4 | Dose20 | Female | 0 | 12 |
5 | Dose20 | Female | 1 | 6 |
6 | Dose20 | Male | 0 | 30 |
7 | Dose20 | Male | 1 | 2 |
8 | Dose40 | Female | 0 | 14 |
9 | Dose40 | Female | 1 | 14 |
10 | Dose40 | Male | 0 | 21 |
11 | Dose40 | Male | 1 | 1 |
12 | Dose80 | Female | 0 | 3 |
13 | Dose80 | Female | 1 | 18 |
14 | Dose80 | Male | 0 | 12 |
15 | Dose80 | Male | 1 | 17 |
3.2 Visualizations
Now we turn to some visualizations to see what is going on beyond the summary statistics in our nausea
dataset.
First, we’ll explore our dependent variable: :NAUSEA
.
data(nausea) *
mapping(:RACE, :NAUSEA; color = :SEX, dodge = :SEX) *
expectation() |> draw AlgebraOfGraphics.
Female patients have more nausea than male patients. Let’s see if this holds for different treatment arms in our nausea
data:
=
plt data(nausea) *
mapping(
:RACE,
:NAUSEA;
= :SEX,
color = :SEX,
dodge = :TRT => sorter(["Placebo", "Dose20", "Dose40", "Dose80"]),
layout *
) expectation()
AlgebraOfGraphics.draw(plt; axis = (; xticklabelrotation = π / 6))
Definitely male patients experience less nausea symptoms than female patients in all dosages. There were no male patients experiencing nausea symptoms in the placebo arm.
Now let’s turn our attention to our covariates :WT
and :AUC
. First, we will remove the placebo observations as their AUC will always be zero:
= @rsubset nausea :TRT != "Placebo"; nausea_no_plb
Also, we would like to analyze the dose-normalized AUC, so we will compute this value:
@rtransform! nausea_no_plb @astable begin
= chop(:TRT, head = 4, tail = 0)
dose_str :DOSE = parse(Int, dose_str)
:AUC_norm = :AUC / :DOSE
end;
We used the chop
function to extract dose information from the :TRT
column. Make sure to check our Strings tutorial to learn more about Julia’s numerous facilities for working with String
s
We can see clearly a negative relationship between :WT
and :AUC
:
data(nausea_no_plb) *
mapping(:WT, :AUC_norm => "Dose-normalized AUC") *
visual(Scatter; alpha = 0.5) + AlgebraOfGraphics.linear()) |> draw (
Also, this negative relationship has a positive interaction effect with dosage. Specifically, the higher the dosage the higher the negative relationship between :WT
and :AUC
. In other words, dosage enhances the relationship between :WT
and :AUC
, which is an inverse relationship.
= renamer(map(i -> i => "$i mg", unique(nausea_no_plb.DOSE))); dose_renamer
=
plt data(nausea_no_plb) *
mapping(:WT, :AUC; color = :DOSE => dose_renamer => "Dose") *
visual(Scatter; alpha = 0.5) + AlgebraOfGraphics.linear())
(draw(plt)
By plotting the relationship between :AUC
and dosage while also considering nausea symptoms (our dependent variable :NAUSEA
), we see that nausea events are tied to higher :AUC
values:
=
plt data(nausea) *
mapping(:Group => "Dose (mg)", :AUC; color = :NAUSEA => nonnumeric) *
visual(Scatter; alpha = 0.5) + smooth())
(draw(plt; axis = (; xticks = (1:4, ["Placebo", "20", "40", "80"])))
4 Pumas modeling
Finally, we’ll create a Pumas model for logistic regression.
4.1 Pumas’ Workflow
Pumas’ workflow
- Define a model. It can be a
PumasModel
or aPumasEMModel
. - Define a
Subject
andPopulation
. - Fit your model.
- Perform inference on the model’s population-level parameters (also known as fixed effects).
- Predict from a fitted model using the empirical Bayes estimate or Simulate random observations from a fitted model using the sampling distribution (also known as likelihood).
- Model diagnostics and visual predictive checks.
Pumas modeling is a highly iterative process. Eventually, you’ll probably go back and forth in the stages of the workflow. This is natural and expected in Pumas’ modeling. Just remember to have good code organization and version control for all of the workflow iterations.
Before we define the model, we need to convert the :RACE
column to dummy variables. Currently :RACE
is a string:
unique(nausea.RACE)
5-element Vector{InlineStrings.String15}:
"Caucasian"
"Asian"
"Other"
"Hispanic"
"Black"
@rtransform! nausea begin
:RACE_Caucasian = ifelse(:RACE == "Caucasian", 1, 0)
:RACE_Asian = ifelse(:RACE == "Asian", 1, 0)
:RACE_Other = ifelse(:RACE == "Other", 1, 0)
:RACE_Hispanic = ifelse(:RACE == "Hispanic", 1, 0)
:RACE_Black = ifelse(:RACE == "Black", 1, 0)
end
4.1.1 Defining a model in Pumas
To define a model in Pumas, you can use either the macros or the function interfaces. We will cover the macros interface in this lesson.
To define a model using the macro interface, you start with a begin .. end
block of code with @model
:
= @model begin end my_model
This creates an empty model. Now we need to populate it with model components. These are additional macros that we can include inside the @model
definition. We won’t be covering all the possible macros that we can use in this lesson, but here is the full list:
@param
, fixed effects specifications.@random
, random effects specifications.@covariates
, covariate names.@pre
, pre-processing variables for the dynamic system and statistical specification.@dosecontrol
, specification of any dose control parameters present in the model.@vars
, shorthand notation.@init
, initial conditions for the dynamic system.@dynamics
, dynamics of the model.@derived
, statistical modeling of dependent variables.@observed
, model information to be stored in the model solution.
4.1.1.1 Model parameter with @param
Ok, let’s start with our logistic_model
using the macro @model
interface.
First, we’ll define our parameters with the @parameters
macro:
@param begin
∈ Domain(...)
ParameterName ...
end
To use the “in” (∈
) operator in Julia, you can either replace ∈
for in
or type \in <TAB>
for the Unicode character.
By using the begin ... end
block we can specify one parameter per line. We will just name the parameters the same name as their respective covariate, while also adding the β
prefix to make clear that this is a coefficient.
Regarding the Domain(...)
, Pumas has several types of domains for you to specify in your @param
block. Here is a list:
RealDomain
for scalar parametersVectorDomain
for vectorsPDiagDomain
for positive definite matrices with diagonal structurePSDDomain
for general positive semi-definite matrices
We will only use scalar parameters in our logistic regression model, so we just need the RealDomain
.
If you don’t specify any arguments inside the domain constructor it will either error (for some domains that have required arguments) or will use the defaults. In the case of the RealDomain()
without arguments it just uses the following arguments:
RealDomain(; lower = -∞, upper = ∞, init = 0)
@param begin
∈ RealDomain()
α ∈ RealDomain()
βAUC ∈ RealDomain()
βisF ∈ RealDomain()
βRACE_Caucasian end
4.1.1.2 Model covariates with @covariates
We also have to specify our model covariates using the macro @covariates
. This can be done using a one-liner syntax, without begin ... end
block:
@covariates Group AUC isF WT... # and so on
Or using the begin ... end
block and specifying one covariate per line. This is what we will do since our model has several covariates, which makes it easy to visualize them and also to comment out a covariate to perform iterative model exploration:
@covariates begin
AUC
isF
RACE_Caucasianend
4.1.1.3 Pre-processing variables with @pre
We can specify all the necessary variables and statistical pre-processing with the @pre
macro.
The @pre
block is traditionally used to specify the inputs of the Ordinary D ifferential Equations (ODE) system used for non-linear mixed-effects PK/PD models (NLME).
But we will use @pre
to specify our variable transformations instead.
Here, we are defining all the coefficients-covariates multiplication operation here. Note that we are adding the prefix _
to the resulting operation. Finally, we are also defining the linear_pred
and linear_pred_race
as the linear predictors that should be added to the intercept and passed to the non-linear logistic function transformation:
@pre begin
= AUC * βAUC
_AUC = isF * βisF
_isF = RACE_Caucasian * βRACE_Caucasian
_RACE_Caucasian = _Group + _AUC + _isF + _Race_Caucasian
linear_pred end
4.1.1.4 Statistical modeling of dependent variables with @derived
Our final block, @derived
, is used to specify all the assumed distributions of observed variables that are derived from the blocks above. This is where we include our dependent variable: NAUSEA
.
Note that NAUSEA
is being declared as following a Bernoulli
distribution. This is why we use the tilde notation ~
. It means (much like the mathematical model notation) that NAUSEA
follows a Bernoulli
distribution. Since NAUSEA
is a vector of values, we need to broadcast, i.e. vectorize, the operation with the dot .
operator:
~ Bernoulli.(p) NAUSEA
where p
is the single probability parameter that parametrizes the Bernoulli distribution. In some cases, we would need to do a lot of broadcasting on the righthand side of the ~
. For example, since NAUSEA
depends on the logistic function of all the linear predictors, i.e. the logistic function of α + linear_pred + linear_pred_race
, we would need to vectorize everything:
~ Bernoulli.(logistic.(α .+ linear_pred .+ linear_pred_race)) NAUSEA
This is cumbersome, so we can use the @.
macro which tells Julia to apply the .
in every operator and function call after it:
~ @. Bernoulli(logistic(α + linear_pred + linear_pred_race)) NAUSEA
Much better!
We are using the @.
macro which tells Julia to vectorize (add the “dot syntax”) to all operations and function calls to the right of it.
In this block we can use all variables defined in the previous blocks, in our example the @param
, @covariates
and @pre
blocks.
Finally, we need our logistic
function. It could be tricky to use our logistic_fun
from the beginning of the tutorial. This is because it is not numerical stable. For this, we can use the logistic
function from Pumas.
To use the logistic
function Pumas, you just need to:
using Pumas
# now we should have logistic available
logistic(x)
This is our @derived
block:
@derived begin
~ @. Bernoulli(logistic(α + linear_pred + linear_pred_race))
NAUSEA end
Here is our full model, once we combined all the macro blocks we just covered:
= @model begin
logistic_model @param begin
∈ RealDomain()
α ∈ RealDomain()
βAUC ∈ RealDomain()
βisF ∈ RealDomain()
βRACE_Caucasian end
@covariates begin
AUC
isF
RACE_Caucasianend
@pre begin
= AUC * βAUC
_AUC = isF * βisF
_isF = RACE_Caucasian * βRACE_Caucasian
_RACE_Caucasian = _AUC + _isF + _RACE_Caucasian
linear_pred end
@derived begin
~ @. Bernoulli(logistic(α + linear_pred))
NAUSEA end
end
PumasModel
Parameters: α, βAUC, βisF, βRACE_Caucasian
Random effects:
Covariates: AUC, isF, RACE_Caucasian
Dynamical system variables:
Dynamical system type: No dynamical model
Derived: NAUSEA
Observed: NAUSEA
4.1.2 Define a Subject
and Population
Once we have our model defined we have to specify a Subject
and a Population
.
In Pumas, subjects (i.e. observations) are represented by the Subject
type and collections of subjects are represented as vectors of Subject
s are defined as Population
.
Subjects
can be constructed with the Subject()
constructor, for example:
Subject()
Subject
ID: 1
We just constructed a Subject
that has ID
equal to 1
and no extra information. In order to provide extra information, you would need to pass keyword arguments to the Subject()
constructor.
You can use the following keyword arguments inside Subject()
:
id
: identifier.observations
: a named tuple of the dependent variables.covariates
: a named tuple containing the covariates, ornothing
.events
: a vector ofEvent
s.time
: a vector of time stamps for the observations.
For this lesson, we will just use the covariates
keyword argument.
This is an example of a Subject
with ID
equal to 42
and with some covariates
specified as a named tuple:
Subject(id = 42, covariates = (; AUC = 100.0, isF = 1, RACE_Caucasian = 1))
Subject
ID: 42
Covariates: AUC, isF, RACE_Caucasian
Since a Population
is just a vector of Subjects
, we can use a simple map()
function over the ID
s present in our nausea
dataset:
= map(
pop -> Subject(
i = i,
id = (;
covariates = nausea.AUC,
AUC = nausea.isF,
isF = nausea.RACE_Caucasian,
RACE_Caucasian
),
),1:maximum(nausea.ID),
)
Population
Subjects: 200
Covariates: AUC, isF, RACE_Caucasian
Observations:
4.1.2.1 Reading Subject
s directly from a DataFrame
Another option is to use the read_pumas()
function to read Subject
s (and Population
) directly from a DataFrame
. This is more convenient than the map()
with the Subject()
constructor inside.
The read_pumas()
function accepts as the first argument a DataFrame
followed by the following keyword options (which are very similar to the Subject()
constructor):
observations
: dependent variables specified by a vector of column names, i.e.[:NAUSEA]
.covariates
: covariates specified by a vector of column names, i.e.[:Group, :AUC, :isF, ...]
.id
: specifies theID
column of theDataFrame
.time
: specifies thetime
column of theDataFrame
.event_data
: toggles assertions applicable to event data, eithertrue
or `false.
Our nausea
dataset do not have any event data, so we should set event_data=false
. Also, we don’t have a :time
column, but Pumas needs us to specify a column for this keyword argument. So let’s create a dummy column :time
where all values are 0.0
:
@rtransform! nausea :time = 0.0;
Now, we can create our population with read_pumas()
:
= read_pumas(
nausea_pop
nausea;= [:NAUSEA],
observations = [:AUC, :isF, :RACE_Caucasian],
covariates = :ID,
id = :time,
time = false,
event_data )
Population
Subjects: 200
Covariates: AUC, isF, RACE_Caucasian
Observations: NAUSEA
4.1.3 Fit your model
Now we are ready to fit our model! We already have a model specified, logistic_model
, along with a Population
: nausea_pop
. We can proceed with model fitting.
Model fitting in Pumas has the purpose of estimate parameters and is done by calling the fit()
function with the following positional arguments:
- Pumas model.
- A
Population
. - A named tuple of the initial parameter’s values.
- An inference algorithm.
4.1.3.1 Initial Parameter’s Values
We already covered the model and Population
, now let’s talk about initial parameter’s values. It is the 3rd positional argument inside fit()
.
You can specify your initial parameters as a named tuple. For instance, if you want to have a certain parameter, β1
, as having an initial value as 3.14
, you can do so by passing it inside a named tuple in the 3rd positional argument of fit()
:
fit(model, population, (; β1 = 3.14))
You can also use the helper function init_params()
which will recover all the initial parameters we specified inside the model’s @param
block.
Since we used the RealDomain()
defaults, all of our example’s initial parameters are set to 0.0
.
4.1.3.2 Inference Algorithms
Finally, our last (4th) positional argument is the choice of inference algorithm.
Pumas has the following available inference algorithms:
Marginal Likelihood Estimation:
NaivePooled()
: first order approximation without random-effects.FO()
: first-order approximation.FOCE()
: first-order conditional estimation with automatic interaction detection.LaplaceI()
: second-order Laplace approximation.
Bayesian with Markov Chain Monte Carlo (MCMC):
BayesMCMC()
: MCMC using No-U-Turn Sampler (NUTS).
We can also use a Maximum A Posteriori (MAP) estimation procedure for any marginal likelihood estimation algorithm. You just need to call the MAP()
constructor with the desired marginal likelihood algorithm inside, for instance:
fit(model, population, init_params(model), MAP(FOCE()))
Ok, we are ready to fit our model. Let’s use the NaivePooled()
since we are not using random-effects in this lesson:
=
nausea_fit fit(logistic_model, nausea_pop, init_params(logistic_model), Pumas.NaivePooled())
[ Info: Checking the initial parameter values.
[ Info: The initial negative log likelihood and its gradient are finite. Check passed.
Iter Function value Gradient norm
0 1.386294e+02 1.974700e+04
* time: 0.021867990493774414
1 1.356988e+02 8.944033e+03
* time: 0.9566919803619385
2 1.342070e+02 2.125159e+03
* time: 0.9570560455322266
3 1.339666e+02 1.108582e+03
* time: 0.957381010055542
4 1.337406e+02 2.078764e+03
* time: 0.9576878547668457
5 1.310077e+02 1.007065e+04
* time: 0.9579918384552002
6 1.265254e+02 1.686018e+04
* time: 0.9583158493041992
7 1.161665e+02 2.192761e+04
* time: 0.9586238861083984
8 1.068977e+02 1.629863e+04
* time: 0.9589409828186035
9 1.027594e+02 8.332132e+03
* time: 0.9592418670654297
10 1.015439e+02 3.226482e+03
* time: 0.9595439434051514
11 1.012961e+02 6.634163e+02
* time: 0.9598488807678223
12 1.012610e+02 1.404645e+02
* time: 0.9601528644561768
13 1.012369e+02 5.328375e+02
* time: 0.9604718685150146
14 1.011695e+02 1.187483e+03
* time: 0.960777997970581
15 1.010163e+02 2.052503e+03
* time: 0.9611070156097412
16 1.006134e+02 3.399937e+03
* time: 0.9614319801330566
17 9.963205e+01 5.310699e+03
* time: 0.9617319107055664
18 9.740393e+01 7.633769e+03
* time: 0.9620349407196045
19 9.334203e+01 9.125369e+03
* time: 0.9623620510101318
20 8.823769e+01 7.815018e+03
* time: 0.9626798629760742
21 8.455556e+01 3.872739e+03
* time: 0.9630389213562012
22 8.353764e+01 7.223117e+02
* time: 0.9633560180664062
23 8.347553e+01 2.676252e+01
* time: 0.9636669158935547
24 8.347313e+01 4.926090e+01
* time: 0.9639739990234375
25 8.347050e+01 2.712655e+01
* time: 0.9642889499664307
26 8.346397e+01 3.819543e+01
* time: 0.9645969867706299
27 8.344834e+01 1.619095e+02
* time: 0.9649059772491455
28 8.340893e+01 3.819645e+02
* time: 0.9652290344238281
29 8.331269e+01 7.375814e+02
* time: 0.9655280113220215
30 8.308737e+01 1.255942e+03
* time: 0.9658360481262207
31 8.262078e+01 1.855313e+03
* time: 0.9661610126495361
32 8.188723e+01 2.193684e+03
* time: 0.9665088653564453
33 8.125091e+01 1.734779e+03
* time: 0.9668159484863281
34 8.101368e+01 7.173788e+02
* time: 0.9671790599822998
35 8.095983e+01 5.583734e+01
* time: 0.9676189422607422
36 8.095442e+01 2.110857e+01
* time: 0.9680428504943848
37 8.095407e+01 9.856148e-01
* time: 0.9685089588165283
38 8.095402e+01 4.973406e+00
* time: 0.9689359664916992
39 8.095402e+01 1.654385e+00
* time: 0.9693639278411865
40 8.095402e+01 2.071506e-01
* time: 0.9697868824005127
41 8.095402e+01 1.090796e-02
* time: 0.9701988697052002
42 8.095402e+01 2.247146e-04
* time: 0.9706318378448486
FittedPumasModel
Successful minimization: true
Likelihood approximation: NaivePooled
Likelihood Optimizer: BFGS
Dynamical system type: No dynamical model
Log-likelihood value: -80.95402
Number of subjects: 200
Number of parameters: Fixed Optimized
0 4
Observation records: Active Missing
NAUSEA: 200 0
Total: 200 0
-------------------------------
Estimate
-------------------------------
α -3.3495
βAUC 0.0023738
βisF 2.4945
βRACE_Caucasian -0.30689
-------------------------------
4.1.4 Perform inference on the model’s population-level parameters
Once the model is fitted, we can analyze our inference and estimates. But first, let’s discuss what the is the meaning of the estimated coefficients in a logistic regression model.
4.1.4.1 Interpreting the Coefficients
We can see that after the model is fitted, it prints a result with a summary and a table of the parameter estimates.
We can also recover the estimates as a named tuple with coef()
:
coef(nausea_fit)
(α = -3.3494807952008214,
βAUC = 0.002373839339878319,
βisF = 2.494522739422596,
βRACE_Caucasian = -0.306888370940882,)
Or as a DataFrame
with coeftable()
:
coeftable(nausea_fit)
Row | parameter | estimate |
---|---|---|
String | Float64 | |
1 | α | -3.34948 |
2 | βAUC | 0.00237384 |
3 | βisF | 2.49452 |
4 | βRACE_Caucasian | -0.306888 |
Remember that the logistic regression’s coefficients are the log-odds, i.e. the natural logarithm of the odds of \(y\) taking value 1.
Log-odds provide the facility that anything that is negative pulls down the odds (and, consequently, the probability) of \(y\) taking value 1. Anything positive pulls it up; and zero is a neutral effect. But since it is in a log-scale it is not easily interpretable. This is why, in order to interpret it, we often undo the log transformation by exponentiating the log-odds, i.e. exponentiating the coefficients.
We can do that easily with the DataFrame
returned by coeftable()
:
@rtransform coeftable(nausea_fit) :estimate_exp = exp(:estimate)
Row | parameter | estimate | estimate_exp |
---|---|---|---|
String | Float64 | Float64 | |
1 | α | -3.34948 | 0.0351026 |
2 | βAUC | 0.00237384 | 1.00238 |
3 | βisF | 2.49452 | 12.1159 |
4 | βRACE_Caucasian | -0.306888 | 0.735733 |
Now let us interpret the :estimate_exp
column.
Our basal rate α
is the odds of \(y\) taking value 1, i.e. an observation having nausea symptoms, when all the other covariates are zero.
Logistic regression, like linear regression, has the assumption that the effects are a linear combination. That means that a prediction can be made by having the basal rate α
and keep adding covariates values multiplied by their β
coefficients. This assumption also holds for parameter/model inference. Hence, all of our coefficients represent compounding effects into our dependent variable \(y\).
Let us now interpret the βAUC
, the covariate AUC
’s coefficient. Its value is approximately 1.0024, which means that for every increase in a single unit of AUC
we can expect an increase of 0.24% in the odds of having nausea symptoms. For instance, a 200 increase in AUC
would produce a 4.8% increase in the odds of having nausea symptoms.
The βisF
, represents the effect of the dummy variable isF
. The odds value is over 12, which means that for every unit of increase in the covariate isF
we can expect on average an increase of 1,200% in the odds of having nausea symptoms. Since isF
is a dummy variable taking either 0 or 1 as values, this means that female observations have 1,200% more risk of having nausea symptoms than male observations.
If you ever heard of odds ratio or relative risk, this is exactly what we are measuring here with our exponentiated log-odds. It measures the odds ratio between having a dependent variable value of 1 over having a value of 0.
We can also inspect our estimated coefficients’ standard errors (SE
) along with the 95% confidence intervals with the infer()
function:
infer(nausea_fit)
[ Info: Calculating: variance-covariance matrix.
[ Info: Done.
Asymptotic inference results using sandwich estimator
Successful minimization: true
Likelihood approximation: NaivePooled
Likelihood Optimizer: BFGS
Dynamical system type: No dynamical model
Log-likelihood value: -80.95402
Number of subjects: 200
Number of parameters: Fixed Optimized
0 4
Observation records: Active Missing
NAUSEA: 200 0
Total: 200 0
---------------------------------------------------------------------------------
Estimate SE 95.0% C.I.
---------------------------------------------------------------------------------
α -3.3495 0.59431 [-4.5143 ; -2.1846 ]
βAUC 0.0023738 0.00035443 [ 0.0016792; 0.0030685]
βisF 2.4945 0.52323 [ 1.469 ; 3.52 ]
βRACE_Caucasian -0.30689 0.41582 [-1.1219 ; 0.50811 ]
---------------------------------------------------------------------------------
Also if you prefer another confidence interval band, you can choose with the keyword argument level
inside infer()
.
For instance, one common band for the confidence intervals is 89%:
infer(nausea_fit; level = 0.89)
[ Info: Calculating: variance-covariance matrix.
[ Info: Done.
Asymptotic inference results using sandwich estimator
Successful minimization: true
Likelihood approximation: NaivePooled
Likelihood Optimizer: BFGS
Dynamical system type: No dynamical model
Log-likelihood value: -80.95402
Number of subjects: 200
Number of parameters: Fixed Optimized
0 4
Observation records: Active Missing
NAUSEA: 200 0
Total: 200 0
---------------------------------------------------------------------------------
Estimate SE 89.0% C.I.
---------------------------------------------------------------------------------
α -3.3495 0.59431 [-4.2993 ; -2.3997 ]
βAUC 0.0023738 0.00035443 [ 0.0018074; 0.0029403]
βisF 2.4945 0.52323 [ 1.6583 ; 3.3307 ]
βRACE_Caucasian -0.30689 0.41582 [-0.97145 ; 0.35768 ]
---------------------------------------------------------------------------------
4.1.5 Predict from a fitted model or Simulate random observations from a fitted model
Now that we understand our model estimates, we can use them to make predictions or simulate random observations.
4.1.5.1 Predictions with predict()
In order to calculate prediction from a fitted model, we need to use the predict()
function and pass two positional arguments:
- A fitted Pumas model.
- Either a single
Subject
or a collection of them with aPopulation
.
For example, to make predictions for a single Subject
, we can:
= Subject(id = 42, covariates = (; AUC = 100.0, isF = 1, RACE_Caucasian = 1)) my_subject
Subject
ID: 42
Covariates: AUC, isF, RACE_Caucasian
predict()
also needs a keyword argument obstimes
which represents the observation times that we want to predict the dependent variable. The value must be a vector. Since we used a dummy 0.0
value as :time
variable in our read_pumas
function, we can pass the same 0.0
but as a vector to the predict()
function.
predict(nausea_fit, my_subject; obstimes = [0.0])
SubjectPrediction
ID: 42
Predictions: NAUSEA: (n=1)
Covariates: AUC, isF, RACE_Caucasian
Time: [0.0]
We can also generate prediction for a Population
:
predict(nausea_fit, nausea_pop; obstimes = [0.0])
Prediction
Subjects: 200
Predictions: NAUSEA
Covariates: AUC, isF, RACE_Caucasian
We can also use the function probstable()
for discrete models. It provides an easy and elegant way of outputting outcome probabilities of all discrete dependent variables in a handy DataFrame
format. See below, that we have the :NAUSEA_iprob1
and :NAUSEA_iprob2
columns. These columns take the following formatting: :DEPVAR_iprobVALUE
. Since NAUSEA
can take only two discrete outcomes, i.e. it can only be 0 or 1, :NAUSEA_iprob1
maps to NAUSEA
taking value 0 and :NAUSEA_iprob2
maps to value 1.
= probstable(nausea_fit)
probs_table first(probs_table, 5)
Row | id | time | NAUSEA_iprob1 | NAUSEA_iprob2 |
---|---|---|---|---|
String | Float64 | Float64 | Float64 | |
1 | 1 | 0.0 | 0.761668 | 0.238332 |
2 | 2 | 0.0 | 0.966088 | 0.0339122 |
3 | 3 | 0.0 | 0.761668 | 0.238332 |
4 | 4 | 0.0 | 0.966088 | 0.0339122 |
5 | 5 | 0.0 | 0.974824 | 0.0251759 |
4.1.5.2 Simulations with simobs()
We can generate random observations from our fitted Pumas model with the simobs()
function. You need to supply at least three positional arguments:
- A non-fitted Pumas model, in our case
logistic_model
. - Either a single
Subject
or a collection of them with aPopulation
. - A named tuple of parameter values.
The 1st and 2nd positional arguments we already covered in our previous examples. Let’s focus on the 3rd positional argument: param
. It can be a named tuple with the parameter values. For instance, we can use our estimated parameters from the fitted model with coef(nausea_fit)
:
coef(nausea_fit)
(α = -3.3494807952008214,
βAUC = 0.002373839339878319,
βisF = 2.494522739422596,
βRACE_Caucasian = -0.306888370940882,)
simobs(logistic_model, nausea_pop, coef(nausea_fit));
You can also pass a single parameter as a Pumas fitted model to simobs()
.
= simobs(nausea_fit); nausea_sim
4.1.6 Model diagnostics
Finally, our last step is to assess model diagnostics.
4.1.6.1 Assessing model diagnostics
To assess model diagnostics we can use the inspect()
function in our fitted Pumas model:
inspect(nausea_fit)
[ Info: Calculating predictions.
[ Info: Calculating empirical bayes.
[ Info: Evaluating individual parameters.
[ Info: Evaluating dose control parameters.
[ Info: Done.
FittedPumasModelInspection
Fitting was successful: true
4.1.6.1.1 AIC
We can also check the Akaike Information Criterion (AIC):
aic(nausea_fit)
169.90803984884482
4.1.6.1.2 BIC
Besides AIC, there is also the Bayesian Information Criterion (BIC):
bic(nausea_fit)
183.10130931503696
4.1.6.2 Visual predictive checks
To conclude, we can inspect visual predictive checks with the vpc_plot()
function. But first, we need to generate a VPC
object with the vpc()
function.
By default, vpc
will use :time
as the independent variable for VPC. We need to change that, because or :time
variable is just a vector of 0
s in the logistic regression. Let’s use :AUC
:
= vpc(nausea_fit, covariates = [:AUC]); vpc_nausea
Now, we need to use the vpc_plot
function with our newly created VPC
object. For this, we need to import the package PumasUtilities
:
using PumasUtilities
vpc_plot(vpc_nausea)
5 Conclusion
I hope you enjoyed this tutorial on logistic regression. Logistic regression is one of the bread and butter of model techniques and is the go-to technique for discrete binary data. Always have it close by in your arsenal of statistical modeling techniques.