Scientific Machine Learning

Authors:

Mohamed Tarek

Scientific Machine Learning (SciML)

SciML

Scientific Machine Learning (SciML) is an interdisciplinary and active field of research that combines machine learning techniques with scientific computing and domain knowledge to model complex systems.
SciML aims to leverage mechanistic models (e.g., differential equations) and data-driven approaches to improve predictions, understand underlying processes, and make informed decisions in scientific domains.
SciML has applications in various scientific fields, including physics, chemistry, biology, and engineering, where understanding and predicting complex phenomena is crucial.
By integrating mechanistic models with data-driven methods, SciML can provide more accurate and interpretable models, especially in scenarios where data is limited or noisy.
A subset of SciML is sometimes called physics-informed machine learning, which focuses on incorporating known physical laws and principles into machine learning models to ensure that predictions are consistent with established scientific knowledge.

SciML

SciML often also includes using machine learning for modelling scientific data, e.g. operator learning, surrogate modeling, etc.
Therefore, sequence models for time series data are often included in SciML, especially when the time series data is generated by a complex system that can be modeled using mechanistic models, e.g. ODEs/PDEs/UDEs.

Universal Differential Equations (UDEs)

Universal Differential Equations (UDEs) are a class of models that combine mechanistic knowledge (e.g., known physical laws) with data-driven components (e.g., neural networks) to model complex systems.
If some parts of the dynamics are known, we can encode that knowledge in the structure of the ODEs and use any universal approximator (e.g., neural network) to model the unknown parts of the dynamics.
- Other universal approximators include Gaussian processes and splines.
We focus on neural networks as the universal approximator in this lecture, but the same principles apply to other types of universal approximators.
We can either start from a neural ODE and add mechanistic terms, or start from a mechanistic model and add neural network terms to capture unknown dynamics.

Universal Differential Equations (UDEs)

UDEs allow us to encode known relationships and structures in the model making the model more data efficient, interpretable and more likely to generalize well to new scenarios.
The parameters of the mechanistic terms can be learned simultaneously with the parameters of the neural network, allowing for a unified training process that optimizes both components together.

UDEs in Pharmacokinetics/Pharmacodynamics (PK/PD)

Let \(\text{Depot}(t)\), \(\text{Central}(t)\), and \(R(t)\) represent the drug amount in the depot compartment, central compartment, and the response at time \(t\), respectively.
We can model the dynamics of these variables using a system of ordinary differential equations (ODEs) that may include both mechanistic terms (e.g., known PK/PD relationships) and data-driven terms (e.g., neural networks to capture unknown dynamics).
We can encode different levels of mechanistic knowledge and assumptions in the structure of the ODEs.

UDEs in Pharmacokinetics/Pharmacodynamics (PK/PD)

Case 1

\[ \begin{aligned} \frac{d\text{Depot}}{dt} & = \text{NN}(\text{Depot}, \text{Central}, R)[1] \\ \frac{d\text{Central}}{dt} & = \text{NN}(\text{Depot}, \text{Central}, R)[2] \\ \frac{dR}{dt} & = \text{NN}(\text{Depot}, \text{Central}, R)[3] \end{aligned} \]

Number of states

Case 2

\[ \begin{aligned} \frac{d\text{Depot}}{dt} & = -\text{NN}_1(\text{Depot}) \\ \frac{d\text{Central}}{dt} & = \text{NN}_1(\text{Depot}) - \text{NN}_2(\text{Central}) \\ \frac{dR}{dt} & = \text{NN}_3(\text{Central}, R) \end{aligned} \]

Number of states
Relationships
Dependence/independence of terms
Conservation between Depot and Central

UDEs in Pharmacokinetics/Pharmacodynamics (PK/PD)

Case 3

\[ \begin{aligned} \frac{d\text{Depot}}{dt} & = -K_a \cdot \text{Depot} \\ \frac{d\text{Central}}{dt} & = K_a \cdot \text{Depot} - \frac{\text{CL}}{V_c} \cdot \text{Central} \\ \frac{dR}{dt} & = \text{NN} \Bigg( \frac{\text{Central}}{V_c}, R \Bigg) \end{aligned} \]

Explicit knowledge of some terms
Relationships (R independent of Depot!)

Case 4

\[ \begin{aligned} \frac{d\text{Depot}}{dt} & = -K_a \cdot \text{Depot} \\ \frac{d\text{Central}}{dt} & = K_a \cdot \text{Depot} - \frac{\text{CL}}{V_c} \cdot \text{Central} \\ \frac{dR}{dt} & = k_\text{in} \cdot \Bigg( 1 + \text{NN}\Bigg( \frac{\text{Central}}{V_c} \Bigg) \Bigg) - k_\text{out} \cdot R \end{aligned} \]

Precise position of the unknown function.
Precise input to the unknown function.
Lots of knowledge!

UDE Case Studies

Physics Informed Neural Networks (PINNs)

Physics Informed Neural Networks (PINNs) are a specific type of SciML models that incorporates known physical laws (e.g., conservation of mass, energy, momentum) into the loss function while training neural networks.
PINNs involve training a neural network to approximate the solution of a PDE or ODE.
PINNs can be used for a wide range of applications, including PK/PD, quantitative system pharmacology, fluid dynamics, heat transfer modeling, and material science, where the underlying physics/biology is (partially) understood and can be leveraged to improve model accuracy and generalization.
PINNs is often called biology-informed neural networks (BINNs) or pharmacology-informed neural networks (PhINNs) in the context of biology and pharmacology, but the core idea is the same. We use PINNs to refer to all such models.

PINNs in Pharmacokinetics

Assume a 2 compartment PK model: \[ \begin{aligned} \frac{d\text{Depot}}{dt} & = -K_a \cdot \text{Depot} \\ \frac{d\text{Central}}{dt} & = K_a \cdot \text{Depot} - \frac{\text{CL}}{V_c} \cdot \text{Central} \end{aligned} \]
Assume a single bolus dose of 100 units at time 0 so the initial conditions of Depot and Central are 100 and 0, respectively.
The PINN would involve defining a neural network \([\hat{\text{Depot}}, \hat{\text{Central}}] = \text{NN}(t)\) that predicts the drug amounts in the depot and central compartments at any time \(t\).

PINNs in Pharmacokinetics

The loss function would include terms that enforce the satisfaction of the above ODE and the initial conditions, such as: \[ \begin{aligned} L &= \frac{1}{N} \sum_{i=1}^N \left| \frac{\hat{\text{Central}}(t_i)}{V_c} - \text{Concentration}_i \right|^2 \\ & \quad\quad\quad + \frac{1}{M} \sum_{i=1}^{M} \lambda_{i,1} \left| \frac{d\hat{\text{Depot}}}{dt}(t_i) + K_a \cdot \hat{\text{Depot}}(t_i) \right|^2 \\ & \quad\quad\quad + \frac{1}{M} \sum_{i=1}^{M} \lambda_{i,2} \left| \frac{d\hat{\text{Central}}}{dt}(t_i) - K_a \cdot \hat{\text{Depot}}(t_i) + \frac{\text{CL}}{V_c} \cdot \hat{\text{Central}}(t_i) \right|^2 \\ & \quad\quad\quad + \lambda_{\text{IC,Depot}} |\hat{\text{Depot}}(0) - 100|^2 + \lambda_{\text{IC,Central}} |\hat{\text{Central}}(0)|^2 \end{aligned} \]
The derivative terms \(\frac{d\hat{\text{Depot}}}{dt}\) and \(\frac{d\hat{\text{Central}}}{dt}\) can be computed using automatic differentiation.

PINNs in Pharmacokinetics

\(M\) is the number of time points at which we penalize the violation of the ODE constraints, acting as virtual observations.
\(M\) can be different from \(N\) which is the number of time points for which we have observed data.
The average over the \(M\) virtual observations can be replaced with an expectation, sampling time points from a given time domain in each iteration of a stochastic optimization algorithm, e.g. sampling \(t_i\) uniformly from 0 to the maximum time in the training data.
The average over the \(N\) actual observations can also be replaced with an expectation, sampling time points from the training data in each iteration in a stochastic optimization algorithm, shuffling between epochs, etc.
\(\lambda\) is a set of hyperparameters (typically not tuned with validation performance) that control the strength of the physics-informed loss terms relative to the data fitting term.
\(\lambda\) is tuned during training to balance the contribution of the data fitting term and the physics-informed terms.

PINNs in Pharmacokinetics

Training PINNs involves differentiating the loss function with respect to the parameters of the neural network to get the gradient for optimization.
- The loss itself involves derivatives of the neural network outputs with respect to time.
- Both nested derivatives can be computed using automatic differentiation, which is supported by modern deep learning frameworks.
If the dynamics of the system are not fully known, they can be described using a UDE while still using the PINN loss.
The ODE/PDE/UDE parameters can be learned together with the neural network weights representing the solution.
The constraints can be enforced on a subset of the compartments if the dynamics of some compartments are better understood than others.

PINNs in Pharmacokinetics

If there are multiple bolus doses (discontinuities in Depot), we can use a different neural network for each segment of the time series between bolus doses, and enforce constraints at the boundaries of these segments by adding the following terms to the loss function: \[ \begin{aligned} L_{\text{boundary}} = & \sum_{j=1}^{J} \lambda_{\text{Depot}, j} |\hat{\text{Depot}}(t_j^-) - \hat{\text{Depot}}(t_j^+) + \text{Dose}_j|^2 \\ &+ \sum_{j=1}^{J} \lambda_{\text{Central}, j} |\hat{\text{Central}}(t_j^-) - \hat{\text{Central}}(t_j^+)|^2 \end{aligned} \]
\(t_j\) is the time of the \(j\)-th bolus dose of \(J\) doses, \(\text{Dose}_j\) is the amount of the \(j\)-th bolus dose, and \(\hat{\text{Depot}}(t_j^-)\) and \(\hat{\text{Depot}}(t_j^+)\) are the predictions of the piecewise neural network just before and just after the bolus dose, respectively (similarly for \(\hat{\text{Central}}\)).

Benefits of PINNs

The known PDE/ODE may not be sufficient to fully describe the data observed.
- However, known dynamics can still guide the learning process by softly enforcing physical constraints on the neural network’s predictions through the loss function.
PINNs are typically more data efficient than purely data-driven models.
- They leverage known physical relationships to guide the learning process, especially in scenarios where data is limited or noisy.
PINNs are easy to implement and are easily suited for GPUs.
- Can be trained using standard GPU-accelerated deep learning frameworks and stochastic optimization algorithms, e.g. Adam.
- Typically more accessible compared to advanced numerical solvers, especially for complex partial differential equations without validated and optimized classical numerical solver implementations.

Limitations of PINNs

Even if the dynamics are well understood, PINNs can often be used to avoid solving the ODE/PDE directly using classical numerical methods. However, this is generally not recommended because:

Classical numerical solvers generally have better theoretical guarantees of convergence, accuracy and stability than PINNs, and are often more efficient for solving well-studied ODEs/PDEs.
PINNs only softly enforce the physics constraints, so some violation will almost always be there.
PINNs also struggle with stiff ODEs and multi-scale problems without further specialization, similar to classical numerical solvers.

Limitations of PINNs

The solution is sensitive to the values of the penalty coefficients \(\lambda\).
It can be difficult to find good values for \(\lambda\), especially if the constraints are on different scales.
- If \(\lambda\) is too low, the constraints will be ignored.
- If \(\lambda\) is too high, the constraints will be much more enforced. However, the NN may not be capable of perfectly satisfying the constraint without sacrificing data fit quality.

PINNs for Inverse Problems

Despite the limitations of PINNs, when learning the parameters of a model from data (inverse problem), an accurate solution may not be necessary at the beginning of training.
PINNs can learn to approximate the solution while learning the parameters simultaneously.
In some (but not all) cases, this may be more efficient than repeatedly solving the ODE/PDE with different parameters using an accurate numerical solver.
Solving inverse problems with gradient-based algorithms also requires differentiating through the numerical solver.
- This is generally more difficult and computationally expensive than calculating the gradient of the PINN loss.
The PINN solution can be used to initialize a more accurate algorithm using more accurate but slower numerical algorithms and gradient calculation.

Deep Operator Networks (DeepONets)

DeepONets are a type of neural network architecture designed to learn operators, which are mappings between function spaces.
An operator \(\mathcal{G}\) maps an input function \(x(u)\) for \(u \in \mathcal{U}\) to an output function \(y(v) = \mathcal{G}(x)(v)\) for \(v \in \mathcal{V}\).
\(\mathcal{U}\) and \(\mathcal{V}\) can be different domains, and the input and output functions can have different dimensions and structures.
DeepONets satisfy the universal approximation theorem for operators, which states that they can approximate any continuous operator to arbitrary accuracy given sufficient capacity and training data.
Many problems in science and engineering can be formulated as learning an operator, such as solving differential equations, where the operator maps initial conditions or parameters to the solution of the equation.

Deep Operator Networks (DeepONets)

DeepONets can be trained once and used as a surrogate model for the operator, allowing for efficient predictions of the output function for new input functions without needing to solve the underlying equations again.
For example, we can simulate the PK/PD dynamics for a range of dosing regimens and patient characteristics, and train a DeepONet to learn the operator that maps these inputs to the resulting drug concentration or response curves.
Once trained, the DeepONet can quickly predict the drug concentration or response curve for new dosing regimens and patient characteristics without needing to solve the underlying PK/PD equations again, which can be computationally expensive.
There are also physics informed versions of DeepONets that can incorporate known physical laws and principles into the training process, improving the accuracy and generalization of the model, especially in scenarios with limited data.

Deep Operator Networks (DeepONets)

A DeepONet typically consists of two main components:
- Branch network:
  1. Takes the measurements of the input function \(x(u)\).
  2. Processes them to extract relevant features/embeddings that capture the essential characteristics of the input function. \[ B = \text{NN}_{\text{branch}}(x(u_1), x(u_2), \ldots, x(u_n)) \]
- Trunk network:
  1. Takes the output of the branch network and the evaluation point \(v\) as inputs
  2. Produces the output function value \(y(v)\) at that point. \[ T = \text{NN}_{\text{trunk}}(v) \]

Deep Operator Networks (DeepONets)

The input function \(x(u)\) can optionally also return the location \(u\) as part of \(x(u)\).
The final output of the DeepONet is obtained by combining the outputs of the branch and trunk networks, typically through a dot product: \[ \hat{y}(v) = \sum_{k = 1}^K B_k \cdot T_k \]
\(K\) is the number of features/embeddings extracted by the branch network, which is also the dimension of the output of the trunk network.
The branch and trunk networks can be implemented using various neural network architectures, such as feed-forward networks, convolutional networks, or recurrent networks, depending on the nature of the input and output functions.

Deep Operator Networks (DeepONets)

The loss for a single input-output function pair can be defined as the sum (or mean) squared error between the predicted output function values and the true output function values at a set of evaluation points: \[ L = \sum_{i=1}^{N} \left| \hat{y}(v_i) - y(v_i) \right|^2 \]
If there are \(M > 1\) input-output function pairs in the training data, the loss can be averaged across all pairs: \[ L = \frac{1}{M} \sum_{m=1}^{M} \sum_{i=1}^{N_m} \left| \hat{y}_m(v_i) - y_m(v_i) \right|^2 \]
The training process involves optimizing the parameters of the branch and trunk networks simultaneously to minimize this loss, using standard optimization algorithms such as stochastic gradient descent or Adam.

DeepONets as Sequence to Vector to Sequence Models

For sequence to vector to sequence modelling, we can consider both the input and output functions as functions of time, i.e., \(x(t)\) and \(y(t)\).
Each subject can be considered a different input-output function pair.
In the sequence to vector to sequence setting, every input \(x(t_1)\) is associated with every output \(y(t_2)\) even if \(t_2 < t_1\), i.e. future inputs can affect past outputs, which is not the case in sequence to sequence models.
This potentially breaks causality but the model may learn to ignore the future inputs when making predictions for past outputs if the training data is consistent with causality, i.e. if the training data does not contain examples where future inputs affect past outputs.
If the input sequence time points are all before the output sequence time points, there is no issue.

Some Alternatives to DeepONets

Fourier Neural Operators (FNOs) are an alternative to DeepONets that use Fourier transforms to learn operators, which can be more efficient for certain types of problems, especially those involving cyclic or oscillating data.
Laplace Neural Operators (LNOs) are another alternative that use the more general Laplace transforms to learn operators.
Operator learning is an active area of research.

Conclusion

SciML has Much More!

Model discovery: Using data-driven methods to discover underlying physical laws or governing equations from observed data.
Uncertainty quantification: Combining mechanistic models with probabilistic machine learning techniques to quantify uncertainty in predictions and model parameters.
Optimal control and reinforcement learning: Using machine learning to optimize control strategies for complex systems governed by mechanistic models.
Residual modeling: Predicting the residual error of mechanistic models using a data-driven model and using it to correct the predictions of mechanistic models.
Model reduction: Using machine learning to identify and retain the most important components of a mechanistic model.
And much more! The field of SciML is rapidly evolving and encompasses a wide range of techniques and applications that go beyond the scope of this lecture.