Introduction to Deep Learning

Authors:

Abdelwahed Khamis, Mohamed Tarek

Perceptron

The perceptron is a simple model of a biological neuron, introduced by Frank Rosenblatt in 1958.
It is a binary classifier that maps input features to one of two classes using a linear decision boundary.
The perceptron computes a weighted sum of the input features, applies an activation function (step function), and outputs a class label.
The perceptron can be trained using the Perceptron Learning Algorithm (PLA), which adjusts the weights based on misclassifications.

Perceptron

Rosenblatt’s Perceptron Learning Algorithm (PLA)

Initialize the weight vector \[\mathbf w^{(0)} \leftarrow \mathbf 0 \;(\text{or small random values}), \qquad b^{(0)} \leftarrow 0\]
Loop over the training set \(\{(\mathbf x_i,\,y_i)\}_{i=1}^{N}\) with \(y_i \in \{-1,+1\}\)
- Predict \(\hat y_i = \operatorname{sign}(\mathbf w\!\cdot\!\mathbf x_i + b)\)
- If it is correct, do nothing
- If it is wrong, update \[\mathbf w \gets \mathbf w + \eta \cdot y_i \cdot \mathbf x_i, \qquad b \gets b + \eta \cdot y_i\] where \(\eta>0\) is a fixed learning rate (often \(\eta = 1\)).
Repeat passes (epochs) until all points are classified correctly or a preset iteration budget is exhausted.

Perceptron

Demo

Multi-Layer Perceptron (MLP)

A neuron computes a weighted sum of inputs, adds a bias, and applies an activation function. \[ y = f\!\biggl(\sum_{i=1}^{n} w_i x_i + b\biggr) \] where
- \(w_i\): weights
- \(b\): bias
- \(f(\cdot)\): activation function
A Multi-Layer Perceptron (MLP) is a feed-forward neural network with one or more hidden layers of neurons.
They are called feed-forward because information flows in one direction, from input to output as opposed to recurrent networks where information can loop back, creating cycles.

Multi-Layer Perceptron (MLP)

Each layer transforms the input data, allowing the network to learn complex mappings.
MLPs are just functions with many parameters (weights and biases) that can be optimized to fit any function.
Activation functions introduce non-linearity, enabling neural networks to learn complex mappings.
Without them, a deep network collapses to a single linear transformation.
Properties of activation functions:
- Nonlinear
- Differentiable almost everywhere for gradient-based learning
- Preferably, computationally efficient

Common Activation Functions

Sigmoid: smooth; bounded output; suffers from vanishing gradients for large |x|.
Rectified Linear Unit (ReLU): simple, efficient; non-smooth; “dies” if x < 0; unbounded output.
Hyperbolic Tangent (tanh): zero-centered; smooth; bounded output; also suffers from vanishing gradients.

Other Common Activation Functions

Function	Expression	Range
Leaky ReLU	\(f(x) = \max(\alpha x, x)\)	\((-\infty, \infty)\)
Softmax	\(f_i(x) = \frac{e^{x_i}}{\sum_j e^{x_j}}\)	\((0, 1)\)

Leaky ReLU: mitigates dying ReLU; small gradient for x < 0; unbounded output.
Softmax: used for multi-class classification; outputs probabilities summing to 1.

Vanishing Gradient & Activation Functions

The vanishing gradient problem occurs when gradients become very small, slowing or halting learning.
Functions like sigmoid squash inputs to (0, 1). For large positive/negative inputs, their derivatives become very small, and gradients shrink when propagated through many layers — causing vanishing gradients.
Both sigmoid and tanh are saturating activation functions, where the derivative tends to zero for large input magnitudes.
ReLU doesn’t saturate in the positive region, preserving gradient magnitude and mitigating vanishing gradients. However, it can cause exploding gradients for large positive inputs.
For more information about activation functions, see https://en.wikipedia.org/wiki/Activation_function

Universal Approximation Theorem

A feed-forward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of \(\mathbb{R}^n\), given appropriate activation functions (e.g., sigmoid, ReLU).

The first formal statement was provided by Cybenko (1989) for sigmoid activations.
Later it was generalized multiple times to other activation functions, discontinuous functions, deeper networks and non-compact domains.
This theorem underpins the theoretical foundation of neural networks, demonstrating their capacity to model complex functions.
However, it does not provide guidance on the number of neurons needed or how to train such networks effectively.

Deep Learning vs Neural Networks

Deep learning refers to the use of deep mathematical models, typically neural networks with multiple hidden layers, to learn representations of data and relationships within it.
The depth of these functions allows them to learn hierarchical representations of data, capturing complex patterns and abstractions.
There are other classes of deep models beyond neural networks, such as deep Gaussian processes and deep Boltzmann machines.
However, in practice, deep learning is often synonymous with deep neural networks due to their widespread success across various domains.

Neural Network Architectures

Among neural networks, there are different architectures designed for specific data types and tasks, such as:

Multi-Layer Perceptrons (MLPs) for tabular data,
Convolutional Neural Networks (CNNs) for image data,
Recurrent Neural Networks (RNNs) for sequential data,
Transformers for natural language processing, and
Graph Neural Networks (GNNs) for graph-structured data.

Neural Networks for Supervised Learning

Learn a mapping from inputs to known targets. \[ \hat{y} = f_\theta(x) \]
The network adjusts weights to minimize the mean prediction error. \[ \hat{\theta} = \arg \min_\theta \frac{1}{N} \sum_{i=1}^{N} \ell(f_\theta(x_i), y_i) \]
Regression loss \[ \ell(\hat{y}, y) = \|\hat{y} - y\|^2 \]
Classification loss \[ \ell(\hat{y}, y) = -\sum_{c} y_c \log(\hat{y}_c) \]

Neural Networks for Unsupervised Learning

Autoencoder

Compresses input to a smaller latent embedding. \[ z = \text{Encoder}_{\theta_1}(x) \]
Reconstructs the input from that embedding. \[ \hat{x} = \text{Decoder}_{\theta_2}(z) \]
Trains to minimize reconstruction error. \[ \begin{aligned} \hat{\theta} &= \arg \min_{\theta_1, \theta_2} \frac{1}{N} \sum_{i=1}^{N} \|x_i - \hat{x}_i\|^2 \\ \hat{x}_i &= \text{Decoder}_{\theta_2}(\text{Encoder}_{\theta_1}(x_i)) \end{aligned} \]

Neural Networks for Unsupervised Learning

Linear Autoencoder (Principal Component Analysis)

A linear autoencoder with a single hidden layer and linear activations learns to project data onto its top principal components.
Assume the data is centered (zero mean).
The encoder maps input \(x\) to a lower-dimensional representation \(z\): \[ z = W_{\text{enc}} \cdot x \]
The decoder reconstructs the input from \(z\): \[ \hat{x} = W_{\text{dec}} \cdot z \]

Neural Networks for Unsupervised Learning

Linear Autoencoder (Principal Component Analysis)

Training minimizes the reconstruction error: \[ \hat{\theta} = \arg \min_{\theta} \frac{1}{N} \sum_{i=1}^{N} \|x_i - \hat{x}_i\|^2 \]
The optimal weights span the same subspace as the top \(k\) principal components of the data.
\(W_{\text{enc}}\) contains the top \(k\) eigenvectors of the data covariance matrix.
\(W_{\text{dec}}\) is the transpose of \(W_{\text{enc}}\).
Biases are not needed for centered data.

Gradient-Based Optimization

Stochastic Gradient Descent (SGD)

Full-batch gradient descent (GD)
- Uses all \(N\) samples each step
- Accurate but computationally heavy
Stochastic gradient descent (SGD)
- Uses random mini-batches \(\mathcal{B}_t\), \(|\mathcal{B}_t| = m \ll N\) \[ \theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta \frac{1}{m} \sum_{i \in \mathcal{B}_t} \ell(f_\theta(x_i), y_i) \]
- Uses a subset of the data per step → cheaper iterations.
- Adds gradient noise → better exploration.

Stochastic Gradient Descent (SGD)

Under certain conditions, converges in expectation to a stationary point (usually a local minimum).
Requires an unbiased gradient estimate: \[ \mathbb{E}_{\mathcal{B}_t}\left[\nabla_\theta \frac{1}{m} \sum_{i \in \mathcal{B}_t} \ell(f_\theta(x_i), y_i)\right] = \nabla_\theta \frac{1}{N} \sum_{i=1}^{N} \ell(f_\theta(x_i), y_i) \]
This holds if mini-batches are sampled uniformly at random from the dataset or if data is shuffled each epoch.
An epoch is one full pass over the dataset.

Stochastic Gradient Descent (SGD)

For convex functions, converging (in expectation) to a local minimum with a high precision requires more epochs than full-batch GD.
In practice, SGD is often faster to converge than GD (especially for low precision) due to cheaper iterations and better exploration when the loss is non-convex.
A highly precise solution is often not needed in ML applications.
A high gradient variance can help escape sharp minima but slows down convergence to higher precision.
The batch size \(m\) controls the gradient variance:
- Smaller \(m\) → higher variance
- Larger \(m\) → lower variance
The optimal batch size depends on the specific problem and dataset and is often chosen based on hardware constraints (e.g., GPU memory).

Stochastic Gradient Descent (SGD)

Another key condition is a diminishing learning rate: \[ \sum_{t=1}^\infty \eta_t = \infty \quad \text{and} \quad \sum_{t=1}^\infty \eta_t^2 < \infty \] For example, \(\eta_t = \frac{\eta_0}{1 + \lambda t}\) satisfies this for \(\lambda > 0\) and \(\eta_0 > 0\).
\(t\) is the iteration index (or epoch index if decaying per epoch).
Constant or piecewise-decayed learning rates are sometimes used for efficiency.
Constant learning rates can still converge to a neighborhood of a minimum, but may not converge to a local minimum precisely.

Practical SGD Algorithm

Shuffle data each epoch
Split data into mini-batches of size \(m\)
Loop over mini-batches, updating weights each time
Apply learning-rate decay or cosine schedules \[ \eta_t = \eta_0 \cdot \frac{1}{1 + \lambda t} \] \[ \eta_t = \eta_{min} + \frac{1}{2}(\eta_0 - \eta_{min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right) \]
Apply early stopping, monitoring validation loss to avoid overfitting
Use learning rate warm-up, gradually increasing the learning rate from a small value to the target value over a few epochs, before decaying it.
Gradient clipping to prevent exploding gradients.

Tradeoffs in SGD

Batch size (m) trade-off
- Smaller m
  - More noise leading to better exploration
  - Faster updates and lower memory footprint
- Larger m
  - Smoother gradients leading to more stable convergence
  - Slower updates and higher memory usage

Tradeoffs in SGD

Learning rate (\(\eta\)) trade-off
- Larger \(\eta\)
  - Can converge faster but with a higher risk of overshooting or divergence
  - May require decay schedules or gradient clipping
- Smaller \(\eta\)
  - Slower but more precise convergence
  - May need more epochs; can get stuck in local minima

Beyond Vanilla SGD

Momentum: smoothens the updates and accelerates along consistent directions. \[ v_{t+1} = \beta v_t + (1-\beta)\nabla_\theta L(\theta_t), \qquad \theta_{t+1} = \theta_t - \eta v_{t+1} \]
Adaptive methods: adjust per-parameter learning rates using running averages.
- RMSProp, for each scalar parameter \(\theta\): \[ s_{t+1} = \beta s_t + (1-\beta)(\nabla_\theta L(\theta_t))^2, \qquad \theta_{t+1} = \theta_t - \eta \frac{\nabla_\theta L(\theta_t)}{\sqrt{s_{t+1}} + \epsilon} \]
- Adam

Adam (Adaptive Moment) Optimizer

For each scalar parameter \(\theta\):

\[ m_{t+1} = \beta_1 m_t + (1-\beta_1)\nabla_\theta L(\theta_t), \quad s_{t+1} = \beta_2 s_t + (1-\beta_2)(\nabla_\theta L(\theta_t))^2, \] \[ \hat{m}_{t+1} = \frac{m_{t+1}}{1 - \beta_1^{t+1}}, \quad \hat{s}_{t+1} = \frac{s_{t+1}}{1 - \beta_2^{t+1}}, \] \[ \theta_{t+1} = \theta_t - \eta \frac{\hat{m}_{t+1}}{\sqrt{\hat{s}_{t+1}} + \epsilon} \]

Combines momentum and adaptive learning rates.
Default hyperparameters: \(\beta_1=0.9\), \(\beta_2=0.999\), \(\epsilon=10^{-8}\).
Works well in practice for many tasks, especially with sparse gradients.

Regularization

Common Regularization Techniques

Penalty-based regularization (weight decay) to improve generalization, less chance of overfitting
Dropout layers to add more noise and improve robustness
Batch normalization layers to stabilize training

Penalty-based Regularization

L1 (Lasso) vs L2 (Ridge) Regularization

Regularization helps prevent overfitting by penalizing model complexity.
L1 (Lasso) adds sum of absolute weights: \[\text{Loss}_{L1} = \text{Loss}_{\text{original}} + \lambda \sum_i |w_i|\]
L2 (Ridge) adds sum of squared weights: \[\text{Loss}_{L2} = \text{Loss}_{\text{original}} + \lambda \sum_i w_i^2\]
L2 regularization actively discourages sparsity.
- Sparsity refers to having many weights exactly zero.
L1 regularization tends to result in more sparse weights.

L1 (Lasso) vs L2 (Ridge) Regularization

L1 can perform feature selection by driving some weights to zero.
- Which variable is selected can be unstable under small data changes.
- Features selected are generally associated, not necessarily causal.
L2 is more robust when many highly correlated features exist, penalty is smooth at 0.
L1 is non-smooth near 0 which means that the gradient near a local optimum of the training loss + regularization may not be close to 0.
Elastic net combines both L1 and L2 penalties. \[ \text{Loss}_{\text{EN}} = \text{Loss}_{\text{original}} + \lambda_1 \sum_i |w_i| + \lambda_2 \sum_i w_i^2 \]

Scaling of Regularization Terms

Summing over all parameters means larger models have larger penalties.
The regularization strength \(\lambda\) thus depends on model size.
It is possible to average the regularization term over the number of parameters, i.e., use \(\frac{1}{P} \sum_i |w_i|\) or \(\frac{1}{P} \sum_i w_i^2\) where \(P\) is the number of parameters.
This normalization makes the regularization strength \(\lambda\) less sensitive to model size, allowing for more consistent tuning across different architectures.
However, this is less common in practice; most implementations use the unnormalized sum.

Weight Decay vs L2 Regularization

Weight decay is a regularization technique that adds a penalty proportional to the magnitude of the weights to the loss function during training.
The SGD update with weight decay \(\lambda\) is: \[ \theta_{t+1} = \theta_t - \eta \left( \nabla_\theta L(\theta_t) + \lambda \theta_t \right) \]
The loss \(L(\theta)\) above does not explicitly include the weight decay term; instead, it is applied directly in the update step.
A similar adaptation can be made for other optimizers like Adam, called AdamW.
In standard SGD, weight decay and L2 regularization are mathematically equivalent (when scaled by the learning rate). That is, applying L2 regularization to the loss produces weight updates that shrink weights, the same effect produced by weight decay in SGD.

Weight Decay vs L2 Regularization

PyTorch’s AdamW implements decoupled weight decay, meaning weight decay is applied directly to weights in the update step rather than being added as a penalty term inside the loss function. See the official docs: PyTorch Documentation \[ \theta_{t+1} = \theta_t - \eta \frac{\hat{m}_{t+1}}{\sqrt{\hat{s}_{t+1}} + \epsilon} - \eta \lambda \theta_t \]
With adaptive optimizers like Adam, L2 regularization added to the loss interacts with adaptive learning rates and does not produce the same effect as simple weight decay — adaptive scaling distorts the regularization effect.
AdamW decouples weight decay preserving the intended shrinkage effect on weights without interfering with the adaptive gradient updates, which was found to work well in practice.
Summary
- For SGD, weight decay = L2 regularization.
- For AdamW, weight decay acts like an L2-type shrinkage step, but is decoupled from the gradient of the loss.

Bayesian View of Regularization

In Bayesian inference, we think of the model as a data generating process.
The model has parameters \(\theta\) whose values are unknown.
We have a prior belief about the parameters, expressed as a prior distribution \(P(\theta)\).
We observe data \(D\) and want to update our belief about \(\theta\) using Bayes’ theorem: \[P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)}\]
The likelihood \(P(D | \theta)\) measures how well the model with parameters \(\theta\) explains the observed data.
The posterior \(P(\theta | D)\) combines the likelihood and the prior, giving us an updated belief about \(\theta\) after seeing the data.

Bayesian View of Regularization

The mode of the posterior distribution is called the maximum a-posteriori (MAP) estimate: \[ \begin{aligned} \hat{\theta}_{MAP} &= \arg \max_\theta P(\theta | D) = \arg \max_\theta P(D | \theta) P(\theta) \\ &= \arg \max_\theta \log P(D | \theta) + \log P(\theta) \\ &= \arg \min_\theta -\log P(D | \theta) - \log P(\theta) \end{aligned} \]
In contrast, the maximum likelihood estimate (MLE) ignores the prior and only maximizes the likelihood: \[ \begin{aligned} \hat{\theta}_{MLE} &= \arg \max_\theta P(D | \theta) \\ &= \arg \max_\theta \log P(D | \theta) \\ &= \arg \min_\theta -\log P(D | \theta) \end{aligned} \]

Bayesian View of Regularization

In the context of neural networks, the prior distribution \(P(\theta)\) can be used to constrain the model parameters and prevent overfitting.
Prior distributions generalize the traditional \(L_1\) and \(L_2\) regularization.

Normal prior is \(L_2\) regularization

\[ \begin{aligned} \theta & \sim \mathcal{N}(0, \sigma^2 \cdot I) \\ -\log P(\theta) & = \sum_{i=1}^{n_\theta} \Bigg( \log (\sigma) + \frac{1}{2} \cdot \log (2 \pi) + \frac{\theta_i^2}{2 \cdot \sigma^2} \Bigg) \\ & = \underbrace{n_\theta \cdot \Bigg( \log (\sigma) + \frac{1}{2} \cdot \log (2 \pi) \Bigg)}_{\text{constant}} + \underbrace{\frac{1}{2 \cdot \sigma^2}}_{\lambda} \cdot \underbrace{\sum_{i=1}^{n_\theta} \theta_i^2}_{\left\lVert \theta \right\rVert_2^2} \\ & = \lambda \cdot \left\lVert \theta \right\rVert_2^2 + \text{constant} \end{aligned} \]

Bayesian View of Regularization

In the context of neural networks, the prior distribution \(P(\theta)\) can be used to constrain the model parameters and prevent overfitting.
Prior distributions generalize the traditional \(L_1\) and \(L_2\) regularization.

Laplace prior is \(L_1\) regularization

\[ \begin{aligned} \theta_i & \sim \text{Laplace}(0, b) \\ -\log P(\theta) & = \sum_{i=1}^{n_\theta} \Bigg( \log (2 \cdot b) + \frac{|\theta_i|}{b} \Bigg) \\ & = \underbrace{n_\theta \cdot \log (2 \cdot b)}_{\text{constant}} + \underbrace{\frac{1}{b}}_{\lambda} \cdot \underbrace{\sum_{i=1}^{n_\theta} |\theta_i|}_{\left\lVert \theta \right\rVert_1} \\ & = \lambda \cdot \left\lVert \theta \right\rVert_1 + \text{constant} \end{aligned} \]

Learning in Small Data Regimes

Transfer Learning

Train on a source task/domain with rich data, reuse its representations/weights to improve a target task with limited data.

When should it help?

High representation overlap: source captures features relevant to target.
Small target dataset with similar modality (e.g., time‑series \(\rightarrow\) time‑series).
Red flag: Large domain gap + tiny target set \(\rightarrow\) risk of negative transfer.

Self-supervised Learning

Task Aggregation

Data Growing / Augmentation

Create synthetic data points by applying label-preserving transformations to existing data.
Common in computer vision (e.g., random crops, flips, color jitter).
For time-series data, augmentations can include time-warping, jittering, scaling, and cropping.
Uses:
- Increase dataset size when collecting new data is expensive or impractical to:
  - Balance class distributions in imbalanced datasets,
  - Simulate rare events or conditions, or
  - Reduce overfitting by providing more diverse training examples.
- Enhance model invariance to certain transformations or destroy spurious correlations in the training data.
- Mimic adversarial examples, improving model robustness.

When does Augmentation Work?

Label‑preserving transforms of inputs that reflect real invariances.
Time‑series examples: small time‑warp, jitter (±5–10%), window cropping/shift, light magnitude scaling, masking.
Realistic noise injection that mirrors measurement/sensor error.
Domain‑aware synthesis: PBPK / pop‑PK simulations to cover rare disease conditions and patient subgroups.

References

Pan & Yang, “A Survey on Transfer Learning” (2010)
Yosinski et al., “How transferable are features in deep neural networks?” (2014)
Zhuang et al., “A Comprehensive Survey on Transfer Learning” (2020)
Kornblith et al., “Do better ImageNet models transfer better?” (2019)