Introduction to Deep Learning
Authors:
Abdelwahed Khamis, Mohamed Tarek
Perceptron
- The perceptron is a simple model of a biological neuron, introduced by Frank Rosenblatt in 1958.
- It is a binary classifier that maps input features to one of two classes using a linear decision boundary.
- The perceptron computes a weighted sum of the input features, applies an activation function (step function), and outputs a class label.
- The perceptron can be trained using the Perceptron Learning Algorithm (PLA), which adjusts the weights based on misclassifications.
Perceptron
![]()
Perceptron
![]()
Rosenblatt’s Perceptron Learning Algorithm (PLA)
- Initialize the weight vector
\[\mathbf w^{(0)} \leftarrow \mathbf 0 \;(\text{or small random values}), \qquad
b^{(0)} \leftarrow 0\]
- Loop over the training set \(\{(\mathbf x_i,\,y_i)\}_{i=1}^{N}\) with \(y_i \in \{-1,+1\}\)
- Predict \(\hat y_i = \operatorname{sign}(\mathbf w\!\cdot\!\mathbf x_i + b)\)
- If it is correct, do nothing
- If it is wrong, update
\[\mathbf w \gets \mathbf w + \eta \cdot y_i \cdot \mathbf x_i, \qquad
b \gets b + \eta \cdot y_i\] where \(\eta>0\) is a fixed learning rate (often \(\eta = 1\)).
- Repeat passes (epochs) until all points are classified correctly or a preset iteration budget is exhausted.
Perceptron
![]()
Multi-Layer Perceptron (MLP)
Multi-Layer Perceptron (MLP)
![]()
Multi-Layer Perceptron (MLP)
- A neuron computes a weighted sum of inputs, adds a bias, and applies an activation function. \[ y = f\!\biggl(\sum_{i=1}^{n} w_i x_i + b\biggr) \] where
- \(w_i\): weights
- \(b\): bias
- \(f(\cdot)\): activation function
- A Multi-Layer Perceptron (MLP) is a feed-forward neural network with one or more hidden layers of neurons.
- They are called feed-forward because information flows in one direction, from input to output as opposed to recurrent networks where information can loop back, creating cycles.
Multi-Layer Perceptron (MLP)
- Each layer transforms the input data, allowing the network to learn complex mappings.
- MLPs are just functions with many parameters (weights and biases) that can be optimized to fit any function.
- Activation functions introduce non-linearity, enabling neural networks to learn complex mappings.
- Without them, a deep network collapses to a single linear transformation.
- Properties of activation functions:
- Nonlinear
- Differentiable almost everywhere for gradient-based learning
- Preferably, computationally efficient
Common Activation Functions
![]()
- Sigmoid: smooth; bounded output; suffers from vanishing gradients for large |x|.
- Rectified Linear Unit (ReLU): simple, efficient; non-smooth; “dies” if x < 0; unbounded output.
- Hyperbolic Tangent (tanh): zero-centered; smooth; bounded output; also suffers from vanishing gradients.
Other Common Activation Functions
| Leaky ReLU |
\(f(x) = \max(\alpha x, x)\) |
\((-\infty, \infty)\) |
| Softmax |
\(f_i(x) = \frac{e^{x_i}}{\sum_j e^{x_j}}\) |
\((0, 1)\) |
- Leaky ReLU: mitigates dying ReLU; small gradient for x < 0; unbounded output.
- Softmax: used for multi-class classification; outputs probabilities summing to 1.
Vanishing Gradient & Activation Functions
- The vanishing gradient problem occurs when gradients become very small, slowing or halting learning.
- Functions like sigmoid squash inputs to (0, 1). For large positive/negative inputs, their derivatives become very small, and gradients shrink when propagated through many layers — causing vanishing gradients.
- Both sigmoid and tanh are saturating activation functions, where the derivative tends to zero for large input magnitudes.
- ReLU doesn’t saturate in the positive region, preserving gradient magnitude and mitigating vanishing gradients. However, it can cause exploding gradients for large positive inputs.
- For more information about activation functions, see https://en.wikipedia.org/wiki/Activation_function
Universal Approximation Theorem
A feed-forward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of \(\mathbb{R}^n\), given appropriate activation functions (e.g., sigmoid, ReLU).
- The first formal statement was provided by Cybenko (1989) for sigmoid activations.
- Later it was generalized multiple times to other activation functions, discontinuous functions, deeper networks and non-compact domains.
- This theorem underpins the theoretical foundation of neural networks, demonstrating their capacity to model complex functions.
- However, it does not provide guidance on the number of neurons needed or how to train such networks effectively.
Deep Learning vs Neural Networks
- Deep learning refers to the use of deep mathematical models, typically neural networks with multiple hidden layers, to learn representations of data and relationships within it.
- The depth of these functions allows them to learn hierarchical representations of data, capturing complex patterns and abstractions.
- There are other classes of deep models beyond neural networks, such as deep Gaussian processes and deep Boltzmann machines.
- However, in practice, deep learning is often synonymous with deep neural networks due to their widespread success across various domains.
Neural Network Architectures
Among neural networks, there are different architectures designed for specific data types and tasks, such as:
- Multi-Layer Perceptrons (MLPs) for tabular data,
- Convolutional Neural Networks (CNNs) for image data,
- Recurrent Neural Networks (RNNs) for sequential data,
- Transformers for natural language processing, and
- Graph Neural Networks (GNNs) for graph-structured data.
Neural Networks for Supervised Learning
- Learn a mapping from inputs to known targets.
\[
\hat{y} = f_\theta(x)
\]
- The network adjusts weights to minimize the mean prediction error. \[
\hat{\theta} = \arg \min_\theta \frac{1}{N} \sum_{i=1}^{N} \ell(f_\theta(x_i), y_i)
\]
- Regression loss \[
\ell(\hat{y}, y) = \|\hat{y} - y\|^2
\]
- Classification loss \[
\ell(\hat{y}, y) = -\sum_{c} y_c \log(\hat{y}_c)
\]
Neural Networks for Unsupervised Learning
Autoencoder
- Compresses input to a smaller latent embedding. \[
z = \text{Encoder}_{\theta_1}(x)
\]
- Reconstructs the input from that embedding. \[
\hat{x} = \text{Decoder}_{\theta_2}(z)
\]
- Trains to minimize reconstruction error. \[
\begin{aligned}
\hat{\theta} &= \arg \min_{\theta_1, \theta_2} \frac{1}{N} \sum_{i=1}^{N} \|x_i - \hat{x}_i\|^2 \\
\hat{x}_i &= \text{Decoder}_{\theta_2}(\text{Encoder}_{\theta_1}(x_i))
\end{aligned}
\]
Neural Networks for Unsupervised Learning
Linear Autoencoder (Principal Component Analysis)
- A linear autoencoder with a single hidden layer and linear activations learns to project data onto its top principal components.
- Assume the data is centered (zero mean).
- The encoder maps input \(x\) to a lower-dimensional representation \(z\): \[
z = W_{\text{enc}} \cdot x
\]
- The decoder reconstructs the input from \(z\): \[
\hat{x} = W_{\text{dec}} \cdot z
\]
Neural Networks for Unsupervised Learning
Linear Autoencoder (Principal Component Analysis)
- Training minimizes the reconstruction error: \[
\hat{\theta} = \arg \min_{\theta} \frac{1}{N} \sum_{i=1}^{N} \|x_i - \hat{x}_i\|^2
\]
- The optimal weights span the same subspace as the top \(k\) principal components of the data.
- \(W_{\text{enc}}\) contains the top \(k\) eigenvectors of the data covariance matrix.
- \(W_{\text{dec}}\) is the transpose of \(W_{\text{enc}}\).
- Biases are not needed for centered data.
Gradient-Based Optimization
![]()
Stochastic Gradient Descent (SGD)
- Full-batch gradient descent (GD)
- Uses all \(N\) samples each step
- Accurate but computationally heavy
- Stochastic gradient descent (SGD)
- Uses random mini-batches \(\mathcal{B}_t\), \(|\mathcal{B}_t| = m \ll N\) \[
\theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta
\frac{1}{m} \sum_{i \in \mathcal{B}_t} \ell(f_\theta(x_i), y_i)
\]
- Uses a subset of the data per step → cheaper iterations.
- Adds gradient noise → better exploration.
Stochastic Gradient Descent (SGD)
![]()
Stochastic Gradient Descent (SGD)
- Under certain conditions, converges in expectation to a stationary point (usually a local minimum).
- Requires an unbiased gradient estimate: \[
\mathbb{E}_{\mathcal{B}_t}\left[\nabla_\theta
\frac{1}{m} \sum_{i \in \mathcal{B}_t} \ell(f_\theta(x_i), y_i)\right]
= \nabla_\theta
\frac{1}{N} \sum_{i=1}^{N} \ell(f_\theta(x_i), y_i)
\]
- This holds if mini-batches are sampled uniformly at random from the dataset or if data is shuffled each epoch.
- An epoch is one full pass over the dataset.
Stochastic Gradient Descent (SGD)
- For convex functions, converging (in expectation) to a local minimum with a high precision requires more epochs than full-batch GD.
- In practice, SGD is often faster to converge than GD (especially for low precision) due to cheaper iterations and better exploration when the loss is non-convex.
- A highly precise solution is often not needed in ML applications.
- A high gradient variance can help escape sharp minima but slows down converegence to higher precision.
- The batch size \(m\) controls the gradient variance:
- Smaller \(m\) → higher variance
- Larger \(m\) → lower variance
- The optimal batch size depends on the specific problem and dataset and is often chosen based on hardware constraints (e.g., GPU memory).
Stochastic Gradient Descent (SGD)
- Another key condition is a diminishing learning rate: \[ \sum_{t=1}^\infty \eta_t = \infty \quad \text{and} \quad \sum_{t=1}^\infty \eta_t^2 < \infty \] For example, \(\eta_t = \frac{\eta_0}{1 + \lambda t}\) satisfies this for \(\lambda > 0\) and \(\eta_0 > 0\).
- \(t\) is the iteration index (or epoch index if decaying per epoch).
- Constant or piecewise-decayed learning rates are sometimes used for efficiency.
- Constant learning rates can still converge to a neighborhood of a minimum, but may not converge to a local minimum precisely.
Practical SGD Algorithm
- Shuffle data each epoch
- Split data into mini-batches of size \(m\)
- Loop over mini-batches, updating weights each time
- Apply learning-rate decay or cosine schedules \[ \eta_t = \eta_0 \cdot \frac{1}{1 + \lambda t} \] \[ \eta_t = \eta_{min} + \frac{1}{2}(\eta_0 - \eta_{min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right) \]
- Apply early stopping, monitoring validation loss to avoid overfitting
- Use learning rate warm-up, gradually increasing the learning rate from a small value to the target value over a few epochs, before decaying it.
- Gradient clipping to prevent exploding gradients.
Tradeoffs in SGD
- Batch size (m) trade-off
- Smaller m
- More noise leading to better exploration
- Faster updates and lower memory footprint
- Larger m
- Smoother gradients leading to more stable convergence
- Slower updates and higher memory usage
Tradeoffs in SGD
- Learning rate (\(\eta\)) trade-off
- Larger \(\eta\)
- Can converge faster but with a higher risk of overshooting or divergence
- May require decay schedules or gradient clipping
- Smaller \(\eta\)
- Slower but more precise convergence
- May need more epochs; can get stuck in local minima
Beyond Vanilla SGD
- Momentum: smoothens the updates and accelerates along consistent directions. \[
v_{t+1} = \beta v_t + (1-\beta)\nabla_\theta L(\theta_t),
\qquad
\theta_{t+1} = \theta_t - \eta v_{t+1}
\]
- Adaptive methods: adjust per-parameter learning rates using running averages.
- RMSProp, for each scalar parameter \(\theta\): \[
s_{t+1} = \beta s_t + (1-\beta)(\nabla_\theta L(\theta_t))^2,
\qquad
\theta_{t+1} = \theta_t - \eta \frac{\nabla_\theta L(\theta_t)}{\sqrt{s_{t+1}} + \epsilon}
\]
- Adam
Adam (Adaptive Moment) Optimizer
For each scalar parameter \(\theta\):
\[
m_{t+1} = \beta_1 m_t + (1-\beta_1)\nabla_\theta L(\theta_t), \quad
s_{t+1} = \beta_2 s_t + (1-\beta_2)(\nabla_\theta L(\theta_t))^2,
\] \[
\hat{m}_{t+1} = \frac{m_{t+1}}{1 - \beta_1^{t+1}}, \quad
\hat{s}_{t+1} = \frac{s_{t+1}}{1 - \beta_2^{t+1}},
\] \[
\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_{t+1}}{\sqrt{\hat{s}_{t+1}} + \epsilon}
\]
- Combines momentum and adaptive learning rates.
- Default hyperparameters: \(\beta_1=0.9\), \(\beta_2=0.999\), \(\epsilon=10^{-8}\).
- Works well in practice for many tasks, especially with sparse gradients.
Common Regularization Techniques
- Penalty-based regularization (weight decay) to improve generalization, less chance of overfitting
- Dropout layers to add more noise and improve robustness
- Batch normalization layers to stabilize training
Penalty-based Regularization
![]()
L1 (Lasso) vs L2 (Ridge) Regularization
- Regularization helps prevent overfitting by penalizing model complexity.
- L1 (Lasso) adds sum of absolute weights: \[\text{Loss}_{L1} = \text{Loss}_{\text{original}} + \lambda \sum_i |w_i|\]
- L2 (Ridge) adds sum of squared weights: \[\text{Loss}_{L2} = \text{Loss}_{\text{original}} + \lambda \sum_i w_i^2\]
- L2 regularization actively discourages sparsity.
- Sparsity refers to having many weights exactly zero.
- L1 regularization tends to result in more sparse weights.
L1 (Lasso) vs L2 (Ridge) Regularization
- L1 can perform feature selection by driving some weights to zero.
- Which variable is selected can be unstable under small data changes.
- Features selected are generally associated, not necessarily causal.
- L2 is more robust when many highly correlated features exist, penalty is smooth at 0.
- L1 is non-smooth near 0 which means that the gradient near a local optimum of the training loss + regularization may not be close to 0.
- Elastic net combines both L1 and L2 penalties. \[ \text{Loss}_{\text{EN}} = \text{Loss}_{\text{original}} + \lambda_1 \sum_i |w_i| + \lambda_2 \sum_i w_i^2 \]
Scaling of Regularization Terms
- Summing over all parameters means larger models have larger penalties.
- The regularization strength \(\lambda\) thus depends on model size.
- It is possible to average the regularization term over the number of parameters, i.e., use \(\frac{1}{P} \sum_i |w_i|\) or \(\frac{1}{P} \sum_i w_i^2\) where \(P\) is the number of parameters.
- This normalization makes the regularization strength \(\lambda\) less sensitive to model size, allowing for more consistent tuning across different architectures.
- However, this is less common in practice; most implementations use the unnormalized sum.
Weight Decay vs L2 Regularization
- Weight decay is a regularization technique that adds a penalty proportional to the magnitude of the weights to the loss function during training.
- The SGD update with weight decay \(\lambda\) is: \[
\theta_{t+1} = \theta_t - \eta \left( \nabla_\theta L(\theta_t) + \lambda \theta_t \right)
\]
- The loss \(L(\theta)\) above does not explicitly include the weight decay term; instead, it is applied directly in the update step.
- A similar adaptation can be made for other optimizers like Adam, called AdamW.
- In standard SGD, weight decay and L2 regularization are mathematically equivalent (when scaled by the learning rate). That is, applying L2 regularization to the loss produces weight updates that shrink weights, the same effect produced by weight decay in SGD.
Weight Decay vs L2 Regularization
- PyTorch’s AdamW implements decoupled weight decay, meaning weight decay is applied directly to weights in the update step rather than being added as a penalty term inside the loss function. See the official docs: PyTorch Documentation
- With adaptive optimizers like Adam, L2 regularization added to the loss interacts with adaptive learning rates and does not produce the same effect as simple weight decay — adaptive scaling distorts the regularization effect.
- AdamW decouples weight decay preserving the intended shrinkage effect on weights without interfering with the adaptive gradient updates, which was found to work well in practice.
- Summary
- For SGD, weight decay = L2 regularization.
- For AdamW, weight decay acts like an L2-type shrinkage step, but is decoupled from the gradient of the loss.
Learning in Small Data Regimes
Transfer Learning
![]()
Transfer Learning
Train on a source task/domain with rich data, reuse its representations/weights to improve a target task with limited data.
When should it help?
- High representation overlap: source captures features relevant to target.
- Small target dataset with similar modality (e.g., time‑series \(\rightarrow\) time‑series).
- Red flag: Large domain gap + tiny target set \(\rightarrow\) risk of negative transfer.
Self-supervised Learning
![]()
Task Aggregation
![]()
Data Growing / Augmentation
![]()
When does Augmentation Work?
- Label‑preserving transforms of inputs that reflect real invariances.
- Time‑series examples: small time‑warp, jitter (±5–10%), window cropping/shift, light magnitude scaling, masking.
- Realistic noise injection that mirrors measurement/sensor error.
- Domain‑aware synthesis: PBPK / pop‑PK simulations to cover rare dosing regimens/subgroups.
References
- Pan & Yang, “A Survey on Transfer Learning” (2010)
- Yosinski et al., “How transferable are features in deep neural networks?” (2014)
- Zhuang et al., “A Comprehensive Survey on Transfer Learning” (2020)
- Kornblith et al., “Do better ImageNet models transfer better?” (2019)