Data, Models, Biases and Uncertainty

Authors:

Mohamed Tarek

Data Terminology

Primitive Data Types

Continuous Data: Numerical data that can take any value within a range.
Discrete Data: Numerical data that can only take specific, separate values.
- Ordinal Data: Categorical data with a meaningful order or ranking.
- Nominal Data: Categorical data without any inherent order or ranking.
- Binary Data: A special case of discrete data with only two possible values (e.g., 0 and 1, true and false), can be either nominal or ordinal.

Data Structures and Containers

Tabular Data: Data organized in a table format.
Time-Series Data: Data points collected or recorded at specific time intervals, typically focusing on a single subject or entity over time.
Longitudinal/Panel Data: Data collected from multiple subjects over a period of time, allowing for analysis of changes within subjects and comparisons between subjects.
Cross-Sectional Data: Data collected at a single point in time from multiple subjects.
Spatial Data: Data related to geographic locations or spatial relationships.
Text Data: Unstructured data in the form of written language.
Image Data: Visual data represented as pixels or vectors.

Data Structures and Containers

Audio Data: Sound data, often represented as waveforms or spectrograms.
Video Data: Sequence of images (frames) that create motion when played in succession
Graph Data: Data represented as nodes and edges, capturing relationships between entities.
Multimodal Data: Combines multiple data types (e.g., text and images).
Hierarchical Data: Data organized in a tree-like structure, capturing parent-child relationships.
Relational Data: Data stored in relational databases, organized into tables with relationships.

Data Source

Experimental Data: Data generated from controlled experiments.
Observational Data: Data collected without intervention, often used in epidemiological studies.
Synthetic Data: Artificially generated data that mimics real-world data.

Data Imperfections

Imbalanced Data: Datasets where certain classes or categories are underrepresented.
Noisy Data: Data that contains errors, inconsistencies, or irrelevant information.
Missing Data: Datasets with gaps or missing values.
Outliers: Data points that significantly deviate from the majority of the data.
Biased Data: Data that does not accurately represent the population or phenomenon of interest.
Systematic Errors: Consistent, repeatable errors associated with faulty equipment or flawed experimental design.
Random Errors: Unpredictable variations in data due to uncontrollable factors, leading to noisy data.

Model Types

Overview of Model Types

Causal models imply association, but associative models generally do not imply causation.
Inferential models can be used as a black-box as well.
Many of these models are often parametric, but some can be non-parametric or semi-parametric. We use the term model interchangeably with algorithm or method.

Black-box Models

Goal: Interpolation, extrapolation, data generation or representation learning based on historical data.
Can be discriminative, generative, conditional generative or representation learning models.
- Discriminative models: learn to predict a response/outcome/target variable given some inputs/explanatory variables/covariates/predictors/features.
- Generative models: learn the underlying distribution of the data and generate new data points.
- Conditional generative models: generate new data points conditioned on some input variables.
- Representation learning models: learn useful representations or features from the data, often for dimensionality reduction or feature extraction.

Black-box Models

Prioritize task (prediction, data generation, etc.) accuracy over interpretability.
Many machine learning (ML) algorithms are black-box in nature.
Black-box models should be tested on new data to ensure they generalize well and do not overfit the training data.
Non-interpretability and non-identifiability of parameters are mostly harmless in black-box models as long as the model performs well on unseen data.
Model explanation techniques can provide insights into the associative relationships between the variables in the black-box model, increasing the trust in the model by explaining how it makes its predictions. This is especially important in high-stakes applications like healthcare.
The term predictive models is sometimes used interchangeably with discriminative models or to contrast with inferential models, i.e. any black-box model that is not inferential.

Inferential Models

Goal: Identify and quantify relationships between variables.
Also known as associative models.
Focus on parameter estimation, hypothesis testing, and confidence intervals.
Inferential models can be used as black-box models for discriminative, generative or representation learning tasks.
Parameter interpretability and identifiability given the data are important for understanding the relationships between variables.
Relationships can be non-causal (purely associative) or causal in nature.

Warning

Using a trained model to make predictions for a new data point is commonly referred to as “inference” in the machine learning community! This is different from the statistical definition of inference, which refers to drawing conclusions about variable relationships from data.

Associative, Non-Causal Models

Goal: Identify and quantify association/correlation relationships between variables.
Used to test associational hypotheses and explore patterns in the observational/real-world data, as an external observer.
Statistically significant associations can be the basis for potential causal hypotheses, to be tested.
But controlled experiments (or causal inference and strong assumptions) are generally needed to test any causal hypotheses.
In associative, non-causal models, the cause and effect are not necessarily the input and output variables, respectively.
One can model the cause as a function of the effect.
Or both the input and output can be effects of a common cause (confounder) or 2 different causes to a common effect (collider).

Causal/Scientific/Mechanistic Models

Goal: Determine cause-and-effect relationships and explain underlying mechanisms.
Encode domain knowledge and scientific understanding of the data-generating process/mechanism in the system being modeled.
Often based on first principles (e.g., physics, chemistry, biology) and can include mechanistic details.
Used to understand the underlying mechanisms and predict the effects of interventions or changes in the system.
Can be used for counterfactual analysis and interventions.
Causal models, if correct, can generalize better to new settings and extrapolate beyond the training data compared to non-causal models.

Causal/Scientific/Mechanistic Models

Assuming the causal model is correct, accurately identifying (low bias and variance) the model’s parameters may still not be possible given the available data due to:
- Data collection issues, e.g, small sample size or selection bias, or
- The existence of unobserved or unobservable confounders, e.g. genetic factors, and the absence or infeasibility of controlled experimental data.
Finding and identifying a causal model from observational data alone (aka causal inference) is generally impossible without making strong assumptions.
Causal inference is an active area of research in statistics and machine learning.

Which Model to Use?

The choice of model type depends on the research question and the nature of the data.
Models can be hybrid, combining elements of different types to address complex questions.
For example, many pharmacometric models encode causal mechanisms and include a causal treatment effect that is identified from randomized controlled trial data.
However, such models also commonly incorporate non-causal, associative covariate models to
1. Improve their predictive performance at baseline,
2. Analyze associations between the covariates and other variables in the model, and/or
3. Identify potential causal hypotheses for further investigation.
Incorporating some domain knowledge and causal structure in an otherwise black-box model can improve the model’s generalization performance in unseen settings.
Hybrid mechanistic and empirical associative/black-box models are commonly known as semi-mechanistic models, semi-empirical models, or more recently scientific machine learning models.

Parametric, Non-Parametric and Semi-Parametric Models

Parametric Models

Assume a specific functional form for the relationship between variables.
Characterized by a finite, fixed set of parameters.
Model complexity is determined by the number of parameters and does not grow with the amount of data.
The model replaces the need for the data at test time.
May not capture complex relationships if the assumed functional form is incorrect.
Examples: linear regression, neural networks.

Non-Parametric Models

Do not assume an explicit functional form for the relationship between variables.
The model complexity/flexibility can automatically grow with the amount of data.
Often but not always, the data is used directly without assuming a specific functional form, this is known as instance-based or memory-based learning.
Often have (hyper-)parameters despite being called “non-parametric”. In fact, the number of parameters can grow with the data size, e.g. decision trees.
Flexible, can model complex relationships.
Most non-parametric models need the data to be available at test time and can be more data hungry than simple parametric models.
Example: k-nearest neighbors, decision trees.

Semi-Parametric Models

Combine parametric and non-parametric components.
Example: Cox proportional hazard model.

Bias

Bias Definitions

Informal definition
- Bias is any error that leads to systematically incorrect conclusions or predictions.
Formal “estimation bias”
- Bias is the difference between the expected value of an estimator and the true value of the parameter being estimated, assuming the model is correct. \[ \text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta \] where \(\hat{\theta}\) is the estimator and \(\theta\) is the true parameter value.
- This definition is useful for theoretical analysis of parameter estimators given a model. However, in practice, the true model and true parameter values are unknown.
Bias can occur in various stages of the data science process, e.g. during data collection and preprocessing, model selection, model training/fitting, and/or model evaluation.

Data Collection Biases

The following are common data collection biases that can lead to biased parameter estimates and predictions downstream.

Selection Bias: Systematic differences between the sample and the intended test population.

Data Collection Biases

Selection Bias: Systematic differences between the sample and the population.
1. Sampling Bias: Non-random sampling leading to a non-representative sample.
  - Example: Sampling only from a specific geographic location.
2. Survivorship Bias: Focusing on subjects that passed a selection process, ignoring those that did not.
  - Example: only analyzing surviving patients in a clinical study.
3. Attrition Bias: Systematic differences due to participants dropping out of a study.
  - Example: patients with severe side effects dropping out of a clinical trial.
4. Exclusion Bias: Systematic exclusion of certain groups from the sample.
  - Example: excluding elderly patients from a clinical trial.
5. Volunteer Bias: Systematic differences between volunteers and non-volunteers.
  - Example: patients who volunteer for a new drug trial may have tried other treatments and they did not work for them.

Data Collection Biases

Confounding Bias: Bias arising from a third variable that influences both the treatment and the outcome, leading to a spurious association.
1. Unmeasured Confounding Bias: Bias due to confounders that are not measured and cannot be adjusted for. Measured confounders can be adjusted for in the analysis. See analysis bias types below.
2. Residual Confounding Bias: Bias due to imperfect measurement or categorization of confounders that are measured.
3. Channeling Bias: Systematic differences in treatment assignment based on prognostic factors, leading to confounding.

Data Collection Biases

Confounding Bias: Bias arising from a third variable that influences both the treatment and the outcome, leading to a spurious association.

Data Collection Biases

Performance Bias: Systematic differences in care, treatment, or exposure provided to participants in different groups, other than the intervention being studied.
- Example: one group may receive more attention or additional treatments, which can affect outcomes.

Data Collection Biases

Information Bias: Systematic errors in data measurement or classification.
1. Measurement/Misclassification Bias: Systematic errors in data collection or measurement.
2. Recall Bias: Differences in accuracy or completeness of recollections.
3. Observer/Interviewer Bias: Systematic differences in data collection due to the observer’s expectations or beliefs.
4. Reporting Bias: Systematic differences in what information is reported or recorded, e.g. subjects may underreport socially undesirable behaviors.
5. Lead-Time Bias: Overestimation of survival time due to earlier detection of a disease.

Data Collection Biases

Ascertainment Bias: Systematic error arising from differences in how outcomes or exposures are detected, identified, or recorded between groups in a study, leading to unequal likelihoods of detecting the event or condition. This is sometimes considered a subtype of information bias.
1. Surveillance Bias: Systematic differences in the frequency or thoroughness of monitoring between groups, e.g., the treatment group is monitored more frequently than the control group.
2. Detection Bias: Systematic differences in the methods or accuracy of outcome determination between groups, e.g., the treatment group is assessed using more precise equipment or tests than the control group.
3. Diagnosis Bias: Systematic differences in the criteria or processes used for diagnosing outcomes between groups, e.g., stricter diagnostic criteria are applied to one group compared to the other.

Modelling Biases

The following are common modelling biases that can lead to biased parameter estimates and poor generalization, e.g. poor extrapolation or poor performance in a new environment.

Omitted Variable Bias: Bias from leaving out important variables, e.g. measured confounders that are correlated with both the treatment and outcome.
Collider Bias: Bias from conditioning on a variable that is influenced by both the treatment and outcome, creating a spurious association.
- This can occur when conditioning on information not available at prediction time, e.g. how many observations a patient has which depends on dropout/death time which depends on the treatment and outcome.
Post-Treatment Bias: Bias from adjusting for variables that are affected by the treatment, which can block part of the treatment effect, underestimating the effect size.
Model Misspecification Bias: Bias from using an incorrect functional form or distributional assumption for the model, e.g. assuming linearity when the true relationship is non-linear.

The Good Bias

When designing empirical models, we often have to make assumptions about the data and the underlying relationships between variables.
These assumptions introduce bias into the model, but they can also help improve the model’s performance and generalization ability. This is known as the bias-variance trade-off.
Inductive bias is the set of assumptions that a learning algorithm uses to predict outputs for inputs it has not encountered during training, it is caused by the assumed empirical model structure and the training algorithm, e.g. regularization.
Different empirical models make different assumptions about the data, e.g. spatial locality or sparsity, which can lead to different inductive biases.

Estimation/Fitting/Analysis Biases

Assuming the model is correct, parameter estimation bias is defined as:

\[ \text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta \]

The following are common estimation/fitting/analysis biases that can lead to biased parameter estimates and poor generalization.

Small Sample Bias: Bias from using small sample sizes that do not accurately represent the population, especially in complex models or in the presence of multicollinearity or outliers.
P-Hacking: Manipulating data or analysis to achieve statistically significant results.
Overfitting/Underfitting Bias: Bias from fitting a model too closely or too loosely to the training data.
Regularization/Prior Bias: Bias from using regularization techniques (to prevent over-fitting) or strong priors (to incorporate domain knowledge) that shrink parameter estimates towards zero or other values.

Estimation/Fitting/Analysis Biases

Non-convergence Bias: Bias from using optimization or inference algorithms that do not converge to the true parameter values, e.g. optimization converging to a local minimum instead of the global minimum, or Markov Chain Monte Carlo (MCMC) chains that do not converge to the target posterior distribution.
Approximation Bias: Bias from using approximate methods or algorithms, e.g. the Laplace approximation or variational inference.
Data Augmentation Bias: Bias from using data augmentation techniques that do not accurately represent the population, e.g. unrealistic image transformations.

Estimation/Fitting/Analysis Biases

Note

Consistency of Estimators: An estimator is consistent if it converges in probability to the true value of the parameter being estimated as the sample size increases. Formally, for an estimator \(\hat{\theta}_n\) of a parameter \(\theta\), consistency means: \[ \lim_{n \to \infty} P(|\hat{\theta}_n - \theta| > \epsilon) = 0 \quad \text{for all } \epsilon > 0. \] Consistency ensures that with enough data, the estimator will produce values arbitrarily close to the true parameter value.

Note

In practice, we only have finite data so consistency does not guarantee low bias or variance in finite samples.

Model Evaluation Biases

Selection Bias in Test Data: Using a non-representative test set for model evaluation.
Information/Data Leakage: Using information from the test set during model training or selection.
Confirmation Bias: Favoring results that confirm prior beliefs or hypotheses.

Uncertainty Quantification

What is Uncertainty Quantification?

Uncertainty quantification (UQ) is the process of quantifying the uncertainty in model predictions and/or parameter estimates.
Important for understanding the reliability and robustness of model predictions, especially in high-stakes applications like healthcare.
Can help identify areas where the model is uncertain and guide further data collection or model refinement.
Can be used to communicate the uncertainty in model predictions to stakeholders and decision-makers.
Can introduce their own biases and assumptions, which should be carefully considered.

Common UQ Methods

Frequentist Methods: Confidence intervals, bootstrap.
Bayesian Methods: Posterior distributions, credible intervals, Bayesian model averaging, variational inference.
Ensemble Methods: Combining predictions from multiple models to estimate uncertainty.
Conformal Prediction: An empirical method that provides valid prediction intervals given a base predictive model, under minimal assumptions.
Monte Carlo Dropout: A technique that uses dropout during inference to estimate uncertainty in neural networks, related to Bayesian inference.

Reading Material

On the Uses and Abuses of Regression Models: A Call for Reform of Statistical Practice and Teaching by John B. Carlin and Margarita Moreno-Betancur (2025)
Statistical Modeling: The Two Cultures by Leo Breiman (2001)
Machine Learning: A Probabilistic Perspective by Kevin P. Murphy (2012)
Causal Inference: What If by Hernán & Robins (2020)
Identifying and Avoiding Bias in Research by Christopher J. Pannucci & Edwin G. Wilkins (2011)
Machine Learning Beyond Point Predictions: Uncertainty Quantification by Rafael Izbicki (2025)