Continuous Data: Numerical data that can take any value within a range.
Discrete Data: Numerical data that can only take specific, separate values.
Goal: Interpolation, extrapolation, data generation or representation learning based on historical data.
Can be discriminative, generative, conditional generative or representation learning models.
Warning
Using a trained model to make predictions for a new data point is commonly referred to as “inference” in the machine learning community! This is different from the statistical definition of inference, which refers to drawing conclusions about variable relationships from data.
Assuming the causal model is correct, accurately identifying (low bias and variance) the model’s parameters may still not be possible given the available data due to:
Finding and identifying a causal model from observational data alone (aka causal inference) is generally impossible without making strong assumptions.
Causal inference is an active area of research in statistics and machine learning.
The choice of model type depends on the research question and the nature of the data.
Models can be hybrid, combining elements of different types to address complex questions.
For example, many pharmacometric models encode causal mechanisms and include a causal treatment effect that is identified from randomized controlled trial data.
However, such models also commonly incorporate non-causal, associative covariate models to
Incorporating some domain knowledge and causal structure in an otherwise black-box model can improve the model’s generalization performance in unseen settings.
Hybrid mechanistic and empirical associative/black-box models are commonly known as semi-mechanistic models, semi-empirical models, or more recently scientific machine learning models.
Informal definition
Formal “estimation bias”
Bias can occur in various stages of the data science process, e.g. during data collection and preprocessing, model selection, model training/fitting, and/or model evaluation.
The following are common data collection biases that can lead to biased parameter estimates and predictions downstream.
Selection Bias: Systematic differences between the sample and the population.
Sampling Bias: Non-random sampling leading to a non-representative sample.
Survivorship Bias: Focusing on subjects that passed a selection process, ignoring those that did not.
Attrition Bias: Systematic differences due to participants dropping out of a study.
Exclusion Bias: Systematic exclusion of certain groups from the sample.
Volunteer Bias: Systematic differences between volunteers and non-volunteers.
Confounding Bias: Bias arising from a third variable that influences both the treatment and the outcome, leading to a spurious association.
Performance Bias: Systematic differences in care, treatment, or exposure provided to participants in different groups, other than the intervention being studied.
Information Bias: Systematic errors in data measurement or classification.
Ascertainment Bias: Systematic error arising from differences in how outcomes or exposures are detected, identified, or recorded between groups in a study, leading to unequal likelihoods of detecting the event or condition. This is sometimes considered a subtype of information bias.
The following are common modelling biases that can lead to biased parameter estimates and poor generalization, e.g. poor extrapolation or poor performance in a new environment.
Omitted Variable Bias: Bias from leaving out important variables, e.g. measured confounders that are correlated with both the treatment and outcome.
Collider Bias: Bias from conditioning on a variable that is influenced by both the treatment and outcome, creating a spurious association.
Post-Treatment Bias: Bias from adjusting for variables that are affected by the treatment, which can block part of the treatment effect, underestimating the effect size.
Model Misspecification Bias: Bias from using an incorrect functional form or distributional assumption for the model, e.g. assuming linearity when the true relationship is non-linear.
When designing empirical models, we often have to make assumptions about the data and the underlying relationships between variables.
These assumptions introduce bias into the model, but they can also help improve the model’s performance and generalization ability. This is known as the bias-variance trade-off.
Inductive bias is the set of assumptions that a learning algorithm uses to predict outputs for inputs it has not encountered during training, it is caused by the assumed empirical model structure and the training algorithm, e.g. regularization.
Different empirical models make different assumptions about the data, e.g. spatial locality or sparsity, which can lead to different inductive biases.
Assuming the model is correct, parameter estimation bias is defined as:
\[ \text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta \]
The following are common estimation/fitting/analysis biases that can lead to biased parameter estimates and poor generalization.
Note
Consistency of Estimators: An estimator is consistent if it converges in probability to the true value of the parameter being estimated as the sample size increases. Formally, for an estimator \(\hat{\theta}_n\) of a parameter \(\theta\), consistency means: \[ \lim_{n \to \infty} P(|\hat{\theta}_n - \theta| > \epsilon) = 0 \quad \text{for all } \epsilon > 0. \] Consistency ensures that with enough data, the estimator will produce values arbitrarily close to the true parameter value.
Note
In practice, we only have finite data so consistency does not guarantee low bias or variance in finite samples.