Resume

Product intelligence

A Comprehensive Machine Learning Glossary for Product Managers

Jan 28, 2026

A practical glossary for building, shipping, and evaluating ML features. Definitions lean toward how terms show up in product decisions, not just textbooks.

A

Accuracy
Share of correct predictions: (TP + TN) / (TP + TN + FP + FN). Useful when classes are balanced. Misleading when one class dominates, because a model can look “accurate” by always predicting the majority class.

Activation Function
Nonlinear function applied to a neuron’s output that allows neural networks to learn nonlinear relationships. Common choices include ReLU, sigmoid, and tanh. Without activations, a deep network collapses into a single linear function.

Algorithm
A general modeling approach or template that maps inputs to outputs, such as linear regression, decision trees, random forests, gradient boosting, or neural networks. In practice, “algorithm choice” is a trade-off between performance, interpretability, and cost.

Artificial Intelligence (AI)
Umbrella term for systems that perform tasks associated with human intelligence, such as perception, language understanding, and decision-making. Machine learning is one subfield of AI.

Attention
Mechanism that lets a model weigh relationships between parts of an input, such as which words matter most for a given prediction. Core idea in transformer models. Enables strong handling of context and long-range dependencies.

AUROC (Area Under the ROC Curve)
Metric that summarizes the ROC curve across thresholds. Interpretable as how well the model ranks positives above negatives. 1.0 is perfect ranking, 0.5 is random. Can look overly optimistic under extreme class imbalance.

B

Backpropagation
Training method for neural networks that computes gradients of the loss with respect to each weight by propagating error backward through the layers. Enables efficient learning in deep architectures.

Bag of Words
Text representation where a document becomes a vector of word counts (or weighted counts). Simple and fast, but ignores word order and context. Often replaced by embeddings and transformer-based approaches.

Bagging (Bootstrap Aggregating)
Ensemble technique where each model is trained on a bootstrapped dataset (sampled with replacement). Reduces variance and stabilizes unstable learners like decision trees.

Baseline
A simple reference approach used to judge whether ML adds value. Could be a heuristic, a rules system, or a simple linear model. A strong baseline prevents overengineering and clarifies ROI.

Batch Gradient Descent
Optimization method that computes the gradient using the full dataset before each update step. Stable but computationally heavy for large datasets.

Bias

In bias-variance trade-off: error from overly simple assumptions that lead to underfitting.
In fairness and data: systematic skew in data or outcomes that can harm certain groups. These are different meanings and should be named explicitly.

Bias Term (Intercept)
Constant term in linear models, often written as w0 or b. Sets the model’s baseline prediction when all inputs are zero.

C

Calibration
How well predicted probabilities reflect reality. If a model outputs 0.8, about 80% of those cases should be positive. Calibration matters for thresholds, risk scoring, and decision-making systems.

Categorical Data
Variables that represent discrete groups, such as country, device type, or plan tier. Often needs encoding (one-hot, ordinal, target encoding, embeddings) before modeling.

Classification
Supervised learning task that predicts a category or label. Examples: fraud vs legitimate, churn vs retain, severity levels, content classes.

Class Imbalance
When one class is much more common than another. Makes accuracy unreliable and changes what “good” performance means. Often requires different metrics (precision/recall), resampling, or cost-sensitive learning.

Clustering
Unsupervised technique that groups data by similarity without labels. Used for segmentation, discovery, anomaly detection, and exploratory analysis.

Coefficient
Weight assigned to a feature in a linear model. Indicates direction and strength of influence, assuming features are comparable in scale and the relationship is approximately linear.

Coefficient of Determination (R²)
For regression, measures how much variance in the target is explained relative to predicting the mean. Useful as a high-level signal, but not always aligned with product cost of errors.

Confusion Matrix
Table of classification outcomes: TP, TN, FP, FN. Foundation for metrics like precision, recall, and false positive rate.

Continuous Data
Numeric values that can vary smoothly, like time, temperature, or revenue.

Convolutional Neural Network (CNN)
Neural architecture designed for images and spatial data. Uses convolution layers (local filters with shared weights) and often pooling to reduce dimensionality while learning hierarchical visual features.

Cost Function (Loss Function)
Numeric measure of error the model optimizes during training. Examples: mean squared error, cross-entropy. Choosing loss is effectively choosing what the model should care about.

CRISP-DM
Framework for ML projects: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment. Useful as a shared language across product, data science, and engineering.

Cross-Validation
Evaluation approach that trains and validates across multiple splits (folds) to reduce sensitivity to a single split. Note: time series requires time-aware validation, not random folds.

D

Data
Observed information used for modeling, including structured tables and unstructured content like text, images, and audio. For product work, “data” also includes pipelines, freshness, and access controls.

Data Drift
Change in input data distribution over time. Example: new user behavior, new product flows, seasonality. Drift can degrade model performance silently.

Data Leakage
When training uses information that would not be available at prediction time, or when test data influences model development. Leads to inflated offline metrics and failure in production.

Decision Tree
Model that makes predictions via a sequence of if-then splits. Easy to interpret, handles nonlinear interactions, but can overfit unless constrained.

Deep Learning
Subset of ML using multi-layer neural networks. Strong on unstructured data and representation learning, but can be expensive and difficult to interpret.

Discrete Data
Countable numeric values such as number of purchases, clicks, or defects.

Deployment
Moving a model into production use. Includes inference serving, monitoring, retraining strategy, rollback plan, and ownership.

E

Embedding
Dense vector representation learned from data, often used for words, users, items, or images. Embeddings capture similarity: closer vectors represent more similar meaning or behavior.

Ensemble Model
Model that combines multiple models to improve performance and robustness. Examples: random forest, gradient boosting, stacking.

Evaluation
Process of measuring model performance in a way that reflects real-world constraints. Includes offline metrics, error analysis, and online experimentation when possible.

F

Feature
Input variable used for prediction. Can be raw (age) or engineered (rolling 7-day spend). Features must be available at prediction time and consistent with how the model will run in production.

Feature Engineering
Creating or transforming features to make patterns learnable. Examples: log transforms, interaction terms, aggregations, time-window statistics.

Feature Selection
Choosing which features to include. Can be manual (domain knowledge) or automatic (regularization, importance measures). Often improves robustness and reduces cost.

False Negative (FN)
Model predicts negative, but the truth is positive. Cost depends on domain, such as missing fraud or missing a medical condition.

False Positive (FP)
Model predicts positive, but the truth is negative. Cost depends on domain, such as blocking legitimate users or generating too many alerts.

False Positive Rate (FPR)
Among actual negatives, fraction incorrectly predicted as positive: FP / (FP + TN).

Fine-Tuning
Continuing training of a pretrained model on your task-specific dataset. Often used in transfer learning.

G

Generalization
Ability of a model to perform well on new, unseen data. The goal is not to win on training data, but to generalize.

Gradient Descent
Optimization method that updates parameters to reduce loss based on gradients. Variants include SGD, mini-batch, Adam.

Ground Truth
The “correct” label or outcome used for training and evaluation. In product settings, ground truth can be noisy, delayed, or ambiguous.

H

Hidden Layer
Layer inside a neural network between input and output. Learns intermediate representations.

Hyperparameter
Configuration set before training that controls model behavior, such as learning rate, tree depth, regularization strength, or number of layers. Tuned using validation, not learned directly from data.

I

Information Gain
Criterion for decision tree splits that measures how much a split reduces impurity. Higher information gain means a cleaner separation.

Inference
Running a trained model to generate predictions. Inference cost and latency are often the limiting factors in production.

Interpretability
How easily humans can understand why a model made a prediction. Linear models and small trees are more interpretable than large ensembles and deep networks.

K

K-Fold Cross-Validation
Cross-validation that splits data into K parts, trains K times, each time validating on a different fold. Provides a more stable estimate than a single split.

K-Means Clustering
Clustering algorithm that assigns points to the nearest of K centers and iteratively updates centers to the mean of their assigned points. Requires choosing K and is sensitive to scaling and initialization.

L

Label
Target value used in supervised learning, such as churned or not churned. Label quality and consistency often dominate model quality.

LASSO (L1 Regularization)
Regularization that adds λ * Σ |w| penalty. Can push some coefficients to zero, acting as feature selection.

Learning Rate
Step size in gradient-based optimization. Too high can diverge, too low can stall.

Linear Regression
Model that predicts a numeric value as a weighted sum of features. Fast, interpretable, strong baseline.

Logistic Regression
Linear classifier that outputs probabilities via sigmoid. Strong baseline for binary classification and risk scoring.

Loss Function
See Cost Function. For PMs, loss is how the model “learns what matters,” which may or may not match business cost.

M

MAE (Mean Absolute Error)
Average absolute error: mean(|y_hat - y|). Robust to outliers, easy to interpret in original units.

MAPE (Mean Absolute Percent Error)
Average percent error: mean(|(y_hat - y) / y|). Communicable, but unstable when y is near zero.

MSE (Mean Squared Error)
Average squared error: mean((y_hat - y)^2). Penalizes large errors heavily.

Mini-Batch Gradient Descent
Updates weights using small batches of data (such as 32 or 128). Efficient and stable on modern hardware.

Model
A learned function mapping inputs to outputs. In products, “the model” includes data pipelines, thresholds, monitoring, and retraining plans.

N

Neural Network
Model composed of layers of connected neurons. Learns complex representations through nonlinear transformations and gradient-based training.

Non-Parametric Algorithm
Model that does not assume a fixed functional form and can grow in complexity with data. Examples: decision trees, k-nearest neighbors.

O

Offline Evaluation
Testing performance on historical datasets. Necessary, but not sufficient. Offline gains do not always translate to product impact.

Online Evaluation
Testing model impact in real user environments, often via A/B testing. Stronger evidence for business outcomes, but more complex and risk-sensitive.

Outcomes
Business results you care about: cost saved, revenue gained, risk reduced, time saved. Outcomes guide what metrics matter.

Outputs
Model predictions and technical metrics. Outputs should be selected to support desired outcomes.

Overfitting
Model fits noise rather than signal, performing well on training data but poorly on new data. Often mitigated with more data, regularization, simpler models, or better validation.

P

Parametric Algorithm
Model with a fixed functional form and a fixed number of parameters. Examples: linear and logistic regression.

Perceptron
Early model of an artificial neuron using a thresholded weighted sum. Historically important, conceptually foundational.

Pooling Layer
CNN layer that reduces spatial dimensions by aggregating values in small regions, often using max or average pooling.

Precision
Among predicted positives, how many are truly positive: TP / (TP + FP). Important when false positives are expensive.

Probability Threshold
Cutoff that converts a probability score into a class decision. Choosing a threshold is a product decision based on cost trade-offs and operational capacity.

R

Random Forest
Ensemble of decision trees trained with bagging and random feature selection. Usually strong performance, lower variance than a single tree.

Recall (Sensitivity, True Positive Rate)
Among actual positives, how many were correctly predicted: TP / (TP + FN). Important when missing positives is costly.

Regression
Supervised learning task predicting a continuous numeric value.

Regularization
Techniques that discourage overly complex models to improve generalization. Common forms include L1, L2, dropout, early stopping.

Reinforcement Learning
Learning by interacting with an environment and receiving rewards. Used for sequential decision-making and control problems.

ReLU
Activation function defined as max(0, x). Common in deep networks because it trains well and reduces vanishing gradient issues compared to sigmoid.

Ridge Regression (L2 Regularization)
Regularization that adds λ * Σ w^2 penalty. Shrinks coefficients, helpful with correlated features.

ROC Curve
Plots True Positive Rate vs False Positive Rate across thresholds. Useful for comparing ranking quality.

S

Sigmoid Function
Maps any input to (0, 1), often used to represent probabilities in logistic regression and some neural nets.

Softmax
Function that converts a vector of scores into a probability distribution that sums to 1. Used for multi-class classification.

Spatial Relationships
Data structure based on proximity in space, such as pixels in an image or coordinates on a map.

Structured Data
Tabular data with defined fields and schema, such as database tables and spreadsheets.

Supervised Learning
Learning from labeled examples. Includes classification and regression.

T

Target (Label, Y)
Output variable the model learns to predict. Target definition is a product decision and must match the real-world action you want to take.

Temporal Relationships
Time-based structure where order matters, such as user event sequences or sensor readings.

Test Set
Held-out dataset used once for final evaluation. Should remain untouched during feature selection and model tuning.

Time Series Data
Data indexed by time where past and future must not be mixed during training and evaluation. Requires time-aware validation.

Training Set
Data used to fit model parameters.

Transfer Learning
Using a pretrained model as a starting point and adapting it to a new task. Reduces data and compute requirements.

Transformer
Neural architecture built around attention mechanisms. Dominant for modern NLP and increasingly used in vision and multimodal systems.

U

True Negative (TN)
Correctly predicted negative case.

True Positive (TP)
Correctly predicted positive case.

Underfitting
Model too simple to capture patterns, leading to high bias and poor performance even on training data.

Unstructured Data
Data without a fixed schema, such as text, images, audio, and video. Often requires embeddings and deep learning approaches.

Unsupervised Learning
Learning without labels. Includes clustering and dimensionality reduction.

V

Validation Set
Data used during development to tune hyperparameters and select between models. Must be separate from test set.

Variance
Error from sensitivity to noise in training data, often leading to overfitting. Reduced with more data, regularization, or ensembling.

Vectorization
Using matrix and vector operations to compute efficiently. Crucial for deep learning and large-scale training.

W

Word Embeddings
Dense vector representations of words learned from text corpora. Capture semantic similarity and serve as inputs to many NLP models.

Additions that PMs often want, but glossaries usually miss

A/B Testing for ML
Online experiments to measure real product impact. Essential when model outputs trigger user-facing changes.

Data Pipeline
The system that collects, cleans, joins, and serves features and labels. Many ML failures are pipeline failures.

Drift Monitoring
Ongoing checks that inputs, predictions, and outcomes have not shifted. Needed because models decay as the world changes.

Human-in-the-Loop
Workflows where humans review, label, or override model outputs. Common in moderation, fraud, and high-stakes classification.

Latency Budget
Maximum time allowed for inference in production. Determines whether you can use complex models or must use lighter ones.

Model Versioning
Tracking which model was used when, with which data and parameters. Critical for debugging and compliance.

Precision-Recall Curve
Plot of precision vs recall across thresholds. Often more informative than ROC under severe class imbalance.

Segmented Metrics
Evaluating performance across cohorts (region, device, language, tenure) to detect uneven performance and hidden risk.

Threshold Tuning
Choosing operating points based on business cost trade-offs, user experience, and operational capacity, not just “0.5 by default.”