A Comprehensive Machine Learning Glossary for Product Managers
Definitions lean toward how terms show up in product decisions, not just textbooks. Where a concept has a common misconception or a product implication worth flagging, it's included.
A
Accuracy Share of correct predictions: (TP + TN) / (TP + TN + FP + FN). Sounds like the obvious metric to optimize, but it's misleading when one class dominates. If 99% of transactions are legitimate, a model that flags nothing as fraud achieves 99% accuracy and catches zero fraud. Useful only when classes are roughly balanced and false positives and false negatives have similar costs.
Activation Function A nonlinear function applied to each neuron's output in a neural network. Without activations, stacking multiple layers still produces a linear function overall: you'd gain computation but not representational power. Activations like ReLU, sigmoid, and tanh introduce nonlinearity, which is what allows deep networks to learn complex patterns. Think of them as the mechanism that lets the network "bend" to fit curved relationships.
Algorithm A general modeling approach or template that maps inputs to outputs. Examples: linear regression, decision trees, random forests, gradient boosting, neural networks. Choosing an algorithm is a trade-off between performance, interpretability, and operational cost. A more complex algorithm isn't always better, it's often slower, harder to explain, and more expensive to run.
Artificial Intelligence (AI) Umbrella term for systems that perform tasks associated with human intelligence: perception, language understanding, reasoning, decision-making. Machine learning is one subfield of AI. Deep learning is a subfield of machine learning. In practice, most AI in products today is machine learning, and most ML in products is supervised learning.
Attention A mechanism inside transformer models that allows the model to weigh how relevant each part of an input is to each other part. In a sentence like "the cat sat on the mat because it was tired," attention helps the model figure out that "it" refers to "cat" and not "mat." This ability to handle long-range relationships is what makes transformers so effective for language tasks. Also increasingly used in vision and multimodal systems.
AUROC (Area Under the ROC Curve) A single number summarizing how well a model ranks positives above negatives across all possible thresholds. 1.0 means perfect ranking, 0.5 means random guessing. Useful for comparing model candidates. Be cautious under extreme class imbalance: the metric can look strong even when the model performs poorly on the minority class that usually matters most.
B
Backpropagation The algorithm that trains neural networks. After a forward pass produces a prediction, the error is propagated backward through the network layer by layer, calculating how much each weight contributed to the mistake. Those gradients are then used to update the weights in a direction that reduces error. This process repeats across thousands or millions of training examples until the model converges.
Bag of Words A simple text representation that turns a document into a count of how many times each word appears, ignoring order and context. "The dog bit the man" and "the man bit the dog" are identical under bag of words. It's fast and sometimes useful, but it misses meaning that depends on sequence. Most modern NLP replaces it with embeddings or transformer-based representations.
Bagging (Bootstrap Aggregating) An ensemble technique where many models are trained on different random samples of the training data (sampled with replacement). Because each model sees a slightly different dataset, their errors are less correlated. Averaging across them reduces variance and produces a more stable result than any single model. Random forests are built on this principle.
Baseline A simple reference model or heuristic used to judge whether ML adds value. Could be "always predict the most common class," a simple rule ("flag any transaction over $10,000"), or a linear regression. A strong baseline is essential: if your complex model barely beats a baseline, the added cost and risk may not be worth it. Without a baseline, you can't quantify what ML is actually contributing.
Batch Gradient Descent An optimization approach that computes gradients using the entire training dataset before taking a single update step. This produces stable, accurate gradient estimates but is computationally impractical for large datasets because you have to process all the data before making any progress. Mostly replaced in practice by mini-batch gradient descent.
Bias This word has two distinct meanings that are easy to conflate.
In the bias-variance trade-off: bias is the error that comes from a model being too simple to capture the real patterns in data. A linear model applied to a curved relationship has high bias because it's consistently wrong in the same direction.
In fairness and data: bias refers to systematic skew in data collection or model outputs that can disadvantage certain groups. For example, a hiring model trained on historical decisions may learn to replicate past discrimination. Always specify which meaning you intend.
Bias Term (Intercept) The constant term in a linear model, usually written as w0 or b. It sets the model's baseline prediction when all input features are zero. Without it, the model is forced to pass through the origin, which is often wrong.
C
Calibration Whether a model's predicted probabilities match actual frequencies. A well-calibrated model that outputs 0.8 should be right about 80% of the time. Calibration matters for any product that uses probability scores to make decisions or communicate risk to users. A model can have strong ranking performance (high AUROC) but be poorly calibrated, meaning the scores don't translate directly into reliable probabilities.
Categorical Data Variables that represent discrete groups rather than numeric values: country, device type, plan tier, product category. Most ML algorithms require numeric inputs, so categorical variables need to be encoded before use. Common approaches include one-hot encoding (a separate binary column per category), ordinal encoding (assigning ordered numbers), and learned embeddings.
Classification A supervised learning task where the output is a category rather than a number. Binary classification has two possible outputs (fraud vs. legitimate, churn vs. retain). Multi-class classification has more than two (content category, severity level, intent class). The model typically produces a probability score per class, and a threshold or argmax converts that into a final label.
Class Imbalance When one class is much more common than another in training data. A dataset where 1% of transactions are fraudulent has severe class imbalance. It makes accuracy misleading and requires careful metric selection (precision/recall instead of accuracy), and sometimes data techniques like resampling or cost-sensitive training. Most real-world classification problems have some imbalance.
Clustering An unsupervised technique that groups data points by similarity without using labels. The goal is not prediction but organization: finding natural groupings that help humans and systems understand structure in data. Used for customer segmentation, topic discovery, anomaly detection, and exploratory analysis. The results are only as meaningful as your choice of which features define "similar."
Coefficient In a linear model, the weight assigned to a feature that indicates how much the prediction changes when that feature increases by one unit, holding all other features constant. A coefficient of 0.5 on "years of tenure" means the model predicts 0.5 units higher for each additional year. Coefficients are useful for understanding feature influence, but should be interpreted carefully when features are on different scales or correlated.
Coefficient of Determination (R²) For regression, measures the proportion of variance in the target that the model explains compared to a simple baseline of predicting the mean. A value of 0.85 means the model explains 85% of the variance. Useful as a high-level signal but doesn't tell you how costly errors are in practice. A model can have a high R² and still produce economically damaging errors in the tail.
Confusion Matrix A table that breaks classification outcomes into four buckets: true positives (correctly predicted positive), true negatives (correctly predicted negative), false positives (predicted positive, actually negative), and false negatives (predicted negative, actually positive). The foundation for metrics like precision, recall, and false positive rate. Looking at the raw confusion matrix is often more informative than any single metric, because it shows where errors are concentrated.
Continuous Data Numeric values that can take infinitely many values within a range: time elapsed, revenue, temperature, distance. Contrasted with discrete data, which comes in countable steps. The distinction matters for modeling choices and evaluation metrics.
Convolutional Neural Network (CNN) A neural network architecture designed for images and other spatially structured data. Instead of connecting every input pixel to every neuron, CNNs use small filters that slide across the image (convolutions), allowing the network to detect patterns like edges, textures, and shapes regardless of where they appear. Pooling layers reduce spatial dimensions to keep computation manageable. CNNs learn hierarchical visual features: early layers detect edges, later layers detect objects.
Cost Function (Loss Function) A numeric measure of how wrong the model's predictions are during training. The training process tries to minimize this number. Choosing a loss function is effectively choosing what the model is optimized for, and it should reflect actual business costs. A model trained to minimize mean squared error will behave differently than one trained to minimize mean absolute error, even on the same data.
CRISP-DM A framework for structuring ML projects across six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment. Useful as shared language across product, data science, and engineering to avoid the common failure mode where modeling starts before the business question is clear.
Cross-Validation An evaluation technique that trains and validates across multiple different splits of the data to get a more stable estimate of real-world performance. For example, 5-fold cross-validation splits data into five parts, trains five models each leaving out a different fold, and averages the results. Important note: time series data requires time-aware validation where training always uses past data and validation uses future data, not random splits.
D
Data Observed information used for modeling. Includes structured tables (databases, spreadsheets) and unstructured content (text, images, audio, video). In product contexts, "data" is not just a static asset: it also includes the pipelines that produce it, its freshness, who has access to it, how it was collected, and what biases it carries.
Data Drift Change in the statistical distribution of input data over time. Example: a new product flow changes the features your model sees, or user behavior shifts after a marketing campaign. Drift can degrade model performance silently while dashboards still look fine. Monitoring for drift is one of the most important operational concerns after a model ships.
Data Leakage When training uses information that wouldn't be available at prediction time, or when test data indirectly influences model development. Leakage produces inflated offline metrics that collapse in production. Common examples: including future events as features, normalizing data before splitting train and test, or using a feature that's a proxy for the label. If your offline performance looks suspiciously good, check for leakage first.
Decision Tree A model that makes predictions through a sequence of if-then questions. Each question splits the data into subgroups, and the process continues until a final prediction is produced. Easy to visualize and explain: you can trace exactly which conditions led to a prediction. Handles nonlinear relationships and feature interactions naturally. The main risk is overfitting if the tree is allowed to grow too deep.
Deep Learning A subfield of ML that uses neural networks with many layers. The depth allows the network to learn increasingly abstract representations of data: early layers detect low-level patterns (edges in an image, common word sequences in text), later layers detect higher-level concepts (faces, sentiment). Particularly strong on unstructured data (images, text, audio, video) because it can learn features directly from raw inputs instead of requiring manual feature engineering. Usually requires substantial data and compute.
Discrete Data Countable numeric values that come in steps rather than varying continuously: number of purchases, number of support tickets, age in years. Distinct from continuous data in how it's modeled and interpreted.
Deployment Moving a trained model into production use. Deployment is not a one-time event. It includes setting up inference infrastructure, monitoring for drift and degradation, building feedback loops for new labels, planning for retraining, and defining who owns the model long-term. Many ML projects succeed in development and fail in deployment because these operational concerns weren't planned for.
E
Embedding A dense vector representation learned from data that captures similarity. Words with similar meaning end up with similar vectors. Users with similar behavior end up with similar vectors. Items that are frequently purchased together end up close in embedding space. Embeddings are used as inputs to models, for search and recommendation, and for clustering. The key property is that distance in embedding space carries semantic meaning.
Ensemble Model A model that combines predictions from multiple models to improve performance and robustness. The intuition is that averaging reduces the impact of any one model's mistakes. Examples include random forests (averaging many decision trees), gradient boosting (trees trained sequentially), and stacking (using a model to combine other models). Usually improves accuracy but reduces interpretability: with hundreds of trees, it's hard to explain any single prediction.
Evaluation The process of measuring model performance in a way that reflects real-world constraints and business value. Evaluation isn't just running metrics: it includes error analysis (what does the model get wrong and why), segment analysis (who is it worse for), robustness checks, and online experimentation when the stakes are high. A model can have strong offline metrics and still fail to create product value.
F
Feature An input variable fed into the model. Can be raw (age, country, device type) or engineered (rolling 7-day spend, days since last login, ratio of completed sessions). Features must be available at prediction time and consistent between training and production. Feature quality is often the single largest driver of model performance, and also the most common source of mistakes.
Feature Engineering Creating or transforming features to make patterns learnable. Examples: applying a log transform to revenue to compress extreme values, creating an interaction term between two features, computing a rolling average over a time window. Good feature engineering encodes domain knowledge that the algorithm might not learn on its own. Often more valuable than switching to a more complex algorithm.
Feature Selection Deciding which features to include in the model. Too few features and the model may underfit. Too many irrelevant features and the model may overfit, train slowly, and become harder to maintain. Feature selection can be manual (domain expertise), statistical (correlation), or automatic (regularization, importance scores). Keeping the feature set clean and defensible also reduces leakage risk.
False Negative (FN) The model predicts negative, but the true label is positive. A missed fraud case. A disease that wasn't flagged. A churning user that wasn't identified in time to intervene. How costly a false negative is depends entirely on the domain and the downstream action.
False Positive (FP) The model predicts positive, but the true label is negative. A legitimate transaction flagged as fraud. A healthy patient sent for unnecessary follow-up. An email incorrectly marked as spam. False positives erode trust and can create real harm depending on what action is triggered.
False Positive Rate (FPR) Among all actual negatives, the fraction incorrectly predicted as positive: FP / (FP + TN). Used in ROC curve analysis. A high false positive rate means the model is generating many unnecessary alerts or interventions.
Fine-Tuning Continuing training of a pretrained model on a new, task-specific dataset. The model starts from learned representations (from a large general dataset) and adapts them to your specific problem. Usually requires less data and compute than training from scratch. The standard approach for NLP and vision tasks where large pretrained models are available.
G
Generalization The ability of a model to perform well on new, unseen data, not just the data it was trained on. Generalization is the actual goal of ML: a model that memorizes training examples but fails on real users isn't useful. Evaluated using held-out test sets and, ultimately, production monitoring.
Gradient Descent An optimization algorithm that iteratively updates model parameters in the direction that reduces the loss function. The gradient tells you the slope of the loss with respect to each parameter. Moving in the opposite direction (descending) reduces the loss. Variants include stochastic gradient descent (one example at a time), mini-batch (small batches), and adaptive methods like Adam that adjust the learning rate automatically.
Ground Truth The correct label or outcome used for training and evaluation. In product settings, ground truth is often messier than it sounds: it can be noisy (human labelers disagree), delayed (you don't know if a user churned until 30 days later), or ambiguous (what counts as a "conversion"?). Label quality sets an upper bound on model quality.
H
Hidden Layer A layer inside a neural network that sits between the input and output layers. Hidden layers learn intermediate representations: features that aren't directly in the input but that the network finds useful for making predictions. The more hidden layers, the more abstract the representations the network can build.
Hyperparameter A configuration value set before training that controls model behavior and complexity. Examples: learning rate, number of trees, tree depth, regularization strength, number of layers, batch size. Unlike model parameters (learned from data), hyperparameters are set by the practitioner and tuned using validation performance. Getting hyperparameters wrong can make a good algorithm look bad.
I
Information Gain The criterion decision trees use to choose splits. It measures how much a given split reduces impurity (how mixed the labels are) at a node. A split that perfectly separates two classes has high information gain. A split that produces groups just as mixed as before has zero. The algorithm tries all possible splits at each node and picks the one with the highest information gain.
Inference Running a trained model on new data to generate predictions. Distinct from training, which is the process of learning the model. Inference cost, latency, and throughput are often the real operational constraints in production: a model that takes 500ms to return a prediction may be unusable in a real-time user flow.
Interpretability How easily a human can understand why a model made a specific prediction. Linear models and shallow decision trees are highly interpretable: you can inspect weights and trace decisions. Large ensembles and deep networks are much less so. Interpretability matters for trust, debugging, compliance, and any product where you need to explain or defend individual decisions to users or regulators.
K
K-Fold Cross-Validation A cross-validation method that divides the data into K equal parts (folds), trains K separate models each time leaving out a different fold as the validation set, and averages performance across folds. More reliable than a single train-test split because the estimate is less sensitive to which examples happened to end up in which split. Not appropriate for time series data without modification.
K-Means Clustering A clustering algorithm that groups data by iteratively assigning points to the nearest of K cluster centers and moving those centers to the mean of their assigned points. Fast and scalable. Works well when clusters are roughly spherical and similar in size. Sensitive to the initial placement of centers and to feature scaling (since it uses distance). Requires you to choose K in advance, which is itself a product decision about how many distinct groups are meaningful.
L
Label The target value used in supervised learning: the "right answer" the model is trained to predict. Churned or not. Fraud or legitimate. A revenue amount. Label quality and consistency usually matter more than algorithm choice. Noisy, inconsistent, or delayed labels can make any algorithm look bad.
LASSO (L1 Regularization) A regularization technique that adds a penalty proportional to the absolute value of each coefficient: λ * Σ |w|. Unlike Ridge, LASSO can push some coefficients exactly to zero, effectively removing those features from the model. This makes it useful both for preventing overfitting and for automatic feature selection when you want a simpler, more interpretable model.
Learning Rate The step size used in gradient descent when updating model parameters. Too high and updates overshoot, causing training to oscillate or diverge. Too low and training converges very slowly or gets stuck. One of the most important hyperparameters to tune. Modern optimizers like Adam adapt the learning rate automatically during training.
Linear Regression A model that predicts a numeric value as a weighted sum of input features: y = w0 + w1x1 + w2x2 + ... Fast to train, interpretable, and often a surprisingly strong baseline. The right starting point for most regression problems before escalating to more complex algorithms.
Logistic Regression A classification model that applies a sigmoid function to a linear score to produce a probability between 0 and 1. Despite the name, it's a classifier, not a regression model. Strong baseline for binary classification and risk scoring. Interpretable, fast, and well-understood. Often outperforms complex models when data is limited or when the relationship between features and the outcome is roughly linear.
Loss Function See Cost Function. For PMs: the loss function is how the model "learns what matters." If your loss doesn't align with business costs, you can optimize the model and hurt the product. For example, a model minimizing average error might perform well on most cases and catastrophically on rare but high-stakes ones.
M
MAE (Mean Absolute Error) Average of the absolute differences between predictions and true values: mean(|ŷ - y|). Treats all errors proportionally, making it robust to outliers. Easy to interpret in the original units of the target. Preferred when you want consistent closeness and don't want a few extreme cases to dominate the metric.
MAPE (Mean Absolute Percent Error) Average of absolute percentage errors: mean(|(ŷ - y) / y|). Useful for communicating error in percentage terms to stakeholders. Breaks down when the true value is near zero, where small absolute errors produce enormous percentage errors. Use cautiously and only when the target is always meaningfully above zero.
MSE (Mean Squared Error) Average of squared errors: mean((ŷ - y)²). Squaring penalizes large errors more than small ones. Useful when large misses are disproportionately costly. Less intuitive than MAE because it's in squared units. Often used alongside RMSE (its square root) to return to original units.
Mini-Batch Gradient Descent The standard approach to training neural networks and many other large models. Updates model weights using a small batch of training examples (typically 32, 64, or 128) per step rather than one example or the full dataset. Efficient on modern hardware, provides stable enough gradient estimates for reliable training, and allows training to proceed continuously as data arrives.
Model A learned function that maps inputs to outputs. In product contexts, "the model" rarely refers to the mathematical function alone. It includes the data pipeline that produces features, the threshold that converts probabilities to decisions, the monitoring that tracks performance, and the retraining process that keeps it current. A model without these components is a demo, not a product.
N
Neural Network A model composed of layers of connected neurons, each applying a weighted sum followed by a nonlinear activation. Neural networks can approximate extremely complex functions given enough data and compute. The key advantage over simpler models is representation learning: instead of requiring hand-engineered features, the network learns useful intermediate representations directly from raw inputs.
Non-Parametric Algorithm A model that doesn't assume a fixed functional form and can grow in complexity with data. Decision trees and k-nearest neighbors are examples. More flexible than parametric models, but can be more expensive and harder to interpret. The trade-off: they can fit more complex patterns but also have more ways to overfit.
O
Offline Evaluation Testing model performance on historical data before deployment. Necessary but not sufficient. Offline metrics measure how well the model fits historical patterns, not whether it creates value in real user interactions. Offline gains don't always translate to product impact, and sometimes a model with slightly lower offline metrics is actually better in production because it generalizes differently.
Online Evaluation Testing model impact in real user environments, typically via A/B testing. Stronger evidence for business outcomes because it measures actual user behavior, not historical patterns. More complex and risk-sensitive than offline evaluation because real users are affected.
Outcomes The business results you care about: cost saved, revenue generated, risk reduced, time saved. Outcomes are expressed in business language and are what justify building ML in the first place. Choosing model metrics that don't connect to outcomes is one of the most common ways ML projects produce technically impressive results that don't matter.
Outputs What the model actually produces: a prediction, a probability score, a ranked list. Outputs are evaluated using technical metrics. The discipline of evaluation is choosing output metrics that reliably predict whether your desired outcomes are being achieved.
Overfitting When a model fits the noise in training data rather than the underlying signal, performing well on training data but poorly on new data. The model has essentially memorized quirks of the training set rather than learning generalizable patterns. Mitigated with more data, regularization, simpler models, early stopping, and rigorous validation.
P
Parametric Algorithm A model with a fixed structure and a fixed number of parameters, regardless of how much data you have. Linear and logistic regression are examples. The structure is chosen in advance, and training learns the best weights within that structure. More interpretable and computationally efficient than non-parametric models, but less flexible.
Perceptron The earliest artificial neuron model: a weighted sum of inputs passed through a hard threshold (outputs either 0 or 1). The conceptual ancestor of modern neural networks. Important historically and useful for building intuition about how neurons work, but too limited for real tasks.
Pooling Layer A layer in a CNN that reduces spatial dimensions by summarizing small regions of the previous layer. Max pooling keeps the strongest activation in each region; average pooling takes the mean. Reduces the number of parameters and makes the network somewhat invariant to small shifts in the input.
Precision Among all cases the model predicts as positive, the fraction that are actually positive: TP / (TP + FP). High precision means fewer false alarms. Important when acting on a false positive is costly: sending a discount to a user who wasn't going to churn, blocking a legitimate transaction, or generating an alert that wastes analyst time.
Probability Threshold The cutoff that converts a model's probability output into a binary decision. A model predicting fraud might output 0.73, and you decide to flag any prediction above 0.5. But 0.5 is often the wrong threshold. The right threshold depends on the relative costs of false positives and false negatives, and on operational capacity. Threshold selection is a product decision, not a technical default.
R
Random Forest An ensemble of decision trees, each trained on a bootstrapped sample of the data and using a random subset of features at each split. Averaging across many trees reduces variance and produces more stable predictions than any single tree. Often a strong default for tabular data: handles nonlinear relationships, doesn't require heavy feature engineering, and is robust to outliers. The trade-off is interpretability: with hundreds of trees, you can't explain individual predictions the way you can with a single tree.
Recall (Sensitivity, True Positive Rate) Among all cases that are actually positive, the fraction the model correctly identifies: TP / (TP + FN). High recall means fewer missed cases. Important when false negatives are costly: missing a fraud case, missing a disease, failing to catch a safety violation. Precision and recall trade off against each other: you usually improve one by worsening the other.
Regression A supervised learning task where the output is a continuous numeric value. Predict revenue, estimate delivery time, forecast demand. Distinct from classification, which predicts categories. The distinction matters for choosing algorithms, loss functions, and evaluation metrics.
Regularization Techniques that add a penalty to discourage overly complex models, improving generalization. L1 (LASSO) and L2 (Ridge) add penalties on the size of model weights. Dropout randomly deactivates neurons during training. Early stopping halts training before the model starts memorizing noise. Regularization is the primary tool for controlling the bias-variance trade-off.
Reinforcement Learning A learning paradigm where an agent learns by taking actions in an environment and receiving rewards or penalties based on outcomes. Unlike supervised learning (which learns from labeled examples), reinforcement learning learns from interaction. Used for game-playing, recommendation optimization, and adaptive systems. Powerful but harder to deploy responsibly because the system changes its own behavior over time, which can create unintended feedback loops.
ReLU (Rectified Linear Unit) An activation function defined as max(0, x): returns 0 for negative inputs and the input itself for positive inputs. Simple, computationally cheap, and empirically effective. One of the most widely used activations in deep networks because it trains faster and is less prone to vanishing gradients than sigmoid or tanh.
Ridge Regression (L2 Regularization) A regularization technique that adds a penalty proportional to the squared value of each coefficient: λ * Σ w². Shrinks coefficients toward zero but rarely to exactly zero. Particularly effective when features are correlated (collinearity), because it distributes weight across correlated features rather than arbitrarily assigning all weight to one.
ROC Curve A plot of true positive rate (recall) against false positive rate across all possible thresholds. Each point on the curve represents a different operating point for the model. The area under the curve (AUROC) summarizes overall ranking quality. Useful for comparing models and understanding the precision-recall trade-off, but can be misleading under severe class imbalance.
S
Sigmoid Function A function that maps any real number to a value between 0 and 1: σ(z) = 1 / (1 + e^(-z)). Used in logistic regression to convert a linear score into a probability. Also used as an activation function in neural networks, though largely replaced in hidden layers by ReLU for practical reasons.
Softmax A function that takes a vector of scores (one per class) and converts them into a probability distribution that sums to 1. The class with the highest score gets the highest probability. Used in multi-class classification at the output layer of a neural network.
Spatial Relationships Data structure based on proximity in space: adjacent pixels in an image, nearby coordinates on a map, neighboring cells in a grid. CNN architectures are specifically designed to exploit spatial relationships by using local filters that share weights across locations.
Structured Data Tabular data with well-defined fields and consistent schema: database tables, spreadsheets, CSV exports. Easier to query, join, and feed directly into models than unstructured data. Most traditional ML (linear models, trees, gradient boosting) works on structured data.
Supervised Learning Learning from labeled examples where the "right answer" is known for each training case. The model learns a mapping from inputs to outputs that generalizes to new cases. Includes classification (predicting categories) and regression (predicting numbers). The most common form of ML in products.
T
Target (Label, Y) The output variable the model is trained to predict. Defining the target is a product decision: "churn" isn't a target until you define what churn means (subscription canceled, no login in 30 days, explicit cancellation request). The target definition shapes everything downstream, including what data you need, how you evaluate the model, and what action the product takes.
Temporal Relationships Data structure where time order matters: user event sequences, sensor readings, financial time series. Models for temporal data need to respect order: you can only train on the past and validate on the future. Random train-test splits are incorrect for time series because they let the model "see the future" during training.
Test Set A held-out dataset used exactly once for final evaluation after all model development is complete. Must never be touched during feature engineering, hyperparameter tuning, or model selection, because any use contaminates it and makes your performance estimate optimistic. The test set is your best estimate of how the model will perform in production.
Time Series Data Data indexed by time where the sequence of observations matters. Forecasting, anomaly detection, and behavior modeling often involve time series. Standard cross-validation doesn't apply: you must always train on earlier data and validate on later data to avoid leaking future information into training.
Training Set The data used to fit model parameters. The model is optimized to perform well on training data, but the goal is always generalization to unseen data. Using training set performance as a proxy for real-world performance is a common mistake.
Transfer Learning Starting from a model that was pretrained on a large general dataset and adapting it to your specific task. Instead of training from scratch, you inherit learned representations (edges and shapes for vision, grammar and semantics for language) and fine-tune on your data. Dramatically reduces the amount of labeled data and compute needed. The standard approach for most NLP and vision tasks today.
Transformer A neural network architecture built around attention mechanisms. Introduced for language tasks and now dominant in NLP. The key innovation is the ability to relate any part of an input to any other part directly, rather than processing sequences step by step. This allows transformers to capture long-range dependencies and contextual meaning far more effectively than earlier architectures. Increasingly applied to vision, audio, and multimodal tasks.
U
True Negative (TN) A case that is actually negative and that the model correctly predicts as negative. A legitimate transaction correctly cleared. A healthy patient correctly identified as low-risk.
True Positive (TP) A case that is actually positive and that the model correctly predicts as positive. A fraudulent transaction correctly flagged. A churning user correctly identified.
Underfitting When a model is too simple to capture the real patterns in data. It performs poorly on both training and test data. The opposite of overfitting. Caused by too few features, too constrained a model, or too much regularization. Addressed by adding features, using a more expressive model, or relaxing regularization.
Unstructured Data Data without a fixed schema: text, images, audio, video. Can't be directly read into most traditional ML models without processing. Usually requires embedding or deep learning approaches to extract useful representations. The volume of unstructured data in most organizations far exceeds structured data, but it's harder and more expensive to use.
Unsupervised Learning Learning without labeled examples. The goal isn't prediction but organization: finding structure, clusters, and patterns that help make sense of data. Includes clustering and dimensionality reduction. Useful as an exploratory tool, for segmentation, and for discovering patterns before you know what to predict.
V
Validation Set Data held out during model development to tune hyperparameters and compare model candidates. Distinct from the test set: the validation set is used repeatedly during development, which means it provides a somewhat optimistic estimate. The test set is used only once, after everything is finalized, to get a clean estimate.
Variance In the bias-variance trade-off: error from a model being too sensitive to the specific training data it saw. A high-variance model fits training data well but produces different results on different samples. Reduced with more data, regularization, simpler models, or ensembling.
Vectorization Representing data as numeric vectors and using matrix operations to compute efficiently across many examples simultaneously. Crucial for making deep learning and large-scale training practical. Without vectorization, training even moderate-sized neural networks would be prohibitively slow.
W
Word Embeddings Dense vector representations of words learned from large text corpora, where words that appear in similar contexts end up with similar vectors. "King" and "queen" are close in embedding space. "Bank" (financial) and "bank" (river) are far apart. Embeddings are used as the input representation for most NLP models and capture semantic relationships that bag-of-words misses entirely.
Additional terms PMs often need
A/B Testing for ML Controlled online experiments that measure real product impact by comparing a model-driven experience against a baseline. Essential when model outputs affect user-facing behavior. The only way to know whether offline metric improvements actually translate into business value. Requires careful design to avoid biases from novelty effects, network effects, and measurement lag.
Data Pipeline The system that collects, cleans, joins, transforms, and serves features and labels to the model. Many ML failures are pipeline failures, not model failures. Data pipelines are often harder to maintain than the model itself, because they depend on upstream systems that change without warning.
Drift Monitoring Ongoing checks that input distributions, prediction distributions, and outcome metrics haven't shifted in ways that would degrade model performance. Models decay as the world changes: new user behaviors, new product flows, seasonal shifts. Without drift monitoring, you won't know a model has gone stale until users complain or a metric tanks.
Human-in-the-Loop A system design where humans review, label, or override model outputs for a subset of cases. Common in content moderation, fraud review, and high-stakes classification. Allows the product to ship with lower model confidence thresholds (because humans catch edge cases) and creates a feedback loop for improving labels and retraining.
Latency Budget The maximum time allowed for inference before it degrades user experience. A recommendation that takes 2 seconds to generate is fine for an email digest but unusable in a real-time search result. Latency budget is a hard product constraint that rules out certain model choices regardless of their accuracy.
Model Versioning Tracking which version of a model was used when, trained on which data, with which hyperparameters, and producing which predictions. Critical for debugging production issues, auditing decisions, satisfying regulatory requirements, and rolling back when something goes wrong.
Precision-Recall Curve A plot of precision against recall across all possible thresholds. More informative than a ROC curve when the positive class is rare, because it focuses on the classes and errors that actually matter. Allows you to visualize the trade-off and choose an operating point based on the relative costs of false positives and false negatives.
Segmented Metrics Evaluating model performance separately across cohorts: by region, device type, language, user tenure, product tier. Overall metrics can mask the fact that the model works well for the majority and poorly for important minorities. Segmented evaluation reveals hidden risks and is increasingly a requirement for responsible deployment.
Threshold Tuning Deliberately choosing the probability cutoff that converts model scores into decisions, based on the business costs of different error types and the operational capacity of downstream teams. The default of 0.5 is almost never the right answer. Threshold tuning is one of the highest-leverage, lowest-effort improvements available after a model is trained.
