The Benchmark First Rule: A Product Manager’s Guide to Machine Learning (Part IV)
Dec 5, 2025
With generative AI everywhere, it is tempting to treat linear models as obsolete.
That is a strategic mistake.
Linear models are the quiet workhorses of machine learning. They are fast, data-efficient, and unusually transparent. They also teach the core logic behind more complex systems: weights, loss functions, and optimization. Even modern neural networks still rely on linear operations as their backbone.
If you build ML-powered products, linear models are not optional knowledge. They are the baseline you should beat and the simplest tool that often ships.
The core idea: linear models are parametric, and that is the point
Linear models are parametric. They assume a fixed mathematical form for the relationship between inputs and outputs, then learn a set of coefficients inside that form.
Unlike many non-parametric approaches, which can grow more complex as data grows, a linear model has a stable structure. It learns a fixed number of parameters.
That constraint creates a clear trade-off:
If the true relationship is complex and nonlinear, linear models can underfit.
If data is limited, noisy, or you need interpretability and speed, linear models can be hard to beat.
A useful way to think about it: the job of a linear model is not to discover any shape. It is to find the best set of weights for a shape you chose on purpose.
1) Linear regression: prediction as geometry
Linear regression assumes the target can be modeled as a weighted sum of the input features.
Simple linear regression
With one feature:
y = w0 + w1 * x
w0is the bias term, also called the intercept. It is the predicted value whenx = 0.w1is the weight for the feature. It represents how much the prediction changes whenxincreases by one unit.
Multiple linear regression
Real products rarely rely on a single feature. With many features:
y = w0 + w1 * x1 + w2 * x2 + ... + wp * xp
Each coefficient is a statement about influence: how strongly that feature pushes the prediction up or down, holding others constant.
For a price model, your features might include square footage, location signals, number of bedrooms, and recent comparable sales. The model learns how to combine them into one prediction.
The cost function: Sum of Squared Error
To learn the weights, the model needs a way to measure wrongness.
A common objective is Sum of Squared Error (SSE):
SSE = Σ (y_hat - y)^2
Squaring does two things:
It makes errors positive.
It penalizes large errors more than small ones.
For linear regression, there is often a closed-form solution that finds the best weights directly. In practice, many teams still use iterative optimizers because they scale better and allow extensions like regularization.
2) Polynomial regression: linear models can fit curves
A common misconception is that linear models only fit straight lines.
The model is linear in the weights, not necessarily in the original feature.
If you transform features, you can model nonlinear relationships while still using a linear framework.
Examples:
Add
x^2orx^3as new featuresUse
log(x)for relationships that compress at higher valuesUse interaction terms like
x1 * x2when features combine multiplicatively
This is called polynomial regression when you add polynomial terms. The model is still linear in the weights, but the curve emerges from the transformed inputs.
This is often the best next step after a baseline linear model: try smarter features before jumping to a more complex algorithm.
3) Regularization: the complexity tax, controlled
Linear regression that only minimizes SSE will happily assign large weights if they help fit the training data. With enough features, you can get impressive training performance and disappointing test performance.
Regularization adds a penalty that discourages overly large coefficients:
J(w) = SSE + λ * Penalty(w)
λcontrols penalty strength.Higher
λpushes weights toward smaller values, reducing variance and improving generalization.
LASSO (L1 regularization)
LASSO uses the sum of absolute weights:
It can drive some coefficients exactly to zero.
That makes it useful for feature selection and simpler explanations.
Use LASSO when you want a smaller, cleaner set of features.
Ridge (L2 regularization)
Ridge uses the sum of squared weights:
It shrinks coefficients, but rarely makes them exactly zero.
It is especially helpful when features are highly correlated, a situation called collinearity.
Use Ridge when you have many correlated features and you want stability.
A practical product translation:
LASSO tends to simplify the model by dropping weak signals.
Ridge tends to stabilize the model by spreading weight across correlated signals.
4) Logistic regression: linear logic for classification
Linear regression is not suitable for classification because it can output values below 0 or above 1, which do not behave like probabilities.
Logistic regression fixes this by mapping the linear score through a sigmoid function.
First compute a linear score:
z = w0 + w1 * x1 + ... + wp * xp
Then convert it to a probability with the sigmoid:
σ(z) = 1 / (1 + e^(-z))
Now the output is between 0 and 1 and can be interpreted as:
P(y = 1 | X)
In product terms, logistic regression gives you a risk score. The final decision depends on a threshold you choose based on the cost of false positives and false negatives.
Why logistic regression is trained differently
Logistic regression usually does not have a simple closed-form solution, so it is commonly trained with iterative optimization such as gradient descent.
The intuition:
Compute how changing each weight changes the loss.
Update weights in the direction that reduces loss.
Repeat until improvements flatten.
Two practical knobs matter:
Learning rate: how big each update step is
Regularization: how aggressively you prevent overly confident weights
5) Softmax regression: linear models for multiple classes
Binary classification is not enough for many products: sentiment categories, intent classes, content labels, topic routing.
Softmax regression extends logistic regression to multi-class classification.
It computes a score for each class, then converts scores to probabilities that sum to 1. The predicted class is the one with the highest probability.
This is a linear classifier at scale: one weight vector per class, one prediction based on weighted sums.
Product implications: why linear models should be your starting point
The benchmark rule
Start with linear or logistic regression to establish a baseline.
If a complex model beats the baseline only marginally, the added cost and risk may not justify it. This is especially true when:
you need real-time inference,
you have limited data,
you need explainability,
the feature pipeline is the real bottleneck.
Interpretability
Linear models are transparent:
You can inspect weights.
You can reason about directionality.
You can explain what features push outcomes up or down.
In regulated domains or trust-sensitive products, this is a major advantage.
Efficiency
Linear models train quickly and run cheaply.
That makes them ideal for:
edge deployment,
high-throughput ranking,
low-latency decisions,
rapid iteration during product discovery.
The real risk: underfitting
Their constraint can be a limitation. If the world is nonlinear and the signal requires complex interactions, a linear model can plateau.
The best response is not immediately a deep neural network. Often, it is:
better features,
interaction terms,
nonlinear transforms,
or a tree-based model that captures interactions automatically.
A practical playbook
Start simple: build a linear or logistic baseline before anything else.
Check for nonlinear patterns: add transforms like log, square, and interactions before escalating model complexity.
Regularize early: if training performance is far better than test performance, use Ridge or LASSO.
Treat thresholds as a product decision: for logistic models, choose thresholds based on the cost of mistakes and operational capacity.
Use weights for insight: inspect coefficients to spot surprising drivers, data issues, or leakage risks.
Scenario: from horsepower to pixels
If you are predicting fuel efficiency, a simple model might use horsepower to predict miles per gallon. If the relationship is curved, you can add horsepower squared or logged horsepower as features. You still have a linear model, but it fits a curve more realistically.
For image classification, the same logic scales. Each pixel value is a feature. An 8 x 8 image has 64 features. For each class, the model learns a set of weights across those 64 inputs and produces probabilities with softmax.
Different domains, same foundation: weights, loss, and optimization.
Takeaways
Linear models are parametric by design, which makes them fast, stable, and data-efficient.
Coefficients are weights that quantify how features influence predictions.
SSE is a common objective for regression; regularization helps prevent overfitting.
Polynomial features and transformations let linear models fit nonlinear patterns.
Logistic regression turns linear scores into probabilities for classification.
Softmax generalizes the same idea to multiple classes.
Linear models are the baseline you should start with and the benchmark complex models must earn the right to replace.

