The Benchmark First Rule: A Product Manager’s Guide to Machine Learning (Part IV)
The core idea: parametric by design
Linear models are parametric. They assume a fixed mathematical form for the relationship between inputs and outputs, then learn a set of coefficients inside that form.
Unlike non-parametric approaches, which can grow more complex as data grows, a linear model has a stable structure. It learns a fixed number of parameters. That constraint creates a clear trade-off: if the true relationship is complex and nonlinear, linear models can underfit. If data is limited, noisy, or you need interpretability and speed, they can be hard to beat.
The job of a linear model is not to discover any shape. It's to find the best set of weights for a shape you chose on purpose.
Linear models are also the conceptual backbone of more complex systems. Weights, loss functions, and optimization all originate here. Even modern neural networks rely on linear operations as their core building block.
Linear regression: prediction as a weighted sum
Linear regression assumes the target can be modeled as a weighted sum of the input features.
Simple linear regression with one feature:
w0 is the bias term (intercept): the predicted value when x = 0.
w1 is the weight for the feature: how much the prediction changes when x increases by one unit.
Multiple linear regression with many features:
Each coefficient is a statement about influence: how strongly that feature pushes the prediction up or down, holding others constant. For a price model, your features might include square footage, location signals, number of bedrooms, and recent comparable sales. The model learns how to combine them into one prediction.
Learning the weights: Sum of Squared Error
To learn the weights, the model needs a way to measure wrongness. A common objective is Sum of Squared Error (SSE):
Squaring does two things: it makes errors positive, and it penalizes large errors more than small ones. For linear regression, there's often a closed-form solution that finds the best weights directly. In practice, many teams use iterative optimizers because they scale better and allow extensions like regularization.
Polynomial regression: linear models can fit curves
A common misconception is that linear models only fit straight lines. The model is linear in the weights, not necessarily in the original features. If you transform features, you can model nonlinear relationships while still using a linear framework.
Examples of useful transformations:
Add x² or x³ as new features
Use log(x) for relationships that compress at higher values
Use interaction terms like x1 * x2 when features combine multiplicatively
This is called polynomial regression when you add polynomial terms. The model is still linear in the weights, but the curve emerges from the transformed inputs.
This is usually the best next step after a baseline: try smarter features before reaching for a more complex algorithm.
Regularization: controlling overfitting
Linear regression that only minimizes SSE will assign large weights if they help fit the training data. With enough features, you can get impressive training performance and disappointing test performance. Regularization adds a penalty that discourages overly large coefficients:
λ controls penalty strength. Higher λ pushes weights toward smaller values, reducing variance and improving generalization.
LASSO (L1 regularization) uses the sum of absolute weights. It can drive some coefficients exactly to zero, which makes it useful for feature selection and simpler explanations. Use LASSO when you want a smaller, cleaner set of features.
Ridge (L2 regularization) uses the sum of squared weights. It shrinks coefficients but rarely makes them exactly zero. Especially useful when features are highly correlated (collinearity). Use Ridge when you have many correlated features and want stability.
A practical translation: LASSO simplifies the model by dropping weak signals. Ridge stabilizes it by spreading weight across correlated signals.
Logistic regression: linear logic for classification
Linear regression can't be used directly for classification because it can output values below 0 or above 1, which don't behave like probabilities. Logistic regression fixes this by mapping the linear score through a sigmoid function.
First compute a linear score:
Then convert to a probability with the sigmoid:
The output is now between 0 and 1 and can be interpreted as P(y = 1 | X). In product terms, logistic regression gives you a risk score. The final decision depends on a threshold you choose based on the cost of false positives and false negatives.
Training logistic regression usually requires iterative optimization (like gradient descent) rather than a closed-form solution. The intuition: compute how changing each weight changes the loss, update weights in the direction that reduces loss, repeat until improvements flatten. Two practical knobs: learning rate (how big each step is) and regularization (how aggressively you prevent overconfident weights).
Softmax regression: linear models for multiple classes
Binary classification isn't enough for many products: sentiment categories, intent classes, content labels, topic routing. Softmax regression extends logistic regression to multi-class problems. It computes a score for each class, then converts scores to probabilities that sum to 1. The predicted class is the one with the highest probability.
This is a linear classifier at scale: one weight vector per class, one prediction based on weighted sums.
From horsepower to pixels: the same logic at different scales
If you're predicting fuel efficiency, a simple model uses horsepower to predict miles per gallon. If the relationship is curved, add horsepower squared or logged horsepower as features. You still have a linear model, but it fits the curve more realistically.
For image classification, the same logic scales. Each pixel value is a feature. An 8×8 image has 64 features. For each class, the model learns a set of weights across those 64 inputs and produces probabilities with softmax.
Different domains, same foundation: weights, loss, optimization.
Why linear models should be your starting point
The benchmark rule: build a linear or logistic baseline before anything else. If a complex model beats it only marginally, the added cost and risk may not justify it. This is especially true when you need real-time inference, have limited data, need explainability, or when the feature pipeline is the real bottleneck.
Interpretability: linear models are transparent. You can inspect weights, reason about directionality, and explain what features push outcomes up or down. In regulated domains or trust-sensitive products, this is a significant advantage.
Efficiency: linear models train quickly and run cheaply. Useful for edge deployment, high-throughput ranking, low-latency decisions, and rapid iteration during product discovery.
The real risk is underfitting. If the world is nonlinear and the signal requires complex interactions, a linear model will plateau. The best response usually isn't a deep neural network. It's better features, interaction terms, nonlinear transforms, or a tree-based model that captures interactions automatically.
Questions to be able to answer when using linear models
Have you built a linear or logistic baseline before escalating complexity?
Are there nonlinear patterns worth capturing with transforms (log, square, interactions) before switching algorithms?
If training performance is far better than test performance, have you tried Ridge or LASSO?
For logistic models: have you chosen a threshold based on the cost of mistakes, not just 0.5?
Have you inspected the coefficients for surprising drivers, data issues, or leakage risks?
Takeaways
Linear models are parametric by design: fast, stable, and data-efficient.
Coefficients are weights that quantify how features influence predictions.
SSE is a common objective for regression. Regularization (LASSO or Ridge) prevents overfitting.
Polynomial features and transformations let linear models fit nonlinear patterns without changing the algorithm.
Logistic regression turns linear scores into probabilities for binary classification. Softmax extends this to multiple classes.
Linear models are the baseline you should start with and the benchmark complex models must earn the right to replace.
