The Precision Trap: A Product Manager’s Guide to Machine Learning (Part III)
Outcomes versus outputs
The most important distinction in evaluation is between outcomes and outputs.
Outcomes are the real-world effects you want: cost saved, revenue generated, time reduced, risk avoided, customer experience improved. They're expressed in business language.
Outputs are what the model produces: a prediction, a probability score, a ranked list, a confidence estimate. These are evaluated using technical metrics like error rates, precision, recall, or calibration.
The rule: define the outcome first, then choose output metrics that reflect it. If you start with a popular metric because it's familiar, you often optimize the wrong thing. If you start with the business outcome, metric choices become much clearer.
A model with 99% accuracy is often a model that has failed. Not because the math is wrong, but because the metric is telling a comforting story that has nothing to do with what the product needs.
Where metrics show up in the lifecycle
Metrics aren't a final exam. They appear at three different points, each with a different purpose.
During iteration: you compare candidates and decide which direction to pursue. This uses validation sets or cross-validation.
Before launch: you evaluate once on a truly unseen test set to estimate real-world generalization. This should happen once, at the end.
After deployment: you track performance over time, detect drift, and catch degradation early.
A good offline score is a starting point, not a finish line.
Evaluating regression models
Regression predicts a continuous number: price, demand, duration, consumption, lifetime value.
The central question isn't "how wrong is it" but "what kind of wrong hurts us."
Mean Squared Error (MSE) and RMSE square the errors before averaging. Large misses get punished heavily, which means a few big errors can dominate the score. This is useful when being very wrong is disproportionately costly: underestimating demand can cause stock-outs, which is far worse than mild overestimation. RMSE brings the metric back to the original units of the target, making it easier to interpret.
Use MSE or RMSE when large outliers are particularly harmful.
Mean Absolute Error (MAE) averages the absolute distance between prediction and truth. It treats all errors linearly, is less sensitive to outliers, and is generally easier to explain.
Use MAE when you care about being consistently close and don't want a few extreme cases to dominate evaluation.
Mean Absolute Percent Error (MAPE) expresses error as a percentage, which is useful for stakeholder communication and for comparing across targets with different scales. The caution: when the true value can be near zero, percentage errors explode and become misleading. A small absolute miss on a small true value can look like a catastrophic failure.
Use MAPE when the target never approaches zero and percent-based communication matters.
R-squared measures how much variation in the target your model explains compared to a naive baseline that predicts the mean. It's useful for intuition but has real limits: it doesn't tell you how costly the errors are, it can look acceptable while hiding dangerous tail failures, and it can mislead outside its assumptions. Error metrics tied to actual costs tend to be more actionable in product settings.
Evaluating classification models
Classification predicts categories: fraud vs. legitimate, churn vs. not churn, safe vs. unsafe.
The reason accuracy becomes a trap is class imbalance. If 99% of cases are negative, predicting "negative" for everything gives 99% accuracy and zero utility. That's not a model. It's a mirror of the base rate.
The confusion matrix breaks outcomes into four buckets:
True positives (TP): predicted positive, actually positive
True negatives (TN): predicted negative, actually negative
False positives (FP): predicted positive, actually negative
False negatives (FN): predicted negative, actually positive
Once you see these counts, you can evaluate in terms of business consequences, not aggregate scores.
Recall answers: out of all real positives, how many did we catch? High recall means you miss fewer positives. This matters when false negatives are costly or dangerous.
Precision answers: out of all predicted positives, how many were correct? High precision means fewer false alarms. This matters when false positives are expensive, annoying, or harmful.
Improving one often worsens the other. That's not a flaw. It's the trade-off. Choose recall when missing a true case is unacceptable. Choose precision when acting on a false alarm creates high cost or erodes trust.
Thresholds are the hidden product decision. Most classifiers output a probability score, not a yes or no. You convert that probability into a decision using a threshold, and 0.5 is common but often wrong. The right threshold depends on the cost of false positives vs. false negatives and the capacity of downstream teams. If a fraud model triggers manual review, the threshold is constrained by review capacity. If a safety model triggers an intervention, it's constrained by risk tolerance. A model score becomes a product decision the moment you pick a threshold.
ROC curves and AUROC show how recall changes as the false positive rate changes, across every possible threshold. AUROC summarizes this into a single number: 1.0 is perfect separation, 0.5 is random guessing. Useful for comparing model families, but can paint an overly optimistic picture when classes are extremely imbalanced, because true negatives dominate.
Precision-recall curves are often more informative when positives are rare. They focus directly on the questions that usually matter: how many real cases do we catch, and how many of our alerts are actually correct? If your problem has very few positives, default to PR curves.
Evaluation is more than a score
Strong evaluation answers questions a product team can act on. Where does the model fail? Which user segments see worse performance? Are errors concentrated in specific contexts? Is the model overconfident? Does it degrade when inputs shift?
This means error analysis, not just metrics. Inspect false positives and false negatives. Group errors by segment, device type, geography, language, tenure, or any attribute tied to product experience. Look at the hard cases that drive cost, not only average performance. If you can't explain the failure modes, you can't design guardrails.
Choosing metrics based on the cost of being wrong
For regression: if one large miss causes a major operational failure, prefer MSE or RMSE. If consistent small misses create customer dissatisfaction, prefer MAE. If stakeholder alignment matters and the domain supports it, use MAPE carefully.
For classification: in turbulence prediction, a false negative is a safety risk while a false positive causes a minor reroute, so you optimize for recall and accept more false positives. In spam detection for a business inbox, false positives hide legitimate messages and break trust, so you prioritize precision and build additional safety layers before filtering aggressively.
The right metric is the one that matches the actual cost curve of mistakes in your product.
When reframing beats tuning
A team trying to help electric utilities prepare for severe weather starts with regression: predict the exact number of outages per town. The model struggles. Predicting exact counts is hard, and small differences are operationally meaningless while still penalizing the model heavily.
After customer conversations, the team reframes: utilities don't need exact outage counts. They need a severity level to allocate crews and equipment. The target becomes classification: predict severity on a 1 to 5 scale.
Performance improves not because the algorithm changed, but because the task definition matched the real outcome. Evaluation did its job: it forced alignment between output metrics and product decisions.
Troubleshooting order
When a model underperforms, debug in this order. The first two categories can make any algorithm look bad, so start there.
Problem framing and metric choice: are you solving the right problem, and measuring success in a way that reflects business value?
Data quantity and quality: do you have enough labeled examples, are labels consistent, are missing values or schema changes distorting training?
Feature design and availability: are you missing key signals, are features available at prediction time, is there leakage?
Model fit and complexity: underfitting because the model is too simple, or overfitting because it's too complex?
Irreducible error: some problems have a hard ceiling due to noise. If the signal is weak, the maximum achievable performance might be lower than stakeholders expect.
Takeaways
Start with outcomes, then pick model metrics that reflect real-world costs.
For regression: MSE or RMSE when large misses are especially costly, MAE when consistent closeness matters, MAPE carefully when targets don't approach zero.
For classification: accuracy misleads under imbalance. Use confusion matrices, precision, and recall to understand error trade-offs.
Threshold selection is a product decision. It should reflect business costs and operational capacity.
ROC curves help compare models broadly; precision-recall curves are better when positives are rare.
Troubleshoot in order: framing, data, features, fit, then accept irreducible noise.
