The Precision Trap: A Product Manager’s Guide to Machine Learning (Part III)
Dec 4, 2025
In machine learning, a model with 99% accuracy is often a model that has failed.
Not because the math is wrong, but because the metric is telling you a comforting story. One that has little to do with what your product needs.
Evaluation is the moment where a prototype either becomes a dependable capability or gets exposed as a lab experiment. It is where you learn whether your model solves the business problem, or whether it simply learned patterns that do not translate into value, safety, or trust.
To do evaluation well, you have to look past single scores and understand trade-offs: what kinds of mistakes your model makes, how expensive they are, who they affect, and how performance holds up once the world changes.
The core idea: outcomes versus outputs
The most important distinction in evaluation is between outcomes and outputs.
Outcomes are the real-world effects you want:
Cost saved
Revenue generated
Time reduced
Risk avoided
Customer experience improved
Outcomes are expressed in business language. They describe impact.
Outputs are what the model produces:
A prediction (number or category)
A probability score
A ranked list
A confidence estimate
Outputs are evaluated using technical metrics such as error rates, precision, recall, or calibration.
The rule that keeps teams honest is simple:
Define the outcome first, then choose output metrics that reflect it.
If you start with a metric because it is popular, you often optimize the wrong thing. If you start with the business outcome, your metric choices become much clearer.
Where metrics show up in the lifecycle
Metrics are not a final exam. They are used at three points, and each has a different goal.
Model selection during iteration
You compare candidates and decide which direction is best. This often uses validation sets or cross-validation.Final evaluation before launch
You evaluate once on a truly unseen test set to estimate real-world generalization.Monitoring after deployment
You track performance over time, detect drift, and catch degradation early.
If you only think about metrics at step two, you will miss the operational reality. A good offline score is a starting point, not a finish line.
Evaluating regression models
Regression is used when you predict a continuous number: price, demand, duration, consumption, value.
The central question is not “how wrong is it,” but “what kind of wrong hurts us.”
Mean Squared Error and RMSE
Mean Squared Error (MSE) squares the errors before averaging them.
What that means in practice:
Large misses get punished heavily.
A few big errors can dominate the score.
This is useful when being very wrong is disproportionately costly. For example, underestimating demand can cause stock-outs, which can be far worse than mild overestimation.
Because MSE is in squared units, teams often use Root Mean Squared Error (RMSE), which brings the metric back to the original units of the target.
Use MSE or RMSE when large outliers are particularly harmful.
Mean Absolute Error
Mean Absolute Error (MAE) averages the absolute distance between prediction and truth.
What that means in practice:
It treats all errors linearly.
It is less sensitive to outliers than MSE.
It is often easier to interpret.
Use MAE when you care about being consistently close and you do not want a few extreme cases to dominate the evaluation.
Mean Absolute Percent Error
Mean Absolute Percent Error (MAPE) expresses error as a percentage.
What it is good for:
Communicating to stakeholders who reason in percentages.
Comparing across targets with different scales, in some cases.
The caution:
When the true value can be near zero, percentage errors can explode and become misleading.
A small absolute miss on a small true value can look like a huge failure.
Use MAPE when the target never approaches zero and when percent-based communication matters, but treat it carefully.
R-squared: what it says and what it does not
R-squared is often used to communicate how much variation in the target your model explains compared to a naive baseline that predicts the mean.
It can be helpful for intuition, but it has limitations:
It does not tell you how costly the errors are.
It can look acceptable while hiding dangerous tail failures.
It can be misleading when used outside its assumptions.
In product settings, error metrics tied to actual costs tend to be more actionable than R-squared alone.
Evaluating classification models
Classification is used when you predict categories: fraud vs legitimate, churn vs not churn, safe vs unsafe, severity levels.
The reason accuracy becomes a trap is class imbalance.
If 99% of cases are negative, then predicting “negative” for everything gives you 99% accuracy and 0% utility. That is not a strong model. It is a mirror of the base rate.
The confusion matrix: your evaluation foundation
A confusion matrix breaks outcomes into four buckets:
True positives (TP): predicted positive, actually positive
True negatives (TN): predicted negative, actually negative
False positives (FP): predicted positive, actually negative
False negatives (FN): predicted negative, actually positive
Once you see these counts, you can evaluate the model in terms of business consequences, not just aggregate scores.
Precision, recall, and why they fight each other
Recall answers: out of all real positives, how many did we catch?
High recall means you miss fewer positives.
This matters when false negatives are costly or dangerous.
Precision answers: out of all predicted positives, how many were correct?
High precision means fewer false alarms.
This matters when false positives are expensive, annoying, or harmful.
Improving one often worsens the other. That is not a flaw. It is the trade-off.
A practical way to frame it for product decisions:
Choose recall when missing a true case is unacceptable.
Choose precision when acting on a false alarm creates high cost or erodes trust.
Thresholds: the hidden product decision
Most classifiers output a probability score, not a yes or no.
You convert that probability into a decision using a threshold:
Threshold at 0.5 is common, but often wrong for real products.
The right threshold depends on the cost of FP vs FN and the capacity of downstream teams.
If a fraud model triggers manual review, the threshold is constrained by review capacity. If a safety model triggers an intervention, the threshold is constrained by risk tolerance.
A model score becomes a product decision the moment you pick a threshold.
ROC curves and AUROC
An ROC curve shows how recall changes as the false positive rate changes, across every possible threshold.
AUROC summarizes that curve into a single number:
1.0 is perfect separation
0.5 is random guessing
ROC curves are useful for comparing model families, but they can paint an overly optimistic picture when classes are extremely imbalanced, because true negatives can dominate.
Precision-Recall curves for imbalanced problems
When positives are rare, a precision-recall curve is often more informative.
It focuses on precision and recall directly, which is usually closer to the product reality:
How many real cases do we catch?
How many of our alerts are actually correct?
If your problem has very few positives, default to PR curves for a clearer view of trade-offs.
Interpretation: performance is not just a score
Evaluation should answer questions a product team can act on:
Where does the model fail?
Which user segments see worse performance?
Are errors concentrated in specific contexts?
Is the model overconfident?
Does the model degrade when inputs shift?
A strong evaluation includes error analysis, not just metrics:
Inspect false positives and false negatives.
Group errors by segment, device type, geography, language, tenure, or any attribute tied to product experience.
Look at the “hard cases” that drive cost, not only average performance.
If you cannot explain the failure modes, you cannot design guardrails.
Product implications: choose metrics based on the cost of being wrong
Metrics are not neutral. They encode trade-offs.
Regression trade-off example
If one large miss causes a major operational failure, prefer MSE or RMSE.
If consistent small misses create customer dissatisfaction, prefer MAE.
If the goal is stakeholder alignment and the domain supports it, use MAPE carefully.
Classification trade-off example
In turbulence prediction, a false negative can be a safety risk, while a false positive might cause a minor reroute. In that case, you likely optimize for recall and accept more false positives.
In spam detection for a business inbox, false positives can hide legitimate messages and break trust. In that case, you likely prioritize precision and build additional safety layers before filtering aggressively.
The right metric is the one that matches the actual cost curve of mistakes in your product.
A practical troubleshooting playbook
When a model underperforms, the best teams debug in a specific order. They start with the most fundamental causes first.
Problem framing and metric choice
Are you solving the right problem?
Are you measuring success in a way that reflects business value?
Are you optimizing a proxy that is misaligned with the outcome?
Data quantity and quality
Do you have enough labeled examples?
Are labels consistent?
Are missing values, outliers, or schema changes distorting training?
Feature design and availability
Are you missing key signals?
Are features available at prediction time?
Are you accidentally leaking future information?
Model fit and complexity
Are you underfitting because the model is too simple?
Are you overfitting because it is too complex?
Have you tried reasonable baselines and alternatives?
Inherent error
Some problems have a hard ceiling due to noise. If the signal is weak, the maximum achievable performance might be lower than stakeholders expect.
This order matters because the first two categories can make any algorithm look bad.
Case study: when reframing beats tuning
Imagine a team trying to help electric utilities prepare for severe weather.
The initial framing is regression: predict the exact number of outages per town.
The model struggles. Predicting exact counts is difficult, and small differences can be operationally meaningless while still penalizing the model heavily.
After customer conversations, the team reframes the need:
Utilities do not need exact outage counts. They need a severity level to allocate crews and equipment.
So the target becomes classification: predict severity on a 1 to 5 scale.
Now evaluation aligns with decisions:
Is the severity band correct?
How often do we under-predict severe events?
How many false alarms can operations tolerate?
Performance becomes better not because the algorithm changed, but because the task definition matched the real outcome.
That is evaluation doing its job: forcing alignment between output metrics and product decisions.
Takeaways
Start with outcomes, then pick model metrics that reflect real-world costs.
For regression, choose MSE or RMSE when large misses are especially costly; choose MAE when consistent closeness matters; use MAPE carefully when targets do not approach zero.
For classification, accuracy can be misleading under imbalance. Use confusion matrices, precision, and recall to understand error trade-offs.
Threshold selection is a product decision. It should reflect business costs and operational capacity.
ROC curves help compare models broadly; precision-recall curves are often better when positives are rare.
Troubleshoot in order: framing, data, features, fit, then accept irreducible noise.

