Why Great ML Products Start With Better Questions: A Product Manager’s Guide to Machine Learning (Part II)

Dec 3, 2025

In the rush to “do AI,” many teams treat machine learning as a math problem or a shopping exercise: pick an algorithm, train it, ship it.


That is rarely where success or failure is decided.


The difference between a model that creates durable business value and one that collapses in production usually comes down to process. Not the algorithm. Not the tooling. Process.


Model development is an iterative lifecycle that starts before the first training run and continues long after launch. It forces you to make trade-offs between data, complexity, speed, explainability, and risk. If you get those trade-offs wrong, you can end up with a model that looks great in a notebook and fails quietly in the real world.




The core idea: a model is an approximation, not the truth


A machine learning model is a mathematical approximation of a relationship in the world:


y = f(X) + ϵ

  • y is the target you want to predict or decide.

  • X is the set of features you use as inputs.

  • ϵ is irreducible noise: randomness, missing context, measurement error, and factors you cannot observe.


Two implications matter for product teams:

  1. The goal is not perfection. The goal is the best usable approximation under real constraints.

  2. Some error is permanent. You cannot road-map your way to 100 percent accuracy if the problem contains noise.


The modeling process exists to find the best approximation that is good enough to ship, safe enough to trust, and cheap enough to run.




The lifecycle: a disciplined loop, not a straight line


A useful way to think about model development is as a six-phase loop. It is not a waterfall. It is normal to revisit earlier phases after you learn something during evaluation or deployment.


Phase 1: Business Understanding


Most ML projects fail here, often politely.


“Predict churn” is not a target. It is a wish.


A target becomes real only when it is measurable:

  • What is the exact definition of churn?

  • Over what time window?

  • For which segment?

  • What actions will the product take based on the prediction?

  • What is the cost of false positives and false negatives?


A strong framing includes:

  • Decision and action: What will change in the product if the model is right?

  • Constraints: latency, cost per prediction, privacy, regulatory requirements, user trust.

  • Success metrics: both model metrics (like precision/recall) and product metrics (like retention lift, time saved, error reduction).

  • Baseline: what happens today without ML, or with a simple heuristic.


If you cannot articulate the baseline, you cannot quantify value.


Phase 2: Data Understanding


Here you map what exists versus what the problem needs.


  • What data sources do you have?

  • How complete are they?

  • Where are the blind spots?

  • How stable are definitions over time?

  • What biases are baked into collection?


This phase is where many teams discover that their “labels” are proxies, and proxies come with compromises.


Phase 3: Data Preparation


This is where most real effort tends to land.


Data preparation includes:

  • Cleaning duplicates and missing values

  • Creating consistent identifiers across systems

  • Joining events, users, accounts, sessions, devices

  • Building features that are available at prediction time

  • Setting up reproducible pipelines so training datasets can be rebuilt


A simple litmus test: if you cannot reproduce a training dataset tomorrow from raw sources, you do not have a product-grade pipeline yet. You have a demo.


Phase 4: Modeling


Only now are you truly ready to train.


This phase typically looks like:

  • Establish a baseline model

  • Train candidate model families

  • Tune hyperparameters

  • Compare performance using validation data

  • Iterate based on failure analysis (what it gets wrong and why)


This is also where you decide how much complexity you can afford, technically and operationally.


Phase 5: Evaluation


Evaluation is where you decide whether the model is good enough for the intended use.


That includes:

  • Offline metrics on unseen data

  • Error analysis by segment (who is it worse for?)

  • Robustness checks (what happens when inputs drift?)

  • Risk checks (does it create harmful failure modes?)

  • Calibration and threshold setting (when do you take action?)


A common structure is:

  • Training set: fit model parameters

  • Validation set: tune and compare versions

  • Test set: evaluate once, at the end, as your clean estimate of real-world performance


The biggest pitfall here is data leakage. If information that would not exist at prediction time slips into training or selection, results will look great and collapse later.


If performance feels suspiciously high, assume leakage until proven otherwise.


Phase 6: Deployment


Deployment is not “push to prod.” It is the beginning of running an ML system.


A production deployment needs:

  • Inference infrastructure (batch or real-time)

  • Monitoring for data drift and prediction drift

  • Feedback loops for outcomes and labels

  • Alerting and rollback plans

  • A retraining strategy and ownership model


The world changes. If you do not plan for drift, the model will decay quietly while dashboards keep looking fine.




The four components every model depends on


Once the problem and data are in decent shape, model development still rests on four pillars.


1) Features


Features are the measurable signals you feed into the model.


Good features have three qualities:

  • They are available at prediction time.

  • They are stable enough to generalize.

  • They have a plausible connection to the target.


2) Algorithm family


This is the general model template: linear models, trees, gradient boosting, neural networks.

The right choice depends on the data type, constraints, and the kind of errors you can tolerate.


3) Hyperparameters


These control model behavior and complexity: depth of trees, regularization strength, learning rate, number of layers.

Hyperparameters are not decoration. They define where you sit on the bias-variance spectrum.


4) Loss function


The loss function encodes what “wrong” means during training.

If your loss does not align with business costs, you can optimize the model and still hurt the product.




Feature selection: high leverage, easy to get wrong


Feature selection is often the largest driver of performance and the biggest source of unforced errors.


Four practical ways teams find strong features:

  1. Domain expertise
    Talk to the people who know the system. Ask what changes before the target changes.

  2. Visualization
    Plot candidate features against the target and look for structure: separation, monotonic trends, thresholds.

  3. Statistics
    Correlations and related tests help rank candidates early, especially when you have many options.

  4. Model-based inspection
    Train a reasonable baseline model and inspect which signals matter, then iterate.


A key nuance: “more features” is not always better.

  • Too few features increases bias and can make the task unsolvable.

  • Too many careless features increases variance, adds noise, and increases leakage risk.


A pragmatic approach: start with a broad but defensible set, then prune aggressively once you understand what is real signal and what is noise.




Complexity and the bias-variance trade-off


Model error is not one thing. It is a blend of:

  • Bias: error from being too simplistic.

  • Variance: error from being too sensitive to noise in the training data.

  • Irreducible error: noise you cannot eliminate.


In practice:

  • High bias (underfitting): consistently wrong in the same way.

  • High variance (overfitting): great on training data, disappointing in the real world.


Your job is not to maximize complexity. It is to find the sweet spot where performance generalizes and operational costs stay sane.




Product implications: why “better model” can mean “worse product”


A model with slightly higher offline accuracy can still be a poor business choice if:

  • it costs far more to run,

  • it is too slow for your UX,

  • it cannot be explained when users demand reasons,

  • it increases support burden,

  • it introduces unacceptable risk.


The “best” model is the one that creates product value reliably, within cost, latency, and trust requirements.




A practical playbook for building models that ship


  • Define success early: Pick model metrics that connect to product outcomes. Decide how you will measure lift.

  • Write down the decision rule: What action happens at what threshold? Who is affected? What is the rollback plan?

  • Start with a baseline: A simple heuristic or a linear model baseline clarifies whether the problem is learnable.

  • Treat data work as product work: Invest in labels, instrumentation, and reproducible pipelines.

  • Protect the test set: Do not touch it during feature engineering or model selection.

  • Watch for overfitting: Large gaps between training and validation performance usually mean you need simpler models or stronger regularization.

  • Choose complexity that matches constraints: Make latency and cost first-class requirements, not afterthoughts.

  • Plan for monitoring: Track drift, performance by segment, and outcome metrics, then retrain intentionally.




Takeaways


  • Modeling success is primarily process, not algorithm choice.

  • A model is an approximation of reality, and irreducible noise is always present.

  • Feature work is often the highest leverage step and the easiest place to make costly mistakes.

  • Evaluation discipline matters: keep training, validation, and test strictly separated and guard against leakage.

  • Deployment turns a model into a system: monitoring, drift, retraining, and ownership determine long-term value.

Let's talk product