The Strategy of Patterns: A Product Manager’s Guide to Machine Learning (Part V)

Dec 6, 2025

Linear models are powerful, but the real world rarely behaves linearly.


Many product problems are conditional. A feature matters only in certain contexts. A pattern flips depending on segment, season, or user intent. In these situations, forcing the world into a single weighted sum can feel like arguing with reality.


Tree-based models solve this by adapting their shape to the data instead of fitting a fixed template. And when you have data without labels, clustering helps you organize the mess into usable structure.


Decision trees, ensemble methods, and K-means clustering form a practical toolkit for navigating nonlinearity, uncertainty, and unlabeled data.




The core idea: two shifts that unlock new capabilities


This overview is really about two conceptual moves.


Shift 1: from parametric to non-parametric supervised learning


Parametric models assume a fixed form and learn weights within it.

Non-parametric models like decision trees are different. They are not locked into one equation. They grow in complexity based on what the data demands, which makes them effective for nonlinear relationships and feature interactions.


Shift 2: from supervised to unsupervised learning


Supervised learning has labels. You know the “right answer” for historical examples.

Unsupervised learning does not. The goal is not prediction, but organization: finding structure, groups, and anomalies that help humans and systems make sense of the data.


A mature ML strategy knows when to use each approach, and what each approach costs in terms of interpretability, stability, and computation.




Part I: Decision trees


A decision tree is a model that predicts by asking a sequence of questions.

Each question splits the dataset into smaller groups. You keep splitting until you reach a final decision.


The anatomy of a tree


  • Nodes: where a question is asked (example: “Is age > 30?”)

  • Branches: the path taken based on the answer

  • Leaves: the endpoints, where the final prediction is produced


This structure makes trees intuitive. You can trace a single prediction end-to-end and see which conditions drove the outcome.


How trees choose splits: impurity and information gain


Training a tree is mostly about one decision repeated many times: what question should I ask next?

Trees choose splits that reduce impurity.


Impurity is a measure of how mixed the labels are at a node:

  • If a node is 50% “yes” and 50% “no,” it is very mixed.

  • If a node becomes almost all “yes,” impurity drops.


Information gain expresses the improvement from a split:


Information Gain = Impurity(parent) - Weighted Impurity(children)


The algorithm tries many candidate splits and selects the one with the highest information gain, meaning the split that creates the cleanest separation.


How trees make predictions


Once an example reaches a leaf, the tree outputs:

  • Classification tree: majority class in that leaf

  • Regression tree: average target value of the training samples in that leaf


So a tree prediction is based on the behavior of the training data that landed in the same final bucket.


The main trade-off: depth and overfitting


Depth is the key hyperparameter.

  • Shallow trees are easy to understand but may underfit.

  • Deep trees can fit training data extremely well, but often overfit by learning noise as if it were signal.


This is why trees are both powerful and risky. They can capture complex interactions, but they can also memorize quirks.


Practical guardrails include:

  • maximum depth

  • minimum samples per leaf

  • minimum samples required to split

  • pruning strategies




Part II: Ensembles


A single decision tree is unstable. Small changes in training data can lead to a different set of splits and a different model.

Ensembles fix this by combining many models into one.


The core idea is simple: averaging reduces variance.

If one model overfits a strange pattern, other models will not overfit the same way. Aggregation cancels noise and keeps what is consistent.


How ensembles produce a final output


Ensembles combine predictions with an aggregation function:

  • Classification: majority vote

  • Regression: average, sometimes weighted average


A key point for product teams: ensembles often improve performance, but interpretability drops. You trade clarity for stability.




Part III: Random forests and bagging


A random forest is a popular ensemble method built from decision trees.

It uses a technique called bagging, short for bootstrap aggregating.


Bagging in plain language


Suppose you have 100 rows of data.


To train Tree 1:

  • sample 100 rows with replacement

  • some rows appear multiple times

  • some rows are missing


Repeat this to train many trees. Each tree sees a slightly different dataset, so their errors are less correlated. That makes averaging effective.


Why random forests also sample features


Random forests add another trick: feature randomness.

When each tree considers a split, it can only consider a random subset of features. This prevents all trees from fixating on the same “strong” feature and becoming too similar.


Design choices you control:

  • number of trees

  • maximum tree depth or minimum samples per leaf

  • number of features considered per split

  • sampling strategy


Random forests are often a strong default when:

  • the relationship is nonlinear,

  • feature interactions matter,

  • you want solid performance without heavy feature engineering.




Part IV: Clustering and K-means


Clustering is unsupervised learning: grouping items without labels.

Instead of asking “what should we predict,” clustering asks “what naturally belongs together.”


The most important decision in clustering: similarity


Clustering does not discover the “true” groups in your business. It groups according to the features you choose.


Similarity depends on your intent.

  • Group customers by geography and you get geographic segments.

  • Group customers by purchase behavior and you get behavioral segments.

  • Group articles by word usage and you get topical clusters.


Clustering is less about the algorithm and more about defining what “similar” should mean for your product goal.


K-means clustering


K-means is a widely used clustering algorithm that groups points by distance to a cluster center.


Goal: minimize the sum of distances between points and their assigned center


Process:

  1. Choose K (number of clusters)

  2. Initialize K centers randomly

  3. Assign each point to the nearest center

  4. Move each center to the mean of its assigned points

  5. Repeat steps 3 and 4 until centers stop moving


What makes K-means useful:

  • it is fast and simple

  • it scales well

  • it provides clear clusters when the data has roughly spherical group structure


Common pitfalls:

  • you must choose K

  • results can change with different initializations

  • it struggles when clusters have different densities or irregular shapes

  • feature scaling matters because distance is sensitive to units




Product implications: performance, interpretability, and cost


Tree-based models and clustering expand what you can build, but they force trade-offs.


Performance


Trees and ensembles often outperform linear baselines on nonlinear data with interactions.


Interpretability


  • A single tree is easy to explain.

  • A forest of hundreds of trees is not.
    If you need a defensible explanation per decision, that constraint should influence your choice early.


Computational cost


  • Linear models are cheap.

  • Trees are moderate.

  • Ensembles cost more to train and more to run, because inference requires many models plus aggregation.


The right model is the one that fits your product constraints, not the one that wins a leaderboard.





A practical playbook


  • Use a decision tree when you want a quick, interpretable model that captures nonlinearity with minimal preprocessing.

  • Use a random forest when a single tree overfits or is unstable and you need better generalization.

  • Use clustering when you have unlabeled data and need segmentation, discovery, or structure.

  • Do a similarity audit for clustering: write down the business meaning of “similar” before you choose features.

  • Treat K as a product decision: test multiple values, compare cluster usefulness, and prefer clusters that lead to distinct actions.





Examples that make it concrete


A simple animal classification tree


You want to classify dog, lizard, bird, and moose.

  • Split 1: “Does it have horns?” Yes means moose.

  • Split 2: “Does it have two legs?” Yes means bird.

  • Split 3: “Is it green?” Yes means lizard, otherwise dog.


That is a tree doing what trees do best: encoding conditional logic.


Weather forecasting and ensembles


Weather systems often use ensembles: multiple models, multiple scenarios, aggregated into one forecast. The reason is not style, it is robustness. When uncertainty is high, averaging across plausible models reduces the chance that one bad assumption dominates the outcome.

The same logic applies in business forecasting and risk prediction.


Clustering houses instead of predicting prices


If you cluster houses by size, you group cottages and mansions.
If you cluster by age, you group historic homes regardless of size.

Neither is “right.” The correct clusters depend on what decision the product needs to support.





Takeaways


  • Non-parametric models like trees adapt to data and handle nonlinear relationships naturally.

  • Decision trees select splits by reducing impurity, often summarized as information gain.

  • Overfitting is the primary risk for deep trees, so depth and minimum samples per leaf matter.

  • Ensembles reduce variance by aggregating many models, improving stability and generalization.

  • Random forests use bagging plus feature randomness to create diverse trees whose average is strong.

  • Clustering is only as meaningful as your definition of similarity.

  • K-means iteratively assigns points to centers and moves centers to the mean until stable.

  • Model choice is always a trade-off between performance, interpretability, and cost.

Let's talk product