The Strategy of Patterns: A Product Manager’s Guide to Machine Learning (Part V)

Two conceptual shifts

This part covers two moves that expand what ML can do.

From parametric to non-parametric supervised learning. Parametric models assume a fixed form and learn weights within it. Non-parametric models like decision trees aren't locked into one equation. They grow in complexity based on what the data demands, which makes them effective for nonlinear relationships and feature interactions.

From supervised to unsupervised learning. Supervised learning has labels: you know the right answer for historical examples. Unsupervised learning doesn't. The goal isn't prediction but organization: finding structure, groups, and anomalies that help humans and systems make sense of data.

Knowing when to use each approach, and what each costs in interpretability, stability, and computation, is what a mature ML strategy looks like.

Decision trees

A decision tree predicts by asking a sequence of questions. Each question splits the dataset into smaller groups. You keep splitting until you reach a final decision.

The anatomy of a tree:

  • Nodes: where a question is asked. Example: "Is age > 30?"

  • Branches: the path taken based on the answer.

  • Leaves: the endpoints, where the final prediction is produced.

This structure makes trees intuitive. You can trace a single prediction end-to-end and see exactly which conditions drove the outcome.

How trees choose splits: impurity and information gain

Training a tree is mostly about one decision repeated many times: what question should I ask next? Trees choose splits that reduce impurity, a measure of how mixed the labels are at a node. If a node is 50% "yes" and 50% "no," it's very mixed. If a split makes one child almost entirely "yes," impurity drops.

Information gain expresses the improvement from a split:

The algorithm tries many candidate splits and selects the one with the highest information gain: the split that creates the cleanest separation.

How trees make predictions

Once an example reaches a leaf, the tree outputs:

  • Classification tree: majority class in that leaf.

  • Regression tree: average target value of the training samples in that leaf.

A tree prediction is based on the behavior of the training data that landed in the same final bucket.

The main trade-off: depth and overfitting

Depth is the key hyperparameter. Shallow trees are easy to understand but may underfit. Deep trees can fit training data extremely well but often overfit by learning noise as if it were signal.

Practical guardrails: maximum depth, minimum samples per leaf, minimum samples required to split, and pruning strategies.

A simple example: classifying animals

You want to classify dogs, lizards, birds, and moose.

  • Split 1: "Does it have horns?" → Yes means moose.

  • Split 2: "Does it have two legs?" → Yes means bird.

  • Split 3: "Is it green?" → Yes means lizard, otherwise dog.

That's a tree doing what trees do best: encoding conditional logic in a sequence of binary decisions.

Ensembles

A single decision tree is unstable. Small changes in training data can lead to a different set of splits and a different model. Ensembles fix this by combining many models into one.

The core idea: averaging reduces variance. If one model overfits a strange pattern, other models won't overfit the same way. Aggregation cancels noise and keeps what is consistent.

How ensembles produce a final output:

  • Classification: majority vote.

  • Regression: average, sometimes weighted.

The trade-off for product teams: ensembles often improve performance, but interpretability drops. You trade clarity for stability.

Weather forecasting as an analogy

Weather systems often use ensembles: multiple models, multiple scenarios, aggregated into one forecast. Not for style but for robustness. When uncertainty is high, averaging across plausible models reduces the chance that one bad assumption dominates the outcome. The same logic applies in business forecasting and risk prediction.

Random forests and bagging

A random forest is a popular ensemble method built from decision trees, using a technique called bagging (bootstrap aggregating).

Bagging in plain language: suppose you have 100 rows of data. To train Tree 1, sample 100 rows with replacement: some rows appear multiple times, some are missing. Repeat this to train many trees. Each tree sees a slightly different dataset, so their errors are less correlated. That makes averaging effective.

Feature randomness: random forests add another trick. When each tree considers a split, it can only consider a random subset of features. This prevents all trees from fixating on the same strong feature and becoming too similar.

Design choices you control: number of trees, maximum tree depth or minimum samples per leaf, number of features considered per split, and sampling strategy.

Random forests are often a strong default when the relationship is nonlinear, feature interactions matter, and you want solid performance without heavy feature engineering.

Clustering and K-means

Clustering is unsupervised learning: grouping items without labels. Instead of asking "what should we predict," it asks "what naturally belongs together."

The most important decision in clustering: similarity

Clustering doesn't discover the "true" groups in your business. It groups according to the features you choose. Group customers by geography and you get geographic segments. Group by purchase behavior and you get behavioral segments. Group articles by word usage and you get topical clusters.

Clustering is less about the algorithm and more about defining what "similar" should mean for your product goal.

Example: clustering houses

If you cluster houses by size, you group cottages and mansions together. If you cluster by age, you group historic homes regardless of size. Neither is "right." The correct clusters depend on what decision the product needs to support.

K-means clustering

K-means groups points by distance to a cluster center. Goal: minimize the sum of distances between points and their assigned center.





What makes K-means useful: fast, simple, scales well, provides clear clusters when the data has roughly spherical group structure.

Common pitfalls: you must choose K upfront, results can change with different initializations, it struggles when clusters have different densities or irregular shapes, and feature scaling matters because distance is sensitive to units.

Trade-offs across the three approaches

Performance: trees and ensembles often outperform linear baselines on nonlinear data with interactions.

Interpretability: a single tree is easy to explain. A forest of hundreds of trees is not. If you need a defensible explanation per decision, that constraint should shape your model choice early, not after you've built it.

Computational cost: linear models are cheap, trees are moderate, ensembles cost more to train and run because inference requires many models plus aggregation.

The right model fits your product constraints, not a leaderboard.

Questions to be able to answer

  • Is the relationship you're modeling likely nonlinear or conditional? If yes, a tree-based approach is worth trying before more complex alternatives.

  • Does a single tree overfit or produce unstable results? If yes, move to a random forest.

  • Do you have unlabeled data that needs structure? Clustering, but write down the business meaning of "similar" before choosing features.

  • For K-means: have you tested multiple values of K and compared cluster usefulness? Do the resulting clusters lead to distinct actions?

  • If using an ensemble: do you have a plan for explaining decisions when interpretability is required?

Takeaways

  • Non-parametric models like trees adapt to data and handle nonlinear relationships naturally.

  • Decision trees select splits by reducing impurity, summarized as information gain.

  • Overfitting is the primary risk for deep trees. Depth and minimum samples per leaf are the main controls.

  • Ensembles reduce variance by aggregating many models, improving stability and generalization.

  • Random forests use bagging plus feature randomness to create diverse trees whose average is strong.

  • Clustering is only as meaningful as your definition of similarity.

  • K-means iteratively assigns points to centers and moves centers to the mean until stable.

  • Model choice is always a trade-off between performance, interpretability, and cost.

Product Intelligence Atlas

Applied thinking on product and AI, from someone doing the work.

I started the Atlas as a place to put things I didn't want to lose. Notes from courses, prompts that actually worked, observations from client work that felt worth writing down. It grew from there. Now it's where I think through AI and product management in public: what I'm learning, what I'm building, what I think is worth paying attention to.

Product Intelligence Atlas

Applied thinking on product and AI, from someone doing the work.

I started the Atlas as a place to put things I didn't want to lose. Notes from courses, prompts that actually worked, observations from client work that felt worth writing down. It grew from there. Now it's where I think through AI and product management in public: what I'm learning, what I'm building, what I think is worth paying attention to.

Product Intelligence Atlas

Applied thinking on product and AI, from someone doing the work.

I started the Atlas as a place to put things I didn't want to lose. Notes from courses, prompts that actually worked, observations from client work that felt worth writing down. It grew from there. Now it's where I think through AI and product management in public: what I'm learning, what I'm building, what I think is worth paying attention to.

Product Intelligence Atlas

Applied thinking on product and AI, from someone doing the work.

I started the Atlas as a place to put things I didn't want to lose. Notes from courses, prompts that actually worked, observations from client work that felt worth writing down. It grew from there. Now it's where I think through AI and product management in public: what I'm learning, what I'm building, what I think is worth paying attention to.

Let's talk product

Maxime John · AI-fluent PM · Based in Germany, relocating to Portland, OR

Open to PM roles at US companies, remote now and on-site in Portland, OR from Q4 2026.

Job conversations, project ideas, and good product discussions all welcome.

Open to PM roles in the US

Available for remote work now

On-site in Portland, OR from Q4 2026

Let's talk product

Maxime John · AI-fluent PM · Based in Germany, relocating to Portland, OR

Open to PM roles at US companies, remote now and on-site in Portland, OR from Q4 2026.

Job conversations, project ideas, and good product discussions all welcome.

Open to PM roles in the US

Available for remote work now

On-site in Portland, OR from Q4 2026

Let's talk product

Maxime John · AI-fluent PM · Based in Germany, relocating to Portland, OR

Open to PM roles at US companies, remote now and on-site in Portland, OR from Q4 2026.

Job conversations, project ideas, and good product discussions all welcome.

Open to PM roles in the US

Available for remote work now

On-site in Portland, OR from Q4 2026

Let's talk product

Maxime John · AI-fluent PM · Based in Germany, relocating to Portland, OR

Open to PM roles at US companies, remote now and on-site in Portland, OR from Q4 2026.

Job conversations, project ideas, and good product discussions all welcome.

Open to PM roles in the US

Available for remote work now

On-site in Portland, OR from Q4 2026