How an LLM Works End-to-End: From Text to Tokens to Output
Nov 12, 2025
As a product manager, you do not need to derive transformer equations to make good decisions. You do need a concrete mental model of what happens from user text to model output, and where the sharp edges live.
This briefing walks the full path:
Text → tokens → context window → transformer attention → logits → sampling → output text
Along the way we will anchor everything in implementation patterns that show up in real products: cost control, latency tuning, prompt design, and reliability under load.
Key terms we will use: tokens, context window, attention, inference, logits, sampling.
The full pipeline in one picture
When a user types:
“Summarize this email thread and propose a reply.”
A production LLM system does roughly this:
Normalize text (minor, model specific)
Tokenize text into token IDs (integers)
Assemble the prompt into a single sequence (system + developer + user + tools + retrieved context)
Check context window limits and apply truncation or retrieval strategies
Run inference (forward pass through the transformer) to produce logits for the next token
Sample a token from logits (greedy or stochastic)
Append that token to the sequence and repeat steps 5 to 7 until stop
Detokenize token IDs back into text
Optionally stream tokens to the UI as they are produced
Everything product relevant hangs off a few facts:
The model operates on tokens, not characters or words.
The model generates output by predicting the next token repeatedly.
Attention and caching dominate runtime behavior, which is why context length drives latency.
Outputs are not “the answer”. They are a sample from a probability distribution.
Let’s build intuition piece by piece.
1) Tokenization: what it is, and why tokens matter
What tokenization is
Tokenization is the process of converting text into a sequence of token IDs.
Tokens are not the same as words.
Common words might be 1 token.
Rare words might be multiple tokens.
Punctuation and whitespace often become tokens.
Different languages, emojis, and code can tokenize very differently.
Most modern LLMs use subword tokenization families such as Byte Pair Encoding (BPE), Unigram (SentencePiece), or closely related variants. The practical effect is:
The model has a finite “vocabulary” of token pieces (for example:
"ing"," pre","http","{")Text is split into the best matching pieces
Each piece maps to an integer ID
Those integers are what the model actually processes
Why tokens matter for cost
Nearly all API pricing and internal capacity planning is in tokens:
Input tokens: the prompt you send
Output tokens: the completion you receive
If your prompt is 20,000 tokens and your output is 1,000 tokens, you pay for both and you also consume runtime on both.
Product implication: prompt bloat has a direct line to gross margin.
Practical patterns:
Put stable policy text into a short system prompt, not repeated user instructions.
Avoid pasting full documents when retrieval plus quotes will do.
Cache expensive prompt prefixes for repeated tasks (more on prefix caching later).
Why tokens matter for latency
Tokenization itself is usually fast, but token count affects latency in two deeper ways:
Prefill cost: processing the input prompt before the model can generate the first output token.
Decode cost: generating each output token, one step at a time.
Both scale with token counts, but not equally.
In many transformer implementations:
Prefill cost grows with input length.
Decode cost grows with output length, but each decode step also attends to prior tokens. This is why long contexts can slow down per token generation too.
Rule of thumb for PM intuition:
Cutting input tokens often reduces “time to first token”.
Cutting output tokens reduces total time, especially if you stream.
Why tokens matter for prompting quality
Prompts “work” because the model sees patterns in sequences of tokens.
Tokenization affects patterns in subtle ways:
A small rewording can change token boundaries, changing what the model “recognizes”.
Lists, headings, and consistent separators often help because they create stable token patterns.
Code blocks tokenize predictably, which is why structured prompts often behave better than prose.
Practical prompting patterns that exploit token stability:
Use clear section labels:
Task:,Constraints:,Output format:Use consistent delimiters for data: triple backticks, JSON, or XML-like tags
Keep formatting consistent across requests if you want consistent behavior
2) Context window: what it is, and what breaks when you exceed it
What a context window is
A model’s context window is the maximum number of tokens it can consider at once in a single forward pass.
This includes:
System instructions
Developer instructions
User messages
Tool outputs
Retrieved documents
The model’s own previous replies in the thread
In other words: the context window is your model’s short term working memory, measured in tokens.
What actually happens near and beyond the limit
There are a few common behaviors, depending on the API and the model:
Hard error: the request is rejected for exceeding context length.
Truncation: the system drops earlier tokens to fit.
Sliding window: only the most recent tokens are used.
Summarization or compression: a service layer compresses older context.
The most important thing is not the exact behavior. It is the failure mode it creates.
What breaks in real products
When you exceed the context window, these are the real breakages you see:
A) Instruction loss and “amnesia”
If truncation drops the system or developer instructions, the model stops following your rules.
Symptoms:
Suddenly wrong output format
Safety or compliance constraints not honored
Tone shifts
Tool use stops
B) Retrieval dilution
If you add too much retrieved content, the model’s attention is spread across many tokens. Even within the limit, relevance can degrade.
Symptoms:
The model quotes the wrong section
Answers become generic
It misses the key detail that is present in the context
C) Latency blowups
Even before you hit the limit, long context increases time to first token and sometimes slows decoding.
Symptoms:
The UI feels frozen before streaming begins
P95 latency climbs sharply with long threads
Timeouts increase
D) Cost explosions that look like “mysterious spend”
Long chats that include full tool outputs and repeated policies can silently multiply token usage.
Symptoms:
Token dashboards show huge input tokens per call
Unit economics degrade as usage scales
Users feel output quality is not improving despite higher cost
Practical patterns to manage context
You have a small set of robust strategies.
1) Budget tokens explicitly
Treat context like a budget:
Reserve a fixed chunk for instructions.
Reserve a fixed chunk for user input.
Reserve a fixed chunk for retrieved context.
Reserve a fixed chunk for output.
Many production systems keep a “prompt budgeter” that measures tokens and trims or compresses.
2) Retrieval augmented generation (RAG), but with discipline
Instead of pasting everything, retrieve only the most relevant chunks.
Implementation patterns:
Chunk documents by semantic boundaries (headings, paragraphs), not fixed character count.
Retrieve top-k chunks and cap them by tokens.
Add citations or chunk IDs so you can debug what the model saw.
3) Summarize older conversation into state
For long-lived threads, store a compact state representation:
user preferences
decisions made
key facts
open tasks
Then include that state, not the full transcript.
4) Use “tool outputs are not chat history” discipline
Tool outputs can be huge. Common pattern:
Store tool output in your database
Feed the model a short excerpt plus a pointer and a summary
Only paste full content when truly necessary
3) Next-token prediction: what it means in practice
The core idea
An LLM is trained to do one thing extremely well:
Given a sequence of tokens, predict the probability distribution of the next token.
That is next-token prediction.
It does not “search for the best answer” in a human sense. It produces a distribution over possible next tokens, then you pick one token and repeat.
So the model is a probability engine that builds text one token at a time.
Why this explains so many model behaviors
The model is sensitive to phrasing
If you change the prompt, you change the token sequence, which changes the distribution over next tokens.
This is why:
“Write a summary” and “Summarize in 3 bullets with a recommendation” can behave very differently.
A structured template can outperform a clever paragraph.
The model does not inherently optimize truth
Training encourages matching patterns in data. Truth can be correlated with those patterns, but it is not guaranteed.
So next-token prediction is compatible with:
confident wrong answers
plausible fabrication
shallow reasoning when the prompt does not force structure
Product implication: if correctness matters, you need scaffolding. Retrieval, verification steps, constrained output, and tool calls.
The model can “continue” almost anything
Give it half a poem, a JSON stub, or a code snippet, and it will predict likely continuations. That is the same mechanism.
This is why strong delimiters and explicit formats help. You are shaping the continuation distribution.
4) Attention: an intuitive model of “context matters”
What attention is, without equations
Attention is how the model decides which previous tokens to use when producing the next token.
At each layer, for each token position, the model computes something like:
Which other tokens are relevant to me right now?
How much weight should I give them?
You can think of it as dynamic, learned referencing.
If the model is generating a token in the sentence:
“The capital of Austria is ____.”
It will assign high weight to tokens like “capital” and “Austria”, and low weight to irrelevant earlier parts of the prompt.
That is attention in the most useful intuitive sense: relevance weighting across the context.
Why attention explains “context matters”
Because attention is how information moves from earlier tokens into the model’s decision about the next token, two practical truths follow:
If the relevant information is not in the context window, the model cannot use it.
If the relevant information is buried in noise, attention might not focus on it reliably.
So “context matters” is not mystical. It is the set of tokens the model can attend to, and how well it can isolate the right ones.
What makes attention expensive
Naively, attention compares tokens to other tokens in the sequence. That can create scaling issues as sequences get long.
Modern inference stacks use multiple optimizations:
KV cache: store key/value tensors from prior tokens so you do not recompute them during decoding.
Memory efficient attention kernels: speedups at the GPU level.
Paged attention or similar memory managers: reduce fragmentation and allow longer contexts.
Architectural variants: some models use local attention, sliding attention, or other mechanisms to scale.
As a PM, you do not need to memorize kernels. You do need to know the product consequence:
Long prompts do not only cost more. They often slow down and increase tail latency.
Optimizations exist, but context length is still a first-class performance variable.
5) Inference: what happens at runtime
Training vs inference
Training: the model learns weights from huge datasets.
Inference: you run the fixed model weights to produce outputs for your input.
Prefill vs decode
Inference typically has two phases:
Prefill (processing the prompt)
The model takes your full input token sequence and computes internal representations. This is where time to first token is decided.
If you have a long conversation history, prefill is usually your biggest tax.
Decode (generating tokens)
The model generates one token at a time. After each generated token, it updates the KV cache and repeats.
This is why:
Output length directly drives latency.
Streaming feels fast because you get tokens as they are decoded, even if total completion time is longer.
The practical performance levers
In real systems, these are common implementation patterns:
Streaming: improves perceived latency and engagement.
Stop sequences: limit rambling and reduce output tokens.
Max output tokens: hard cap to protect cost and latency.
Batching: serve multiple requests together on GPUs to increase throughput.
Quantization: reduce model precision to run faster or cheaper, sometimes at some quality cost.
Speculative decoding: use a smaller draft model to propose tokens, then verify with the main model to speed up decoding.
Prefix caching: cache repeated system and developer prefixes, especially for agentic workflows that call the model repeatedly.
From a product perspective, you want to expose only the controls that matter, and hide the ones users will misuse.
6) Logits: the raw output before “choosing” a token
What logits are
At each decode step, the model outputs a vector of numbers, one per token in the vocabulary. Those numbers are called logits.
Logits are unnormalized scores. Higher logit means the token is more likely.
To turn logits into probabilities, you apply softmax. But in practice, you rarely need probabilities. You need a way to pick the next token.
That is where sampling comes in.
Why PMs should care about logits
Because many “personality” and “reliability” behaviors are just transformations applied to logits before sampling.
Examples:
Temperature rescales logits.
Top-k and top-p truncate options.
Repetition penalties push down logits for previously used tokens.
Biasing can push up or down specific tokens (use cautiously).
If you want deterministic output, you are essentially constraining how logits turn into the next token.
7) Sampling: how the model chooses what to say
Sampling is the decision rule
Once you have logits, you choose the next token. That process is sampling.
Common strategies:
Greedy decoding
Always pick the highest probability token.
Pros: deterministic, often concise.
Cons: can get stuck in bland or locally optimal phrasing.
Temperature
A scalar that flattens or sharpens the distribution.
Lower temperature: more deterministic, less variety.
Higher temperature: more creative, more risk.
Top-k
Only consider the top k tokens by probability.
Reduces weird low-probability picks.
Can be overly restrictive if k is too small.
Top-p (nucleus sampling)
Consider the smallest set of tokens whose cumulative probability exceeds p.
Adapts to confidence. When the model is confident, the set is small. When uncertain, the set grows.
Often a good default for natural language.
Stop tokens and stop sequences
Hard stop conditions that end generation when certain tokens appear.
Crucial for bounded outputs.
Great for tool calls and structured formats.
Why sampling explains “inconsistency”
If you allow randomness (temperature, top-p), two runs can produce different outputs from the same prompt.
Even with low randomness, small differences in prompt tokens can change the distribution enough to change outcomes.
Product implication:
For workflows that require reproducibility, push toward deterministic settings and constrained formats.
For ideation products, embrace higher diversity but add guardrails.
8) A concrete end-to-end walkthrough
Let’s run the whole loop with a simplified example.
User input:
“Write a one sentence summary of: The launch was delayed because the vendor missed the security review.”
Step 1: Tokenize
The text becomes token IDs like:
[1012, 345, 8921, ...]
You never see these IDs in product UI, but they are the truth inside the model.
Step 2: Build the prompt sequence
Your system might prepend:
System: “You are a helpful assistant…”
Developer: “Output must be one sentence.”
User: the text
All of that becomes one token sequence.
Step 3: Prefill inference
The model processes all input tokens, producing internal states and populating the KV cache.
Step 4: Compute logits for the next token
The model outputs logits for every possible next token.
The highest scoring tokens might correspond to pieces like:
“The”
“Launch”
“was”
“delayed”
Step 5: Sample
If greedy, you pick “The”. If temperature and top-p, you might pick “The” anyway, but you might also pick “Launch”.
Step 6: Append and repeat
Now the model predicts the next token given the updated sequence. It continues until it produces an end token or hits your max tokens or a stop sequence.
Step 7: Detokenize
Token IDs convert back to text.
Output:
“The launch was delayed because the vendor did not complete the security review in time.”
That is it. No hidden scratchpad is required for the basic mechanism. Everything emerges from repeated next-token prediction, conditioned on context via attention.
9) Practical product patterns that fall directly out of this model
Pattern 1: Prompt budgeting as a first-class feature
If tokens are cost and latency, treat them like memory in a mobile app.
Implementation habits:
Count tokens before sending.
Enforce caps per segment (instructions, retrieved context, user input).
Prefer retrieval and summarization over raw paste.
PM payoff: predictable cost, fewer context-related bugs, better tail latency.
Pattern 2: Separate “knowledge” from “instructions”
Instructions should be short, stable, and always present.
Knowledge should be retrieved, scoped, and minimized.
This reduces the chance that truncation destroys behavior.
Pattern 3: Design for time to first token
Users perceive “fast” as “responds quickly”, not “finishes quickly”.
Implementation choices:
Stream output.
Reduce prompt size.
Cache prefixes.
Use short tool calls first, then expand if needed.
Pattern 4: Force structure to reduce sampling risk
If you need reliability, do not ask for “a good answer”.
Ask for:
a JSON schema
bullet points with fixed headings
a plan with numbered steps
citations to retrieved chunk IDs
Structure narrows the set of likely next tokens, which increases consistency.
Pattern 5: Make overflow behavior explicit
If your context can overflow, choose a strategy and make it deliberate:
truncate oldest user messages but keep system instructions
summarize older context into state
retrieve rather than paste
show the user what is being used if transparency matters
A surprising amount of “LLM flakiness” is actually uncontrolled truncation.
10) Common misunderstandings, corrected
“If it is in the chat, the model knows it”
Only if it is still within the context window that was sent to the model for that call.
“The model is reasoning over facts”
It is generating tokens that are likely given the context and its training. If you need factual grounding, you need retrieval or tools.
“Longer prompts always help”
Longer prompts can help, but they can also dilute attention, push you toward truncation, increase latency, and raise cost. More tokens are not free.
“Temperature is just creativity”
Temperature is the knob that changes how sharply the model prefers the top token. Higher temperature increases variety and risk. Lower temperature increases consistency and can increase repetitiveness.
Closing mental model
If you remember nothing else, remember this:
Tokens are the model’s true input and output units.
The context window is hard physics. Beyond it, the model cannot see.
The model generates text via next-token prediction, not by retrieving “the answer”.
Attention is the mechanism that lets it use context. It is also why noise and length matter.
Inference is runtime execution that splits into prefill and decode.
Logits are raw next-token scores.
Sampling turns logits into actual words, creating both creativity and inconsistency.
With that model, most product decisions become clearer: what to cache, what to retrieve, what to cap, what to stream, and what to structure.
If you want, I can turn this into a checklist you can use when reviewing any LLM feature spec: token budget, context strategy, latency plan, and output constraints.

