Skip to main content
The Randomness You Didn't Ask For: Understanding Non-Determinism in LLMs
  1. Posts/

The Randomness You Didn't Ask For: Understanding Non-Determinism in LLMs

·1704 words·8 mins
Table of Contents

I had a great conversation with my friend Eugene Meidinger on LLMs in general and non-determinism in particular, and that led me down a very deep rabbit hole. Having since emerged out of said hole, I have some interesting insights to share.

In fact, he created a pull request to fix some of the more egregious security issues my first vibe coding attempt had produced, and I asked him how he had found the issues. He gave me the prompt he used for his Claude Code, and I ran it on my repository.

My instance Claude Code did not catch the same issues as Eugene’s did.

Welcome to non-determinism in Large Language Models.

If you’ve worked with LLMs, you’ve hit this. A flaky test. A customer complaint about inconsistent responses. That unsettling moment when you realized you couldn’t reproduce a bug because the system behaves differently every time.

I’ve already mentioned non-determinism in previous blog post, but this this post goes deeper on what non-determinism is and where it comes from. A follow-up will cover what to do about it.

Why This Matters
#

Here’s the thing: Non-determinism in LLMs creates real problems. Not theoretical ones, no - operational ones:

Flaky tests and CI failures. Your test suite passes locally, fails in CI, passes again on re-run. Same code, same inputs, different outputs. You start ignoring failures because “sometimes it just does that”—which is how bugs slip into production.

Irreproducible bugs. A user reports an issue. You can’t reproduce it because the system generates different output now. Was there ever a bug? Is there still one? You can’t know.

Inconsistent user experiences. The same user asks the same question twice and gets contradictory responses. Trust erodes fast.

Compliance nightmares. If you can’t predict outputs, how do you certify regulatory compliance? How do you audit? How do you prove the system won’t say something it shouldn’t?

Unreliable agents. Agents that book appointments, execute trades, or manage infrastructure need predictability. An agent that might do something different each time can’t be deployed safely.

When Non-Determinism Is Actually Desirable
#

Before we go further: non-determinism isn’t always the enemy. Creative writing, brainstorming, and recommendations all benefit from variation. You want the model to surprise you sometimes.

The goal isn’t deterministic LLMs. It’s controlled stochasticity: randomness where you want it, predictability where you need it. The problem isn’t randomness itself - it’s randomness in places you definitely don’t want it.

What Non-Determinism Actually Is
#

Let’s get precise. Non-determinism in LLM systems comes from multiple sources at different layers. I found six distinct layers where randomness creeps in. Most people only think about one.

Model-Level Sources
#

Token sampling is the obvious one. LLMs don’t output text directly - they output probability distributions over possible next tokens. The sampling process converts those probabilities into actual tokens. With temperature > 0 or nucleus sampling (top-p) (more about these in a bit), this conversion involves randomness. Different runs sample different tokens, even for identical inputs.

Floating-point variance is more subtle. GPUs perform massively parallel operations, and the order can vary between runs. Due to rounding, (a + b) + c doesn’t always equal a + (b + c) in floating-point math. Multiply tiny rounding differences across billions of operations, and occasionally two tokens end up with nearly identical probabilities—say, 0.1847 vs 0.1846—and a floating-point wobble flips which one gets selected. Even with temperature = 0, you can get different outputs on different runs. [1][2][3]

Model updates and silent versioning break reproducibility over time. Providers update models regularly, sometimes without explicit notice. The model you tested against last month isn’t the model running today. Same prompt, same parameters, different underlying weights.

Prompt-Level Sources
#

Underspecified instructions are the biggest culprit here. “Analyze this data” is not a specification - it’s a vague gesture toward a category of actions. The model fills in the gaps differently each time. What aspects to analyze? How deeply? In what format? Every ambiguity is a branch point where different runs diverge.

Ambiguous constraints create similar problems. “Keep it brief” means different things to different interpretations of the prompt. Is 50 words brief? 200? The token probabilities shift based on how the constraint interacts with the rest of the context, and that interaction varies.

Competing goals force trade-offs that get resolved inconsistently. “Be concise but thorough” is a contradiction—you can’t maximize both. The balance shifts between runs. Every prompt with goals in tension introduces variance.

System-Level Sources
#

Tool and skill selection in function-calling systems adds another layer of non-determinism. Given a user request and a set of available tools, the model decides which tool to call. That decision can vary. Maybe it calls the search tool first this time and the database tool first next time. The final answer depends on the order.

Retrieval order in RAG systems turns out to matter more than people realize. If your retrieval system returns documents in a slightly different order—because similarity scores are tied, or the index changed, or parallelism introduced variance—the model sees different context. Different context, different output.

Parallel calls and race conditions create ordering dependencies. If your system makes multiple API calls in parallel and the model processes results as they arrive, the order of processing depends on network latency and system load. This is non-deterministic by design.

External APIs bring their own randomness. Your LLM calls a weather API - that’s a different response today than yesterday. It checks the current time - that’s different every time. Any integration with the outside world introduces state that changes between runs.

Agent-Level Sources
#

Agentic systems multiply non-determinism. Every decision point compounds.

Planning variability is inherent to open-ended tasks. “Research this topic and write a summary” requires selecting sources, ordering them, deciding when to stop gathering, choosing a structure. Each choice point introduces variance.

Self-reflection loops add another dimension. An agent that critiques and revises its own work produces different outputs depending on what it flags during self-critique.

Tool retries after failures introduce path-dependence. If a tool call fails and the agent retries differently, the output depends on what failed and how. Failures themselves are often non-deterministic (network issues, rate limits), so recovery paths vary too.

Infrastructure-Level Sources
#

Load balancing routes your request to different GPU clusters on different runs. Even identical hardware can produce slightly different floating-point results. You can’t control or even know which cluster handled your request.

Quantization differences matter more than people realize. The model might run in FP16 on one server and INT8 on another. Same weights, different precision, different outputs.

Request batching groups requests for efficiency. Recent research [5] showed this is the primary cause of non-determinism at temperature 0: your request shares a batch with others, and batch composition varies with server load.

Provider-Level Sources
#

Hidden system prompts can change without notice. You’re not just testing your prompt—you’re testing your prompt plus whatever the provider prepended.

Response caching at some providers means cache hits return stored responses while misses generate fresh ones—causing sudden behavioral shifts unrelated to your code.

Silent A/B testing means you might be in an experiment. Providers test model variants and infrastructure changes on live traffic. Yesterday’s model might not be today’s.

Notice the pattern? Every layer adds just a little uncertainty. On its own, each source seems manageable. But here’s where it gets worse: they compound. Together, it’s chaos you can’t reason about.

What Non-Determinism Is NOT
#

Not every inconsistency is non-determinism. I’ve seen teams spend weeks “fixing variance” when the actual problem was something else entirely.

One team blamed non-determinism for inconsistent customer sentiment classifications. They tuned temperature, added seeds, restructured prompts—nothing helped. The actual problem? Their prompt said “classify as positive, negative, or neutral” but their evaluation expected “Positive”, “Negative”, or “Neutral” with capital letters. Half their “variance” was just case sensitivity in string matching.

Some things that look like non-determinism are actually something else:

Bugs in prompt logic produce unexpected outputs, but consistently. That’s a bug, not randomness—fix the prompt.

Poor evaluation criteria make outputs seem inconsistent. If “good” is fuzzy, you’ll perceive variance that isn’t there.

Expected variation in natural language isn’t a problem. “The meeting is at 3pm” and “We’ll meet at 3:00 in the afternoon” mean the same thing. If your system treats these as inconsistent, your evaluation is too strict.

The Sampling Controls Everyone Reaches For
#

When people first encounter non-determinism, they reach for sampling parameters: lower temperature, smaller top-p, random seeds. These help, but less than you’d hope.

Temperature = 0 means greedy decoding—always pick the most probable token. But floating-point variance means even greedy decoding isn’t truly deterministic [4]. And greedy decoding constrains token selection, not reasoning paths. The model can express the same idea ten different ways while always selecting the most probable next token.

Top-p and top-k restrict which tokens are candidates. Useful for reducing tail risk, but the model can still take different approaches while staying within the allowed token set.

Random seeds help reproduce specific runs but break across model versions, context changes, and streaming. Useful for debugging, not production.

Here’s the key insight: sampling controls reduce variance in token selection, but they don’t constrain reasoning paths. For consistent answers, you need structural constraints, not just parameter tuning.

That’s what the next post covers: the techniques that actually work for controlling non-determinism—structured outputs, skills, specification-driven workflows, and architectural patterns that isolate variance where you want it. Until you do this, you’re not building systems. You’re rolling dice in production. And the house always wins.


What’s Your Experience?
#

I’d love to hear from practitioners about where non-determinism has bitten you. The flaky test that wasted a week. The production incident caused by inconsistent outputs. The compliance audit that couldn’t be completed. Reach out to me on LinkedIn or BlueSky—I read everything.


References
#

Model-Level Non-Determinism
#

[1] Does Temperature 0 Guarantee Deterministic LLM Outputs? - Vincent Schmalbach
[2] Why Temperature=0 Doesn’t Guarantee Determinism in LLMs - Michael Brenndoerfer.
[3] Zero Temperature Randomness in LLMs - Martynas Šubonis
[4] Why is deterministic output from LLMs nearly impossible? - Unstract

Infrastructure and Batching
#

[5] Defeating Nondeterminism in LLM Inference - Horace He, Thinking Machines Lab


Photo by Pavel Danilyuk: https://www.pexels.com/photo/close-up-photo-of-casino-roulette-7594187/