I Stopped Asking the LLM to Remember Everything
A practical story about replacing full-history prompting with a small state machine for faster, calmer, more predictable LLM conversations.
The first version felt reasonable.
We had a short conversational flow. The user answered a few questions. The model decided what to ask next. At the end, the backend needed a clean little bundle of structured data.
So we did the obvious thing: send the full conversation to the model every time, add instructions like “ask one question at a time,” and trust the model to infer where we were.
For a while, this was fine.
Then the conversation got longer. The prompt got longer. The model got less consistent. The UX started doing that painful thing where it looked smart for three turns and then asked the user something they had already answered.
That was the moment the design flaw became obvious:
We were asking the model to be both the conversational layer and the source of truth.
Those should not be the same job.
Interactive Model
Transcript memory vs explicit state
Tokens
~2,730
Latency
2.6s
Correct flow
73%
The model rereads the whole chat, infers what has already happened, then guesses the next best question.
Goal
Launch a cleaner onboarding flow
Audience
Ask this next
Timeline
Still missing
Structured Output
{
"message": "What should I know about the audience?",
"is_ready": false,
"suggestions": [
"Founders",
"Internal platform users"
],
"next_field": "audience",
"state": {
"goal": "Launch a cleaner onboarding flow",
"audience": null,
"timeline": null
}
}The cute prototype became prompt soup
The early implementation was simple:
Workflow
Compressed loop
01
User
02
LLM with full transcript
03
LLM infers state
04
response
Every request included the full chat, the same instructions, and a growing pile of contextual crumbs. The model had to reread the conversation, decide which fields were already collected, interpret the newest message, choose the next field, and phrase the next question nicely.
That is too much responsibility hiding inside a prompt.
At first, the problems looked unrelated:
- responses got slower
- occasional requests timed out
- the assistant repeated questions
- steps were skipped
- validation edge cases kept leaking into the prompt
But they had the same root cause. The model was reconstructing state from text every turn.
That meant every new instruction became another patch on top of an already fuzzy system:
Prompt Set
Ask one question at a time. Do not repeat questions. Skip fields the user already answered. Validate the new answer. If the answer is ambiguous, ask a follow-up. Do not mark the flow complete too early. Actually, also preserve a friendly tone.
This is how prompts turn into policy junk drawers. Every bug adds a sentence. Every sentence adds a new edge case. Soon, nobody can say with confidence what the system will do.
The better split: model decides what to say, system decides what is true
The migration was not magical. It was mostly a responsibility cleanup.
Instead of sending the model the whole transcript and asking it to infer reality, we started sending explicit state:
{
"user_input": "example user message",
"state": {
"goal": "launch a cleaner onboarding flow",
"audience": null,
"timeline": null
}
}
And we asked for structured output:
{
"message": "Who is this flow for?",
"is_ready": false,
"suggestions": ["Founders", "Internal platform users"],
"next_field": "audience",
"state": {
"goal": "launch a cleaner onboarding flow",
"audience": null,
"timeline": null
}
}
The important line is this:
The model decides what to ask next. The system decides what is true.
Once that boundary exists, the whole flow becomes easier to reason about. The backend owns validation, persistence, and the canonical state. The model owns the human part: wording, helpful suggestions, and graceful recovery when the user says something messy.
The new loop
Why this made the product feel calmer
The most obvious win was speed. The payload stopped growing with every turn.
Instead of repeatedly sending the transcript, we sent the latest user input plus a compact state object. In this flow, average tokens per request dropped from roughly 2,100 to roughly 600. The largest requests went from 5,000+ tokens to around 1,100.
That translated directly into latency:
| Metric | Before | After |
|---|---|---|
| Average response time | 2.3s | 1.0s |
| P95 latency | 5.5s | 2.0s |
| Timeout rate | 5.8% | 0.7% |
But the more interesting win was predictability.
When the model had to infer state from a transcript, correctness depended on interpretation. When the backend passed explicit state, the model no longer needed to rediscover the plot every time.
Before
Correct progression
About 76% of flows moved through the right steps without repeats or skips.
After
Correct progression
About 95% of flows moved cleanly once the backend owned state.
Result
Less weirdness
Repeated questions dropped from 15% to 2%, and invalid state updates fell below 1%.
The user does not care that your prompt is clever. They care that the assistant does not forget what they just said.
The architecture became boring in the best way
The old flow looked elegant in a diagram and chaotic in production:
Workflow
Compressed loop
01
User
02
LLM (full history)
03
LLM infers state
04
response
The new flow is slightly more mechanical, but much easier to operate:
Workflow
Compressed loop
01
User
02
Backend updates state
03
LLM(state + input)
04
structured response
05
Backend validates
06
persist
That one extra backend step is the difference between “the model probably knows” and “the system knows.”
And once the system knows, you get normal engineering tools back:
- schema validation
- deterministic transitions
- logs that explain what changed
- smaller prompts
- easier retries
- safer UI states
The model still matters. It is just no longer the database, the reducer, the validator, and the copywriter all at once.
A small rule I now trust
If an LLM flow has required fields, completion criteria, validation, or branching, I do not want the model to infer the canonical state from prose.
I want a state object.
That state object can be tiny. It can be boring. It can be three nullable fields and a next step. The point is not ceremony. The point is ownership.
Design heuristic
Make ambiguity the model's job, not truth
Let the model handle fuzzy language, tone, suggestions, and recovery. Let deterministic code handle known facts, allowed transitions, readiness, and persistence.
This is especially useful when the product needs to feel conversational but the business logic needs to be strict.
The model can say:
“Almost there. What would success look like?”
The system should know:
{
"next_field": "success_criteria",
"is_ready": false
}
That combination is the sweet spot. Friendly on the surface. Boring underneath.
The lesson
The migration was not really from prompts to state machines. It was from vibes to boundaries.
Full-history prompting made the model reconstruct the past before it could decide the future. Explicit state let the backend carry the past, while the model focused on the next useful sentence.
That cut latency roughly in half, reduced failures dramatically, and made the conversation feel more stable.
The big lesson is simple:
Stop asking the model to infer state. Start providing it explicitly.
Your prompt gets shorter. Your flow gets calmer. Your users stop answering the same question twice.
Further reading
Related notes that continue the same thread.
ai
AI Doesn’t Make You Learn Faster. It Changes What You Learn.
Anthropic’s research points to a useful distinction: AI can improve output while weakening skill formation when you are still learning the domain.