Why Your AI Gets Worse the Longer You Talk to It

Most people have had this experience: you are deep into a long AI session, things are going well, and then the model starts contradicting itself, forgetting instructions, or producing answers that feel less precise. It can feel like the AI is getting tired. In reality, what you are seeing is usually a context-management problem, not fatigue.

The Transcript Problem

Modern chat models are broadly stateless from one turn to the next. They do not "remember" your conversation the way a person does. Instead, each new response is generated from the information supplied in the current request, which typically includes system instructions, recent messages, tool results, and other context assembled by the product.

A useful mental model is a consultant who rereads the case file before every call. The consultant may sound continuous and informed, but the continuity comes from repeatedly receiving the file, not from human-style memory. That is roughly how chat systems work in practice.

That file is usually called the context window. It has a finite size, and model quality can degrade as the context grows longer and denser. Anthropic's documentation explicitly notes that performance can decline in very long contexts, including weaker recall and lower accuracy on earlier details.

The compounding effect is real. Your prompts may be short, but model responses are often long, and tool use can add substantial volume quickly. File contents, search results, code output, and generated summaries all consume room in the active context.

Context window — how it fills

Drag the slider to see how context accumulates across a session and where reliability starts to slip.

Start of session Window limit

Turns 2 turns

Clear — full reliability

Caution — drift risk

Critical — reset recommended

The Tools Are Not All the Same

Different AI products manage context differently, which is why they can feel different even when they rely on similar underlying mechanics. The shared pattern is simple: finite context, selective inclusion of prior material, and tradeoffs between continuity, speed, and reliability.

Claude and ChatGPT both operate with finite context windows, but the exact limits depend on the model and product surface. Anthropic now supports up to 1 million tokens for newer models such as Claude Opus 4.8 and Sonnet 4.6 in the API, while other Claude models still use 200,000-token windows. OpenAI also treats memory as a separate layer from the active context window. Saved memories can personalize future chats, but they do not eliminate the model's finite per-session context constraints.

Claude Code is different mainly because of workload shape. It may read files, inspect outputs, and summarize intermediate state while working through a task, so it can consume context much faster than a normal back-and-forth chat. Anthropic also documents compaction, which compresses prior interaction into a summary so a session can continue with more headroom, though that tradeoff can lose detail.

Cowork sits in a different category again. Claude Cowork is Anthropic's desktop agent for multi-step knowledge work: you describe an outcome, grant access to the files or apps it needs, and it works through the task on your behalf. Context is managed at the task level rather than as one endless chat thread, so each task acts more like a self-contained working session with its own instructions, intermediate state, and deliverables.

Tool or surface	Context notes	What matters in practice
Claude chat	Depends on model. Some Claude models support 200k tokens. Newer API models support up to 1M tokens.	Long, focused sessions work better than sprawling ones.
Claude Code	Same basic finite-context mechanic, but tool activity can add context rapidly.	Coding sessions can hit context pressure sooner than plain chat.
Cowork	Task-scoped session in Anthropic's desktop agent. Context accumulates within a task rather than across one giant ongoing conversation.	Best for outcome-oriented file and workflow tasks where the work has a clear beginning and end.
ChatGPT	Finite context window plus a separate memory layer for saved preferences or facts.	Memory can help personalization, but it does not replace active context management.

What Goes Wrong and Why

Two failures become more common in long sessions: instruction drift and factual slippage. As more material accumulates, the model has to balance more competing signals, and important constraints can become less salient than they were at the beginning.

Instruction drift

You define tone, structure, or constraints up front. The model gradually stops following them. The earlier instructions did not vanish. They lost effective weight inside a crowded context.

Factual slippage

Errors become more likely as context quality degrades. The model is juggling many prior turns, tool outputs, and partially relevant details. The safest claim is not that hallucinations always spike late, but that reliability can decline in long, overloaded sessions.

How to Work With This

Start each session with the most important context up front. Put the task, constraints, desired output format, and any must-remember facts near the beginning so they have the best chance of staying salient.

Keep sessions narrow. One session for one problem usually performs better than one giant thread covering planning, writing, debugging, research, and editing all at once.

When a session starts to feel off, do a handoff. Write a short summary of the goal, key decisions, unresolved questions, and constraints, then start a fresh session with that summary. In tools that support compaction, that process may be partly automated, but the same tradeoff applies: more headroom, less raw detail.

For multi-stage work, treat AI sessions like project workstreams. Close each session with a compact written state snapshot, then open the next one with that snapshot instead of dragging the entire history forward. The model will often perform better because it is reading a cleaner file, not a noisier one.

The Core Idea

The context window is not a bug so much as a structural property of current large-language-model systems. Better models, larger windows, and smarter product design can reduce the pain, but they do not remove the need to manage context well.

That is why experienced users often seem to get better results from the same tool. They are not just writing better prompts. They are managing scope, preserving signal, and resetting before the session becomes cluttered.

Part of the Proof of Value series from Busted Eye. Submit your details below if you want to talk through what this means for your team.

The Transcript Problem

Context window — how it fills

The Tools Are Not All the Same

What Goes Wrong and Why

Instruction drift

Factual slippage

How to Work With This

The Core Idea

Let's talk about your first 90 days.