How Good is an LLMs “memory”?

Large Language Models (LLMs) like ChatGPT, Claude, and Gemini have brilliant memories of the text they were trained on. Wikipedia, Stack Exchange, Github they’ll tell you everything they’ve learnt from these sources.

But what the other stuff, where up-to-date information, internal company data or even knowledge about your most recent conversation is needed? This needs to be provided to the chatbot at run-time.

Here’s the ugly secret: each time you hit “send,” the bot essentially wakes up from suspended animation. If your question isn’t already baked into its current knowledge, a second system⁠—a “retriever”⁠—scrambles to fetch the missing facts and whispers them into the model’s digital ear.

That hand-off is where hallucinations can hatch, and it boils down to two inherent flaws:

1. Garbage-In, Garbage-Out Context

Like a Google search, the retriever only works if your prompt contains words that match the right documents. Users rarely treat chatbots like search bars, so the algorithm rummages through a haystack of hints. Shockingly, it misses needles.

2. Long-Context Loss

Suppose the retriever is a rock star and dumps pages of gold into the context window. Now the LLM must rifle through it. Despite marketing hype about “million-token windows,” mainstream models get thrown past 8k–16k tokens (roughly 12–24 pages). Benchmarks such as Fiction.liveBench peg their accuracy around a coin-flip once you hit that length. Remember: most of that window is already clogged with system instructions and your chat history. Good luck squeezing real updates in there.

We keep expecting LLMs to act like omniscient coworkers, but today they’re closer to interns with photographic memories of 2023, struggling to recall what happened yesterday. Until retrieval gets as smart as generation ⁠— and models learn to juggle truly massive contexts without dropping half the tokens on the floor ⁠— hallucinations are here to stay.

The good news: The best minds are working in this problem. There’s been some recent success, but nothing completely game changing. Concepts like “Shared Timeline” by Windsurf, self-RAG, CRAG, CAG there are a myriad of on-going projects.

Our Opinion  - another ChatGPT level breakthrough is required to fix the problem.

Measure How Much Productivity You Could Gain With Our Calculator

Our productivity calculator reveals the potential costs Traffyk can save your business and improve  productivity by when inefficient workforce communication is reduced.