
In the last few weeks, users have been complaining about ChatGPT inserting mention of goblins, gremlins, trolls and ogres, into their chats regardless of context. E.g. when debugging code ChatGPT refers to itself as a “goblin with a flashlight”.
Similarly, Claude users have been complaining that Claude incessantly attempts to get the user to go to bed and sleep.
To understand what is happening here I want to introduce the concept of LLM Generality vs Specificity:
An LLM with high generality can do many different tasks e.g. Chat, code, write poetry but to an average level.
An LLM with high specificity can do one of these tasks to an expert level.
A long-standing problem in data science is that models cannot have both high generality and specificity. Tune one up and the other suffers. In one of the above examples, OpenAI introduced personalities to chats (specificity), one of them being “nerdy” and it has leaked into other aspects of model use like coding.
Anthropic, on the other hand is trying to add temporal awareness to Claude’s skills. It provides the model your computer’s datetime with every message you send. An example of generality. Now it is trying to use this new capability during chats by predicting when the user should be tired.
To understand how this is occurring you need to understand the model training process. Currently, at the AI labs, there are thousands of model checkpoints but only the “best” are released to the public e.g. (Opus 4.7 or Codex 5.5).
Hundreds of millions of hours of training have been put into getting the models to the initial checkpoint before others are made. Each checkpoint is a result of a researcher curating a dataset (e.g. writing 1000’s of nerdy goblin filled conversations) and training the model on this, effectively adding or deepening a capability. The researcher then runs their new checkpoint through the suite of evaluations (e.g. can it count the number of r’s in strawberry) and an overall behaviour profile and score is produced. A judgement call is made as to whether this checkpoint is good enough to be released to the public. You best believe there are sleepy goblin evaluations added to their test suites now.
What many don’t realise is that you can create your own checkpoints. If you want to tune generality vs specificity for your business task, the labs offer this as a service. This is sometimes one of the best ways to increase accuracy for specific internal use-cases. However, you need to know what you’re doing as it is a relatively costly venture.