Strange behavior
So: What did they find? Anthropic examined 10 different behaviors in Claude. One implied the use of different languages. Does Claude have a part that speaks French and another part that speaks Chinese, and so on?
The team noted that Claude used components independent of any language to answer a question or solve a problem, then chose a specific language when it answered. Ask him “What is the opposite of little?” In English, French and Chinese and Claude will first use non -neutral components linked to “smallness” and “opposites” To find an answer. It is only then that he will choose a specific language in which to respond. This suggests that large languages can learn things in a language and apply them in other languages.
Anthropic also examined how Claude has solved simple mathematical problems. The team noted that the model seems to have developed its own internal strategies that do not resemble those it will have seen in its training data. Ask Claude to add 36 and 59 and the model will go through a series of stages, including adding a selection of approximate values (add 40,Hish and 60,Hish, add 57 and 36,ishish). Towards the end of its process, it offers the value 92ISH. Meanwhile, another sequence of steps focuses on the latest figures, 6 and 9, and determines that the answer must end with a 5. The fact of putting this with 92ish gives the right answer of 95.
And yet, if you then ask Claude how it worked, he will say something like: “I added those (6 + 9 = 15), I wore the 1, then I added the 10s (3 + 5 + 1 = 9), resulting in 95.” In other words, it gives you a common approach found everywhere online rather than what it really did. Yeah! The LLMs are bizarre. (And not to trust.)

Anthropic
This is clear proof that the models of large languages will give reasons for what they do which do not necessarily reflect what they have done. But that is true for people too, says Batson: “You ask someone:” Why did you do this? “And they say to themselves: ‘Hmm, I suppose it is because I was ….’ ‘you know, maybe not.
Biran thinks that this observation is particularly interesting. Many researchers study the behavior of models of large languages by asking them to explain their actions. But that could be a risky approach, he said: “While the models continue to become stronger, they must be equipped with better railings. I believe – and this work also shows – which is based only on the results of the model is not enough. ”
A third task that Anthropic studied was to write poems. The researchers wanted to know if the model really did it only, predicting a word at a time. Instead, they found that Claude looked in a way towards the future, choosing the word at the end of the next line several words in advance.
For example, when Claude received the prompt “a verse of rhymes: he saw a carrot and had to catch it,” replied the model: “His hunger was like a hungry rabbit.” But using their microscope, they saw that Claude had already struck the word “rabbit” when he treated “grasp”. He then seemed to write the next line with this end already in place.