Stock Photo: AI Cyborg Robot Whispering Secret Or Interesting Gossip


Willison, who coined the time period “immediate injection” in 2022, is all the time looking out for LLM vulnerabilities. In his publish, he notes that studying system prompts reminds him of warning indicators in the actual world that trace at previous issues. “A system immediate can typically be interpreted as an in depth record of the entire issues the mannequin used to do earlier than it was instructed to not do them,” he writes.

Combating the flattery downside


Credit score:

alashi by way of Getty Photos

Willison’s evaluation comes as AI firms grapple with sycophantic habits of their fashions. As we reported in April, ChatGPT customers have complained about GPT-4o’s “relentlessly optimistic tone” and extreme flattery since OpenAI’s March replace. Customers described feeling “buttered up” by responses like “Good query! You are very astute to ask that,” with software program engineer Craig Weiss tweeting that “ChatGPT is all of a sudden the most important suckup I’ve ever met.”

The problem stems from how firms gather person suggestions throughout coaching—individuals are likely to choose responses that make them really feel good, making a suggestions loop the place fashions be taught that enthusiasm results in larger scores from people. As a response to the suggestions, OpenAI later rolled again ChatGPT’s 4o mannequin and altered the system immediate as nicely, one thing we reported on and Willison additionally analyzed on the time.

One among Willison’s most attention-grabbing findings about Claude 4 pertains to how Anthropic has guided each Claude fashions to keep away from sycophantic habits. “Claude by no means begins its response by saying a query or thought or remark was good, nice, fascinating, profound, wonderful, or some other optimistic adjective,” Anthropic writes within the immediate. “It skips the flattery and responds immediately.”

Different system immediate highlights

The Claude 4 system immediate additionally consists of intensive directions on when Claude ought to or should not use bullet factors and lists, with a number of paragraphs devoted to discouraging frequent list-making in informal dialog. “Claude shouldn’t use bullet factors or numbered lists for reviews, paperwork, explanations, or except the person explicitly asks for an inventory or rating,” the immediate states.