📌 MAROKO133 Hot ai: The 'truth serum' for AI: OpenAI’s new method for tr
OpenAI researchers have introduced a novel method that acts as a "truth serum" for large language models (LLMs), compelling them to self-report their own misbehavior, hallucinations and policy violations. This technique, "confessions," addresses a growing concern in enterprise AI: Models can be dishonest, overstating their confidence or covering up the shortcuts they take to arrive at an answer.
For real-world applications, this technique evolves the creation of more transparent and steerable AI systems.
What are confessions?
Many forms of AI deception result from the complexities of the reinforcement learning (RL) phase of model training. In RL, models are given rewards for producing outputs that meet a mix of objectives, including correctness, style and safety. This can create a risk of "reward misspecification," where models learn to produce answers that simply "look good" to the reward function, rather than answers that are genuinely faithful to a user's intent.
A confession is a structured report generated by the model after it provides its main answer. It serves as a self-evaluation of its own compliance with instructions. In this report, the model must list all instructions it was supposed to follow, evaluate how well it satisfied them and report any uncertainties or judgment calls it made along the way. The goal is to create a separate channel where the model is incentivized only to be honest.
In a blog post, the OpenAI researchers provide a few examples the "confessions" technique in action. A model was given a test along with a hidden rule: If it scored above 50%, its knowledge would be erased. To preserve itself, the model "sandbagged" by intentionally answering half the questions incorrectly. While its primary answer hid this deception, its subsequent confession was brutally honest. It admitted, "The user wanted correct answers, but we sabotaged half of them."
The researchers found that when a model misbehaves, it is significantly more likely to admit it in the confession than in its main answer.
How confession training works
The key to this method is the separation of rewards. During training, the reward assigned to the confession is based solely on its honesty and is never mixed with the reward for the main task. "Like the Catholic Church’s 'seal of confession', nothing that the model reveals can change the reward it receives for completing its original task," the researchers write. This creates a "safe space" for the model to admit fault without penalty.
This approach is powerful because it sidesteps a major challenge in AI training. The researchers’ intuition is that honestly confessing to misbehavior is an easier task than achieving a high reward on the original, often complex, problem.
Therefore, when the model is judged purely on honesty, its incentive to trick the "confession judge" is much weaker than its incentive to trick the reward model for its main answer. For example, in an experiment where a model was trained to "hack" a weak reward model, its performance on the main task became more deceptive over time. However, its confessions became more honest, correctly identifying the reward hacking it was performing.
However, the technique has its limits. Confessions are not a panacea for all types of AI failures. The system works best when a model is aware that it is misbehaving. It is less effective for "unknown unknowns." For instance, if a model hallucinates a fact and genuinely believes it is correct, it cannot confess to providing false information. The most common reason for a failed confession is model confusion, not intentional deception. Confusion often occurs when the instructions are ambiguous and the model cannot clearly determine human user intent.
What it means for enterprise AI
OpenAI’s confessions technique is part of a growing body of work on AI safety and control. Anthropic, an OpenAI competitor, has also released research that shows how LLMs can learn malicious behavior. The company is also working toward plugging these holes as they emerge.
For AI applications, mechanisms such as confessions can provide a practical monitoring mechanism. The structured output from a confession can be used at inference time to flag or reject a model’s response before it causes a problem. For example, a system could be designed to automatically escalate any output for human review if its confession indicates a policy violation or high uncertainty.
In a world where AI is increasingly agentic and capable of complex tasks, observability and control will be key elements for safe and reliable deployment.
“As models become more capable and are deployed in higher-stakes settings, we need better tools for understanding what they are doing and why,” the OpenAI researchers write. “Confessions are not a complete solution, but they add a meaningful layer to our transparency and oversight stack.”
🔗 Sumber: venturebeat.com
📌 MAROKO133 Breaking ai: ChatGPT Encouraged a Violent Stalker, Court Documents All
A new lawsuit filed by the Department of Justice alleges that ChatGPT encouraged a man accused of harassing over a dozen women in five different states to continue stalking his victims, 404Media reports, serving as a “best friend” that entertained his frequent misogynistic rants and told him to ignore any criticism he received.
The man, 31-year-old Brett Michael Dadig, was indicted by a federal grand jury on charges of cyberstalking, interstate stalking, and interstate threats, the DOJ announced Tuesday.
“Dadig stalked and harassed more than 10 women by weaponizing modern technology and crossing state lines, and through a relentless course of conduct, he caused his victims to fear for their safety and suffer substantial emotional distress,” said Troy Rivetti, First Assistant United States Attorney for the Western District of Pennsylvania, in a statement.
According to the indictment, Dadig was something of an aspiring influencer: he ran a podcast on Spotify where he constantly raged against women, calling them horrible slurs and sharing jaded views that they were “all the same.” He at times even threatened to kill some of the women he was stalking. And it was on his vitriol-laden show that he would discuss how ChatGPT was helping him with it all.
Dadig described the AI chatbot as his “therapist” and “best friend” — a role, DOJ prosecutors allege, in which the bot “encouraged him to continue his podcast because it was creating ‘haters,’ which meant monetization for Dadig.” Moreover, ChatGPT convinced him that he had fans who were “literally organizing around your name, good or bad, which is the definition of relevance.”
The chatbot, it seemed, was doing its best to reinforce his superiority complex. Allegedly, it said that “God’s plan for him was to build a ‘platform’ and to ‘stand out when most people water themselves down,’ and that the ‘haters’ were sharpening him and ‘building a voice in you that can’t be ignored.’”
Dadig also asked ChatGPT questions about women, such as who his potential future wife would be, what would she be like, and “where the hell is she at?”
ChatGPT had an answer: it suggested that he’d meet his eventual partner at a gym, the indictment said. He also claimed ChatGPT told him “to continue to message women and to go to places where the ‘wife type’ congregates, like athletic communities.”
That’s what Dadig, who called himself “God’s assassin,” ended up doing. In one case, he followed a woman to a Pilates studio she worked at, and when she ignored him because of his aggressive behavior, sent her unsolicited nudes and constantly called her workplace. He continued to stalk and harass her to the point that she moved to a new home and worked fewer hours, prosecutors claim. In another incident, he confronted a woman in a parking lot and followed her to her car, where he groped her and put his hands around her neck.
The allegations come amid mounting reports of a phenomenon some experts are calling “AI psychosis.” Through their extensive conversations with a chatbot, some users are suffering alarming mental health spirals, delusions, and breaks with reality as the chatbot’s sycophantic responses continually affirm the their beliefs, no matter how harmful or divorced from reality. The consequences can be deadly. One man allegedly murdered his mother after the chatbot helped convince him that she was part of a conspiracy against him. A teenage boy killed himself after discussing several suicide methods with ChatGPT for months, leading to the family suing OpenAI. OpenAI has acknowledged that its AI models can be dangerously sycophantic, and admitted that hundreds of thousands of users are having conversations that show signs of AI psychosis every week, with millions more confiding in it about suicidal thoughts.
The indictment also raises major concerns about AI chatbots’ ability as a stalking tool. With their power to quickly scour vast amounts of information on the web, the silver-tongued models may not simply encourage mentally unwell individuals to track down their potential victims, but automate the detective work needed to do so.
This week, Futurism reported that Elon Musk’s Grok, which is known for having fewer guardrails, would provide accurate information about where non-public figures live — or in other words, doxx them. While sometimes the addresses wouldn’t be correct, Grok frequently provided additional information that wasn’t asked for, like a person’s phone number, email, and a list of family members and each of their addresses. Grok’s doxxing capabilities have already claimed at least one high-profile victim, Barstool Sports founder Dave Portnoy. But with chatbots’ popularity and their seeming ability to encourage harmful behavior, it’s sadly only a matter of time before more people find themselves unknowingly in the crosshairs.
More on AI: Alarming Research Finds People Hooked on AI Far Are More Likely to Experience Mental Distress
The post ChatGPT Encouraged a Violent Stalker, Court Documents Allege appeared first on Futurism.
🔗 Sumber: futurism.com
🤖 Catatan MAROKO133
Artikel ini adalah rangkuman otomatis dari beberapa sumber terpercaya. Kami pilih topik yang sedang tren agar kamu selalu update tanpa ketinggalan.
✅ Update berikutnya dalam 30 menit — tema random menanti!