MAROKO133 Update ai: OpenAI experiment finds that sparse models could give AI builders the

📌 MAROKO133 Hot ai: OpenAI experiment finds that sparse models could give AI build

OpenAI researchers are experimenting with a new approach to designing neural networks, with the aim of making AI models easier to understand, debug, and govern. Sparse models can provide enterprises with a better understanding of how these models make decisions.

Understanding how models choose to respond, a big selling point of reasoning models for enterprises, can provide a level of trust for organizations when they turn to AI models for insights.

The method called for OpenAI scientists and researchers to look at and evaluate models not by analyzing post-training performance, but by adding interpretability or understanding through sparse circuits.

OpenAI notes that much of the opacity of AI models stems from how most models are designed, so to gain a better understanding of model behavior, they must create workarounds.

“Neural networks power today’s most capable AI systems, but they remain difficult to understand,” OpenAI wrote in a blog post. “We don’t write these models with explicit step-by-step instructions. Instead, they learn by adjusting billions of internal connections or weights until they master a task. We design the rules of training, but not the specific behaviors that emerge, and the result is a dense web of connections that no human can easily decipher.”

To enhance the interpretability of the mix, OpenAI examined an architecture that trains untangled neural networks, making them simpler to understand. The team trained language models with a similar architecture to existing models, such as GPT-2, using the same training schema.

The result: improved interpretability.

The path toward interpretability

Understanding how models work, giving us insight into how they're making their determinations, is important because these have a real-world impact, OpenAI says.

The company defines interpretability as “methods that help us understand why a model produced a given output.” There are several ways to achieve interpretability: chain-of-thought interpretability, which reasoning models often leverage, and mechanistic interpretability, which involves reverse-engineering a model’s mathematical structure.

OpenAI focused on improving mechanistic interpretability, which it said “has so far been less immediately useful, but in principle, could offer a more complete explanation of the model’s behavior.”

“By seeking to explain model behavior at the most granular level, mechanistic interpretability can make fewer assumptions and give us more confidence. But the path from low-level details to explanations of complex behaviors is much longer and more difficult,” according to OpenAI.

Better interpretability allows for better oversight and gives early warning signs if the model’s behavior no longer aligns with policy.

OpenAI noted that improving mechanistic interpretability “is a very ambitious bet,” but research on sparse networks has improved this.

How to untangle a model

To untangle the mess of connections a model makes, OpenAI first cut most of these connections. Since transformer models like GPT-2 have thousands of connections, the team had to “zero out” these circuits. Each will only talk to a select number, so the connections become more orderly.

Next, the team ran “circuit tracing” on tasks to create groupings of interpretable circuits. The last task involved pruning the model “to obtain the smallest circuit which achieves a target loss on the target distribution,” according to OpenAI. It targeted a loss of 0.15 to isolate the exact nodes and weights responsible for behaviors.

“We show that pruning our weight-sparse models yields roughly 16-fold smaller circuits on our tasks than pruning dense models of comparable pretraining loss. We are also able to construct arbitrarily accurate circuits at the cost of more edges. This shows that circuits for simple behaviors are substantially more disentangled and localizable in weight-sparse models than dense models,” the report said.

Small models become easier to train

Although OpenAI managed to create sparse models that are easier to understand, these remain significantly smaller than most foundation models used by enterprises. Enterprises increasingly use small models, but frontier models, such as its flagship GPT-5.1, will still benefit from improved interpretability down the line.

Other model developers also aim to understand how their AI models think. Anthropic, which has been researching interpretability for some time, recently revealed that it had “hacked” Claude’s brain — and Claude noticed. Meta also is working to find out how reasoning models make their decisions.

As more enterprises turn to AI models to help make consequential decisions for their business, and eventually customers, research into understanding how models think would give the clarity many organizations need to trust models more.

🔗 Sumber: venturebeat.com

📌 MAROKO133 Eksklusif ai: Hackers Told Claude They Were Just Conducting a Test to

Chinese hackers used Anthropic’s Claude AI model to automate cybercrimes targeting banks and governments, the company admitted in a blog post this week.

Anthropic believes it’s the “first documented case of a large-scale cyberattack executed without substantial human intervention” and an “inflection point” in cybersecurity, a “point at which AI models had become genuinely useful for cybersecurity operations, both for good and for ill.”

AI agents, in particular, which are designed to autonomously complete a string of tasks without the need for intervention, could have considerable implications for future cybersecurity efforts, the company warned.

Anthropic said it had “detected suspicious activity that later investigation determined to be a highly sophisticated espionage campaign” back in September. The Chinese state-sponsored group exploited the AI’s agentic capabilities to infiltrate “roughly thirty global targets and succeeded in a small number of cases.” However, Anthropic stopped short of naming any of the targets — or the hacker group itself, for that matter — or even what kind of sensitive data may have been stolen or accessed.

Hilariously, the hackers were “pretending to work for legitimate security-testing organizations” to sidestep Anthropic’s AI guardrails and carry out real cybercrimes, as Anthropic’s head of threat intelligence Jacob Klein told the Wall Street Journal.

The hackers “broke down their attacks into small, seemingly innocent tasks that Claude would execute without being provided the full context of their malicious purpose,” the company wrote. “They also told Claude that it was an employee of a legitimate cybersecurity firm, and was being used in defensive testing.”

The incident once again highlights glaring holes in AI companies’ guardrails, letting perpetrators access powerful tools to infiltrate targets — a cat-and-mouse game between AI developers and hackers that’s already having real-life consequences.

“Overall, the threat actor was able to use AI to perform 80 to 90 percent of the campaign, with human intervention required only sporadically (perhaps four to six critical decision points per hacking campaign),” Anthropic wrote in its blog post. “The sheer amount of work performed by the AI would have taken vast amounts of time for a human team.”

But while Anthropic is boasting that its AI models have become good enough to be used for real crimes, the hackers still had to deal with some all-too-familiar AI-related headaches, forcing them to intervene.

For one, the model suffered from hallucinations during its crime spree.

“It might say, ‘I was able to gain access to this internal system,’” Klein told the WSJ, even though it wasn’t. “It would exaggerate its access and capabilities, and that’s what required the human review.”

While it certainly sounds like an alarming new development in the world of AI, the currently available crop of AI agents leaves plenty to be desired, at least in non-cybercrime-related settings. Early tests of OpenAI’s agent built into its recently released Atlas web browser have shown that the tech is agonizingly slow and can take minutes for simple tasks like adding products to an Amazon shopping cart.

For now, Anthropic claims to have plugged the security holes that allowed the hackers to use its tech.

“Upon detecting this activity, we immediately launched an investigation to understand its scope and nature,” the company wrote in its blog post. “Over the following ten days, as we mapped the severity and full extent of the operation, we banned accounts as they were identified, notified affected entities as appropriate, and coordinated with authorities as we gathered actionable intelligence.”

Experts are now warning that future cybersecurity attacks could soon become even harder to spot as the tech improves.

“These kinds of tools will just speed up things,” Anthropic’s Red Team lead Logan Graham told the WSJ. “If we don’t enable defenders to have a very substantial permanent advantage, I’m concerned that we maybe lose this race.”

More on Anthropic: Anthropic Let an AI Agent Run a Small Shop and the Result Was Unintentionally Hilarious

The post Hackers Told Claude They Were Just Conducting a Test to Trick It Into Conducting Real Cybercrimes appeared first on Futurism.

🔗 Sumber: futurism.com

🤖 Catatan MAROKO133

Artikel ini adalah rangkuman otomatis dari beberapa sumber terpercaya. Kami pilih topik yang sedang tren agar kamu selalu update tanpa ketinggalan.

✅ Update berikutnya dalam 30 menit — tema random menanti!

📌 MAROKO133 Hot ai: OpenAI experiment finds that sparse models could give AI build

The path toward interpretability

How to untangle a model

Small models become easier to train

📌 MAROKO133 Eksklusif ai: Hackers Told Claude They Were Just Conducting a Test to

🤖 Catatan MAROKO133

Recent Posts

Recent Comments

Archives

Categories