📌 MAROKO133 Update ai: Which Agent Causes Task Failures and When?Researchers from
Share My Research is Synced’s column that welcomes scholars to share their own research breakthroughs with over 1.5M global AI enthusiasts. Beyond technological advances, Share My Research also calls for interesting stories behind the research and exciting research ideas. Contact us: [email protected]
Meet the authors
Institutions: Penn State University, Duke University, Google DeepMind, University of Washington, Meta, Nanyang Technological University, and Oregon State University. The co-first authors are Shaokun Zhang of Penn State University and Ming Yin of Duke University.
In recent years, LLM Multi-Agent systems have garnered widespread attention for their collaborative approach to solving complex problems. However, it’s a common scenario for these systems to fail at a task despite a flurry of activity. This leaves developers with a critical question: which agent, at what point, was responsible for the failure? Sifting through vast interaction logs to pinpoint the root cause feels like finding a needle in a haystack—a time-consuming and labor-intensive effort.
This is a familiar frustration for developers. In increasingly complex Multi-Agent systems, failures are not only common but also incredibly difficult to diagnose due to the autonomous nature of agent collaboration and long information chains. Without a way to quickly identify the source of a failure, system iteration and optimization grind to a halt.
To address this challenge, researchers from Penn State University and Duke University, in collaboration with institutions including Google DeepMind, have introduced the novel research problem of “Automated Failure Attribution.” They have constructed the first benchmark dataset for this task, Who&When, and have developed and evaluated several automated attribution methods. This work not only highlights the complexity of the task but also paves a new path toward enhancing the reliability of LLM Multi-Agent systems.
The paper has been accepted as a Spotlight presentation at the top-tier machine learning conference, ICML 2025, and the code and dataset are now fully open-source.
Paper:https://arxiv.org/pdf/2505.00212
Code:https://github.com/mingyin1/Agents_Failure_Attribution
Dataset:https://huggingface.co/datasets/Kevin355/Who_and_When
Research Background and Challenges
LLM-driven Multi-Agent systems have demonstrated immense potential across many domains. However, these systems are fragile; errors by a single agent, misunderstandings between agents, or mistakes in information transmission can lead to the failure of the entire task.
Currently, when a system fails, developers are often left with manual and inefficient methods for debugging:
Manual Log Archaeology : Developers must manually review lengthy interaction logs to find the source of the problem.
Reliance on Expertise : The debugging process is highly dependent on the developer’s deep understanding of the system and the task at hand.
This “needle in a haystack” approach to debugging is not only inefficient but also severely hinders rapid system iteration and the improvement of system reliability. There is an urgent need for an automated, systematic method to pinpoint the cause of failures, effectively bridging the gap between “evaluation results” and “system improvement.”
Core Contributions
This paper makes several groundbreaking contributions to address the challenges above:
1. Defining a New Problem: The paper is the first to formalize “automated failure attribution” as a specific research task. This task is defined by identifying the failure-responsible agent and the decisive error step that led to the task’s failure.
2. Constructing the First Benchmark Dataset: Who&When : This dataset includes a wide range of failure logs collected from 127 LLM Multi-Agent systems, which were either algorithmically generated or hand-crafted by experts to ensure realism and diversity. Each failure log is accompanied by fine-grained human annotations for:
Who: The agent responsible for the failure.
When: The specific interaction step where the decisive error occurred.
Why: A natural language explanation of the cause of the failure.
3. Exploring Initial “Automated Attribution” Methods : Using the Who&When dataset, the paper designs and assesses three distinct methods for automated failure attribution:
– All-at-Once: This method provides the LLM with the user query and the complete failure log, asking it to identify the responsible agent and the decisive error step in a single pass. While cost-effective, it may struggle to pinpoint precise errors in long contexts.
– Step-by-Step: This approach mimics manual debugging by having the LLM review the interaction log sequentially, making a judgment at each step until the error is found. It is more precise at locating the error step but incurs higher costs and risks accumulating errors.
– Binary Search: A compromise between the first two methods, this strategy repeatedly divides the log in half, using the LLM to determine which segment contains the error. It then recursively searches the identified segment, offering a balance of cost and performance.
Experimental Results and Key Findings
Experiments were conducted in two settings: one where the LLM knows the ground truth answer to the problem the Multi-Agent system is trying to solve (With Ground Truth) and one where it does not (Without Ground Truth). The primary model used was GPT-4o, though other models were also tested. The systematic evaluation of these methods on the Who&When dataset yielded several important insights:
– A Long Way to Go: Current methods are far from perfect. Even the best-performing single method achieved an accuracy of only about 53.5% in identifying the responsible agent and a mere 14.2% in pinpointing the exact error step. Some methods performed even worse than random guessing, underscoring the difficulty of the task.
– No “All-in-One” Solution: Different methods excel at different aspects of the problem. The All-at-Once method is better at identifying “Who,” while the Step-by-Step method is more effective at determining “When.” The Binary Search method provides a middle-ground performance.
– Hybrid Approaches Show Promise but at a High Cost: The researchers found that combining different methods, such as using the All-at-Once approach to identify a potential agent and then applying the Step-by-Step method to find the error, can improve overall performance. However, this comes with a significant increase in computational cost.
– State-of-the-Art Models Struggle: Surprisingly, even the most advanced reasoning models, like OpenAI o1 and DeepSeek R1, find this task challenging.- This h…
Konten dipersingkat otomatis.
🔗 Sumber: syncedreview.com
📌 MAROKO133 Hot ai: The creator of Claude Code just revealed his workflow, and dev
When the creator of the world's most advanced coding agent speaks, Silicon Valley doesn't just listen — it takes notes.
For the past week, the engineering community has been dissecting a thread on X from Boris Cherny, the creator and head of Claude Code at Anthropic. What began as a casual sharing of his personal terminal setup has spiraled into a viral manifesto on the future of software development, with industry insiders calling it a watershed moment for the startup.
"If you're not reading the Claude Code best practices straight from its creator, you're behind as a programmer," wrote Jeff Tang, a prominent voice in the developer community. Kyle McNease, another industry observer, went further, declaring that with Cherny's "game-changing updates," Anthropic is "on fire," potentially facing "their ChatGPT moment."
The excitement stems from a paradox: Cherny's workflow is surprisingly simple, yet it allows a single human to operate with the output capacity of a small engineering department. As one user noted on X after implementing Cherny's setup, the experience "feels more like Starcraft" than traditional coding — a shift from typing syntax to commanding autonomous units.
Here is an analysis of the workflow that is reshaping how software gets built, straight from the architect himself.
How running five AI agents at once turns coding into a real-time strategy game
The most striking revelation from Cherny's disclosure is that he does not code in a linear fashion. In the traditional "inner loop" of development, a programmer writes a function, tests it, and moves to the next. Cherny, however, acts as a fleet commander.
"I run 5 Claudes in parallel in my terminal," Cherny wrote. "I number my tabs 1-5, and use system notifications to know when a Claude needs input."
By utilizing iTerm2 system notifications, Cherny effectively manages five simultaneous work streams. While one agent runs a test suite, another refactors a legacy module, and a third drafts documentation. He also runs "5-10 Claudes on claude.ai" in his browser, using a "teleport" command to hand off sessions between the web and his local machine.
This validates the "do more with less" strategy articulated by Anthropic President Daniela Amodei earlier this week. While competitors like OpenAI pursue trillion-dollar infrastructure build-outs, Anthropic is proving that superior orchestration of existing models can yield exponential productivity gains.
The counterintuitive case for choosing the slowest, smartest model
In a surprising move for an industry obsessed with latency, Cherny revealed that he exclusively uses Anthropic's heaviest, slowest model: Opus 4.5.
"I use Opus 4.5 with thinking for everything," Cherny explained. "It's the best coding model I've ever used, and even though it's bigger & slower than Sonnet, since you have to steer it less and it's better at tool use, it is almost always faster than using a smaller model in the end."
For enterprise technology leaders, this is a critical insight. The bottleneck in modern AI development isn't the generation speed of the token; it is the human time spent correcting the AI's mistakes. Cherny's workflow suggests that paying the "compute tax" for a smarter model upfront eliminates the "correction tax" later.
One shared file turns every AI mistake into a permanent lesson
Cherny also detailed how his team solves the problem of AI amnesia. Standard large language models do not "remember" a company's specific coding style or architectural decisions from one session to the next.
To address this, Cherny's team maintains a single file named CLAUDE.md in their git repository. "Anytime we see Claude do something incorrectly we add it to the CLAUDE.md, so Claude knows not to do it next time," he wrote.
This practice transforms the codebase into a self-correcting organism. When a human developer reviews a pull request and spots an error, they don't just fix the code; they tag the AI to update its own instructions. "Every mistake becomes a rule," noted Aakash Gupta, a product leader analyzing the thread. The longer the team works together, the smarter the agent becomes.
Slash commands and subagents automate the most tedious parts of development
The "vanilla" workflow one observer praised is powered by rigorous automation of repetitive tasks. Cherny uses slash commands — custom shortcuts checked into the project's repository — to handle complex operations with a single keystroke.
He highlighted a command called /commit-push-pr, which he invokes dozens of times daily. Instead of manually typing git commands, writing a commit message, and opening a pull request, the agent handles the bureaucracy of version control autonomously.
Cherny also deploys subagents — specialized AI personas — to handle specific phases of the development lifecycle. He uses a code-simplifier to clean up architecture after the main work is done and a verify-app agent to run end-to-end tests before anything ships.
Why verification loops are the real unlock for AI-generated code
If there is a single reason Claude Code has reportedly hit $1 billion in annual recurring revenue so quickly, it is likely the verification loop. The AI is not just a text generator; it is a tester.
"Claude tests every single change I land to claude.ai/code using the Claude Chrome extension," Cherny wrote. "It opens a browser, tests the UI, and iterates until the code works and the UX feels good."
He argues that giving the AI a way to verify its own work — whether through browser automation, running bash commands, or executing test suites — improves the quality of the final result by "2-3x." The agent doesn't just write code; it proves the code works.
What Cherny's workflow signals about the future of software engineering
The reaction to Cherny's thread suggests a pivotal shift in how developers think about their craft. For years, "AI coding" meant an autocomplete function in a text editor — a faster way to type. Cherny has demonstrated that it can now function as an operating system for labor itself.
"Read this if you're already an engineer… and want more power," Jeff Tang summarized on X.
The tools to multiply human output by a factor of five are already here. They require only a willingness to stop thinking of AI as an assistant and start treating it as a workforce. The progr…
Konten dipersingkat otomatis.
🔗 Sumber: venturebeat.com
🤖 Catatan MAROKO133
Artikel ini adalah rangkuman otomatis dari beberapa sumber terpercaya. Kami pilih topik yang sedang tren agar kamu selalu update tanpa ketinggalan.
✅ Update berikutnya dalam 30 menit — tema random menanti!