MAROKO133 Breaking ai: Databricks research reveals that building better AI judges isn&#039

πŸ“Œ MAROKO133 Eksklusif ai: Databricks research reveals that building better AI judg

The intelligence of AI models isn't what's blocking enterprise deployments. It's the inability to define and measure quality in the first place.

That's where AI judges are now playing an increasingly important role. In AI evaluation, a "judge" is an AI system that scores outputs from another AI system.Β 

Judge Builder is Databricks' framework for creating judges and was first deployed as part of the company's Agent Bricks technology earlier this year. The framework has evolved significantly since its initial launch in response to direct user feedback and deployments.

Early versions focused on technical implementation but customer feedback revealed the real bottleneck was organizational alignment. Databricks now offers a structured workshop process that guides teams through three core challenges: getting stakeholders to agree on quality criteria, capturing domain expertise from limited subject matter experts and deploying evaluation systems at scale.

"The intelligence of the model is typically not the bottleneck, the models are really smart," Jonathan Frankle, Databricks' chief AI scientist, told VentureBeat in an exclusive briefing. "Instead, it's really about asking, how do we get the models to do what we want, and how do we know if they did what we wanted?"

The 'Ouroboros problem' of AI evaluation

Judge Builder addresses what Pallavi Koppol, a Databricks research scientist who led the development, calls the "Ouroboros problem."Β  An Ouroboros is an ancient symbol that depicts a snake eating its own tail.Β 

Using AI systems to evaluate AI systems creates a circular validation challenge.

"You want a judge to see if your system is good, if your AI system is good, but then your judge is also an AI system," Koppol explained. "And now you're saying like, well, how do I know this judge is good?"

The solution is measuring "distance to human expert ground truth" as the primary scoring function. By minimizing the gap between how an AI judge scores outputs versus how domain experts would score them, organizations can trust these judges as scalable proxies for human evaluation.

This approach differs fundamentally from traditional guardrail systems or single-metric evaluations. Rather than asking whether an AI output passed or failed on a generic quality check, Judge Builder creates highly specific evaluation criteria tailored to each organization's domain expertise and business requirements.

The technical implementation also sets it apart. Judge Builder integrates with Databricks' MLflow and prompt optimization tools and can work with any underlying model. Teams can version control their judges, track performance over time and deploy multiple judges simultaneously across different quality dimensions.

Lessons learned: Building judges that actually work

Databricks' work with enterprise customers revealed three critical lessons that apply to anyone building AI judges.

Lesson one: Your experts don't agree as much as you think. When quality is subjective, organizations discover that even their own subject matter experts disagree on what constitutes acceptable output. A customer service response might be factually correct but use an inappropriate tone. A financial summary might be comprehensive but too technical for the intended audience.

"One of the biggest lessons of this whole process is that all problems become people problems," Frankle said. "The hardest part is getting an idea out of a person's brain and into something explicit. And the harder part is that companies are not one brain, but many brains."

The fix is batched annotation with inter-rater reliability checks. Teams annotate examples in small groups, then measure agreement scores before proceeding. This catches misalignment early. In one case, three experts gave ratings of 1, 5 and neutral for the same output before discussion revealed they were interpreting the evaluation criteria differently.

Companies using this approach achieve inter-rater reliability scores as high as 0.6 compared to typical scores of 0.3 from external annotation services. Higher agreement translates directly to better judge performance because the training data contains less noise.

Lesson two: Break down vague criteria into specific judges. Instead of one judge evaluating whether a response is "relevant, factual and concise," create three separate judges. Each targets a specific quality aspect. This granularity matters because a failing "overall quality" score reveals something is wrong but not what to fix.

The best results come from combining top-down requirements such as regulatory constraints, stakeholder priorities, with bottom-up discovery of observed failure patterns. One customer built a top-down judge for correctness but discovered through data analysis that correct responses almost always cited the top two retrieval results. This insight became a new production-friendly judge that could proxy for correctness without requiring ground-truth labels.

Lesson three: You need fewer examples than you think. Teams can create robust judges from just 20-30 well-chosen examples. The key is selecting edge cases that expose disagreement rather than obvious examples where everyone agrees.

"We're able to run this process with some teams in as little as three hours, so it doesn't really take that long to start getting a good judge," Koppol said.

Production results: From pilots to seven-figure deployments

Frankle shared three metrics Databricks uses to measure Judge Builder's success: whether customers want to use it again, whether they increase AI spending and whether they progress further in their AI journey.

On the first metric, one customer created more than a dozen judges after their initial workshop. "This customer made more than a dozen judges after we walked them through doing this in a rigorous way for the first time with this framework," Frankle said. "They really went to town on judges and are now measuring everything."

For the second metric, the business impact is clear. "There are multiple customers who have gone through this workshop and have become seven-figure spenders on GenAI at Databricks in a way that they weren't before," Frankle said.

The third metric reveals Judge Builder's strategic value. Customers who previously hesitated to use advanced techniques like reinforcement learning now feel confident deploying them because they can measure whether improvements actually occurred.

"There are customers who have gone and done very advanced things after having had these judges where they were reluctant to do so before," Frankle said. "They've moved from doing a little bit of prompt engineering to doing reinforcement learning with us. Why spend the money on reinforcement learning, and why spend the energy on reinforcement learning if you don't know whether it actually made a difference?"

What enterprises should do now

The teams successfully moving AI from pilot to production treat judges not as one-time artifacts but as evolving assets that grow with their systems.

Databricks recommends three practical steps. First, focus on high-impact…

Konten dipersingkat otomatis.

πŸ”— Sumber: venturebeat.com


πŸ“Œ MAROKO133 Update ai: Waymo Haunted by Killing of Beloved Neighborhood Cat Terbar

If years of activist campaigns, political lobbying, and vandalism couldn’t slow Waymo’s roll, the horrific death of a beloved neighborhood kitty just might.

A tabby cat named KitKat has become a cause célèbre for driverless car regulation in California after a Waymo ran him down last week, leaving the neighborhood heartbroken. According to the San Francisco Chronicle, the death and ensuing community backlash has inspired district supervisor Jackie Fielder to appeal to the governor and state regulators to reconsider who has the final say on autonomous ride-hail vehicles.

Fielder’s push is apparently inspired by a dead bill in state Senate that would shift authority over driverless cars to municipal authorities, effectively giving local residents a more direct say in what is and isn’t allowed on their streets. Per the Chronicle, the local leader is tying KitKat’s death into a broader campaign against autonomous vehicles, which includes concerns about congestion, noise, data privacy, and weakened public transit.

“Here in the Mission [district], we will never forget our sweet KitKat, we will always put community before tech oligarchs,” Fielder said in an Instagram reel. “AVs [autonomous vehicles] collect endless amounts of data on us and from a road ridership struggling for public transportation, contribute to traffic congestion, and also drive harmful mining practices in the Global South.”

The district supervisor held a press conference on Tuesday at Randa’s Market, the corner store KitKat called home, to announce a resolution she’s introducing to the San Francisco board of directors. If successful, the resolution would issue an official call to state legislators to allow municipal voters to have the final say.

“If I were the Waymo PR team, I would be hoping that this whole KitKat thing just dies, and that’s not happening,” Fielder said.

When driverless cars like Waymo were first unleashed on US roads, they did so largely without the say of residents who would be forced to share the streets with the for-profit experiments. In California, for example, Governor Gavin Newsom β€” a longtime ally to big tech companies β€” helped lobby state politicians to fast-track the approval of driverless cars, going over the heads of local lawmakers in cities like San Francisco.

Fielder’s challenge to state lawmakers will likely face fierce opposition from politicians like Newsom, who’ve made their bones on cozy relationships with the tech industry’s biggest players.

More on Waymo: Waymo CEO Says Society Is Ready for One of Its Cars to Kill Someone

The post Waymo Haunted by Killing of Beloved Neighborhood Cat appeared first on Futurism.

πŸ”— Sumber: futurism.com


πŸ€– Catatan MAROKO133

Artikel ini adalah rangkuman otomatis dari beberapa sumber terpercaya. Kami pilih topik yang sedang tren agar kamu selalu update tanpa ketinggalan.

βœ… Update berikutnya dalam 30 menit β€” tema random menanti!

Author: timuna