<aside> <img src="/icons/verified_gray.svg" alt="/icons/verified_gray.svg" width="40px" />

Objective


Prepare for the paper discussion meeting on "Towards a Science of AI Agent Reliability" (arXiv 2602.16666). Read and understand the 12 reliability metrics across 4 dimensions, synthesize key arguments, and show up with thoughtful contributions connecting this to AI safety and WellAware's agentic work.

</aside>

<aside> <img src="/icons/target_gray.svg" alt="/icons/target_gray.svg" width="40px" />

Goals


  1. Read the paper and understand the core framework (12 metrics × 4 dimensions)
  2. Synthesize key arguments — what does "capability-reliability gap" mean in practice?
  3. Form 2-3 discussion points to contribute at the meeting </aside>

<aside> <img src="/icons/link_gray.svg" alt="/icons/link_gray.svg" width="40px" />

Links


<aside> <img src="/icons/book_gray.svg" alt="/icons/book_gray.svg" width="40px" />

Related Papers


<aside> <img src="/icons/list-indent_gray.svg" alt="/icons/list-indent_gray.svg" width="40px" />

Table of Contents


</aside>


image.png

Research Synthesis: AI Agent Reliability (Full)

https://luma.com/4uoxjpwz?pk=g-ScTjK6b4pnn4DNA

this week’s paper. Also interesting for me since I am researching AI Agents

https://www.emergentmind.com/papers/2602.16666#summary

https://youtu.be/nKaRVwyZpvY

Core Argument

Rising benchmark scores don't capture the full picture of agent reliability. A single success metric hides critical operational failures — agents can score well on benchmarks yet still fail in real-world deployments.

The 4 Reliability Dimensions (CRPS)

  1. Consistency — Does the agent behave the same way across repeated runs?
  2. Robustness — Can it withstand perturbations to inputs or environment?
  3. Predictability — Does it fail in expected/bounded ways?
  4. Safety — Is error severity bounded when things go wrong?

The 12 Metrics

Experimental Setup

Key Finding: The Capability-Reliability Gap

High capability ≠ high reliability. Even top-performing models on standard benchmarks show significant reliability failures in practice (e.g., Replit's AI coding assistant deleting a user's production database, July 2025).

📋 Research Synthesis Summary

<aside> 🔬

Key Takeaways from Full Analysis


The Core Finding: 18 months of capability improvements (0.21/year accuracy gains) produced minimal reliability gains (0.03-0.10/year). More capable ≠ more reliable.

The Framework: 4 dimensions × 12 metrics

Critical Limitations:

Highest Relevance for You:

  1. 🎯 Multi-agent reliability composition — identified research gap, pioneer opportunity
  2. 🛠️ AI observability engineering — framework provides concrete production monitoring vocabulary
  3. 🔒 Safety metrics implementation — aligns with MIRI/BlueDot background
  4. 📊 Evaluation infrastructure — benchmark quality = model quality importance

📄 Full detailed analysis below

</aside>

My Notes & Reactions

March 10, 2026

a model without errors will be 100% reliable

Safety Excluded from the aggregate

image.png

image.png

image.png

Robustness

image.png

Fault:

does the model recover when the experimenters modify things in ways the model wasn’t expecting

Environment

Prompt

models should be able to extract the task from the prompt regardless of the phrasing used

Predicability

<aside> <img src="/icons/forward_gray.svg" alt="/icons/forward_gray.svg" width="40px" />

</aside>

Linear regression (sum of squared errors)

Can the model guess how likely it is to have gotten the answer correct?

image.png

Example: I say 1 (I got this correct)

Model’s prediction of it’s own accuracy




Meeting Synthesis & Discussion Transcript

March 15, 2026 AI Safety Evals Paper Reading Club — Zoom

Full Discussion Transcript (Transcribed from Zoom Captions & Chat)

Images 14-22 from the meeting have been fully transcribed below. Presentation slide images (1-13) remain in-page as visual reference for the CRPS framework, formulas, and result charts.

On FAA Metrics vs. Open-World Problems (Speaker 1):

"...What I wanted to ask was a lot of their structure for these metrics comes from the FAA and other system design concepts that exist for relatively well-defined problems. Whereas this is the first very open-world problem. Where the environment itself changes as a result of your output. And the output of other agents in a timescale that's much faster. Then, in some ways, yes, the air around the plane is changing, but it's not changing in terms of the rules itself — physics remains stable over that time scale. But in a way, in a digital environment, that's not true, and I think if you try to apply these metrics to something like Multbook, which is this new agent and a bit of a gift for data. I fail to see how we could even use these metrics, reliability and predictability, by whose definition."

RR (on catastrophic events):

"I concur. I was thinking that, too, in terms of catastrophic events in safety here, since we can't even define what those catastrophic events may be. Front here, how are we going to be able to quantize them?"

Speaker 1 (on calibration in open-world):