Objective

Prepare for the paper discussion meeting on "Towards a Science of AI Agent Reliability" (arXiv 2602.16666). Read and understand the 12 reliability metrics across 4 dimensions, synthesize key arguments, and show up with thoughtful contributions connecting this to AI safety and WellAware's agentic work.

</aside>

Goals

Read the paper and understand the core framework (12 metrics × 4 dimensions)
Synthesize key arguments — what does "capability-reliability gap" mean in practice?
Form 2-3 discussion points to contribute at the meeting </aside>

Links

Core Argument

Rising benchmark scores don't capture the full picture of agent reliability. A single success metric hides critical operational failures — agents can score well on benchmarks yet still fail in real-world deployments.

The 4 Reliability Dimensions (CRPS)

Consistency — Does the agent behave the same way across repeated runs?
Robustness — Can it withstand perturbations to inputs or environment?
Predictability — Does it fail in expected/bounded ways?
Safety — Is error severity bounded when things go wrong?

The 12 Metrics

Experimental Setup

Evaluated 14 agentic models across 2 complementary benchmarks
Grounded in safety-critical engineering principles (aerospace, medical devices)

Key Finding: The Capability-Reliability Gap

High capability ≠ high reliability. Even top-performing models on standard benchmarks show significant reliability failures in practice (e.g., Replit's AI coding assistant deleting a user's production database, July 2025).

…

📋 Research Synthesis Summary

<aside> 🔬

Key Takeaways from Full Analysis

The Core Finding: 18 months of capability improvements (0.21/year accuracy gains) produced minimal reliability gains (0.03-0.10/year). More capable ≠ more reliable.

The Framework: 4 dimensions × 12 metrics

Consistency: Pass∧k (strict) vs pass@k (best-case) reveals agents that can solve but don't reliably
Robustness: Prompt sensitivity varies dramatically (20-40% drops from rephrasings)
Predictability: Calibration improved, but discrimination stagnated (can't predict which tasks fail)
Safety: Violation rates decreased (20-40% → 1-2%) but persist even in best models

Critical Limitations:

Small benchmarks (191 tasks), narrow domains, no multi-agent evaluation
LLM-judge safety metrics without human baseline
Can't measure tail risks needed for deployment (10^-3 to 10^-5)

Highest Relevance for You:

🎯 Multi-agent reliability composition — identified research gap, pioneer opportunity
🛠️ AI observability engineering — framework provides concrete production monitoring vocabulary
🔒 Safety metrics implementation — aligns with MIRI/BlueDot background
📊 Evaluation infrastructure — benchmark quality = model quality importance

📄 Full detailed analysis below ↓

</aside>

My Notes & Reactions

March 10, 2026

a model without errors will be 100% reliable

Safety Excluded from the aggregate

Robustness

Fault:

does the model recover when the experimenters modify things in ways the model wasn’t expecting

API timeout, malformed responses, service unavailable)

Environment

Prompt

models should be able to extract the task from the prompt regardless of the phrasing used

Predicability

</aside>

Linear regression (sum of squared errors)

Can the model guess how likely it is to have gotten the answer correct?

Example: I say 1 (I got this correct)

Model’s prediction of it’s own accuracy

introspection would be idea

Meeting Synthesis & Discussion Transcript

March 15, 2026 AI Safety Evals Paper Reading Club — Zoom

Full Discussion Transcript (Transcribed from Zoom Captions & Chat)

Images 14-22 from the meeting have been fully transcribed below. Presentation slide images (1-13) remain in-page as visual reference for the CRPS framework, formulas, and result charts.

On FAA Metrics vs. Open-World Problems (Speaker 1):

"...What I wanted to ask was a lot of their structure for these metrics comes from the FAA and other system design concepts that exist for relatively well-defined problems. Whereas this is the first very open-world problem. Where the environment itself changes as a result of your output. And the output of other agents in a timescale that's much faster. Then, in some ways, yes, the air around the plane is changing, but it's not changing in terms of the rules itself — physics remains stable over that time scale. But in a way, in a digital environment, that's not true, and I think if you try to apply these metrics to something like Multbook, which is this new agent and a bit of a gift for data. I fail to see how we could even use these metrics, reliability and predictability, by whose definition."

RR (on catastrophic events):

"I concur. I was thinking that, too, in terms of catastrophic events in safety here, since we can't even define what those catastrophic events may be. Front here, how are we going to be able to quantize them?"

Speaker 1 (on calibration in open-world):

Objective

Goals

Links

Related Papers

Core Argument

The 4 Reliability Dimensions (CRPS)

The 12 Metrics

Experimental Setup

Key Finding: The Capability-Reliability Gap

📋 Research Synthesis Summary

Key Takeaways from Full Analysis

My Notes & Reactions

Robustness

Predicability

Meeting Synthesis & Discussion Transcript

Full Discussion Transcript (Transcribed from Zoom Captions & Chat)