<aside> <img src="/icons/verified_gray.svg" alt="/icons/verified_gray.svg" width="40px" />
Prepare for the paper discussion meeting on "Towards a Science of AI Agent Reliability" (arXiv 2602.16666). Read and understand the 12 reliability metrics across 4 dimensions, synthesize key arguments, and show up with thoughtful contributions connecting this to AI safety and WellAware's agentic work.
</aside>
<aside> <img src="/icons/target_gray.svg" alt="/icons/target_gray.svg" width="40px" />
<aside> <img src="/icons/link_gray.svg" alt="/icons/link_gray.svg" width="40px" />
<aside> <img src="/icons/book_gray.svg" alt="/icons/book_gray.svg" width="40px" />
<aside> <img src="/icons/list-indent_gray.svg" alt="/icons/list-indent_gray.svg" width="40px" />
Table of Contents
</aside>

Research Synthesis: AI Agent Reliability (Full)
https://luma.com/4uoxjpwz?pk=g-ScTjK6b4pnn4DNA
this week’s paper. Also interesting for me since I am researching AI Agents
https://www.emergentmind.com/papers/2602.16666#summary
Rising benchmark scores don't capture the full picture of agent reliability. A single success metric hides critical operational failures — agents can score well on benchmarks yet still fail in real-world deployments.
High capability ≠ high reliability. Even top-performing models on standard benchmarks show significant reliability failures in practice (e.g., Replit's AI coding assistant deleting a user's production database, July 2025).
…
<aside> 🔬
The Core Finding: 18 months of capability improvements (0.21/year accuracy gains) produced minimal reliability gains (0.03-0.10/year). More capable ≠ more reliable.
The Framework: 4 dimensions × 12 metrics
Critical Limitations:
Highest Relevance for You:
📄 Full detailed analysis below ↓
</aside>
March 10, 2026
a model without errors will be 100% reliable
Safety Excluded from the aggregate




Fault:
does the model recover when the experimenters modify things in ways the model wasn’t expecting
Environment
Prompt
models should be able to extract the task from the prompt regardless of the phrasing used
<aside> <img src="/icons/forward_gray.svg" alt="/icons/forward_gray.svg" width="40px" />
</aside>
Linear regression (sum of squared errors)
Can the model guess how likely it is to have gotten the answer correct?

Example: I say 1 (I got this correct)
Model’s prediction of it’s own accuracy
March 15, 2026 AI Safety Evals Paper Reading Club — Zoom
Images 14-22 from the meeting have been fully transcribed below. Presentation slide images (1-13) remain in-page as visual reference for the CRPS framework, formulas, and result charts.
On FAA Metrics vs. Open-World Problems (Speaker 1):
"...What I wanted to ask was a lot of their structure for these metrics comes from the FAA and other system design concepts that exist for relatively well-defined problems. Whereas this is the first very open-world problem. Where the environment itself changes as a result of your output. And the output of other agents in a timescale that's much faster. Then, in some ways, yes, the air around the plane is changing, but it's not changing in terms of the rules itself — physics remains stable over that time scale. But in a way, in a digital environment, that's not true, and I think if you try to apply these metrics to something like Multbook, which is this new agent and a bit of a gift for data. I fail to see how we could even use these metrics, reliability and predictability, by whose definition."
RR (on catastrophic events):
"I concur. I was thinking that, too, in terms of catastrophic events in safety here, since we can't even define what those catastrophic events may be. Front here, how are we going to be able to quantize them?"
Speaker 1 (on calibration in open-world):