Research Agenda

I attended a review of the paper Toward a Science of AI Agent Reliability hosted by BlueDot impact. My notes are here. There’s a pretty large capability-reliability gap in terms of AI Agents. Even though LLM’s are getting more capable (there are more tasks it can solve), reliability on tasks (whether it can do the task every time) seems to be lagging. Unless reliability catches up, we will need to build under the assumption of relatively permanent human-in-the-loop scenarios, or we will have increasingly capable but chaotic agents, which is the most potentially dangerous scenario. In this case, the way that agents fail at a task may be very difficult to predict in advance. The system unreliability would also compound in multi-agent systems.

</aside>

Log

Research on Multi-Agents

https://www.alignmentforum.org/posts/hHnpn3mEPbJMFLj4g/quick-thoughts-on-the-implications-of-multi-agent-views-of

https://docs.google.com/spreadsheets/d/1oOdrQ80jDK-aGn-EVdDt3dg65GhmzrvBWzJ6MUZB8n4/edit?usp=sharing

200 Concrete Problems In Interpretability Spreadsheet

Past BlueDot projects

https://fuzzyhead.substack.com/p/reproducing-metrs-re-bench-reward

https://blog.bluedot.org/p/shutdown-resistance-revisited-replicating

https://blog.bluedot.org/p/learnings-from-building-my-first

https://blog.bluedot.org/p/causal-probes

Interpretability Research

Inspect Evals Technical Contribution Guide