Adrià Garriga-Alonso

Research Scientist

FAR AI

My goal is to prevent existential risk from AI. To this end, I am researching how neural networks internally work, including:

How can we evaluate the accuracy of an interpretability explanation?
How can we find explanations of the algorithm the NN implements, at lower labor and compute costs?
What explains the behavior of agent-like AIs? What do they want?

I am doing this work at FAR AI. If you are interested in it, email me, and consider joining us! Also join us if you want to run your independent AI safety agenda.

Previously I worked at Redwood Research on interpretability research and software development.

I hold a PhD in machine learning, which was advised by Prof. Carl Rasmussen at the University of Cambridge. My research focused on improving uncertainty quantification in neural networks (NNs) using Bayesian principles.

Publications

More Publications

Towards Automatic Circuit Discovery for mechanistic interpretability

We identify the common workflow for mechanistic interpretability work, and automate its “systematic ablations” step with a new …

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso

Preprint Code

Causal Scrubbing: a method for rigorously testing interpretability hypotheses

The key idea behind causal scrubbing is to test interpretability hypotheses via behavior-preserving resampling ablations. We apply this …

Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, Nate Thomas

Preprint

Adrià Garriga-Alonso

Research Scientist

FAR AI

Publications

Towards Automatic Circuit Discovery for mechanistic interpretability

Causal Scrubbing: a method for rigorously testing interpretability hypotheses

Recent Blogs

Remote development with Unison

An Alternative Population Ethics

Embarbussaments

Contest writeup: Murcia qualifiers 2015

How to multiply polynomials in Θ(n log n) time

Tags