Computational Superposition in a Toy Model of the U-AND Problem

I’ve been working on some AI Safety research. It’s kinda dense for a blog, so I’m hosting elsewhere.

It’s investigation into how ML models do boolean at the most fundamental level. Under an assumption of feature sparsity, which is common for large models, certain patterns appear.

Read on Less Wrong

Leave a Reply