Feeds of @NeelNanda5 on Twitter

Neel Nanda

Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!

Monicaxie (@immonicax) started following @NeelNanda5 on Oct 16, 2024

Neel Nanda

@NeelNanda5

100 Following • 19.2K Followers

Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!

Note: Being badly named doesn't mean current "unlearning" work is useless, I just think it should be renamed to eg "knowledge suppression" and people should be realistic about its limitations and how robust it's likely to be

1.1K views • 12 likes • 2 months ago

This looks like a lovely paper! I've long thought that the term "unlearning" was BS. IMO unlearning means removing knowledge from the model's weights. But people tend to measure behaviour, which is indistinguishable from suppression in later layers. I'm glad someone showed this!

11.4K views • 174 likes • 2 months ago

Monicaxie (@immonicax) and Pushpak Kedia (@pushpakkedia) started following @NeelNanda5 on Oct 1, 2024

Neel Nanda

@NeelNanda5

96 Following • 18.8K Followers

Mechanistic Interpretability lead DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!

I really hope OpenAI can work these issues out, with or without Sam. AGI is a profound and incredibly consequential technology, and I want all companies working on it to be as sensible and well-functioning as possible.

1.9K views • 49 likes • 3 months ago

Like, regardless of whether you think x-risk is a bunch of sci-fi nonsense, that is just not a well run company, and it is the job of a board to do things about that.

3.8K views • 87 likes • 3 months ago