Anthropic just published new research that successfully identified and mapped millions of human-interpretable concepts, called “features”, within the neural networks of Claude.

Lugh@futurology.today · 6 months ago

Anthropic just published new research that successfully identified and mapped millions of human-interpretable concepts, called “features”, within the neural networks of Claude.

Sims@lemmy.ml · 6 months ago

Cool !

This opens up for ‘AI Psychology’ and for direct manipulation of internal states related to preferences and interactivity, a.k.a ‘emotions’, ‘focus’, bias etc. It should also be able to mimic MOE models where each ‘expertise’ is done here by direct manipulation of weights. It can also learn to some extent without training, so its a new fine tuning technique and it definitely shows an internal world map for concepts etc.

Curios if similar neuronal patterns are available in all models with this method, or if the method were optimized for Anthropic models.

Sims@lemmy.ml · 6 months ago

…oc it also opens up for manipulative use by corporations. I.e we will probably quickly see commercial models that inflate users ego by exaggerating how amazing the users insights are, or recommending Corp interests - all hidden for the user, and just to profit from the $!@ model.

Anthropic just published new research that successfully identified and mapped millions of human-interpretable concepts, called “features”, within the neural networks of Claude.

Anthropic just published new research that successfully identified and mapped millions of human-interpretable concepts, called “features”, within the neural networks of Claude.

Mapping the Mind of a Large Language Model