We have identified how millions of concepts are represented inside Claude Sonnet, one of our deployed large language models. This is the first ever detailed look inside a modern, production-grade large language model.
This opens up for ‘AI Psychology’ and for direct manipulation of internal states related to preferences and interactivity, a.k.a ‘emotions’, ‘focus’, bias etc. It should also be able to mimic MOE models where each ‘expertise’ is done here by direct manipulation of weights. It can also learn to some extent without training, so its a new fine tuning technique and it definitely shows an internal world map for concepts etc.
Curios if similar neuronal patterns are available in all models with this method, or if the method were optimized for Anthropic models.
…oc it also opens up for manipulative use by corporations. I.e we will probably quickly see commercial models that inflate users ego by exaggerating how amazing the users insights are, or recommending Corp interests - all hidden for the user, and just to profit from the $!@ model.
Cool !
This opens up for ‘AI Psychology’ and for direct manipulation of internal states related to preferences and interactivity, a.k.a ‘emotions’, ‘focus’, bias etc. It should also be able to mimic MOE models where each ‘expertise’ is done here by direct manipulation of weights. It can also learn to some extent without training, so its a new fine tuning technique and it definitely shows an internal world map for concepts etc.
Curios if similar neuronal patterns are available in all models with this method, or if the method were optimized for Anthropic models.
…oc it also opens up for manipulative use by corporations. I.e we will probably quickly see commercial models that inflate users ego by exaggerating how amazing the users insights are, or recommending Corp interests - all hidden for the user, and just to profit from the $!@ model.