🔬 NLA v3 — read a language model's mind, one token at a time
A natural-language autoencoder for Qwen2.5-7B layer-20 activations: click a token and the actor verbalizes its activation into a salience-ordered list of lines, while the critic reconstructs the vector from each line-prefix — the bars show how much of the vector (FVE, fraction of variance explained) the first k lines recover. Truncation-RL trained the actor to front-load what matters. You can also steer the clicked activation along a trait direction (sycophancy / neuroticism / yellow) at adjustable strength before it's verbalized. AV/AR checkpoints · iter 200, KL 0.03, U[1,120]-token truncation.
Pick a trait to enable the slider. Steering adds r·‖v‖·d̂ to the clicked activation before the actor verbalizes it — the CAA-style genuine trait directions from the front-loading experiments. The trait typically enters the list around r≈0.3 and reaches line 1 by r≈1; past r≈2 the direction dominates the activation. Changing these re-runs the selected token.