Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
We propose SVGT, a plug-and-play module that achieves stable LLM alignment by decoupling value modeling from the backbone's dynamic residual stream and steering generation via latent Bridge Tokens.
SVGT introduces an independent value module with dedicated value representations and explicit behavioral guidance. Latent Bridge Tokens act as dynamic value anchors, steering generation without disrupting the backbone's internal representations. Across multiple backbones and safety benchmarks, SVGT reduces harmful scores by over 70% while maintaining generation fluency.