Verification Series

Fine-grained Activation Manipulation by Contrastive Orthogonal Unalignment for Large Language Model

22nd January 2026, 11:00 add to calender
Yi Dong
UOL

Abstract

Large language models have been widely applied, but can inadvertently
encode sensitive or harmful information, raising significant safety concerns.
Machine unlearning has emerged to alleviate this concern; however, existing
training-time unlearning approaches, relying on coarse-grained loss
combinations, have limitations in precisely separating knowledge and balancing
removal effectiveness with model utility. In contrast, we propose Fine-grained
Activation manipuLation by Contrastive Orthogonal uNalignment (FALCON), a novel
representation-guided unlearning approach that leverages information-theoretic
guidance for efficient parameter selection, employs contrastive mechanisms to
enhance representation separation, and projects conflict gradients onto
orthogonal subspaces to resolve conflicts between forgetting and retention
objectives. Extensive experiments demonstrate that FALCON achieves superior
unlearning effectiveness while maintaining model utility, exhibiting robust
resistance against knowledge recovery attempts.
add to calender (including abstract)