Is Inference Mediated by Distinct Semantic Structures in LLMs? A Mechanistic Interpretation

2026-05-25Computation and Language

Computation and Language
AI summary

The authors studied how language models understand relationships between sentences, focusing on whether these models grasp the actual changes (semantic operations) that connect one sentence to another. They used pairs of sentences differing by one specific change and found that the models' internal representations reflect these operations, not just the final label or answer. By manipulating the model’s internal states, the authors showed these semantic directions influence the model’s predictions, though the effect varies across different models. This suggests that understanding and controlling language models should focus on these meaningful operations rather than just the output labels.

Transformer modelsNatural Language Inferencesemantic operationsactivation steeringsingular value decomposition (SVD)layer-wise activationsdecoder modelssubspace analysis
Authors
Nura Aljaafari, Marco Valentino, André Freitas
Abstract
Predicting a label correctly does not necessarily require representing the operation that produces it. Transformer representations are known to carry label-level information, but whether they encode semantic operations producing those labels is unclear. We investigate this in Natural Language Inference using controlled premise-hypothesis pairs that differ by a single semantic transformation. Using layer-wise activations, we estimate operation-level subspaces via SVD and test their causal relevance through activation steering in four open-weight decoder models. Transformation effects are decodable with $84.8$-$99\%$ accuracy and occupy partially distinct but overlapping subspaces, exceeding random-subspace baselines. Steering experiments show that these directions causally influence predictions, though steerability varies across models; cross-operation steering further reveals structured interference and a dissociation between subspace selectivity and cross-operation independence. These findings indicate that the models encode not only that a hypothesis relates to a premise but also, in part, how it does so, implying that mechanistic analysis and control should operate at the level of semantic operations rather than predicted labels alone.