AI summaryⓘ
The authors study complex-valued Transformers, which are neural networks that process data with both magnitude and phase. They find that usual attention methods don't preserve phase information well, so they create the Phase-Coherent Transformer (PCT), which uses a new type of attention that avoids token competition and keeps phase intact across layers. Their tests show that PCT works better than traditional Transformers on various tasks involving memory, reasoning, and image classification. They also demonstrate that preserving phase smoothly is crucial for performance, especially on long-range tasks, and that PCT avoids accuracy drops in deeper models. Overall, the authors suggest that maintaining phase coherence in attention improves complex-valued Transformer generalization.
Complex-valued neural networksTransformersSoftmax attentionPhase coherenceToken competitionQuery-key similarityLong-range memoryHierarchical reasoningAttention mechanismsNeural network generalization
Abstract
Complex-valued Transformers have largely inherited softmax attention from real-valued architectures. However, row-normalised token competition is not necessarily aligned with phase-preserving computation. In this paper, we introduce the Phase-Coherent Transformer (PCT), which applies a real-valued, element-independent, smooth gate to L2-normalised complex query-key similarities. PCT replaces token competition with token-non-competing attention and is designed to preserve phase information across layers. Across mid-scale benchmarks spanning long-range memory, hierarchical long-range reasoning, positional retrieval, phase-based memory and superposition, and image classification, PCT shows strong generalisation across task categories. Under parameter-fair comparison, PCT consistently outperforms both the standard softmax Transformer and its direct complex-valued counterpart. Moreover, even on tasks traditionally considered difficult for complex-valued neural networks, such as NIAH and LRA-Text, PCT remains competitive with Multiscreen, the strongest real-valued NN baseline in our comparison. Experiments introducing gates that deliberately violate the PCT conditions show that the design is not incidental: smooth gates that preserve negatively aligned phase components remain strong, whereas gates that delete such components collapse on long-range retrieval, and gates whose outputs become excessively large suffer clear performance degradation. PCT also shows no depth-related accuracy collapse across the tested depth range. These results support introducing multi-layer phase-coherent structure into attention as a promising design principle for achieving generalisation in complex-valued Transformers.