Structure-Informed Multiple Sequence Alignment: A Formal Model and Hardness Results

2026-06-01Computational Complexity

Computational Complexity
AI summary

The authors introduce a new problem called MSA-S, which improves sequence alignment by including structural information as pairs of positions that should align. They model the problem mathematically and prove that deciding if a good enough alignment exists is computationally very hard (NP-complete). They also show it's hard to even approximate the best alignment within any reasonable accuracy for certain cases, including just two sequences. These findings help explain why adding structure to sequence alignment makes it challenging from a computational viewpoint.

multiple sequence alignmentstructure-informed alignmentNP-completeNP-hardapproximation algorithmscontact mapaffine gap penaltiespairwise scoringcomputational complexityoptimization problem
Authors
Yoshiki Kanazawa, Naphan Benchasattabuse, Michal Hajdušek, Rodney Van Meter
Abstract
We formulate a structure-informed multiple sequence alignment problem, denoted MSA-S. The model abstracts biological sequences as strings and structural information as designated position-pairs. It augments a fixed pairwise string score, defined by a fixed non-gap symbol-pair scoring rule and fixed affine gap penalties, with a binary overlap score on designated position-pairs, which can be interpreted as a contact-map overlap score in structural applications. This yields a fixed-score, integer-valued optimization model suitable for complexity-theoretic analysis. Under this formulation, we show that the decision problem MSA-S-DEC is NP-complete for a broad class of fixed pairwise string scoring schemes. We also show that NP-hardness persists even under the restriction that every designated position-pair set is nonempty and the pair-overlap threshold is strictly positive. For the associated scalarized optimization problem MSA-S-OPT(lambda) with any fixed rational constant lambda >= 1, we further show that, under the canonical unit scheme for the non-gap symbol-pair scoring rule, MSA-S-OPT(lambda) admits no polynomial-time approximation scheme (PTAS) even for two input strings (k = 2), unless P = NP. These results establish a formal complexity-theoretic baseline for structure-informed multiple sequence alignment.