From Genes to Tokens: a GWAS-inspired Approach for Interpretable Stylometric Analysis
2026-06-08 • Computation and Language
Computation and Language
AI summaryⓘ
The authors present a new way to figure out which words (called 'gene' tokens) are strongly linked to particular writers (the 'phenotypes'). They use a math technique called logistic regression, which helps see if certain words are tied to specific authors, while making sure not to get tricked by random chance. They tested this method on texts in English, German, and Russian and found words that clearly stand out as unique to individual authors. This helps better understand how an author's style can be identified through their word choices.
stylometrygenome-wide association studies (GWAS)logistic regressionmultiple-comparison correctionlexical markersauthorship attributioncorporastatistical significancetokensphenotype
Authors
Dmitry Pronin, Evgeny Kazartsev
Abstract
This short paper introduces a stylometric interpretation method inspired by genome-wide association studies (GWAS). Each "gene" token's association with "phenotype" authorship is tested using logistic regression with multiple-comparison correction. Applied to English, German, and Russian corpora, the method detects statistically significant lexical markers distinctive of individual authors.