From Genes to Tokens: a GWAS-inspired Approach for Interpretable Stylometric Analysis

2026-06-08 • Computation and Language

Computation and Language

AI summaryⓘ

The authors present a new way to figure out which words (called 'gene' tokens) are strongly linked to particular writers (the 'phenotypes'). They use a math technique called logistic regression, which helps see if certain words are tied to specific authors, while making sure not to get tricked by random chance. They tested this method on texts in English, German, and Russian and found words that clearly stand out as unique to individual authors. This helps better understand how an author's style can be identified through their word choices.

stylometrygenome-wide association studies (GWAS)logistic regressionmultiple-comparison correctionlexical markersauthorship attributioncorporastatistical significancetokensphenotype

Authors

Dmitry Pronin, Evgeny Kazartsev

Abstract

This short paper introduces a stylometric interpretation method inspired by genome-wide association studies (GWAS). Each "gene" token's association with "phenotype" authorship is tested using logistic regression with multiple-comparison correction. Applied to English, German, and Russian corpora, the method detects statistically significant lexical markers distinctive of individual authors.

View PDFOpen arXiv