rsx: A high-performance streaming toolkit for RAD-seq sex determination
2026-06-04 • Performance
Performance
AI summaryⓘ
The authors created a new software tool called rsx, written in Rust, to improve how scientists find genetic markers linked to sex in species without well-studied genomes using RAD-seq data. Their tool is faster, uses less memory, and provides more detailed statistical evidence compared to previous methods like RADSex. It also offers better integration with programming languages and keeps results consistent across different computers. The authors tested rsx on large datasets and confirmed it matched older results while finding some new hints about sex-linked markers. They made the software available openly with supporting documentation and code interfaces.
RAD-seqsex-linked markersRust programming languageBayesian statisticsparallel processingposterior probabilitygenomic data analysisbeta-binomial modelchi-squared testAPI bindings
Authors
Rohit Goswami, Ruhila Goswami
Abstract
Restriction site-associated DNA sequencing (RAD-seq) is widely used to discover sex-linked markers in non-model organisms, but large studies produce marker tables with millions of RAD tags. RADSex provides the reference workflow for building marker-by-individual depth tables and testing sex-biased marker distributions, but its depth, merge, and related table-building commands grow memory-hungry, and its standard output reports frequentist calls with no posterior evidence and no direct Python or C integration. We present rsx, a Rust implementation of the complete RADSex command set that preserves marker-table semantics and command-line compatibility. rsx combines 2-bit DNA keys, parallel ingestion, memory-mapped marker tables, external sorting, bitset group counts, and streamed Gram-matrix PCA so that memory stays bounded by the number of individuals or by explicit buffers. It adds conjugate Beta-Binomial Bayes factors and posterior probabilities under XY and ZW hypotheses, returning strict, posterior-supported, and Bayes-factor-only evidence grades. A portable, libm-independent minimax approximation of the error function keeps the chi-squared tail reproducible across platforms without changing the underlying Yates test. On four real RAD-seq datasets comprising 41.9 billion bases and 29 million markers, rsx reproduced published RADSex v1.2.0 calls, achieved an 8.38-fold geometric-mean speedup across 56 paired timings (2.77-fold for FASTQ processing), and recovered every Bonferroni-significant positive-control marker. In Danio albolineatus, treated as null in the source publication, the posterior layer surfaced 30 W-linked marker hypotheses; in Notothenia rossii it withheld 400 Bayes-factor-only rows compatible with a low-prevalence null. Python bindings, a C API, and a reproducibility archive provide the workflows used for all reported numbers. rsx is released under GPL-3.0-or-later.