OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib

2026-06-09Machine Learning

Machine Learning
AI summary

The authors created OncoTraj, a public dataset of 813 lung cancer patients with EGFR mutations treated with osimertinib, combining data from three sources. They set up three tasks to predict cancer progression and resistance mechanisms using genetic data taken at one time point. Their tests showed that current models can’t do better than random guessing, indicating that single-timepoint genetic data are not enough for accurate predictions. They did find that having a TP53 mutation clearly increases the chance of cancer progression within a year. This work provides a baseline and highlights the need for future data to include multiple time points from blood samples for better prediction.

EGFR-mutant NSCLCosimertinibclonal evolutionlongitudinal patient datanext-generation sequencing (NGS)circulating tumor DNA (ctDNA)progression predictionTP53 co-mutationmachine learning benchmarkmulti-task learning
Authors
Abhijoy Sarkar, Aarchi Singh Thakur
Abstract
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computational models on the corresponding longitudinal patient trajectories. We introduce OncoTraj, a public benchmark of 813 EGFR-mutant NSCLC patients receiving first-line osimertinib, harmonized from three real-world clinical-genomic sources: MSK-CHORD (672 patients), AACR Project GENIE BPC NSCLC (34 patients), and the FLAURA molecular-resistance supplement (107 patients). OncoTraj defines three locked tasks: (A) binary classification of progression by a fixed 12-month landmark, (B) regression of time-to-first-progression in days, and (C) six-class classification of the dominant resistance mechanism. We release the harmonized dataset, patient-level train/validation/test splits with an audited no-leakage guarantee, an open-source evaluation harness, and six reference baselines spanning a majority-class predictor, logistic regression, random forest, XGBoost, an LSTM, and a multi-task transformer. With v1's single-timepoint snapshot features, no task clears chance on clean within-source evaluation: the uniformity of this ceiling across every model class localizes the limit to the input modality (single-snapshot tissue NGS rather than serial ctDNA), not the algorithm. The benchmark does recover a reproducible literature-consistent association: TP53 co-mutation raises the 12-month progression rate from 29% to 59% cohort-wide. OncoTraj establishes a reproducible, leakage-audited baseline and converts the modality limit into concrete design requirements for a serial-ctDNA-enriched v2.