Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

2026-06-08Artificial Intelligence

Artificial IntelligenceComputation and LanguageMachine LearningPerformance
AI summary

The authors found that biomedical language models mistakenly think unrelated things, like hormone levels and stock market changes, are very similar, which can cause errors when these models try to link ideas in a person's life. Standard biomedical encoders scored unrelated pairs with high similarity, leading to zero accuracy in telling cross-domain differences. To fix this, the authors used a special training method that improved the model's ability to distinguish related from unrelated biomedical pairs and boosted performance on standard tests. They also optimized the model's speed using specific hardware and software enhancements and shared all their tools and data for others to use.

biomedical language modelcosine similarityembedding spacecontrastive learningcross-domain discriminationfoundation modelknowledge graphBIOSSES benchmarkOpenVINOhardware acceleration
Authors
Suraj Biswas, Saurabh Gupta, Pritam Mukherjee
Abstract
Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%. Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user's life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness. We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.