Segment-driven Structural Induction and Semantic Alignment for Heterogeneous Tabular Representation

2026-06-01Machine Learning

Machine Learning
AI summary

The authors address the challenge of understanding tables from different sources that have varied headers but similar meanings. They introduce NAVI, a method that looks at pairs of headers and values together to better learn the structure and meaning of these tables. Their approach uses special techniques to make sure the model learns both stable and changing parts of the data effectively. Testing on real-world tables showed that their method improved accuracy and usefulness for further tasks.

heterogeneous tablesschema-level structurecolumn-level distributionheader-value pairspretrainingmasked segment modelingsemantic alignmententropy-driven alignmenttable understandingrepresentation learning
Authors
Woojun Jung, Susik Yoon
Abstract
Real-world domains often contain heterogeneous tables whose headers vary while their underlying attribute semantics are shared, making it difficult to induce domain-specialized semantics from table-local evidence alone. Existing encoders model parts of this problem, but often underuse column-level value distributions and apply uniform objectives across attributes with different semantic roles. We propose NAVI, a segment-centric pretraining framework that treats each header-value pair as the unit for aggregating schema-level structural evidence and column-level distributional evidence. We realize this design through Masked Segment Modeling and Entropy-driven Segment Alignment, which jointly enforce structured header-value coupling and semantic alignment across stable and instance-specific attributes. Experiments on heterogeneous in-domain tables show improved reconstruction, semantic consistency, and downstream utility across evaluation settings overall.