ALINC: Active Learning for Inductive Node Classification via Graph Sampling

2026-06-03 • Machine Learning

Machine Learning

AI summaryⓘ

The authors point out that active learning usually picks important individual nodes to label within big graphs, but some problems have thousands of separate graphs instead. In these cases, labeling one node in a graph actually reveals labels for many others, so it's better to choose whole graphs rather than single nodes for annotation. They created ALINC, a method that turns node-level information into graph-level choices using different ways to combine data. After testing various strategies and datasets, the authors found some top ways to pick graphs that improve learning and reduce labeling costs. They also showed ALINC works well for predicting molecule properties and designing circuit boards.

active learningnode classificationgraph samplinginductive learningCoreSetTypiClustBADGEmolecular chemistryprinted circuit board designaggregation methods

Authors

Pascal Plettenberg, Denis Huseljic, André Alcalde, Bernhard Sick, Josephine M. Thomas

Abstract

Active learning (AL) for node classification typically focuses on selecting the most informative nodes for annotation within one or a few large graphs (e.g., in social network analysis). However, in other domains, such as molecular chemistry or electronic design automation, datasets consist of thousands of independent graphs. In many of these inductive settings, annotating an individual node requires a full-graph analysis, which effectively yields the remaining node labels on-the-fly. Therefore, these scenarios require AL strategies that select entire graphs instead of single nodes, a problem which has not been tackled in the literature so far. Thus, we introduce ALINC, an AL framework for inductive node classification via graph sampling. It bridges the existing methodological gap by elevating node-level utility measures to graph-level selection criteria through various aggregation mechanisms. In an extensive benchmark including ten strategies, three aggregation methods, and four datasets, we identify CoreSet, TypiClust, and BADGE as the top-performing graph sampling strategies. Our detailed analysis further reveals that the choice of the aggregation method is pivotal, as it substantially affects model performance and annotation costs. Finally, we demonstrate the effectiveness of ALINC in two use case studies: site-of-metabolism prediction in molecules and design automation of printed circuit board schematics.

View PDFOpen arXiv