GLIP: Graph and LLM Joint Pretraining for Graph-Level Tasks
2026-06-29 • Machine Learning
Machine Learning
AI summaryⓘ
The authors propose GLIP, a new way to train models that combine Graph Neural Networks (GNNs) and Large Language Models (LLMs) specifically for tasks involving whole graphs, which are more complex than focusing on single nodes or edges. GLIP uses special techniques to pick out important parts of graphs, adds context through diffusion processes, and trains the combined model by balancing both semantic meaning and graph structure. Their experiments show that GLIP works better than current methods when there is limited labeled data for graph-level tasks. The authors also made their code publicly available for others to use.
Graph Neural Networks (GNNs)Large Language Models (LLMs)Graph-level tasksGraph augmentationContrastive learningDiffusion-based projectorSemantic alignmentGraph pretrainingPatch selectionRepresentation learning
Authors
Haoxin Sun, Yiqing Lin, Yajun Huang, Chenhui Dong, Mingjun Li, Zhongzhi Zhang
Abstract
Graphs are widely used to model relational systems, with applications in domains such as social networks, finance, and biomedicine. Graph neural networks (GNNs) have become a mainstream approach for learning graph representations. With the rise of large language models (LLMs), recent studies have attempted to combine GNNs with LLMs. However, most existing works concentrate on node-level and edge-level tasks, while graph-level tasks, which require capturing more complex structural and feature information, remain relatively underexplored. Moreover, graph pretraining is a widely adopted strategy to alleviate the challenge of label scarcity. Most existing approaches are designed solely for GNNs such as GraphCL, leaving LLMs uninvolved in the process. To address these limitations, we propose GLIP, a Graph-LLM JoInt Pretraining framework for graph-level tasks. GLIP first performs graph augmentation to construct positive and negative pairs and introduces a multi-token selection strategy to identify patches informative in both structure and features. It further leverages a diffusion-based projector to enrich them with contextual information, enabling GLIP to capture signals from both global and local perspectives. Finally, GLIP employs a joint objective that integrates the LLM's semantic judgments with a contrastive alignment loss, ensuring consistent supervision at both the semantic and structural levels. After pretraining, GLIP is fine-tuned with limited labeled data for downstream tasks, and extensive experiments show that it outperforms state-of-the-art methods on graph-level classification and reasoning tasks. Our source code is publicly available at https://anonymous.4open.science/r/GLIP.