Generate with CodeXHug: A Dataset to Enhance Model Cards with Code Usage Patterns

2026-06-22Software Engineering

Software Engineering
AI summary

The authors explore how pre-trained models (PTMs) from the HuggingFace repository are actually used in real software projects on GitHub, not just in experiments or for storage. They created a dataset called CodeXHug that links PTMs with real code examples from GitHub, making it easier for developers to understand how to apply these models. Their dataset includes over 7,000 PTMs and more than 20,000 Python files. They also show how to find common coding patterns for specific models using statistical methods and clustering. This work aims to help newcomers by providing practical code usage alongside model information.

Pre-trained modelsHuggingFaceGitHubModel cardsCode usage patternsData curationClusteringStatistical analysisPythonSoftware engineering
Authors
Stefano Palombo, Claudio Di Sipio, Juri Di Rocco, Davide Di Ruscio
Abstract
Pre-trained models (PTMs) are becoming increasingly popular in the software engineering community. Their usage is facilitated by model repositories, e.g., HuggingFace, which collect, store, and maintain a wide range of PTMs. However, the actual adoption of these models in real-world projects is still an open question, i.e., many of them are used in toy projects or simply as a mirror for the HF repository. In addition, most of the available model cards and textual documents that contain critical information about their usage do not include explanatory code patterns, thus increasing the difficulty for newcomers. Thus, we see the need for a curated codebase related to PTMs to support developers and practitioners who are interested in using them in their projects. In this paper, we present CodeXHug, a curated dataset of HuggingFace PTMs exploited in the Github ecosystem and the related code usage patterns. Starting from the latest HF dump, we first conduct a data curation to collect PTMs with a tag and a model card. Then, the Github platform has been queried to find actual usages of the identified PTMs, resulting in 7,325 different models and 20,545 Python files. To demonstrate a concrete application of CodeXHug, we propose a usage scenario focused on extracting representative code usage patterns for specific PTMs through a statistical analysis and clustering techniques applied to relevant code snippets.