Revealing the Technology Development of Natural Language Processing: A Scientific Entity-Centric Perspective

2026-06-29 • Computation and Language

Computation and LanguageComputers and SocietyDigital LibrariesInformation Retrieval

AI summaryⓘ

The authors studied how technology in Natural Language Processing (NLP) has changed by focusing on specific technical terms like methods, datasets, and tools mentioned in research papers. They found that papers now include more technical details, making it harder for researchers to keep up. Important methods like pre-trained language models (e.g., BERT and Transformer) have become very influential recently. They also noticed that some older technologies, like the Wikipedia dataset and BLEU metric, remain important over time. Overall, the authors show how new technologies are spreading faster and becoming popular more quickly than before.

Natural Language Processingentity recognitionpre-trained language modelsBERTTransformerBLEU metricWikipedia datasettechnology trendsz-scoreco-occurrence network

Authors

Heng Zhang, Chengzhi Zhang, Yuzhuo Wang

Abstract

Most studies on technology development have been conducted from a thematic perspective, but the topics are coarse-grained and insufficient to accurately represent technology. The development of automatic entity recognition techniques makes it possible to extract technology-related entities on a large scale. Thus, we perform a more accurate analysis of technology development from an entity-centric perspective. To begin with, we extract technology-related entities such as methods, datasets, metrics, and tools in articles on Natural Language Processing (NLP), and we apply a semi-automatic approach to normalize the entities. Subsequently, we calculate the z-scores of entities based on their co-occurrence networks to measure their impact. We then analyze the development trends of new technologies in the NLP domain since the beginning of the 21st century. The findings of this paper include three aspects: Firstly, the continued increase in the average number of entities per paper implies a growing burden on researchers to acquire relevant technical background knowledge. However, the emergence of pre-trained language models has injected new vitality into the technological innovation of the NLP domain. Secondly, Methods dominate among the 179 high-impact entities. An analysis of the z-score trend about the top 10 entities reveals that pre-trained language models, exemplified by BERT and Transformer, have become mainstream in recent years. Unlike the trend of the other eight method entities, the impact of Wikipedia dataset and BLEU metric has continued to rise in the long term. Thirdly, in recent years, there has been a remarkable surge in popularity for new high-impact technologies than ever before, and their acceptance by researchers has accelerated at an unprecedented speed. Our study provides a new perspective on analyzing technology development in a specific domain.

View PDFOpen arXiv