Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling

2026-06-01Computation and Language

Computation and LanguageMachine Learning
AI summary

The authors explore how to correctly categorize short and messy product descriptions from data like receipts or online scans into standard groups for comparing prices. They use a three-step method involving cleaning the text, applying rule-based guessing with key phrases, and then confirming the category with a simple machine learning model. Their tests show that basic word-based models work very well, needing only about 67 labeled examples to be effective, and adding complex features doesn't improve performance much. They also examine different ways to combine human judgments on labeling and find some methods better than others. Lastly, they provide practical advice for offices using transaction data for price measurement.

consumer price measurementproduct classificationtext normalizationtokenizationprefix-tree (trie)binary classificationbag-of-wordsF1 scorehuman-in-the-loopDawid-Skene model
Authors
Vladimir Beskorovainyi
Abstract
Consumer-price measurement increasingly draws on alternative data sources -- scanner, web-scraped, and transaction/receipt data. A recurring obstacle is that product descriptions in such sources are short, noisy, and abbreviated, with no standard product code, so each item must first be mapped to a consumption classification (e.g., the UN COICOP scheme) before prices can be compared. This paper studies that mapping as a general, reproducible method. The pipeline is: (i) text normalization and tokenization of noisy item names; (ii) a prefix-tree (trie) rule-based pre-classifier driven by per-category key-phrases and stop-phrases; and (iii) a per-category binary confirmation model deciding whether an item belongs to a tentatively assigned category. For labels at scale we use a human-in-the-loop protocol in which annotators give a binary valid/reject judgment, aggregated by a dynamically updated reliability weight; the model joins the same rule, enabling continual fine-tuning. Our empirical finding is deflationary: in a controlled, leakage-free study (one category, real positives vs. hard negatives, five seeds), bag-of-words models essentially saturate the task (F1 about 0.99) -- a linear classifier matches a multilayer perceptron, explicit word-order (n-gram) features add nothing, and about 67 labeled examples already suffice. A Monte-Carlo study of the labeling protocol shows the reliability-weighted vote barely beats plain majority (its additive weights saturate) while Dawid-Skene recovers labels markedly better. We also discuss price-level quality control and design lessons for statistical offices considering transaction data. All figures are illustrative; no confidential data, code, or documentation is reproduced.