Leveraging Code-Mixed Product Metadata and User Feedback for Personalized Recommendation on Daraz Bangladesh

2026-06-15 • Information Retrieval

Information Retrieval

AI summaryⓘ

The authors studied product reviews from Bangladesh written in Bengali script, English, and a mix called Banglish (Bengali sounds written in Latin letters). They tested different recommendation models to see how well they work, especially when users have very few reviews. They found that some models do better than others depending on how many reviews are available, and that mixed-language reviews, especially Banglish, make recommendation accuracy worse because of inconsistent spelling. Their work provides a way to evaluate recommendation systems in tricky, mixed-language shopping environments.

E-commerceCode-mixingCollaborative FilteringMatrix FactorizationBanglishTF-IDFNDCG@10k-core thresholdTransliteration

Authors

KM Fahim A Bari, Muhammad Abdullah Adnan, Nafis Sadeq

Abstract

Bangladeshi e-commerce platforms host millions of product reviews written in Bengali Unicode, English, and Banglish, where Bengali is phonetically transcribed in Latin script. However, the impact of code-mixed reviews on recommendation performance remains largely unexplored. We present the first such benchmarking on product reviews from Daraz Bangladesh, evaluating six model families under a per-user chronological leave-last-out protocol. To address the severe long-tail sparsity of the dataset, where 59.3% of users have exactly one interaction, we conduct a systematic k-core threshold ablation across five density configurations. The results reveal that Item-based Collaborative Filtering remains stable across settings, Implicit Matrix Factorization degrades sharply with decreasing density, and Explicit Matrix Factorization uniquely improves at higher thresholds. To characterize the impact of code-mixing on recommendation quality, we perform a language-stratified evaluation of content-based filtering using character n-gram TF-IDF profiles. The results provide empirical evidence that fragmentation of the Banglish vocabulary reduces NDCG@10 by 46.8% relative to Bengali-script users, a degradation traceable to transliteration inconsistency across surface forms. This work establishes a reproducible evaluation foundation for recommendation research in code-mixed, low-resource e-commerce settings. The code is publicly available at https://github.com/os-car-war-thy/daraz-recsys.

View PDFOpen arXiv