P3B3: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in LLMs

2026-06-15 • Computation and Language

Computation and LanguageArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors studied how Large Language Models (LLMs) handle different versions of Portuguese, focusing on European Portuguese and Brazilian Portuguese. They found that most LLMs prefer Brazilian Portuguese since there is more data available for it. To investigate this, the authors created a special test called P3B3 that is designed to be fair for both language varieties. Their experiments show that language models are biased towards Brazilian Portuguese and vary in how well they can switch between versions. The authors suggest more balanced data is needed for better support of all Portuguese varieties.

Large Language ModelsPortuguese language varietiesEuropean PortugueseBrazilian Portugueselanguage biasbenchmarklinguistic variationmodel evaluationcontrollability

Authors

Rafael Ferreira, Inês Vieira, Inês Calvo, James Furtado, Iago Paulo, Diogo Tavares, Diogo Glória-Silva, David Semedo, João Magalhães

Abstract

As Large Language Models (LLMs) become embedded in everyday communication, capturing regional linguistic variation is essential for reliable and equitable language use. In Portuguese, European (pt-PT) and Brazilian (pt-BR) varieties remain unevenly represented, with pt-BR dominating in data quantity, while LLM preference for Portuguese variants remains underexplored. To address this gap, we introduce P3B3, an expert-curated language variety agnostic benchmark of conversational prompts, along with an evaluation framework for measuring variety bias and controllability. Experiments on several models show that most LLMs exhibit a strong bias toward pt-BR, with variation in controllability across models. These results highlight the need for more balanced multilingual representation across language varieties.

View PDFOpen arXiv