Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning
2026-05-25 • Computation and Language
Computation and LanguageArtificial IntelligenceMachine Learning
AI summaryⓘ
The authors test a new way to measure creativity called the creative quality metric from their earlier work, but now in practical engineering settings with limited data and small models. They use about 100 expert examples and find that common datasets are biased toward craft skills, lacking in audience understanding and logical realism. They introduce the term Creative Quality Alignment (CQA) for methods improving creativity in AI. Additionally, they provide a theory explaining why a small number of examples can be enough, based on a property of the language model's architecture.
Creative Quality MetricCalibrated SurpriseChain-of-Thought AnnotationsLarge Language ModelsAlignment DatasetsCreative Quality AlignmentConditional Distribution ArchitectureLIMAArchitectural DualityData Bias
Authors
Bo Zou, Chao Xu
Abstract
This paper provides an empirical implementation of the creative quality metric proposed in Calibrated Surprise (Zou & Xu, 2026a). The question this paper addresses is: does this mathematical claim hold at the engineering level? To make the answer as general as possible, we deliberately choose the strictest engineering conditions: low data cost and a small base model. Training data comes from approximately 100 expert chain-of-thought (CoT) annotations produced by the BC Protocol (Zou & Xu, 2026b). We also identify a data bias: most publicly available alignment datasets are skewed toward craft-related knowledge, while audience modeling and reality-logic coverage are systematically weak. We use the term Creative Quality Alignment (CQA) to describe this class of engineering methods. We also offer a supporting theoretical observation: in an LLM with a single conditional distribution architecture, calibrating the appreciation side automatically transfers to the generation side via architectural duality. This is the structural reason why ~100 CoT examples are sufficient -- not a purely empirical observation like LIMA (Zhou et al., 2023).