Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

2026-03-17 • Computation and Language

Computation and Language

AI summaryⓘ

The authors present Chronos, a system that helps conversational AI remember and use information from long chats spanning weeks or months. Chronos organizes conversations into events with dates and details, making it easier to find specific facts over time. It uses smart prompts to guide the AI when answering complex questions that need looking back through the conversation. Their tests show Chronos improves accuracy significantly compared to previous methods, mainly by structuring events and using calendars to track context.

Large Language ModelsConversational AITemporal MemoryEvent TupleMulti-hop ReasoningDialogue HistoryDynamic PromptingInformation RetrievalAblation StudyLong-term Interaction

Authors

Sahil Sen, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah

Abstract

Recent advances in Large Language Models (LLMs) have enabled conversational AI agents to engage in extended multi-turn interactions spanning weeks or months. However, existing memory systems struggle to reason over temporally grounded facts and preferences that evolve across months of interaction and lack effective retrieval strategies for multi-hop, time-sensitive queries over long dialogue histories. We introduce Chronos, a novel temporal-aware memory framework that decomposes raw dialogue into subject-verb-object event tuples with resolved datetime ranges and entity aliases, indexing them in a structured event calendar alongside a turn calendar that preserves full conversational context. At query time, Chronos applies dynamic prompting to generate tailored retrieval guidance for each question, directing the agent on what to retrieve, how to filter across time ranges, and how to approach multi-hop reasoning through an iterative tool-calling loop over both calendars. We evaluate Chronos with 8 LLMs, both open-source and closed-source, on the LongMemEvalS benchmark comprising 500 questions spanning six categories of dialogue history tasks. Chronos Low achieves 92.60% and Chronos High scores 95.60% accuracy, setting a new state of the art with an improvement of 7.67% over the best prior system. Ablation results reveal the events calendar accounts for a 58.9% gain on the baseline while all other components yield improvements between 15.5% and 22.3%. Notably, Chronos Low alone surpasses prior approaches evaluated under their strongest model configurations.

View PDFOpen arXiv