Self-Conditioned Positional HNSW for Overlap-Aware Retrieval in Chunked-Document RAG Systems: Method and Industrial Evidence-Quality Audit

2026-06-01 • Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster ComputingArtificial IntelligenceComputation and LanguageDatabasesInformation Retrieval

AI summaryⓘ

The authors address a problem in systems that find information by splitting documents into overlapping chunks and searching these chunks using a method called HNSW. Overlapping chunks can cause the system to pick very similar pieces of information repeatedly, wasting space in the response. They propose a new approach, SCP-HNSW, that adds positional information to improve how these chunks are selected without changing the core search method. They also performed quality checks on reviews and OCR data to understand real-world performance and highlight areas needing improvement for reliable retrieval.

retrieval-augmented generationchunked-document retrievalapproximate nearest-neighbor searchHNSWpositional encodingquery procedureOCR audittext-evidence auditdocument indexingoverlapping chunks

Authors

Nataraj Agaram Sundar, Tejas Morabia

Abstract

Chunked-document retrieval is a common component of retrieval-augmented generation (RAG) systems. Documents are split into overlapping chunks, embedded, and indexed with approximate nearest-neighbor search such as hierarchical navigable small world graphs (HNSW). Overlap improves boundary coverage but induces a practical failure mode: top-k retrieval often returns near-adjacent chunks that repeat evidence and waste prompt budget. We propose Self-Conditioned Positional HNSW (SCP-HNSW), a lightweight modification that appends a low-dimensional positional code to chunk embeddings and uses a two-pass query procedure to estimate and apply a query-specific document-position prior. SCP-HNSW leaves HNSW graph construction and traversal unchanged while adding an auditable minimum-index-gap selector for final context construction. We also integrate industrial review artifacts for generated evidence quality: a 770-review text-evidence audit with 318 fully labeled reviews and a 70-case OCR audit with 350 ratings. The text audit shows that 574 of 770 projected reviews are rated 3/5, only 39 fall in the 1-2 range, and narrative reviewer detail appears much more often than structured issue flags. The OCR audit shows slice-level pass rates from 95% for clean chat screenshots to 45% for handwritten/blurry captures, with moderate to strong agreement. These results motivate overlap-aware, audit-friendly RAG retrieval and identify the remaining controlled retrieval ablations needed for causal performance claims.

View PDFOpen arXiv