TSseek: Regular Expression-Based Similarity Search for Distributed Time Series Datasets

2026-06-08Databases

Databases
AI summary

The authors address the problem of searching for patterns in large time series data without needing exact value sequences. They introduce TSseek, a system that lets users search using flexible patterns like trends and value ranges by turning time series into line segments and using regular expressions to describe queries. TSseek uses a special distributed index for fast searching and supports full and partial series matching. Their experiments show TSseek is more accurate and faster than existing methods.

time seriessimilarity searchregular expressionsdistributed indexline segment approximationPAASAXsubsequence matchingtrend detectionpattern search
Authors
Xiaoshuai Li, Khalid Alnuaim, Mohamed Y. Eltabakh, Elke A. Rundensteiner
Abstract
Similarity search is a fundamental operation in time series analysis. Most existing techniques, however, require users to supply a precise sequence of values (typically an entire time series object) as the query input. This rigid requirement limits real-world applications, where users instead want to express patterns, trends, or value ranges. Flexible, pattern-based search has been explored in text retrieval and complex event processing, but remains underexplored for large-scale distributed time series. To close this gap, we propose TSseek, a regular-expression-powered search framework for distributed time series datasets. TSseek's query language enables users to compose patterns encompassing trends, value ranges, and wildcard segments. We show that conventional approximation techniques (e.g., PAA and SAX) and their index structures are ill-suited for such queries because they cannot operate on regular-expression query constructs. In TSseek, we map the time series objects and the query constructs into the same space by approximating time series objects as sequences of line segments that retain both trend (slope direction) and value range, and translating query constructs into bounding rectangles. To support efficient processing, we build TSseek-X, a distributed spatial index over the time series segments. TSseek supports two fundamental query types, namely whole-matching queries (over entire series) and subsequence-matching queries (over arbitrary windows within a series). Across benchmark and real-world datasets, full-scan, model-based, and SAX-based baselines all sacrifice either accuracy or speed, whereas TSseek returns exact answers efficiently. Also, for subsequence workloads, it achieves significant speedups over state-of-the-art subsequence matching engines.