Tetris: Tile-level Sampling for Efficient and High-Fidelity Video Object Tracking

2026-05-25 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionDatabases

AI summaryⓘ

The authors present Tetris, a system to efficiently extract object tracks from stationary videos by breaking frames into tiles and grouping relevant ones to reduce unnecessary processing. Instead of analyzing the whole video repeatedly, Tetris smartly prunes parts unlikely to contain objects and assembles important tiles to feed into detectors. This approach keeps tracking accuracy close to full analysis but runs much faster than previous methods and standard pipelines. Their experiments show Tetris achieves good speed without losing more than 5% accuracy across various datasets.

track materializationobject trackingstationary videotile-based decompositionpolyominotemporal frame samplingdetector callsinteger linear program (ILP)tracking accuracyspatiotemporal pruning

Authors

Chanwut Kittivorawong, Alena Chao, Charlie Si, Alvin Cheung

Abstract

Track materialization converts raw video into reusable object tracks that downstream queries can run against without rerunning tracking, but extracting those tracks efficiently and with high fidelity remains expensive. Prior systems reduce cost through temporal frame sampling, erasing the inter-frame motion that fine-grained tracking requires. In stationary video, however, large portions of each frame contain no objects of interest, and the remaining regions tolerate different sampling rates. We present Tetris, a track-extraction system that decomposes videos into a tile-based polyomino data model, enabling fine-grained spatiotemporal pruning that reduces detector calls with minimal fidelity loss. Tetris runs three operators upstream of the user-provided detector: a classifier identifies relevant tiles and groups them into polyominoes, an integer linear program (ILP) prunes redundant polyominoes under a user-specified accuracy constraint, and a packer assembles the survivors into canvases that minimize detector calls. Across 7 stationary-video datasets, Tetris stays within a 5% tracking accuracy loss of a full-frame, every-frame reference pipeline, whereas prior systems exceed this bound on 3 of the 7 datasets. At this 5% bound, Tetris achieves up to 17.4x higher throughput than prior systems and up to 68.8x higher than the reference pipeline. The project page is at https://tetris-db.github.io .

View PDFOpen arXiv