Towards an End-To-End System for Real-Time Gesture Recognition from Surface Vibrations
2026-05-11 • Hardware Architecture
Hardware Architecture
AI summaryⓘ
The authors developed a complete system that detects hand gestures by sensing vibrations on a regular office desk using special piezoelectric sensors. They created a flexible process that turns raw vibration data into a format a computer model can understand, then trained a small neural network to recognize different gestures. Their system was tested with 15 people and correctly identified six types of gestures with high accuracy, even when recognizing gestures from people not seen during training. This work shows how everyday surfaces can be used for quiet, built-in gesture control in smart homes.
piezoelectric sensorsgesture recognitionsignal processing1D-CNNband-pass filteringcross-validationpre-processinguser-independent performancemin-max normalizationdepthwise separable convolutions
Authors
Florian Hettstedt, Cedric Giese, Tianheng Ling, Keiichi Yasumoto, Gregor Schiele, Andreas Erbslöh
Abstract
Sensing surface vibrations promise unobtrusive interaction for smart home systems by enabling gesture recognition on existing everyday surfaces without disturbing living-space design. Existing approaches typically address only parts of the processing chain, such as sensing hardware or offline gesture recognition, rather than providing an end-to-end system from surface-mounted sensors to the evaluation of the prediction model. This paper presents a custom sensor system and a configurable data-to-model pipeline for gesture recognition on a standard office desk. Our hardware enables a low-noise sensing of the vibrations using piezoelectric sensors. Building on a modular signal-processing framework, we model the full chain from continuous recordings through variable pre-processing to a model-ready dataset, and process the resulting data with compact depthwise separable 1D-CNNs. We conduct a joint search over pre-processing and model hyperparameters and identify a configuration with 8,722 parameters that uses band-pass filtering, fixed-length windows, and min-max normalization. On a self-recorded dataset with 15 participants performing six gestures this configuration achieves high accuracies across different data splitting methods, including strong user-independent performance in a leave-one-subject-out cross-validation.