Just Dance with pi! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection

Snehashis Majhi, Giacomo D'Amicantonio, Antitza Dantcheva, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, Egor Bondarau, Francois Bremond

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

3 Downloads (Pure)

Abstract

Weakly-supervised methods for video anomaly detection (VAD) are conventionally based merely on RGB spatio-temporal features, which continues to limit their reliability in real-world scenarios. This is due to the fact that RGB-features are not sufficiently distinctive in setting apart categories such as shoplifting from visually similar events. Therefore, towards robust complex real-world VAD, it is essential to augment RGB spatio-temporal features by additional modalities. Motivated by this, we introduce the Poly-modal Induced framework for VAD: PI-VAD (or\pi-VAD), a novel approach that augments RGB representations by five additional modalities. Specifically, the modalities include sensitivity to fine-grained motion (Pose), three dimensional scene and entity representation (Depth), surrounding objects (Panoptic masks), global motion (optical flow), as well as language cues (VLM). Each modality represents an axis of a polygon, streamlined to add salient cues to RGB.\pi-VAD includes two plug-in modules, namely Pseudo-modality Generation module and Cross Modal Induction module, which generate modality-specific prototypical representation and, thereby, induce multi-modal information into RGB cues. These modules operate by performing anomaly-aware auxiliary tasks and necessitate five modality backbones--only during training. Notably,\pi-VAD achieves state-of-the-art accuracy on three prominent VAD datasets encompassing real-world scenarios, without requiring the computational overhead of five modality backbones at inference.
Original languageEnglish
Title of host publicationProceedings of the Computer Vision and Pattern Recognition Conference
PublisherInstitute of Electrical and Electronics Engineers
Publication statusAccepted/In press - Jun 2025
EventIEEE/CVF Conference on Computer Vision and Pattern Recognition Conference 2025, CVPR 2025 - Nashville, United States
Duration: 11 Jun 202515 Jun 2025

Conference

ConferenceIEEE/CVF Conference on Computer Vision and Pattern Recognition Conference 2025, CVPR 2025
Abbreviated titleCVPR 2025
Country/TerritoryUnited States
CityNashville
Period11/06/2515/06/25

Fingerprint

Dive into the research topics of 'Just Dance with pi! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection'. Together they form a unique fingerprint.

Cite this