AnomalyCLIP

Delving into CLIP latent space for Video Anomaly Recognition

Luca Zanella^*1, Benedetta Liberatori^*1, Willi Menapace¹, Fabio Poiesi², Yiming Wang², Elisa Ricci^1,2

¹ University of Trento ² Fondazione Bruno Kessler

^*Indicates Equal Contribution

Abstract

We tackle the complex problem of detecting and recognising anomalies in surveillance videos at the frame level, utilising only video- level supervision. We introduce the novel method AnomalyCLIP, the first to combine Large Language and Vision (LLV) models, such as CLIP, with multiple instance learning for joint video anomaly detection and classification. Our approach specifically involves manipulating the latent CLIP feature space to identify the normal event subspace, which in turn allows us to effectively learn text- driven directions for abnormal events. When anomalous frames are projected onto these directions, they exhibit a large feature magnitude if they belong to a particular class. We also introduce a computationally efficient Transformer architecture to model short- and long-term temporal dependencies between frames, ultimately producing the final anomaly score and class prediction probabilities. We compare AnomalyCLIP against state-of-the-art methods considering two major anomaly detection benchmarks, i.e. ShanghaiTech, UCF-Crime, and XD-Violence, and empirically show that it outperforms baselines in recognising video anomalies.

Method Overview

Illustration of our proposed framework AnomalyCLIP. The Selector model learns directions 𝒅 using
CoOp [1] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. International Journal of Computer Vision (2022).
, and uses them to identify the likelihood of each feature 𝒙 to represent an occurrence of the corresponding anomalous class. MIL selection of the top-𝐾 and bottom-𝐾 abnormal segments is performed by considering the distribution of likelihoods along the corresponding direction. A Temporal model performs temporal aggregation of the features to produce the final prediction.

Delving into CLIP latent space

Illustration of the CLIP space and the effects of the re-centring transformation with features of normal. When the space is not re-centred around the normality prototype 𝒎, directions 𝒅′ are similar, making it difficult to discern anomaly types, and feature magnitude is not linked to the degree of anomaly, making it difficult to identify anomalous events. When re-centred, the distribution of the magnitudes of features projected on each 𝒅 identifies the degree of detected anomaly of the corresponding type.

Experiments

We validate our method against a range of baselines taken from state-of-the-art video anomaly detection (VAD) and action recognition methods which we adapt to the video anomaly recognition (VAR) task. We evaluate our method using three widely-used VAD datasets, i.e., , , and , and perform comparison in both the VAD and VAR tasks.

Quantitative Results

Comparison of various anomaly detection methods on the ShanghaiTech, UCF-Crime, and XD-Violence datasets in terms of the area under the curve (AUC) of the receiver operating characteristic (ROC) and the average precision (AP) of the precision-recall curve (PRC). A higher AUC and AP are crucial for video anomaly detection as they reflect the model’s ability in correctly recognising the presence of anomalies.

Comparison of various anomaly recognition methods on the ShanghaiTech, UCF-Crime, and XD-Violence datasets in terms of the mean area under the curve (mAUC) of the receiver operating characteristic (ROC) and the mean average precision (mAP) of the precision-recall curve (PRC), which calculate the mean of binary AUC ROC and AP PRC values for all anomalous classes, respectively. A higher mAUC and mAP are crucial for video anomaly recognition as they reflect the model's ability in correctly recognising the correct abnormal class. Notably, our proposed method, \methodname, achieves the highest performance on all datasets, surpassing both the state-of-the-art methods on video anomaly detection that are re-purposed for anomaly recognition and CLIP-based video action recognition methods.

Qualitative Results

In each of the following illustrations we show a video from the testing set of UCF-Crime, together with the anomaly probability p(A) and the conditional probabilities for each anomalous class p(c|A) predicted by AnomalyCLIP. When p(A) is greater than the threshold maximizing (true positive rate - false positive rate), the frame is predicted as anomalous, the bounding box around the video is colored red and the bins of top-3 predictions are highlighted in red. In the bottom part of the figure, red shaded areas denote the temporal ground-truth of anomalies, while a red slider indicates the video's time progression.

The video captures a road accident between two vehicles. The anomaly probability remains low while the scene is normal, but quickly rises when the two vehicles collide. The RoadAccident class is consistently predicted with the highest probability for most of the frames, while some frames also have a high probability for Explosion when smoke can be seen after the crash.

The video depicts a building in flames, where the presence of an anomaly becomes highly probable as soon as smoke emerges, and continues to remain high even if the ground truth fails to identify an anomaly. Notably, the detected anomaly is accurately classified as an explosion.

The video depicts an explosion occurring at a fuel station. Initially, the situation appears normal with a low abnormal probability. However, when the explosion takes place, the conditional probability for the abnormal class increases significantly. Notably, the video loops twice, but the ground truth annotation only pertains to the first loop. Nevertheless, our model accurately detects multiple anomalous events in the video.

The video shows arson. As soon as the flame starts to appear, the probability of anomaly presence increases, and the two anomalies with higher probability are Arson, which is the groud truth, and Explosion, which is highly correlated.

The video portrays a common scenario in a supermarket. Our AnomalyCLIP model classifies the video as normal for the entire duration, with only a few frames exceeding the abnormality threshold. Conditional probabilities do not show any particular action that dominates the predictions made by the model.

The video depicts a road accident between a vehicle and some bikers. The anomaly probability remains low during the normal scene and rises sharply when the impact occurs. AnomalyCLIP accurately identifies the anomalous frames as RoadAccident. Notably, the video loops twice, but our model correctly detects two anomalous events with consistent predictions, while the ground truth annotation only covers the first loop of the video.

The video captures a group of individuals engaged in a physical altercation, which is correctly classified by AnomalyCLIP as fighting. The model initially predicts the video as normal, but as the situation escalates and the individuals start fighting, the probability of being abnormal increases. While the model also classifies the video as assault, the circumstances make it difficult to distinguish between the two actions.

The video shows the unfolding of a burglary. The anomaly probability is high in the time window in which the anomaly takes place and the anomalous class is correct for the majority of the frames.

The video shows a traffic accident involving some motorcyclists and a van making a U-turn. The anomaly probability remains low until the moment of the accident, when it rises steeply. AnomalyCLIP correctly labels the video as RoadAccident, but during the frames in which the motorbike catches fire, the predicted action is predominantly Explosion.

The video captures a scene of shoplifting. Due to the subtle nature of the action, AnomalyCLIP predicts a consistently low anomaly probability throughout the entire video, failing to surpass the threshold for abnormality.

The video shows a man setting a car on fire. At first, the frames are predicted with high probability as anomaly and the anomaly detected is Stealing. Then the arson is correctly identified and the second class with higher probability is Explosion. It can also be seen that the probability of anomaly remains high and related to arson due to the presence of firefighters.

The video shows a man being beaten by a group of people. The frames are predicted as anomaly with a high probability and the anomaly class correctly identified is Assault. In addition, the second most likely anomalous class predicted is Fighting.

The video shows a man setting a car on fire. At first, the frames are predicted with high probability as anomaly and the anomaly detected is Stealing. Then the arson is correctly identified and the second class with higher probability is Explosion. It can also be seen that the probability of anomaly remains high and related to arson due to the presence of smoke.

The video depicts the arrest of a man with a high anomalous probability, with the predicted anomaly being correctly classified as Arrest. However, there are instances where the predicted anomaly is Abuse, which can be attributed to the presence of the man on the ground surrounded by people. In the first part of the video, there were predictions of Robbery, possibly due to the fact that the man carries a gun.

The video depicts a man attempting to break a store window, labelled as an act of vandalism. AnomalyCLIP detects abnormal frames starting from the moment the man covers his face with a balaclava, and particularly when he throws an object at the window. In some frames, the action is misclassified as robbery and burglary, but given the context of the video, these predictions are reasonable. Notably, the video loops twice, and AnomalyCLIP correctly identifies both instances of abnormality, even when the man is out of view due to the shattered glass.

References

[1] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. International Journal of Computer Vision (2022).
[2] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. 2018. Future frame prediction for anomaly detection a new baseline. In CVPR.
[3] Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly detection in surveillance videos. In CVPR.
[4] Wu Peng, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. 2020. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In ECCV.