TY - GEN
T1 - Temporal DINO
T2 - 2023 IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2023
AU - Teeti, Izzeddin
AU - Bhargav, Rongali Sai
AU - Singh, Vivek
AU - Bradley, Andrew
AU - Banerjee, Biplab
AU - Cuzzolin, Fabio
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - The emerging field of action prediction - the task of forecasting action in a video sequence - plays a vital role in various computer vision applications such as autonomous driving, activity analysis and human-computer interaction. Despite significant advancements, accurately predicting future actions remains a challenging problem due to high dimensionality, complex dynamics and uncertainties inherent in video data. Traditional supervised approaches require large amounts of labelled data, which is expensive and time-consuming to obtain. This paper introduces a novel self-supervised video strategy for enhancing action prediction inspired by DINO (self-distillation with no labels). The approach, named Temporal-DINO, employs two models; a 'student' processing past frames; and a 'teacher' processing both past and future frames, enabling a broader temporal context. During training, the teacher guides the student to learn future context by only observing past frames. The strategy is evaluated on ROAD dataset for the action prediction downstream task using 3D-ResNet, Transformer, and LSTM architectures. The experimental results showcase significant improvements in prediction performance across these architectures, with our method achieving an average enhancement of 9.9% Precision Points (PP), which highlights its effectiveness in enhancing the backbones' capabilities of capturing long-term dependencies. Furthermore, our approach demonstrates efficiency in terms of the pretraining dataset size and the number of epochs required. This method overcomes limitations present in other approaches, including the consideration of various backbone architectures, addressing multiple prediction horizons, reducing reliance on hand-crafted augmentations, and streamlining the pretraining process into a single stage. These findings highlight the potential of our approach in diverse video-based tasks such as activity recognition, motion planning, and scene understanding. Code can be found at https://github.com/IzzeddinTeeti/ssl-pred.
AB - The emerging field of action prediction - the task of forecasting action in a video sequence - plays a vital role in various computer vision applications such as autonomous driving, activity analysis and human-computer interaction. Despite significant advancements, accurately predicting future actions remains a challenging problem due to high dimensionality, complex dynamics and uncertainties inherent in video data. Traditional supervised approaches require large amounts of labelled data, which is expensive and time-consuming to obtain. This paper introduces a novel self-supervised video strategy for enhancing action prediction inspired by DINO (self-distillation with no labels). The approach, named Temporal-DINO, employs two models; a 'student' processing past frames; and a 'teacher' processing both past and future frames, enabling a broader temporal context. During training, the teacher guides the student to learn future context by only observing past frames. The strategy is evaluated on ROAD dataset for the action prediction downstream task using 3D-ResNet, Transformer, and LSTM architectures. The experimental results showcase significant improvements in prediction performance across these architectures, with our method achieving an average enhancement of 9.9% Precision Points (PP), which highlights its effectiveness in enhancing the backbones' capabilities of capturing long-term dependencies. Furthermore, our approach demonstrates efficiency in terms of the pretraining dataset size and the number of epochs required. This method overcomes limitations present in other approaches, including the consideration of various backbone architectures, addressing multiple prediction horizons, reducing reliance on hand-crafted augmentations, and streamlining the pretraining process into a single stage. These findings highlight the potential of our approach in diverse video-based tasks such as activity recognition, motion planning, and scene understanding. Code can be found at https://github.com/IzzeddinTeeti/ssl-pred.
UR - http://www.scopus.com/inward/record.url?scp=85182949169&partnerID=8YFLogxK
U2 - 10.1109/ICCVW60793.2023.00352
DO - 10.1109/ICCVW60793.2023.00352
M3 - Conference proceedings published in a book
AN - SCOPUS:85182949169
T3 - Proceedings - 2023 IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2023
SP - 3273
EP - 3283
BT - Proceedings - 2023 IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 2 October 2023 through 6 October 2023
ER -