Simplifying Open-Set Video Domain Adaptation with Contrastive Learning

Giacomo Zara; Victor Guilherme Turrisi da Costa; Subhankar Roy; Paolo Rota; Elisa Ricci

doi:10.48550/arXiv.2301.03322

Simplifying Open-Set Video Domain Adaptation with Contrastive Learning

Giacomo Zara^* (Corresponding Author), Victor Guilherme Turrisi da Costa, Subhankar Roy, Paolo Rota, Elisa Ricci

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

In an effort to reduce annotation costs in action recognition, unsupervised video domain adaptation methods have been proposed that aim to adapt a predictive model from a labelled dataset (i.e., source domain) to an unlabelled dataset (i.e., target domain). In this work we address a more realistic scenario, called open-set video domain adaptation (OUVDA), where the target dataset contains "unknown" semantic categories that are not shared with the source. The challenge lies in aligning the shared classes of the two domains while separating the shared classes from the unknown ones. In this work we propose to address OUVDA with an unified contrastive learning framework that learns discriminative and well-clustered features. We also propose a video-oriented temporal contrastive loss that enables our method to better cluster the feature space by exploiting the freely available temporal information in video data. We show that discriminative feature space facilitates better separation of the unknown classes, and thereby allows us to use a simple similarity based score to identify them. We conduct thorough experimental evaluation on multiple OUVDA benchmarks and show the effectiveness of our proposed method against the prior art.

Original language	English
Article number	103953
Number of pages	10
Journal	Computer Vision and Image Understanding
Volume	241
Early online date	16 Feb 2024
DOIs	https://doi.org/10.48550/arXiv.2301.03322 https://doi.org/10.1016/j.cviu.2024.103953
Publication status	Published - 1 Apr 2024
Externally published	Yes

Bibliographical note

We acknowledge the support of the MUR PNRR project FAIR - Future AI Research (PE00000013) funded by the NextGenerationEU. E.R. is partially supported by the PRECRISIS, funded by the EU Internal Security Fund (ISFP-2022-TFI-AG-PROTECT-02-101100539), the EU project SPRING (No. 871245), and the by the PRIN project LEGO-AI (Prot. 2020TA3K9N). The work was carried out in the Vision and Learning joint laboratory of FBK and UNITN, and supported by the Caritro Deep Learning lab of the ProM facility.

Data Availability Statement

The data involved in the experimental evaluation is already publicly available and has been used as provided by the original releasers, according to the declared setting used in previous works.

Keywords

Open-set video domain adaptation
Video Action Recognition
Contrastive learning

Access to Document

10.48550/arXiv.2301.03322Licence: CC BY
10.1016/j.cviu.2024.103953Licence: Unspecified

Cite this

@article{21bd3ecf94d34f60b3babe26dce7a90b,

title = "Simplifying Open-Set Video Domain Adaptation with Contrastive Learning",

abstract = " In an effort to reduce annotation costs in action recognition, unsupervised video domain adaptation methods have been proposed that aim to adapt a predictive model from a labelled dataset (i.e., source domain) to an unlabelled dataset (i.e., target domain). In this work we address a more realistic scenario, called open-set video domain adaptation (OUVDA), where the target dataset contains {"}unknown{"} semantic categories that are not shared with the source. The challenge lies in aligning the shared classes of the two domains while separating the shared classes from the unknown ones. In this work we propose to address OUVDA with an unified contrastive learning framework that learns discriminative and well-clustered features. We also propose a video-oriented temporal contrastive loss that enables our method to better cluster the feature space by exploiting the freely available temporal information in video data. We show that discriminative feature space facilitates better separation of the unknown classes, and thereby allows us to use a simple similarity based score to identify them. We conduct thorough experimental evaluation on multiple OUVDA benchmarks and show the effectiveness of our proposed method against the prior art. ",

keywords = "Open-set video domain adaptation, Video Action Recognition, Contrastive learning",

author = "Giacomo Zara and Costa, {Victor Guilherme Turrisi da} and Subhankar Roy and Paolo Rota and Elisa Ricci",

note = "We acknowledge the support of the MUR PNRR project FAIR - Future AI Research (PE00000013) funded by the NextGenerationEU. E.R. is partially supported by the PRECRISIS, funded by the EU Internal Security Fund (ISFP-2022-TFI-AG-PROTECT-02-101100539), the EU project SPRING (No. 871245), and the by the PRIN project LEGO-AI (Prot. 2020TA3K9N). The work was carried out in the Vision and Learning joint laboratory of FBK and UNITN, and supported by the Caritro Deep Learning lab of the ProM facility.",

year = "2024",

month = apr,

day = "1",

doi = "10.48550/arXiv.2301.03322",

language = "English",

volume = "241",

journal = "Computer Vision and Image Understanding",

issn = "1077-3142",

publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Simplifying Open-Set Video Domain Adaptation with Contrastive Learning

AU - Zara, Giacomo

AU - Costa, Victor Guilherme Turrisi da

AU - Roy, Subhankar

AU - Rota, Paolo

AU - Ricci, Elisa

N1 - We acknowledge the support of the MUR PNRR project FAIR - Future AI Research (PE00000013) funded by the NextGenerationEU. E.R. is partially supported by the PRECRISIS, funded by the EU Internal Security Fund (ISFP-2022-TFI-AG-PROTECT-02-101100539), the EU project SPRING (No. 871245), and the by the PRIN project LEGO-AI (Prot. 2020TA3K9N). The work was carried out in the Vision and Learning joint laboratory of FBK and UNITN, and supported by the Caritro Deep Learning lab of the ProM facility.

PY - 2024/4/1

Y1 - 2024/4/1

N2 - In an effort to reduce annotation costs in action recognition, unsupervised video domain adaptation methods have been proposed that aim to adapt a predictive model from a labelled dataset (i.e., source domain) to an unlabelled dataset (i.e., target domain). In this work we address a more realistic scenario, called open-set video domain adaptation (OUVDA), where the target dataset contains "unknown" semantic categories that are not shared with the source. The challenge lies in aligning the shared classes of the two domains while separating the shared classes from the unknown ones. In this work we propose to address OUVDA with an unified contrastive learning framework that learns discriminative and well-clustered features. We also propose a video-oriented temporal contrastive loss that enables our method to better cluster the feature space by exploiting the freely available temporal information in video data. We show that discriminative feature space facilitates better separation of the unknown classes, and thereby allows us to use a simple similarity based score to identify them. We conduct thorough experimental evaluation on multiple OUVDA benchmarks and show the effectiveness of our proposed method against the prior art.

AB - In an effort to reduce annotation costs in action recognition, unsupervised video domain adaptation methods have been proposed that aim to adapt a predictive model from a labelled dataset (i.e., source domain) to an unlabelled dataset (i.e., target domain). In this work we address a more realistic scenario, called open-set video domain adaptation (OUVDA), where the target dataset contains "unknown" semantic categories that are not shared with the source. The challenge lies in aligning the shared classes of the two domains while separating the shared classes from the unknown ones. In this work we propose to address OUVDA with an unified contrastive learning framework that learns discriminative and well-clustered features. We also propose a video-oriented temporal contrastive loss that enables our method to better cluster the feature space by exploiting the freely available temporal information in video data. We show that discriminative feature space facilitates better separation of the unknown classes, and thereby allows us to use a simple similarity based score to identify them. We conduct thorough experimental evaluation on multiple OUVDA benchmarks and show the effectiveness of our proposed method against the prior art.

KW - Open-set video domain adaptation

KW - Video Action Recognition

KW - Contrastive learning

UR - http://www.scopus.com/inward/record.url?scp=85185552775&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2301.03322

DO - 10.48550/arXiv.2301.03322

M3 - Article

SN - 1077-3142

VL - 241

JO - Computer Vision and Image Understanding

JF - Computer Vision and Image Understanding

M1 - 103953

ER -

Simplifying Open-Set Video Domain Adaptation with Contrastive Learning

Abstract

Bibliographical note

Data Availability Statement

Keywords

Access to Document

Other files and links

Fingerprint

Cite this