S-JEA: Stacked Joint Embedding Architectures for Self-Supervised Visual Representation Learning

Alzbeta Manova; Aiden Durrant; Georgios Leontidis

doi:10.48550/arXiv.2305.11701

S-JEA: Stacked Joint Embedding Architectures for Self-Supervised Visual Representation Learning

Alzbeta Manova, Aiden Durrant, Georgios Leontidis^*

^*Corresponding author for this work

Research output: Working paper › Preprint

3 Downloads (Pure)

Abstract

The recent emergence of Self-Supervised Learning (SSL) as a fundamental paradigm for learning image representations has, and continues to, demonstrate high empirical success in a variety of tasks. However, most SSL approaches fail to learn embeddings that capture hierarchical semantic concepts that are separable and interpretable. In this work, we aim to learn highly separable semantic hierarchical representations by stacking Joint Embedding Architectures (JEA) where higher-level JEAs are input with representations of lower-level JEA. This results in a representation space that exhibits distinct sub-categories of semantic concepts (e.g., model and colour of vehicles) in higher-level JEAs. We empirically show that representations from stacked JEA perform on a similar level as traditional JEA with comparative parameter counts and visualise the representation spaces to validate the semantic hierarchies.

Original language	English
Publisher	ArXiv
Pages	1-9
Number of pages	9
DOIs	https://doi.org/10.48550/arXiv.2305.11701
Publication status	Published - 19 May 2023

Keywords

Deep Learning
Self-Supervised Learning
Computer vision

Access to Document

10.48550/arXiv.2305.11701Licence: Unspecified

Leontidis_arxiv_Self-supervised learningFinal published version, 1.54 MB

Cite this

@techreport{1e1bcb5c1a5b4996a7f360e5fb158c10,

title = "S-JEA: Stacked Joint Embedding Architectures for Self-Supervised Visual Representation Learning",

abstract = "The recent emergence of Self-Supervised Learning (SSL) as a fundamental paradigm for learning image representations has, and continues to, demonstrate high empirical success in a variety of tasks. However, most SSL approaches fail to learn embeddings that capture hierarchical semantic concepts that are separable and interpretable. In this work, we aim to learn highly separable semantic hierarchical representations by stacking Joint Embedding Architectures (JEA) where higher-level JEAs are input with representations of lower-level JEA. This results in a representation space that exhibits distinct sub-categories of semantic concepts (e.g., model and colour of vehicles) in higher-level JEAs. We empirically show that representations from stacked JEA perform on a similar level as traditional JEA with comparative parameter counts and visualise the representation spaces to validate the semantic hierarchies.",

keywords = "Deep Learning, Self-Supervised Learning, Computer vision",

author = "Alzbeta Manova and Aiden Durrant and Georgios Leontidis",

year = "2023",

month = may,

day = "19",

doi = "10.48550/arXiv.2305.11701",

language = "English",

pages = "1--9",

publisher = "ArXiv",

type = "WorkingPaper",

institution = "ArXiv",

}

TY - UNPB

T1 - S-JEA

T2 - Stacked Joint Embedding Architectures for Self-Supervised Visual Representation Learning

AU - Manova, Alzbeta

AU - Durrant, Aiden

AU - Leontidis, Georgios

PY - 2023/5/19

Y1 - 2023/5/19

N2 - The recent emergence of Self-Supervised Learning (SSL) as a fundamental paradigm for learning image representations has, and continues to, demonstrate high empirical success in a variety of tasks. However, most SSL approaches fail to learn embeddings that capture hierarchical semantic concepts that are separable and interpretable. In this work, we aim to learn highly separable semantic hierarchical representations by stacking Joint Embedding Architectures (JEA) where higher-level JEAs are input with representations of lower-level JEA. This results in a representation space that exhibits distinct sub-categories of semantic concepts (e.g., model and colour of vehicles) in higher-level JEAs. We empirically show that representations from stacked JEA perform on a similar level as traditional JEA with comparative parameter counts and visualise the representation spaces to validate the semantic hierarchies.

AB - The recent emergence of Self-Supervised Learning (SSL) as a fundamental paradigm for learning image representations has, and continues to, demonstrate high empirical success in a variety of tasks. However, most SSL approaches fail to learn embeddings that capture hierarchical semantic concepts that are separable and interpretable. In this work, we aim to learn highly separable semantic hierarchical representations by stacking Joint Embedding Architectures (JEA) where higher-level JEAs are input with representations of lower-level JEA. This results in a representation space that exhibits distinct sub-categories of semantic concepts (e.g., model and colour of vehicles) in higher-level JEAs. We empirically show that representations from stacked JEA perform on a similar level as traditional JEA with comparative parameter counts and visualise the representation spaces to validate the semantic hierarchies.

KW - Deep Learning

KW - Self-Supervised Learning

KW - Computer vision

UR - https://arxiv.org/pdf/2305.11701.pdf

U2 - 10.48550/arXiv.2305.11701

DO - 10.48550/arXiv.2305.11701

M3 - Preprint

SP - 1

EP - 9

BT - S-JEA

PB - ArXiv

ER -

S-JEA: Stacked Joint Embedding Architectures for Self-Supervised Visual Representation Learning

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

"Maxwell" HPC for Research

Cite this

S-JEA: Stacked Joint Embedding Architectures for Self-Supervised Visual Representation Learning

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Equipment

"Maxwell" HPC for Research

Cite this