Generalized Product of Experts for Learning Multimodal Representations in Noisy Environments

Abhinav Joshi; Naman Gupta; Jinang Shah; Binod Bhattarai*; Ashutosh Modi; Danail Stoyanov

doi:10.1145/3536221.3556596

Generalized Product of Experts for Learning Multimodal Representations in Noisy Environments

Abhinav Joshi, Naman Gupta, Jinang Shah, Binod Bhattarai*, Ashutosh Modi, Danail Stoyanov

Computing Science

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

Abstract

A real-world application or setting involves interaction between different modalities (e.g., video, speech, text). In order to process the multimodal information automatically and use it for an end application, Multimodal Representation Learning (MRL) has emerged as an active area of research in recent times. MRL involves learning reliable and robust representations of information from heterogeneous sources and fusing them. However, in practice, the data acquired from different sources are typically noisy. In some extreme cases, a noise of large magnitude can completely alter the semantics of the data leading to inconsistencies in the parallel multimodal data. In this paper, we propose a novel method for multimodal representation learning in a noisy environment via the generalized product of experts technique. In the proposed method, we train a separate network for each modality to assess the credibility of information coming from that modality, and subsequently, the contribution from each modality is dynamically varied while estimating the joint distribution. We evaluate our method on two challenging benchmarks from two diverse domains: multimodal 3D hand-pose estimation and multimodal surgical video segmentation. We attain state-of-the-art performance on both benchmarks. Our extensive quantitative and qualitative evaluations show the advantages of our method compared to previous approaches.

Original language	English
Title of host publication	ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction
Publisher	ACM
Pages	83-93
Number of pages	11
ISBN (Print)	9781450393904
DOIs	https://doi.org/10.1145/3536221.3556596
Publication status	Published - 7 Nov 2022
Event	2022 International Conference on Multimodal Interaction - Bangalu, India Duration: 7 Nov 2022 → 11 Nov 2022 Conference number: 24 https://icmi.acm.org/2022/#:~:text=The%2024th%20ACM%20International%20Conference,%2C%20interfaces%2C%20and%20system%20development.

Conference

Conference	2022 International Conference on Multimodal Interaction
Abbreviated title	ICMI
Country/Territory	India
City	Bangalu
Period	7/11/22 → 11/11/22
Internet address	https://icmi.acm.org/2022/#:~:text=The%2024th%20ACM%20International%20Conference,%2C%20interfaces%2C%20and%20system%20development.

Bibliographical note

We would like to thank reviewers for their insightful comments. Ashutosh Modi is supported in part by SERB India (Science and
Engineering Board) (SRG/2021/000768).
Binod Bhattarai and Danail Stoyanov are funded by in whole, or in part, by the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS) (203145/Z/16/Z), Engineering and Physical Sciences Research Council (EPSRC) (EP/P012841/1), the Royal Academy of Engineering Chair in Emerging Technologies scheme, and EndoMapper project by Horizon 2020 FET (GA863146).

Access to Document

10.1145/3536221.3556596Licence: Unspecified

Cite this

Joshi, A, Gupta, N, Shah, J, Bhattarai*, B, Modi, A & Stoyanov, D 2022, Generalized Product of Experts for Learning Multimodal Representations in Noisy Environments. in ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction. ACM, pp. 83-93, 2022 International Conference on Multimodal Interaction, Bangalu, India, 7/11/22. https://doi.org/10.1145/3536221.3556596

@inproceedings{ea992fecf046497d96ce24f0efabfa04,

title = "Generalized Product of Experts for Learning Multimodal Representations in Noisy Environments",

abstract = "A real-world application or setting involves interaction between different modalities (e.g., video, speech, text). In order to process the multimodal information automatically and use it for an end application, Multimodal Representation Learning (MRL) has emerged as an active area of research in recent times. MRL involves learning reliable and robust representations of information from heterogeneous sources and fusing them. However, in practice, the data acquired from different sources are typically noisy. In some extreme cases, a noise of large magnitude can completely alter the semantics of the data leading to inconsistencies in the parallel multimodal data. In this paper, we propose a novel method for multimodal representation learning in a noisy environment via the generalized product of experts technique. In the proposed method, we train a separate network for each modality to assess the credibility of information coming from that modality, and subsequently, the contribution from each modality is dynamically varied while estimating the joint distribution. We evaluate our method on two challenging benchmarks from two diverse domains: multimodal 3D hand-pose estimation and multimodal surgical video segmentation. We attain state-of-the-art performance on both benchmarks. Our extensive quantitative and qualitative evaluations show the advantages of our method compared to previous approaches.",

author = "Abhinav Joshi and Naman Gupta and Jinang Shah and Binod Bhattarai* and Ashutosh Modi and Danail Stoyanov",

note = "We would like to thank reviewers for their insightful comments. Ashutosh Modi is supported in part by SERB India (Science and Engineering Board) (SRG/2021/000768). Binod Bhattarai and Danail Stoyanov are funded by in whole, or in part, by the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (WEISS) (203145/Z/16/Z), Engineering and Physical Sciences Research Council (EPSRC) (EP/P012841/1), the Royal Academy of Engineering Chair in Emerging Technologies scheme, and EndoMapper project by Horizon 2020 FET (GA863146).; 2022 International Conference on Multimodal Interaction, ICMI ; Conference date: 07-11-2022 Through 11-11-2022",

year = "2022",

month = nov,

day = "7",

doi = "10.1145/3536221.3556596",

language = "English",

isbn = "9781450393904",

pages = "83--93",

booktitle = "ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction",

publisher = "ACM",

url = "https://icmi.acm.org/2022/#:~:text=The%2024th%20ACM%20International%20Conference,%2C%20interfaces%2C%20and%20system%20development.",

}

TY - GEN

T1 - Generalized Product of Experts for Learning Multimodal Representations in Noisy Environments

AU - Joshi, Abhinav

AU - Gupta, Naman

AU - Shah, Jinang

AU - Bhattarai, Binod

AU - Modi, Ashutosh

AU - Stoyanov, Danail

N1 - Conference code: 24

PY - 2022/11/7

Y1 - 2022/11/7

N2 - A real-world application or setting involves interaction between different modalities (e.g., video, speech, text). In order to process the multimodal information automatically and use it for an end application, Multimodal Representation Learning (MRL) has emerged as an active area of research in recent times. MRL involves learning reliable and robust representations of information from heterogeneous sources and fusing them. However, in practice, the data acquired from different sources are typically noisy. In some extreme cases, a noise of large magnitude can completely alter the semantics of the data leading to inconsistencies in the parallel multimodal data. In this paper, we propose a novel method for multimodal representation learning in a noisy environment via the generalized product of experts technique. In the proposed method, we train a separate network for each modality to assess the credibility of information coming from that modality, and subsequently, the contribution from each modality is dynamically varied while estimating the joint distribution. We evaluate our method on two challenging benchmarks from two diverse domains: multimodal 3D hand-pose estimation and multimodal surgical video segmentation. We attain state-of-the-art performance on both benchmarks. Our extensive quantitative and qualitative evaluations show the advantages of our method compared to previous approaches.

AB - A real-world application or setting involves interaction between different modalities (e.g., video, speech, text). In order to process the multimodal information automatically and use it for an end application, Multimodal Representation Learning (MRL) has emerged as an active area of research in recent times. MRL involves learning reliable and robust representations of information from heterogeneous sources and fusing them. However, in practice, the data acquired from different sources are typically noisy. In some extreme cases, a noise of large magnitude can completely alter the semantics of the data leading to inconsistencies in the parallel multimodal data. In this paper, we propose a novel method for multimodal representation learning in a noisy environment via the generalized product of experts technique. In the proposed method, we train a separate network for each modality to assess the credibility of information coming from that modality, and subsequently, the contribution from each modality is dynamically varied while estimating the joint distribution. We evaluate our method on two challenging benchmarks from two diverse domains: multimodal 3D hand-pose estimation and multimodal surgical video segmentation. We attain state-of-the-art performance on both benchmarks. Our extensive quantitative and qualitative evaluations show the advantages of our method compared to previous approaches.

U2 - 10.1145/3536221.3556596

DO - 10.1145/3536221.3556596

M3 - Published conference contribution

SN - 9781450393904

SP - 83

EP - 93

BT - ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

PB - ACM

T2 - 2022 International Conference on Multimodal Interaction

Y2 - 7 November 2022 through 11 November 2022

ER -

Generalized Product of Experts for Learning Multimodal Representations in Noisy Environments

Abstract

Conference

Bibliographical note

Access to Document

Fingerprint

Cite this