Evaluating Semantic Representations in Multimodal Word Grounding

Research output: Chapter in Book/Report/Conference proceedingConference proceedings published in a bookpeer-review

29 Downloads (Pure)

Abstract

Word grounding tasks aim to associate individual words with
corresponding elements in visual scenes, enabling machines to link language with perception for effective human–machine interaction. However,
existing grounding models struggle to generalize to synonyms or unseen
lexical variants, limiting their performance in open-domain scenarios. In
this paper, we present a Bayesian multimodal grounding model that incorporates word embeddings as priors within a probabilistic generative
process to improve robustness under lexical variation. We compare the
effects of static FastText and contextual BERT embeddings on grounding accuracy by conditioning word–visual associations on their semantic
representations. Experiments use CLEVR-generated 3D scenes paired
with structured compositional descriptions to test the grounding of object categories, colors, and spatial relations across lexical shifts. Results
show that contextual embeddings such as BERT consistently outperform static embeddings like FastText in overall grounding accuracy and
in resolving spatial relations. We demonstrate that integrating structured
probabilistic inference with rich semantic embeddings offers a principled
and scalable solution for robust, interpretable word grounding.
Original languageEnglish
Title of host publicationPRICAI Conference 2025
PublisherSpringer
Publication statusPublished - 17 Nov 2025
Event22nd Pacific Rim International Conference on Artificial Intelligence (PRICAI 2025) -
Duration: 17 Nov 202521 Nov 2025

Conference

Conference22nd Pacific Rim International Conference on Artificial Intelligence (PRICAI 2025)
Period17/11/2521/11/25

Fingerprint

Dive into the research topics of 'Evaluating Semantic Representations in Multimodal Word Grounding'. Together they form a unique fingerprint.

Cite this