Abstract
Word grounding tasks aim to associate individual words with
corresponding elements in visual scenes, enabling machines to link language with perception for effective human–machine interaction. However,
existing grounding models struggle to generalize to synonyms or unseen
lexical variants, limiting their performance in open-domain scenarios. In
this paper, we present a Bayesian multimodal grounding model that incorporates word embeddings as priors within a probabilistic generative
process to improve robustness under lexical variation. We compare the
effects of static FastText and contextual BERT embeddings on grounding accuracy by conditioning word–visual associations on their semantic
representations. Experiments use CLEVR-generated 3D scenes paired
with structured compositional descriptions to test the grounding of object categories, colors, and spatial relations across lexical shifts. Results
show that contextual embeddings such as BERT consistently outperform static embeddings like FastText in overall grounding accuracy and
in resolving spatial relations. We demonstrate that integrating structured
probabilistic inference with rich semantic embeddings offers a principled
and scalable solution for robust, interpretable word grounding.
corresponding elements in visual scenes, enabling machines to link language with perception for effective human–machine interaction. However,
existing grounding models struggle to generalize to synonyms or unseen
lexical variants, limiting their performance in open-domain scenarios. In
this paper, we present a Bayesian multimodal grounding model that incorporates word embeddings as priors within a probabilistic generative
process to improve robustness under lexical variation. We compare the
effects of static FastText and contextual BERT embeddings on grounding accuracy by conditioning word–visual associations on their semantic
representations. Experiments use CLEVR-generated 3D scenes paired
with structured compositional descriptions to test the grounding of object categories, colors, and spatial relations across lexical shifts. Results
show that contextual embeddings such as BERT consistently outperform static embeddings like FastText in overall grounding accuracy and
in resolving spatial relations. We demonstrate that integrating structured
probabilistic inference with rich semantic embeddings offers a principled
and scalable solution for robust, interpretable word grounding.
| Original language | English |
|---|---|
| Title of host publication | PRICAI Conference 2025 |
| Publisher | Springer |
| Publication status | Published - 17 Nov 2025 |
| Event | 22nd Pacific Rim International Conference on Artificial Intelligence (PRICAI 2025) - Duration: 17 Nov 2025 → 21 Nov 2025 |
Conference
| Conference | 22nd Pacific Rim International Conference on Artificial Intelligence (PRICAI 2025) |
|---|---|
| Period | 17/11/25 → 21/11/25 |