Abstract
For autonomous robots to interact naturally with humans, they must develop language understanding capabilities that connect linguistic expressions to multimodal perception. A key challenge arises when robots encounter lexical variations such as synonyms or novel phrases not observed during training. In this ongoing work, we present a multimodal word grounding framework that systematically integrates linguistic structures—including word indices, parts-of-speech tags, semantic word embeddings, and large language model representations—with perceptual features extracted from sensory data, including object geometry, color, and spatial positioning (centroids), where spatial relationships are learned through our Bayesian grounding model. We evaluate five experimental cases and demonstrate improved synonym generalization using semantic embeddings. While this framework effectively grounds individual words, it is limited to single-word grounding and cannot handle more complex linguistic structures such as phrases or full sentences. Therefore, we discuss extending the framework toward compositional language understanding, from the word to phrase to sentence levels, aiming to enable robots to build linguistic knowledge in an unsupervised bottom-up manner. This work contributes to advancing robot language understanding and generalization for natural human–robot interaction in dynamic environments.
| Original language | English |
|---|---|
| Number of pages | 4 |
| Publication status | Published - 30 Jun 2025 |
| Event | IEEE International Conference on Robot & Human Interactive Communication (Ro-Man) - Netherlands, Eindhoven Duration: 25 Aug 2025 → 29 Aug 2025 https://www.ro-man2025.org/ |
Conference
| Conference | IEEE International Conference on Robot & Human Interactive Communication (Ro-Man) |
|---|---|
| City | Eindhoven |
| Period | 25/08/25 → 29/08/25 |
| Internet address |