Abstract
Despite recent breakthroughs in face hallucination, video face hallucination remains a challenging task due to the issue of consistency across video frames. The temporal dimension in videos makes it difficult to learn facial motion and maintain color uniformity throughout the sequence. To address these challenges, we propose a novel audio-visual cross-modality support based video face hallucination network. The framework excels in learning fine spatiotemporal motion patterns by leveraging the correlation between movement of the facial structure and associated speech signal. Another significant challenge generic to face hallucination is blurriness around the key facial regions, such as mouth and lips. These areas show higher spatial displacement rendering their recovery in low-resolution images particularly difficult. The proposed approach explicitly defines a lip reading loss to learn the fine-grain motion in these facial regions. Further, during training, GANs show a higher potential to overfit to small frequency bands, which results in missing hard-to-synthesize frequencies. As a remedy, we introduce a frequency based loss function compelling the model to grasp salient frequency features. Visual and quantitative comparisons with state-of-the-art demonstrate significant improvements in visual results as well as higher coherence in the generated outputs across successive frames.
Original language | English |
---|---|
Article number | 77 |
Number of pages | 14 |
Journal | Machine Vision and Applications |
Volume | 36 |
Issue number | 4 |
Early online date | 9 May 2025 |
DOIs | |
Publication status | Published - Jul 2025 |
ASJC Scopus subject areas
- Software
- Hardware and Architecture
- Computer Vision and Pattern Recognition
- Computer Science Applications
Keywords
- Cross-modality
- Face hallucination
- Fourier transform
- Generative adversarial networks
- Speech recognition