Audio-driven talking face generation has received growing interest, particularly for applications requiring expressive
and natural human-avatar interaction. However, most existing emotion-aware methods rely on a single modality (either audio or image)
for emotion embedding, limiting their ability to capture nuanced affective cues. Additionally, most methods condition on a single
reference image, restricting the model’s ability to represent dynamic changes in actions or attributes across time.
To address these issues, we introduce SynchroRaMa, a novel framework that integrates a multi-modal emotion embedding
by combining emotional signals from text (via sentiment analysis) and audio (via speech-based emotion recognition and
audio-derived valence-arousal features), enabling the generation of talking face videos with richer and more authentic
emotional expressiveness and fidelity. To ensure natural head motion and accurate lip synchronization, SynchroRaMa
includes an audio-to-motion (A2M) module that generates motion frames aligned with the input audio. Finally, SynchroRaMa
incorporates scene descriptions generated by Large Language Model (LLM) as additional textual input, enabling it to capture
dynamic actions and high-level semantic attributes. Conditioning the model on both visual and textual cues enhances temporal
consistency and visual realism. Quantitative and qualitative experiments on benchmark datasets demonstrate that SynchroRaMa
outperforms the state-of-the-art, achieving improvements in image quality, expression preservation, and motion realism.
A user study further confirms that SynchroRaMa achieves higher subjective ratings than competing methods in overall
naturalness, motion diversity, and video smoothness. Code and model weights will be released upon acceptance.