The task of emotion recognition in conversations (ERC) benefits from the availability multiple modalities, as provided, for example, video-based Multimodal EmotionLines Dataset (MELD). However, only a few research approaches use both acoustic and visual information MELD videos. There are two reasons this: First, label-to-video alignments noisy, making those videos an unreliable source emotional...