Fine-grained emotion recognition can model the temporal dynamics of emotions. It is temporally more precise when compared to predicting one for activities (e.g., video clip watching). Previous works require large amounts continuously annotated data train an accurate model. However, experiments collect physiological signals are costly and time-consuming. To overcome this challenge, we propose a ...