Instance-level video segmentation requires a solid integration of spatial and temporal information. However, current methods rely mostly on domain-specific information (online learning) to produce accurate instance-level segmentations. We propose novel approach that relies exclusively the generic spatio-temporal attention cues. Our strategy, named Multi-Attention Instance Network (MAIN), overco...