We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is output pixel-level map of object(s) that produce sound at time image frame. To facilitate this research, we construct first benchmark (AVSBench), providing pixel-wise annotations for sounding objects audible videos. Two settings are studied with benchmark: 1) semi-supervised single source and 2) ful...