Localizing sound sources in a visual scene has many important applications and quite few traditional or learning-based methods have been proposed for this task. Humans the ability to roughly localize within beyond range of vision using their binaural system. However most existing use monaural audio, instead as modality help localization. In addition, prior works usually form object-level boundi...