Image captioning technology has great potential in many scenarios. However, current text-based image methods cannot be applied to approximately half of the world's languages due these languages’ lack a written form. To solve this problem, recently image-to-speech task was proposed, which generates spoken descriptions images bypassing any text via an intermediate representation consisting phonem...