Generating descriptions for videos has many applications including assisting blind people and human-robot interaction. The recent advances in image captioning as well as the release of large-scale movie description datasets such as MPII-MD [28] and M-VAD [31] allow to study this task in more depth. Many of the proposed methods for image captioning rely on pre-trained object classifier CNNs and ...