In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through four novel generation tasks, including Adversarial (AIC), Image-conditioned Masked Language Modeling (IMLM), Denoising Autoencoding (IDA), and Text-conditioned Feature Generation (TIFG). As result, the pre-trained XGPT ca...