Alignment between image and text has shown promising improvements on patch-level pre-trained document models. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost time. Thus, question naturally arises: Could we fine-tune the models adaptive to downstream tasks with objectives achieve comparable better perform...