In this paper, we study tracking by language that localizes the target box sequence in a video based on query. We propose framework called GTI decomposes problem into three sub-tasks: Grounding, Tracking, and Integration. The sub-task modules operate simultaneously predict frame-by-frame. “Grounding” predicts referred region directly from “Tracking” history of grounded regions previous frames. ...