Supplementary material: Spatio-temporal Person Retrieval via Natural Language Queries

نویسندگان

  • Masataka Yamaguchi
  • Kuniaki Saito
  • Yoshitaka Ushiku
  • Tatsuya Harada
چکیده

In this section, we provide the further details of the dataset statistics. The description length. We first analyze the description length (i.e., the number of words in a description). Figure 1 shows the distribution of the number of words in a description. We can see that our dataset contains various lengths of descriptions. The average length of descriptions in our dataset is 13.1. We also show the comparison of the average description length of our dataset to those of other datasets in Table 1. ReferIt [7] and Google RefExp [12] are the datasets of referring expressions, each of which is true of only a single region in an image. The descriptions in VisualGenome [10], MSR-VTT [20] and MSCOCO [2] focus on regions in images, whole images and videos, respectively. Even though a description in our dataset focuses on a single person, the average description length of our dataset is larger than not only those of the datasets of which descriptions focus on regions in images, but also those of the datasets of which descriptions focus on the whole images or videos. This implies that the descriptions in our dataset tend to contain more detailed information than those in other datasets. The number of annotated people in a clip. Figure 2 shows the distribution of the number of people who are annotated with bounding boxes and descriptions in a single clip. While many clips contain only one annotated person, some clips contain multiple annotated people. The number of occurrences of each high-frequency word. Figure 4 shows the number of occurrences of the most frequently occurring words (Stop words are excluded). We can see that high-frequency words involve various types of words such as colors, actions, clothes and places. Figure 5 shows the comparison of frequencies of words in Figure 4 between our dataset and VisualGenome. While the frequencies of words describing colors (e.g. black, white, blue and red) and people (e.g. man, woman, girl and boy) in our dataset are close to those in VisualGenome, the 10 15 20 25 30 35

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Contextual Media Retrieval Using Natural Language Queries

The 21st century has seen a rapid increase in the abundance of mobile devices with cameras. This, along with the evolution of digital photography and the internet, has presented mankind with a virtual mine of media content. The increasing number of images and videos rich with metadata (timestamps, GPS location, camera orientation etc.) has the potential to act as a collective memory dispersed i...

متن کامل

Spatio-Temporal Querying of Video Content Using SQL for Quantizable Video Databases

Multimedia database modeling and representation play an important role for efficient storage and retrieval of multimedia. Modeling of semantic video content that enables spatiotemporal queries is one of the challenging tasks. A video is called as “quantizable” if the instants of a video are enough for a person to imagine the missing scenes properly. A semantic query for quantizable videos can b...

متن کامل

Natural Language Interface on a Video Data Model

Depending on a content-based spatio-temporal video data model, a natural language interface is implemented to query the video data. The queries, which are given as English sentences, are parsed using Link Parser, and the semantic representations of given queries are extracted from their syntactic structures using information extraction techniques. At the last step, the extracted semantic repres...

متن کامل

Video Question Answering via Hierarchical Spatio-Temporal Attention Networks

Open-ended video question answering is a challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced video content according to the question. However, the existing visual question answering works only focus on the static image, which may be ineffectively applied to video question answering due to the lack of modeling the tem...

متن کامل

Query-by-Gaming: Interactive spatio-temporal querying and retrieval using gaming controller

Spatio-temporal querying and retrieval is a challenging task due to the lack of simple user interfaces for building queries despite the availability of powerful indexing structures and querying languages. In this paper, we propose Query-by-Gaming scheme for spatio-temporal querying that can benefit from gaming controller for building queries. By using Query-by-Gaming, we introduce our spatio-te...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017