Cognitive Vision Needs Attention to Link Sensing with Recognition
نویسنده
چکیده
“Cognitive computer vision is concerned with integration and control of vision systems using explicit but not necessarily symbolic models of context, situation and goaldirected behaviour” (Vernon 2003). This paper discusses one small but critical slice of a cognitive computer vision system, that of visual attention. The presentation begins with a brief discussion on a definition for attention followed by an enumeration of the different ways in which attention should play a role in computer vision and cognitive vision systems in particular. The Selective Tuning Model is then overviewed with an emphasis on its components that are most relevant for cognitive vision, namely the winner-take-all processing, the use of distributed saliency and feature binding as a link to recognition. 1. Towards a Definition of Attention What is ‘attention’? Is there a computational justification for attentive selection? The obvious answer that has been given many times that the brain is not large enough to process all the incoming stimuli, is hardly satisfactory (Tsotsos 1987). This answer is not quantitative and provides no constraints on what processing system might be sufficient. Methods from computational complexity theory have formally proved for the first time that purely data-directed visual search in its most general form is an intractable problem in any realization (Tsotsos 1989). There, it is claimed that visual search is ubiquitous in vision, and thus purely data-directed visual processing is also intractable in general. Those analyses provided important constraints on visual processing mechanisms and led to a specific (not necessarily unique or optimal) solution for visual perception. One of those constraints concerned the importance of attentive processing at all stages of analysis: the combinatorics of search are too large at each stage of analysis otherwise. Attentive selection based on task knowledge turns out to be a powerful heuristic to limit search and make the overall problem tractable (Tsotsos 1990). This conclusion leads to the following view of attention: Attention is a set of strategies that attempts to reduce the computational cost of the search processes inherent in visual perception. It thus plays a role in all aspects of vision. Many (the active/animate vision researchers) seem to claim that attention and eye movements are one and the same; certainly none of the biological scientists working on this problem would agree. That one can attend to particular locations in the visual field without eye movements has been known since Helmholtz (1924), but eye movements require visual attention to precede them to their goal (Hoffman 1998 surveys relevant experimental work). Both selection goals are needed corresponding to overt and covert attentional fixations described in the perception literature. Active vision, as it has been proposed and used in computer vision, necessarily includes attention as a sub-problem. 3. Attention in Computer Vision What is it about attention that makes it one of the easiest topics to neglect in computer vision? the task of tracking, or active control of fixation, requires as a first step the detection of the target or focus of attention. How would one go about solving this? Knowing that with no task knowledge and in a purely-data-directed manner, this sub-task of target detection is NP-Complete making it appear as if these authors are attempting to solve a problem that includes known intractable sub-problems. What conclusions can be drawn from such proposals? Is the problem thought to be irrelevant or is it somehow assumed away? Those who build complete vision application systems invoke attentional mechanisms because they must confront and defeat the computational load in order to achieve the goal of real-time processing (there are many examples, two of them being Baluja & Pomerleau 1997 and Dickmanns 1992). But the mainstream of computer vision does not give attentive processes, especially task-directed attention, much consideration. A spectrum of problems requiring attention has appeared (Tsotsos 1992): selection of objects, events, tasks relevant for domain, selection of world model, selection of visual field, selection of detailed sub-regions for analysis, selection of spatial and feature dimensions of interest, selection of operating parameters for low level operations. Take a look at this list and note how most research makes assumptions that reduce or eliminate the need for attention: • Fixed camera systems negate the need for selection of visual field • Pre-segmentation eliminates the need to select a region of interest • 'Clean' backgrounds ameliorate the segmentation problem • Assumptions about relevant features and the ranges of their values reduce their search ranges • Knowledge of task domain negates the need to search a stored set of all domains • Knowledge of which objects appear in scenes negates the need to search a stored set of all objects • Knowledge of which events are of interest negates the need to search a stored set of all events The point is that the extent of the search space is seriously reduced before the visual processing takes place, and most often even before the algorithms for solution are designed! However, it is clear that in everyday vision, and certainly in order to understand vision, these assumptions cannot be made. More importantly, the need for attention is broader than simply vision as the above list shows. It touches on the relevant aspects of visual reasoning, recognition, and visual context. As such, cognitive vision systems should not include these sorts of assumptions and must provide mechanisms that can deal with the realities inherent in real vision. 5. The Selective Tuning Model of Visual Attention The modeling effort described herein features a theoretical foundation of provable properties based in the theory of computational complexity (Tsotsos 1987, 1989, 1990, 1992). The ‘first principles’ arise because vision is formulated as a search problem (given a specific input, what is the subset of neurons that best represent the content of the image?) and complexity theory is concerned with the cost of achieving solutions to such problems. This foundation suggests a specific biologically plausible architecture as well as its processing stages as will be briefly described in this article (a more detailed account can be found in (Tsotsos 1990, Tsotsos et al. 1995). 5.1. The Model The visual processing architecture is pyramidal in structure with units within this network receiving both feed-forward and feedback connections. When a stimulus is presented to the input layer of the pyramid, it activates in a feed-forward manner all of the units within the pyramid with receptive fields (RFs) mapping to the stimulus location; the result is a diverging cone of activity within the processing pyramid. It is assumed that response strength of units in the network is a measure of goodness-of-match of the stimulus within the receptive field to the model that determines the selectivity of that unit. Selection relies on a hierarchy of winner-take-all processes. WTA is a parallel algorithm for finding the maximum value in a set. First, a WTA process operates across the entire visual field at the top layer where it computes the global winner, i.e., the units with largest response (see Section 4.3 for details). The fact that the first competition is a global one is critical to the method because otherwise no proof could be provided of its convergence properties. The WTA can accept guidance to favor areas or stimulus qualities if that guidance is available but operates independently otherwise. The search process then proceeds to the lower levels by activating a hierarchy of WTA processes. The global winner activates a WTA that operates only over its direct inputs to select the strongest responding region within its receptive field. Next, all of the connections in the visual pyramid that do not contribute to the winner are pruned (inhibited). The top layer is not inhibited by this mechanism. However, as a result, the input to the higher-level unit changes and thus its output changes. This refinement of unit responses is an important consequence because one of the important goals of attention is to reduce or eliminate signal interference (Tsotsos 1990). By the end of this refinement process, the output of the attended units at the top layer will be the same as if the attended stimulus appeared on a blank field. This strategy of finding the winners within successively smaller receptive fields, layer by layer, in the pyramid and then pruning away irrelevant connections through inhibition is applied recursively through the pyramid. The end result is that from a globally strongest response, the cause of that largest response is localized in the sensory field at the earliest levels. The paths remaining may be considered the pass zone of the attended stimulus while the pruned paths form the inhibitory zone of an attentional beam. The WTA does not violate biological connectivity or relative timing constraints. Figure 1 gives a pictorial representation of this attentional beam.
منابع مشابه
Perceptive Machines: From Selective Attention and Recognition to Visual Cognition
While the task of sensing and perceiving the visual environment as we go about our daily lives is trivial for most humans, attempts to emulate the principles underlying human vision in machine vision systems have only been marginally successful. Attention, mediated by eye movements, acts as the critical gateway to visual cognition by searching for areas with relevant information and selecting t...
متن کاملبررسی اثرات روشنایی در عملکرد روانی و شناختی انسان - یک مطالعه مروری ساختار یافته
Introduction: Lighting affects many non-visual functions such as Circadian rhythm, alertness, core body temperature, hormone secretion and sleep. The aim of this study was to investigate the effects of lighting on human cognitive and mental performance. Methods: In this systematic review, databases including ISI Web of Knowledge, Scopus, PubMed and Science Direct were searched to access the re...
متن کاملVisual Scene Interpretation as a Dialogue between Vision and Language
We present a framework for semantic visual scene interpretation in a system with vision and language. In this framework the system consists of two modules, a language module and a vision module that communicate with each other in a form of a dialogue to actively interpret the scene. The language module is responsible for obtaining domain knowledge from linguistic resources and reasoning on the ...
متن کاملThe Message in the Shadow: noise or knowledge? (Dagstuhl Seminar 15192)
Computer vision, besides being a key area in Computer Science, is present in various industrial applications, such as traffic sign recognition (including car license plates), face and gesture recognition, content-based image retrieval, remote sensing, cartography, radar sensing, and robot mapping. However, most computer vision systems disregard the cognitive aspects of human perception, thus li...
متن کاملThe Study of the Emotion Recognition and the Cognitive Failures of Children with Developmental Coordination Disorder
Objectives: Recently, attention has been paid to the Developmental Coordination Disorder (DCD) in children. DCD occurs silently in child development stages; the child is involved with deficiencies that affect his / her social relationships, academic achievement and emotional perception. In this study, we tried to investigate the state of emotion recognition and the cognitive profiles of the chi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006