An object recognition framework using contextual interactions among objects
Item Usage Stats
Object recognition is one of the fundamental tasks in computer vision. The main endeavor in object recognition research is to devise techniques that make computers understand what they see as precise as human beings. The state of the art recognition methods utilize low-level image features (color, texture, etc.), interest points/regions, filter responses, etc. to find and identify objects in the scene. Although these work well for specific object classes, the results are not satisfactory enough to accept these techniques as universal solutions. Thus, the current trend is to make use of the context embedded in the scene. Context defines the rules for object - object and object - scene interactions. A scene configuration generated by some object recognizers can sometimes be inconsistent with the scene context. For example, observing a car in a kitchen is not likely in terms of the kitchen context. In this case, knowledge of kitchen can be used to correct this inconsistent recognition. Motivated by the benefits of contextual information, we introduce an object recognition framework that utilizes contextual interactions between individually detected objects to improve the overall recognition performance. Our first contribution arises in the object detector design. We define three methods for object detection. Two of these methods, shape based and pixel classification based object detection, mainly use the techniques presented in the literature. However, we also describe another method called surface orientation based object detection. The goal of this novel detection technique is to find objects whose shape, color and texture features are not discriminative while their surface orientations (horizontality or verticality) are consistent across different instances. Wall, table top, and road are typical examples for such objects. The second contribution is a probabilistic contextual interaction model for objects based on their spatial relationships. In order to represent the spatial relationships between objects, we propose three features that encode the relative position/location, scale and orientation of a given object pair. Using these features and our object interaction likelihood model, we achieve to encode the semantic, spatial, and pose context of a scene concurrently. Our third main contribution is a contextual agreement maximization framework that assigns final labels to the detected objects by maximizing a scene probability function that is defined jointly using both the individual object labels and their pairwise contextual interactions. The most consistent scene configuration is obtained by solving the maximization problem using linear optimization. We performed experiments on the LabelMe  and Bilkent data sets by both utilizing and not utilizing the scene type (indoor or outdoor) information. While the average F2 score increased from 0.09 to 0.20 without the scene type assumption, it increased from 0.17 to 0.25 when the scene type is known on the LabelMe dataset. The results are similar for the experiments performed on the Bilkent data set. F2 score increased from 0.16 to 0.36 when the scene type information is not available and it increased from 0.31 to 0.44 when this additional information is used. It is clear that the incorporation of the contextual interactions improves the overall recognition performance.