Browsing by Author "Forsyth, D."

Now showing 1 - 2 of 2

Open Access
Searching video for complex activities with finite state models
(IEEE, 2007-06) İkizler, Nazlı; Forsyth, D.
We describe a method of representing human activities that allows a collection of motions to be queried without examples, using a simple and effective query language. Our approach is based on units of activity at segments of the body, that can be composed across space and across the body to produce complex queries. The presence of search units is inferred automatically by tracking the body, lifting the tracks to 3D and comparing to models trained using motion capture data. We show results for a large range of queries applied to a collection of complex motion and activity. Our models of short time scale limb behaviour are built using labelled motion capture set. We compare with discriminative methods applied to tracker data; our method offers significantly improved performance. We show experimental evidence that our method is robust to view direction and is unaffected by the changes of clothing. © 2007 IEEE.
Open Access
Translating images to words for recognizing objects in large image and video collections
(Springer, 2006) Duygulu, P.; Baştan M.; Forsyth, D.
We present a new approach to the object recognition problem, motivated by the recent availability of large annotated image and video collections. This approach considers object recognition as the translation of visual elements to words, similar to the translation of text from one language to another. The visual elements represented in feature space are categorized into a finite set of blobs. The correspondences between the blobs and the words are learned, using a method adapted from Statistical Machine Translation. Once learned, these correspondences can be used to predict words corresponding to particular image regions (region naming), to predict words associated with the entire images (autoannotation), or to associate the speech transcript text with the correct video frames (video alignment). We present our results on the Corel data set which consists of annotated images and on the TRECVID 2004 data set which consists of video frames associated with speech transcript text and manual annotations.