1. Introduction

It is anticipated that the digital libraries, world wide web contributions, and other information stores of the future will provide diverse collections of data. The amount of multimedia information included among these collections is expected to increase substantially due to the decreasing cost of storage space. This growth in the number of multimedia "documents" will necessitate the development of systems that allow users to display, search, and browse these data formats in an effective and useful manner. Specifically, the primary goal of the research presented in this paper is to facilitate the design of interfaces for rapid screening of digitized video data. In view of our expectations of the future, research that investigates the methods of representing and displaying video data for browsing and retrieval should be conducted.

There have been many mechanisms for representing video data that have been proposed (Elliot, 1993, Davis, 1994 & O'Connor, 1985). Key frame surrogates consist of a set of salient still images extracted from a full-length digital video. Video key frames are analogous to "abstracts" that are created for textual documents ( O'Connor, 1991). In this experiment, key frames are extracted using a technique that automatically segments video clips according to a change in scene. It is maintained that these key frame surrogates characterize the content of videos ( Kobla, 1996). The key frames can be viewed by scrolling through the still pictures such as in a set of "flip" cards. Key frames that are "flipped through" conserve screen space, however, the speed of the scrolling must be optimal for human perception. Ding (1996) concluded that eight frames per second(fps) was an acceptable speed for viewing sets of scrolling key frames. She found that human capabilities to recognize objects and comprehend the video began to decrease at speeds faster than 8 fps. This current investigation employs the use of video key frame surrogates that are scrolled through to produce a "flip book" effect.

One type of tool for creating and displaying video surrogates has been developed at the MIT Media Laboratory (Elliot, 1993, Davis, 1994). The Video Streamer enables users to create multi-layered annotations of streams of video data. This tool is used to mark a selection of video footage that the user will annotate. The idea behind the Video Streamer is that although some aspects of video can be automatically parsed, more detailed representations of video content require user annotations. The video clips produced by the Video Streamer are depicted as three dimensional video blocks with the pictures stacked. The block representation reveals the temporal attributes of the stream. Users may also be able to comprehend actions that take place in the clip from the edges of images seen on the sides of the block. The Streamer also includes a utility to automatically recognize cuts between shots. Using this feature, the user can save the top frame of each shot in the stream to create an overview. The overview can be displayed either as a storyboard or as a QuickTime movie. Systems such as the Video Streamer allow users to manipulate video data for editing and storage purposes. These techniques permit easier browsing and retrieval of these clips by users. Use of a "Video Streamer" type of representation may or may not facilitate easier comprehension of the full-length video's content due to the human involvement in selection of the frames. Our study used an automated procedure for key frame extraction. Although selected key frames were chosen from the computer-generated output in order to eliminate duplicate pictures and "fuzzy" images, the key frames used in our study were more from an automated procedure than human selection of "shots". Computer-generated extraction methods are used in our experiment for generalization of results to computerized extractions. This is due to the fact that surrogates generated from large archival databases of video data will most likely be done through the use of automated procedures. Even though it is not within the scope of our experiment, the Video Streamer system provides useful ideas that may contribute to the design of future systems for browsing and retrieving video data. Some of these include methods for representing the length of the entire video and allowing user annotations to video surrogates. The Ding (1996) study provided exploratory data on human ability to perceive scrolling key frame surrogates at specific speeds. From this study, it was determined that the speed of 1 fps allowed the best performance by users and that 12 fps could be a "speed breakpoint". It was postulated that beyond this speed, object identification performance will remain poor. Another interesting finding was that slower speeds were required when completing object iden tification tasks than for the comprehension tasks given. This suggests that higher speeds can be used when understanding of the content is desired, while lower speeds are necessary for identifying individual objects. It is possible then, that speeds greater than 12 fps may be used for video content comprehension. This data begins to answer questions regarding the types of tools that need to be built to support browsing and retrieval of video data from large archives of video information. In this study, a specific qu estion about limits on the number of video surrogates presented simultaneously is asked. From the data described in this study we attempt to add to the list of necessary elements required in systems for video browsing.

The design of systems for browsing video data should focus on human perceptual and cognitive abilities rather than system characteristics. However, theory in psychology is lacking a standard model that describes visual processing. The available literature reviews report conflicting data that indicate that there are multiple points along the visual pathway where filtering, semantic identification, and visual identification may occur (LeDoux and Hirst, 1988 & Wickens, 1992).

The experiment presented in this article examines human cognitive abilities for simultaneously previewing of several sets of key frames. Eighteen key frames are shown at a constant speed (1 fps) to users as they are exposed to either one, two, three or four sets of key frames shown concurrently. The speed of one frame per second is used for two reasons. First, this speed showed the best performance from users in the Ding (1996) study of display speed. Second, in the condition with four videos simultaneously, users will be viewing four videos at 1 fps which may cause a perception of viewing at approximately 4 fps (4 videos at 1 fps each) altogether. It is hypothesized that at comparable speeds, performance past a certain threshold will result in information overload and, therefore, a disruptions in user's gist comprehension and object identification for the video content.

The fundamental problem with a simultaneous presentation of these key frame sets is that humans have limited abilities to divide attention between stimuli, all of which need to be processed. It is often difficult to maintain several things in working memory . These limits of divided attention correspond to our limited cognitive abilities to time-share performance between two tasks.

There are two theories of divided attention which are single resource theory and multiple resource theory. Single resource theory ( Kahneman, 1973) proposes that there is a single undifferentiated pool of re sources available to all tasks and mental activities. As task demands increase, either by harder tasks or more tasks, the supply of resources increases until the increase is insufficient to compensate for the demand, at which time, performance declines. Multiple resource theory (Wickens, 1992) states that instead of one pool of resources, there are several different dichotomous dimensions of resources. These dimensions are stages, perceptual modalities, and pro cessing codes. Perceptual modalities is of primary interest for this experiment. In this dimension, dual tasks that are cross-modal, that is, are split between two different sources (auditory, visual) produce better performance than tasks that are inter-modal, which split tasks between one source (two visual inputs). ( Wickens, Sandry and Vidulick, 1983). However, it is uncertain whether this is true due to the nature of tasks being c ross-modal or whether peripheral factors disadvantage the inter-modal tasks. As in our experiment, because it is inter-modal, placing the video clip screens far apart may cause difficulty due to the additional visual scanning between them. Due to this, th e screens are placed beside each other forming a square to allow for easier visual scanning. There is one possible drawback to this, which is that screens too close together may cause confusion. In either scenario, due to the inter-modal nature of the tas k, and the limitations of human resources to divide attention, the theory leads us to believe that users will have difficulty dividing attention between the simultaneous video displays. What we expect to learn is at what point resources are no longer available for viewing multiple video screens. It is hoped that this experiment will shed light not only on optimal system design for browsing video data but will elucidate the psychological aspects as well.


Continue
Return to Comprehension and Object Recognition Capabilities for Presentations of Simultaneous Video Key Frame Surrogates