Currently, information retrieval systems generally use text-based keywords or standard field searches (e.g., title, director, genre, and actors/actresses) regardless of data type. This is reasonable for small collections of videos and videos with clearly defined, well-known attributes such as genre (e.g., Westerns). However, as more videos are produced and used for purposes other than entertainment (e.g., research, sonograms, documentaries, teleconference calls, or lectures), finding videos relevant to users' needs based on text-based attributes becomes more problematic.
A system-based approach is to index video segments on visual properties such as color, motion, shapes, or brightness data. Algorithms are used to differentiate video segments for indexing based on differences detected in these properties. Users will still need to be able to describe to the system what exactly they are looking for. For example, IBM's Query by Image Content (QBIC) system allows users to create visual queries by drawing the shape of target objects or identifying colors in desired scenes. Other systems such as Carnegie Mellon University's Informedia Project take advantage of non-visual features such as speech recognition of dialog or closed caption text to help identify useful video segments.
A complementary user-approach, taken in this study, lets the user select the video or segment of interest based on direct inspection. Basically, users have different criteria for determining relevance for their specific needs. Not all of these can be expressed explicitly. By presenting different visual summaries or representations, users will be able to browse a large number of video documents and find those that match their needs directly. [See Plaisant et al. (1995) for a taxonomy of image browsers and different visual tasks.] The problem addressed in this experiment is whether different types of video summaries (storyboard and slide show) are useful for different types of tasks (gist determination and object recognition).

A number of different video surrogates have been proposed. Some are based on direct extraction of stills from the video stream without any modifications. For example, O'Connor (1991) described a technique for automating the extraction of key frames, individual frames taken directly from video which are representative of important events in the video. Theoretically, users would only need to view a limited number of key frames--rather than the 30 frames per second required for full-motion video--to learn about the content of the full video. O'Connor termed these video representations "contour maps" and compared them to reading maps to learn about the terrain of a particular area without actually having to be there. Elliot (1993) took another approach with the Video Streamer. Rather than selecting only a few important stills, images taken from a video are given finite width and stacked on top of each other. This creates a three-dimensional "block" that can be used to identify scene changes and motion along the "edges."
Another type of video representation requires the generation of "new" stills that are composites or a collage of images from different scenes. Thus, because these representations are synthesized from discontinuous pieces of video, they cannot actually be found in the original. For example, Yow et al. (1995) created "panoramic reconstructions" to summarize highlights of a soccer match. The effect is that of a multiply exposed image--objects can be viewed in various physical locations and their motion estimated. Another technique uses optical flow computations. Vectors are calculated and used to represent the motion of objects. Teodosio and Bender (1993) used this technique to create salient stills, images formed selectively representing objects in motion while keeping the background constant.
While collages and individual key frames can provide some useful information about videos, "higher order" structures such as the storyboard layout emphasize temporal relationships between frames. The storyboard design displays images side-by-side in temporal order, like a film strip. This allows for browsing of key frames like reading a comic strip--each subsequent frame provides the next major event. Viewers mentally fill in the motion and events between frames. Another way to organize key frames is hierarchically. Zhang, Low, and Smoliar (1995) used the storyboard concept but added an additional "resolution" dimension. At the root level, a single key frame represents the entire video. At lower levels, greater numbers of key frames are revealed. All key frames are presented in storyboard fashion at the lowest level. Thus, a viewer can quickly "zoom in" to view specific key frames by following a particular branch in the hierarchy. The main advantage of this representation is that key frames themselves are used as indices while simultaneously saving screen real estate by presenting only the stills from the part of the video the user is interested in. Yeung et al. (1995) used a hierarchical scene transition scene model. Nodes are created based on overall similarity between shots. A single key frame is used to represent each node. Edges drawn between nodes represent temporal relationships. Collections of key frames at each node are inspected storyboard style.
The representations discussed thus far are based on still images. These can be collectively called "static" displays. Representations that are in motion or "dynamic" have also been proposed. For example, Wectlar et al. (1996) devised a video skimming technique. Scene significance is determined by using a combination of data streams such as scene changes and breaks and audio level. The skim itself consists of playing frames immediately around a previously identified "important" video event at full-motion speeds (e.g., 30 frames per second). Video skims hence use very short video clips to represent longer scenes.
Many different and innovative ways to abstract or represent video have been proposed and devised. In theory, each of these techniques saves user time and effort by providing data in highly compact and abbreviated formats. Since each representation should require only a fraction of the time to view, as compared to the full video, many more videos can be considered within the same unit time. In addition, because relevant information is exposed by the representation, users should require little time to decide whether the video needs to be examined in more detail or can be immediately rejected as irrelevant.
This experiment is an attempt to begin the process of systematically dissecting the various factors that might be involved in effective video browsing. Ideally, each of the proposed representations can be compared under controlled conditions with a variety of users and a variety of video requirements. Then each factor could be varied and its effects measured. The result would be a user-task-surrogate table or model that could be used to identify optimal video search conditions.
In this experiment, we compare user performance in completing two simple tasks (gist determination and object recognition) using two video representation designs (storyboard and slide show - online demonstration):
The tasks -- gist determination (GD) and objective recognition (OR) -- were selected because they represent two distinctly different user needs. GD represents tasks in which users are trying to learn what the video is about -- objectively and rapidly (e.g., a video abstract). GD users want to obtain an overview of the story line without having to watch the entire video or clip. Their goal is either (1) to determine the relevance of the video to their needs or (2) to learn something about the topic discussed in the video. For example, a television news producer may want to find a clip about the formation of hail to go along with a story that is airing in 20 minutes about a hail storm that caused massive damage in Boulder. She will need to browse through the station's video library to find relevant archival footage. Thus, GD is a goal-oriented task--to learn about the content of the video. OR, on the other hand, is task-oriented. An example of an OR task is finding footage of Apple Macintosh computers being used in an educational environment. The user doesn't need to form an overall understanding of the video, he just needs to identify Macs and students.
The different interface design types were also selected based on their representation of distinct categories of video surrogates. The storyboard (SB) is a static display that relies on users scanning strips of images. The slide show (SS) is dynamic and doesn't require as much visual scanning: users can fix their eyes on a general location where images are displayed. Also, the SS design is much closer, conceptually, to video as moving images. This may help users relate the temporal sequence of images better.
[ Abstract | Credits | 1. Introduction | 2. Experiment | 3. Results | 4. Discussion | 5. Conclusions | Acknowledgements | References | Appendices ]