1. Introduction

Digital video is becoming more important as network bandwidth and processing power increases and costs decrease. Applications such as digital libraries, medicine, education, video conferencing, and video-on-demand are becoming more common. Consequently, there is a need for efficient video retrieval and management systems.

Currently, information retrieval systems generally use text-based keywords or standard field searches (e.g., title, director, genre, and actors/actresses) regardless of data type. This is reasonable for small collections of videos and videos with clearly defined, well-known attributes such as genre (e.g., Westerns). However, as more videos are produced and used for purposes other than entertainment (e.g., research, sonograms, documentaries, teleconference calls, or lectures), finding videos relevant to users' needs based on text-based attributes becomes more problematic.

A system-based approach is to index video segments on visual properties such as color, motion, shapes, or brightness data. Algorithms are used to differentiate video segments for indexing based on differences detected in these properties. Users will still need to be able to describe to the system what exactly they are looking for. For example, IBM's Query by Image Content (QBIC) system allows users to create visual queries by drawing the shape of target objects or identifying colors in desired scenes. Other systems such as Carnegie Mellon University's Informedia Project take advantage of non-visual features such as speech recognition of dialog or closed caption text to help identify useful video segments.

A complementary user-approach, taken in this study, lets the user select the video or segment of interest based on direct inspection. Basically, users have different criteria for determining relevance for their specific needs. Not all of these can be expressed explicitly. By presenting different visual summaries or representations, users will be able to browse a large number of video documents and find those that match their needs directly. [See Plaisant et al. (1995) for a taxonomy of image browsers and different visual tasks.] The problem addressed in this experiment is whether different types of video summaries (storyboard and slide show) are useful for different types of tasks (gist determination and object recognition).


1.1. Theory and Review of the Literature: Video Surrogates

Humans and machines have different capabilities. Ideally, user interfaces are designed to take advantage of the strengths of each. Machines are good at repetitive tasks, whereas humans excel at making judgements and planning complex actions (Shneiderman, 1998). In visual searching tasks, humans are much better at finding patterns, recognizing objects, generalizing or inferring information from limited data, and making relevance decisions (Helander, 1988). Machines are much more efficient at measuring and detecting discrete changes in physical properties, organizing and storing large amounts of data, and creating large numbers of video representations. The experiment conducted here assumes that machines will be used to organize and manipulate large amounts of digital video and narrow the number of possible "hits" in response to a user's query for a particular video. The user will be presented with these selections in some rank-ordered manner to browse and decide which one(s) is interesting enough to pursue further. Thus, the power of the human visual and decision-making systems is combined with the speed and accuracy of computer systems to carry out large-scale, repetitive actions to allow users to browse or visualize surrogates representing large numbers of videos. Surrogates are entities that represent a full document. For example, in text-based documents, a title, abstract, table of contents, or card catalog entry may serve as a surrogate -- Each can be browsed more rapidly than the full document while providing information about the document itself.

A number of different video surrogates have been proposed. Some are based on direct extraction of stills from the video stream without any modifications. For example, O'Connor (1991) described a technique for automating the extraction of key frames, individual frames taken directly from video which are representative of important events in the video. Theoretically, users would only need to view a limited number of key frames--rather than the 30 frames per second required for full-motion video--to learn about the content of the full video. O'Connor termed these video representations "contour maps" and compared them to reading maps to learn about the terrain of a particular area without actually having to be there. Elliot (1993) took another approach with the Video Streamer. Rather than selecting only a few important stills, images taken from a video are given finite width and stacked on top of each other. This creates a three-dimensional "block" that can be used to identify scene changes and motion along the "edges."

Another type of video representation requires the generation of "new" stills that are composites or a collage of images from different scenes. Thus, because these representations are synthesized from discontinuous pieces of video, they cannot actually be found in the original. For example, Yow et al. (1995) created "panoramic reconstructions" to summarize highlights of a soccer match. The effect is that of a multiply exposed image--objects can be viewed in various physical locations and their motion estimated. Another technique uses optical flow computations. Vectors are calculated and used to represent the motion of objects. Teodosio and Bender (1993) used this technique to create salient stills, images formed selectively representing objects in motion while keeping the background constant.

While collages and individual key frames can provide some useful information about videos, "higher order" structures such as the storyboard layout emphasize temporal relationships between frames. The storyboard design displays images side-by-side in temporal order, like a film strip. This allows for browsing of key frames like reading a comic strip--each subsequent frame provides the next major event. Viewers mentally fill in the motion and events between frames. Another way to organize key frames is hierarchically. Zhang, Low, and Smoliar (1995) used the storyboard concept but added an additional "resolution" dimension. At the root level, a single key frame represents the entire video. At lower levels, greater numbers of key frames are revealed. All key frames are presented in storyboard fashion at the lowest level. Thus, a viewer can quickly "zoom in" to view specific key frames by following a particular branch in the hierarchy. The main advantage of this representation is that key frames themselves are used as indices while simultaneously saving screen real estate by presenting only the stills from the part of the video the user is interested in. Yeung et al. (1995) used a hierarchical scene transition scene model. Nodes are created based on overall similarity between shots. A single key frame is used to represent each node. Edges drawn between nodes represent temporal relationships. Collections of key frames at each node are inspected storyboard style.

The representations discussed thus far are based on still images. These can be collectively called "static" displays. Representations that are in motion or "dynamic" have also been proposed. For example, Wectlar et al. (1996) devised a video skimming technique. Scene significance is determined by using a combination of data streams such as scene changes and breaks and audio level. The skim itself consists of playing frames immediately around a previously identified "important" video event at full-motion speeds (e.g., 30 frames per second). Video skims hence use very short video clips to represent longer scenes.

Many different and innovative ways to abstract or represent video have been proposed and devised. In theory, each of these techniques saves user time and effort by providing data in highly compact and abbreviated formats. Since each representation should require only a fraction of the time to view, as compared to the full video, many more videos can be considered within the same unit time. In addition, because relevant information is exposed by the representation, users should require little time to decide whether the video needs to be examined in more detail or can be immediately rejected as irrelevant.


1.2. Statement of the Problem

In spite of the number and variety of different theoretical video abstract constructs, there has been surprisingly little experimental research on how effective each of these representations really is in solving user needs. Very few controlled experiments have been conducted to compare the various techniques proposed in the literature. For example, there are no data on which representation is most effective for a particular user need. In fact, it is not known what user capabilities or needs are important factors in determining the usefulness of browsing video abstracts. To our knowledge, there is, at present, no theory or model of information seeking for video.

This experiment is an attempt to begin the process of systematically dissecting the various factors that might be involved in effective video browsing. Ideally, each of the proposed representations can be compared under controlled conditions with a variety of users and a variety of video requirements. Then each factor could be varied and its effects measured. The result would be a user-task-surrogate table or model that could be used to identify optimal video search conditions.

In this experiment, we compare user performance in completing two simple tasks (gist determination and object recognition) using two video representation designs (storyboard and slide show - online demonstration):

The tasks -- gist determination (GD) and objective recognition (OR) -- were selected because they represent two distinctly different user needs. GD represents tasks in which users are trying to learn what the video is about -- objectively and rapidly (e.g., a video abstract). GD users want to obtain an overview of the story line without having to watch the entire video or clip. Their goal is either (1) to determine the relevance of the video to their needs or (2) to learn something about the topic discussed in the video. For example, a television news producer may want to find a clip about the formation of hail to go along with a story that is airing in 20 minutes about a hail storm that caused massive damage in Boulder. She will need to browse through the station's video library to find relevant archival footage. Thus, GD is a goal-oriented task--to learn about the content of the video. OR, on the other hand, is task-oriented. An example of an OR task is finding footage of Apple Macintosh computers being used in an educational environment. The user doesn't need to form an overall understanding of the video, he just needs to identify Macs and students.

The different interface design types were also selected based on their representation of distinct categories of video surrogates. The storyboard (SB) is a static display that relies on users scanning strips of images. The slide show (SS) is dynamic and doesn't require as much visual scanning: users can fix their eyes on a general location where images are displayed. Also, the SS design is much closer, conceptually, to video as moving images. This may help users relate the temporal sequence of images better.


1.3. Related Research

A number of researchers have been studying video indexing, retrieval, and abstracting. Prototype systems not already mentioned include Commercial video information retrieval systems include:
Continue

[ Abstract | Credits | 1. Introduction | 2. Experiment | 3. Results | 4. Discussion | 5. Conclusions | Acknowledgements | References | Appendices ]