The primary goal of this research is to explore an aspect of video surrogate display that may be useful for browsing video data. User needs for quickly locating a particular video clip among a multitude of video data within a digital library may be met through an interface that allows simultaneous display of these surrogates.
Research Question 1.
Before implementing an interface for browsing video data that allows simultaneous display of videos, one must first understand human limitations and abilities to process the information. The first research question addresses this. We wish to determine the threshold for human performance. One through four video clips are displayed at once. Our hypotheses state expected outcomes. There are four conditions in this experiment. The condition number corresponds to the number of video clips that a subject views simultaneously.
H1: Subjects in conditions 4 and 3 will not be able to accurately identify objects in video key frames as well as subjects in conditions 2 and 1.
H2: Subjects in conditions 4 and 3 will not be able to comprehend what is taking place in the videos as well as those in conditions 1 and 2. This hypothesis holds for both the sentence writing comprehension task and the multiple choice comprehension task.
These hypotheses are stated because of the experimenter's belief that subjects will have a better understanding and ability to recognize objects for a clip in conditions 1 and 2 than in conditions 3 and 4 simply because the subjects must attend simultaneously to several videos at once. In the case of multiple screens, their attention is divided, making it more difficult for them to get the gist of the story and focus on content for each video.
Research Question 2.
This experiment was repeated twice by each subject. The subject was allowed to view the videos a second time and then repeat the object recognition and comprehension tasks.
H1: After viewing video key frames a second time, the subjects will improve their accuracy for identifying objects and will have a greater understanding of the gist of the video clip.
Research Question 3.
An evaluation questionnaire was competed by each subject. Each question is rated on a scale from 1 to 7 with descriptive adjectives anchoring the right and left ends of the scale. Note that the "Number" questions are not asked for Condition 1 because there is only 1 video shown. The following questions are asked for the evaluation dependent variables:
About object identification....
"Speed Objects" 1. The speed that the videos were presented was: too slow/too fast
"Number Objects" 2. The number of videos presented was: imperceptible/perceivable
About video comprehension....
"Speed Compreh" 1. The speed that the videos were presented was: too slow/too fast
"Number Compreh" 2. The number of videos presented was: imperceptible/perceivable
H1: For the question "Speed Objects" 1. The speed that the videos were presented was: too slow/too fast", the hypothesis states that condition 1 subjects will rate more towards the "too slow" end of the scale and condition 4 subjects will rate towards "too fast". Subjects in conditions 2 and 3 will rate somewhere in the middle of this scale.
Even though all the video key frames are displayed at the same rate, subjects perceive the videos as going "too fast" in the conditions with three and four videos simultaneously because they must split their attention between several displays.
H2: For the question "Speed Compreh" 1. The speed that the videos were presented was: too slow/too fast, it is expected that condition 1 will rate more towards the "too slow" end of the scale because those subjects have an easier task to comprehend wha t is taking place in one video. Condition 4 will rate "too fast" because of the difficulty of comprehending several videos at once. The other conditions will rate somewhere in the middle of too fast and too slow.
H3: For the question "Number Objects" 2. The number of videos presented was: imperceptible/perceivable, it is expected that the greater the number of videos presented, the less the subject will be able to identify objects in the videos. Therefore, subjects in the conditions with more videos shown will rate the "number of videos" more towards the "imperceptible" end of the scale.
H4: For the question "Number Compreh" 2. The number of videos presented was: imperceptible/perceivable, it is expected that the greater the number of videos presented, the less the subject will be able to comprehend the gist of the story in the video clips.
A pilot study was conducted which resulted in several changes in the experimental design prior to beginning the current study. In conditions 2-4, the subject must complete object recognition tasks for between two and four separate sets of key frames. The tasks were previously given to the subjects one at a time. For example, a subject views three videos at once. Then, they are given object recognition tasks for the three videos one at a time. It was decided that to remove order effects, all tasks of the same type (object recognition, comprehension) for a particular condition will be presented at the same time. Subjects in the pilot study then became confused about which video (in the 2-4 videos conditions) corresponded to the tasks given. A picture of the screen layout with the placement of the video in question was added as an indicater.
Twenty-eight undergraduate students (20 males, 8 females) enrolled in introductory psychology at University of Maryland, College Park participated in this experiment. Subjects took part in the investigation in order to fulfill a course requirement. The subjects voluntarily chose this experiment, it is unknown why the number of males who signed up is so much greater than the number of females. Subjects were required to have prior experience using a computer mouse.
The video key frames used in the experiment were segmented from digitized MPEG video clips. These were extracted from 3-5 minute video clips from the following Discovery Channel educational programs: Spirits of Rainforest, Flight Over Equator, Space Shuttle and The Revolutionary War. All the video programs were the same difficulty level for comprehension of content. They were meant as learning tools for novices and students. Five 3-5 minute video clips were chosen so that the short segment conveyed a meaningful story. Video Clip 1 (Spirits of Rainforest) showed how researchers conduct studies on monkeys in a rain forest. Specifically, it showed how the researchers tagged monkeys for tracking. Video Clip 2 (Flight Over Equator) showed scenes of Singapore's industrialization, its culture and people in daily life. Video Clip 3 (Spirits of Rainforest) showed how a native American tribe makes a living in the jungle. Video Clip 4 (The Revolutionary War) showed enacted scenes of the Battle of Concord during the Revolutionary War. Video Clip 5 (Space Shuttle) was about the Apollo 11 astronauts training and moon landing activities.
The key frames were created by a color histogram-based segmenting and indexing technique developed in the Center for Automation Research at University of Maryland (CFAR). Tests showed this technique gained nearly 90% accuracy in key frame extraction and have been shown to capture salient objects and overall gist (Kobla et. al, 1996). There were five sets of key frames (one for each video clip), each set was composed of eighteen individual key frames.
The experiment took place in the Academic Information Technology Services (aITs) Teaching Theater at the University of Maryland, College Park campus. The teaching theater is equipped with 25 Gateway(TM) 586 computers running Windows 95 operating system. Subjects viewed the video key frames using Netscape version 3.0. Monitors used in the experiment displayed 256 colors at 800x600 pixels resolution. A html/Javascript file controlled the look of the interface and the rate at which the key frames were shown (1 fps) to the subject. The file displaying the video key frames was placed on the hard drive of the machine so that the speed for the key frames would not be affected by speed of the server. All other experimental files were placed on a WWW server. Subjects also completed the object recognition, sentence writing, comprehension tasks and an evaluation of the video display using html/Javascript files with Netscape version 3.0 browser. The object recognition, sentence writing, comprehension questions and evaluation are listed in Appendix A.
For the object recognition tasks, there were an equal number of distracter objects and objects that actually appeared in the key frames. The list was developed by equalizing the probability of selecting false positives with the true items listed. In other words, the distracter objects were designed to fit what might be a part of a schema for things expected to be the key frames. For example, although a horse might be expected to be present in a battle during the Revolutionary War, it is never shown in the video key frames presented to the subjects. In the design of the object lists, face validity was an important aspect and terms were chosen so that they were at the same specificity and difficulty level ( Ding, 1996). The sentence writing task simply asked the subjects to write down in one or two sentences what they believed was the gist of the video. A single multiple choice question was presented for each video clip to measure comprehension of the video clip's meaning. The comprehension questions were designed using two principles. The first was to maximize the distinction between choices. The second was to minimize additional prior knowledge about the videos. (Ding, 1996) The evaluation measured subject's perception of the speed that the key frames were displayed and their perception of the number of videos presented at one time. Note that the question regarding the number of videos presented was not asked of subjects who were in condition 1 because they were only shown one video clip.
The interface used for displaying the video key frames is shown in Appendix A. The place numbers indicate the position of the video clips in the experiment. All subjects were given a practice trial prior to the actual experiment. In the practice trial, the subjects were shown video clip 5 in the upper left corner of the video display interface. The videos were placed in a square-shaped format to consolodate the videos into a single space. This minimized the need for subjects in conditions three and four to move their heads and eyes in order to get an overall view of all the videos at once. The video in the third condition was added to the bottom right hand corner in anticipation of the fact that subjects would move their eyes in a clockwise motion while viewing the screens, enhancing visual scanning.
Administration
A series of html/Javascript files was used to administer the experiment and gather the data collected. The subjects were first shown a "Welcome" screen that gave instructions for the experiment. This was read to the subjects so that the subject understood the directions. Subjects were also told that it was important to complete all tasks in the experiment, including sentence writing. The subjects were randomly assigned to one of four groups. (Groups 1-4 which correspond to the number of video keyframe surrogates presented.) The subjects were presented with a practice trial and then completed the practice object recognition and comprehension tasks. The subjects viewed the video surrogates in the experimental condition assigned. The subjects then completed the comprehension and object recognition tasks. The subjects received a questionnaire in order to evaluate the subject's satisfaction with the speed of key frame presentation and number of video displays presented. The subjects then completed the experiment a second time (excluding the practice).
Object Recognition
Performance on object recognition tasks will be measured based on accuracy scores and accuracy percentage. This is calculated based on two scores. The first is the number of objects that were correctly identified by the subjects. The second is the number of distractor objects that were not identified by the subject. These two are added together and divided by 20 (total objects in the list) to obtain the subjects overall accuracy percentage. The number of wrong items checked was also recorded and percentages obtained.
Sentence Comprehension
A content analysis was performed on the sentences provided by the subjects. Key words were derived from the sentences and placed into one of the following categories: People, Objects, Actions/Concepts, and Places.
Multiple Choice Comprehension
For each subject, it was indicated whether or not the answer for the multiple choice question was correct or incorrect. The percentage of correct answers was obtained for conditions 2, 3, and 4.
Continue
Return to Comprehension and Object
Recognition Capabilities for Presentations of Simultaneous Vidoe
Key Frame Surrogates