Shore '00: Student HCI Online Research Experiments

University of Maryland

Abstract
Introduction
Experiment
Results
Discussion
Conclusions

Acknowledgements
References
Appendices
Credits
Feedback

Back To Main

Comparison of Telephone Menu Interfaces

Experiment

Preface

In attempting to attribute the satisfaction of users of two email-by-phone services to specific system characteristics, we must take into account the menu structure of the interactive voice response (IVR) application, the numeric telephone-pad entries required to operate the system, and the acceptability of the synthesized speech.

Determining the depth, breadth, and length of menus has been the subject of many papers and conference presentations [9]. However, specific quantitative guidelines have not been set, and for good reason. Decisions about depth, breadth, and length of menus are heavily application-dependent and almost always involve a number of tradeoffs related to the logical task structure of the application, limitations of human memory, and ease of application navigation. In the early 1990s, many applications failed, or were irritating to callers because of: 1) navigation problems created by menus that were too deep; 2) menus that were too broad (creating the perception of wasting the listener's time); and 3) memory burdens due to overly long menus. Those who offered specific guidance [10,11] to help developers avoid the most common errors were criticized, to some extent, for overgeneralizing. Generally, it is recommended that menus consist of four or fewer items.

Other research has attempted to determine the lowest error rate that might reasonably be expected when people use an IVR system and the sources of those errors [12]. The simplest task people perform with the touch-tone pad is to dial a telephone number to place a call. The touch-tone telephone has been widely studied and improved since the earliest human factors design research was conducted [13,14], and most people have had years of experience with it. People should be well past the rapidly changing part of the learning curve, having achieved major gains in speed through practice. Near perfect accuracy might be expected. More complex uses of touch-tone dialing such as IVR applications can only be expected to increase errors or slow performance relative to the simple, well-practiced dialing task.

Finally, the utility of text-to-speech systems depends on two aspects of their performance: intelligibility and naturalness. Evaluation of these aspects of synthetic speech provides important information about the performance of a speech synthesizer in comparison to competing products. It can be important for developers to understand where the relative strengths and weaknesses of a particular synthesizer are so they can assist in development as well as marketing. Diagnostic evaluation can help with the development effort by pinpointing specific problems in synthesis that can be solved by engineering solutions [15].

Traditionally, developers of TTS systems have distinguished between three basic measures of performance: acceptability (or preference), intelligibility, and naturalness [16]. In the context of laboratory evaluations of the quality of speech produced by a TTS system, these measures are intended as an indication of how successful the system will be at producing speech that 1) is generally usable in some overall sense, 2) transmits intended messages clearly and effectively, and 3) sounds like natural human speech when used in its intended application.

Before discussing these measures individually, it is important to consider some of the factors related to their nature and use. Other measures of acceptability, intelligibility, and naturalness are used in setting performance benchmarks for development or for selection of a specific TTS system for a particular application. However, this can dangerously oversimplify the evaluation processes. Rather than simply accepting these three measures as external, absolute scales for setting performance standards, it is important to understand what they actually measure, and how different components of a TTS systems interact with listeners and task demands to affect performance on these measures [17,18].

Although they are often treated as measuring distinct aspects of a synthesizer, these qualities are not completely independent of one another. For example, since acceptability is a global measure of overall quality of speech, it depends on all the more specific perceptual qualities of the speech. Thus, if intelligibility of naturalness is low, then acceptability is not likely to be very high. Similarly, in many cases speech with low naturalness is not as intelligible as more natural sounding speech, while speech that is low in intelligibility is almost invariably perceived as unnatural. This correlation of naturalness and intelligibility has generally been assumed to be the case for all speech.

Besides being interdependent, these three measures are all complex, in the sense that they simultaneously reflect diverse characteristics of the synthesizer, linguistic properties of the text being spoken, and the momentary and characteristic states and properties of the human listener [15]. Because they represent the interaction of many different aspects of the synthesis process, the underlying factors which affect acceptability, naturalness and intelligibility are not all objectively recoverable from the speech signal alone.

Furthermore, because laboratory-testing conditions can seldom reflect all relevant aspects of actual application conditions perfectly, test results cannot be taken as absolute measures. That is, just because speech produced by a particular TTS system is identified 92% correctly on a particular test of intelligibility does not mean that people using that synthesizer in its intended context will always recognize 92% of the words it produces. When comparing two synthesizer, it is often most informative to simply compare the rank ordering of each synthesizer relative to the other, and to that of a natural human voice presented in the same testing conditions.

Hypothesis

If two systems are comparable in many respects, time is the most important factor for user satisfaction or even just the sense that they are completing their task quicker (hence the need for better menu and TTS design to reduce errors and frustration while increasing message comprehension).

Keeping the above in mind, we believe that Shoutmail, while may not be statistically significant in time, will be much more favorable to the users in their subjective satisfactions.

Independent variables:   Email Systems
      Treatments:   Coolemail.com and Shoutmail.com

Dependent variables:
      Running time
      User Satisfaction

Subjects

Twenty-three subjects representing typical young telephone users were recruited for the experiment by offering extra credit points, with the cooperation of the instructor, to students in a University of Maryland Computer Science class. Four of the subjects were dropped from the experiment because of incomplete data and other factors, leaving nineteen subjects (10 male, 9 female). The subjects represented an age range of 18 to 24, a variety of racial and ethnic backgrounds, and a college educational level. The students were unpaid for their participation.

Materials

A digital telephone set was connected to a double-deck tape recorder with a recording adapter purchased from Radio Shack. The adapter was connected between the coiled cable of the handset and the telephone unit. A 1/8-inch audio output plug was then taken from the adapter and connected to the microphone input of the tape recorder. It is in this manner that any sound samples of the subjects completing their tasks were obtained.

The telephone base contained a keypad with numbers arranged in a standard 3 x 4 layout and was located approximately 50cm from the subjects. Instructions were also located 50cm from the participants with text written in 12-point font.

Procedure

At the beginning of the session, the experimenter provided the introductory instructions outlining the procedure to be followed. To help motivate the subjects to a fast solution, instructions were provided in the form of a real-life scenario:

You're on your way to an interview and thought you could remember how to get to the company - you were wrong. The company had included directions to their location in an email to you approximately a week ago with a subject line that was something like: "interview confirmation"...

Your task is to find those directions as quickly as possible on your cell phone before you miss an exit and are late to this important interview. Luckily you find the information you need to connect to the phone-accessible email service.

The subjects were then provided with the phone numbers for Shoutmail and Coolemail along with user IDs and passwords for the accounts they were to access. Timing began after subjects dialed the number for the service and ended when they had listened to the part of the message giving directions in its entirety. In the end subjects were asked to fill out a questionnaire (see the Appendix).

In an attempt to maintain an even balance of test cases, five males and females were tested in Shoutmail-Coolemail order while five males and females were tested in Coolemail-Shoutmail order.

Pilot Study Results

We found that subjects in our pilot studies were having difficulty entering the numbers needed to gain access to their email. Upon the initial creation of accounts, user IDs and passwords were both creating using the template "cmsc434##" where ## were numbers ranging from 00 to 19. The actual keypad translation of "cmsc" to "2672" proved to be too frustrating and time-consuming for impatient subjects. Furthermore, this impatience fueled our belief that differences in quality between systems could be leveled if the lower-quality system proved to be faster. Following suggestions made by the aforementioned research into low-error IVR system design, we made all user IDs and passwords of the form "1121314151", "2232425262" and so on. This way subjects were able to detect the pattern, which translated easily to the telephone keypad and move on to the task of retrieving email.

Another problem that arose during pilot testing was one of communication between the experimenters and subjects. We found that trying to verbally relate the task to users was useless, as many failed to pay attention. By providing written instructions supplemented by verbal clarification, subjects were more likely to begin the experiment without questions or doubts. Furthermore, the need arose to specifically limit the task to finding the specified email and listening only to the part of the email giving directions. Some subjects insisted on listening to entire emails that were irrelevant to the task. Others became interested in all the available features of a system and decided to try sending voicemails.

In the interest of extracting cleaner and more informative data from our subjects, we shortened our questionnaire. We decided that since subjects are impatient, a shorter questionnaire would invite more hand-written comments.

Problems during the experiment

Several problems arose during the experiment. The first subject was uncertain as to how the first task ended and continued listening to the email, trying to remember the exact directions being read in the email. We believed, initially, that our pilot results had given an incorrect representation of the average time it would take to complete the task. After discussing with the subject it was decided that we explain the tasks in a different manner, telling the subjects that they could end the call simply by hanging up the phone.

Another problem was system related. It seemed that some of the accounts had not received the emails that were forwarded to them. The experiment had to be postponed for a short period while we resent email to the accounts. As a result of this, the subject using the empty account had to be excluded from the experiment results.


Department of Computer Science: Direct questions and comments to the student editorial team

University of Maryland