Results

Statistical Treatment of Results

The null hypothesis, H0, is that there is no difference between the scores on the AMLCD and the CRT; i.e., the difference between search times for each subject will have zero mean. The alternate hypothesis, H1, is that the scores for the AMLCD are significantly better or worse than the CRT.

We have measured search times under a variety of conditions and defined a mean for each case. Before we can test the significance of the difference in means, we must determine whether the variances of these samples are homogeneous. The null hypothesis for this question is that each sample variance is drawn from the same population. If this Analysis of Variance (ANOVA) accepts the null hypothesis, then we may proceed to an analysis of the difference between means - the students’ t test. Despite our attempt to minimize the cognitive workload for the subjects’ tasks, variations between subjects may still have contributed to an increase in the standard deviations obtained. We will attempt to reduce this effect by using a paired t test analysis.

The results of the ANOVA for the means are illustrated in Table 1.

Table 1. ANOVA results

Sum of the squares Degrees of freedom Mean square
within samples 5766.71 44.00 131.06
between samples 20.16 3.00 6.72
total 5786.87 47.00 123.12

The result of the F test , F(3,44)=0.05, is much less than the tabular value of 2.83 at the 0.05 significance level. This result allows us to accept the ANOVA null hypothesis for the data obtained.

Proceeding now to the analysis of the difference in means, the mean search times (T ) for all subjects on the LCD and CRT displays are given below in Table 2.

The difference may be expressed in terms of standard deviations. Noting that the ANOVA allows the representation of the standard deviation by a homogeneous value, we will represent the deviation of the total population by an average of our measurements. Therefore, we define as

The probability that these values (D above) could come from a normal distribution of mean equal to zero and standard deviation equal to is illustrated in Table 3.

John Tukey (1953) defines an “Honest Statistical Difference” as a difference in means greater than 1/2 a standard deviation at the 0.05 level. In terms of we have

None of these results meet Tukey’s definition of an “Honest Statistical Difference”.

In attempting to meet Tukey’s definition of an “Honest Statistical Difference”, a paired t tests analysis can be used. Since each subject performed a similar task for both types of displays and for both static and dynamic cases, we may be able to use the distribution of the individual differences in search times to eliminate variances in subjects in technique, time of day, etc. This analysis yielded the following means and standard deviations:

Note that the D’s are not necessarily the same as those determined using the standard t test. The standard test investigates the entire population (all the subjects) as a whole whereas the paired t test analysis investigates each subject independently. Because all subjects did not complete all the experimental conditions, what was formerly an absence of data in one cell now causes invalid data in neighboring cells.

The probability that mean differences in Tables 5.1 and 5.2 could come from a normal distribution of mean equal to zero and standard deviation equal to s’s in Tables 5.1 and 5.2 is illustrated in Table 6.

Using the calculated standard deviation (s) for each case, the results of the paired t test analysis are as follows: Table 7

The difference in means of static and dynamic on the LCD display yields a negative 0.82sLCD meeting Tukey’s goal of an “Honest Statistical Difference”. This result suggests that performance on the LCD decreased in the dynamic case. Since we saw no similar decrease in performance for the CRT, one possible explanation is that the test confirms the observation of Holzel regarding the quality of changing (“video”) data on the LCD.

We can ask how large a sample size would be necessary to reduce the standard deviation to the point where this difference would be statistically significant. Then we could confidently accept or reject the null hypothesis.

This sample size can be determined by using the following formula

Runyon (1984)

where µa and µb are the normal variables (z scores) corresponding to a and b (accepting or rejecting the null hypothesis) respectively, and D is expressed in standard deviation units.

If a and b are both set at 0.05 and we wish to detect a difference between the means of half a standard deviation, then

and hence at least 87 subjects should be tested. Requiring such a large subject pool could be very expensive and quite unattainable under time constraints.

In order to reduce the number of subjects required, we can use sequential analysis for the testing. The main feature of sequential analysis is that the sample size is not determined in advance; instead, the validity of the null hypothesis is tested after each set of results has been collected. It should be noted that sequential analysis may not reveal correlations, clusters, gaps and outliers related to variations in the time domain.

Sequential analysis techniques were used successfully for the analysis of the effectiveness of child-proof packaging in the United Kingdom.

The following is a sequential analysis technique (Barnard, 1946). A normal curve has two parameters: mean and standard deviation. The assumption we make is that the four test cases come from the same distribution with the same standard deviation (). The difference in the distribution is with the means which is what we are testing. It must be noted that must be known in advance, perhaps determined from results of a pilot study.

Let a, the possibility of falsely rejecting the null hypothesis, and b, the possibility of falsely accepting the null hypothesis, both be set at the .05 level;

Let d , the difference in means that it is considered important to detect, be defined as a half of one standard deviation.

Define the following parameters for the sequential analysis (Barnard, 1946); (Note that these conditions correspond to Tukey’s “Honest Statistical Difference”.)

and, as each subject is tested, for the current sample size , define

and calculate T, the sum of the difference scores. Conclusions are made based on the following conditions (Barnard, 1946):


  • if T < T0, accept null hypothesis
  • if T > T1, accept alternative hypothesis
  • if T0 < T < T1, no conclusion can be made and testing must continue

    Table 8.1 (below) illustrates the results for the sequential analysis of the static case for the LCD and CRT. No conclusion can be made based on the T values calculated, therefore testing must continue. The results of the dynamic case, illustrated in Table 8.2, suggest that testing must continue also.

    Table 8.1 Sequential Analysis for Static Case

    Table 8.2 Sequential Analysis for Dynamic Case

    Barnard’s U Test is another sequential analysis test that be used to compare search times for a Test and Reference product. In addition to the distribution parameters already defined, we define (Barnard, 1946).

    and

    Using the data from the Table 9.0, Critical Values for Barnard’s test, we generate a table for Barnard’s U Test, with the critical values for U0 and U1 calculated for a difference of one half standard deviation and statistical significance at the 0.05 level. Conclusions are made based on the following conditions (Barnard, 1946):


  • if T < T0, accept null hypothesis
  • if T > T1, accept alternative hypothesis
  • if T0 < T < T1, no conclusion can be made and testing must continue

    The results of the Barnard’s U analysis, illustrated in Table 9.1 and Table 9.2 (below), also show that testing must continue. Although no conclusion could be made using sequential analysis, this technique can be quite useful. Upon observing the trend in the T value calculated in Table 9.1 and comparing it to the table of critical values for Barnard’s test, it appears that maybe only 25-30 subjects would be needed to have a significant result and to accept the null hypothesis for the static case. For the dynamic case, it is inconclusive from the data obtained what could be predicted.

    Table 9.1 Barnard’s U Analysis for Static Case

    Table 9.2 Barnard’s U Analysis for Dynamic Case

    The Subjects

    The results of the subjective satisfaction survey administered following the experiment favored the AMLCD display. Most users felt the AMLCD was a much better “quality” display than the CRT. This really is not too surprising since the images displayed by an AMLCD are sharper and have more contrast that those of the CRT. Users equated this with quality, and some of them openly expressed their preferance for the AMLCD during the experiment. However, when they judged the "comfort" of using the screens, the AMLCD was not as unanimous a winner. A fair number of subjects felt the CRT was comfortable. The “softer” image produced by the CRT, has less contrast that the AMLCD screen and can seem less harsh to some users. While most subjects still preferred the AMLCD, a significant proportion felt the CRT was more comfortable to use.

    After a subject had completed the tests and surveys, we removed the boxes and revealed the displays. Most were very impressed by the look and narrow profile of the AMLCD. Based on our observations, the AMLCD was attractive to just about everyone who participated in this experiment.

    During the experiment, the administrator recorded their observations of the subjects performing the experiment. Many subjects showed signs of frustration, especially those who were required to perform numerous trials. It was noted that when some subjects achieved the ±10% of the total number of targets, they showed signs of satisfaction and acceptance. But when those subjects did not achieve the threshold, they demonstrated signs of annoyance and dissatisfaction. This observation may help validate the researchers’ attempt to determine what a successful trial would be. Subjects also expressed signs of fatigue. Many commented that the tasks were tedious and repetitive. To help reduce fatigue, we suggest to those who continue the research to require the subjects to take a small break in between runs. For the subjects who were required to do an excessive number of trials, it was interesting to note that they showed no signs of giving up. All subjects completed what was asked of them. We appreciate the patience of all the subjects who participated in this experiment.

    This work parallels the work of Roufs and Boschman, 1991 , comparing the perceptual quality of CRT displays.


    Continue
    Return to the Title Page for Comparison of AMLCD and CRT Displays