51st Session of the International Statistical Institute, Istanbul Aug 18-26 1997

Survo as an environment for statistical research and teaching

Seppo Mustonen and Kimmo Vehkalahti
Department of Statistics
P.O.Box 54
00014 University of Helsinki
Finland
Seppo.Mustonen@survo.fi
Kimmo.Vehkalahti@helsinki.fi

Statistical computing has in Finland a long tradition which is not very well known universally. Various forms of the Survo system are essential parts in this tradition. The first Survo systems were created in sixties (Alanko, Mustonen, Tienari 1968) and seventies (Mustonen 1981). SURVO 76 was one of the first interactive statistical packages. All this happened before the current microcomputer age.

The present Survo (SURVO 84C) was started about 13 years ago from new ideas developed and tested in a rudimentary form already in SURVO 76. These ideas are based on a concept of editorial approach. In this approach all functions of the system are controlled by a specific text editor which distributes the tasks between various independent program modules. Thus SURVO 84C is not a single huge program but a large family of small program modules. The Survo editor is the center of all activities and the system as a whole is a general environment for various tasks related not only to statistical analysis and computing. In fact, we have extended functions of Survo to many areas which are important in statistical research and teaching of statistics.

Fundamentally using Survo is like working with a combined word processor and spreadsheet program with extended capabilities to various directions. Thus in Survo one can maintain the whole statistical research process. For this purpose Survo includes functions for data input and screening, general data management, statistical graphics and analysis, matrix computations, making reports in printable form, desktop publishing, etc.

Survo also includes a powerful macro language which enables making of various expert applications by combining automatically and conditionally ready-made functions of the system. The same technique can be used for creating teaching programs on any topic, for example, on certain statistical methods. Survo macros are called sucros. Plenty of teaching programs have been made as sucros on topics related to statistics.

Survo provides also means for making hypertext applications. For example, a basic course on multivariate statistical methods has been made by the first author (in Finnish). It has been published as a textbook, but all the text, formulas, numerical examples, data sets, and sucros related to the topic are available in electronic form when using Survo. Thus when giving the course both text and examples can be shown on the screen in the classroom. The teacher can easily modify the examples and repeat various stages of analyses and simulations during the lecture.

In a short paper it is impossible to render any representative examples about how Survo is used in true research and teaching situations. However, we hope that the following tiny application related to problems in factor analysis (FA) illustrates something essential.

In the textbook on multivariate analysis (Seber 1984, pp. 222 - 235) the author cites critical comments of Francis (1973) based on simulation experiments. Francis was studying various artificial factor structures of ten variables and 2 or 3 factors by creating samples typically of size 50 and trying to detect the original factor pattern by the standard methods, for example, by maximum likelihood factorization and varimax rotation. On the basis of these experiments Francis as quoted by Seber came to very negative results. One of Seber's conclusions is "Even if the postulated model is true - and this is a very strong assumption - the chance of its recovery by present methods does not seem very great".

We think that such claims are exaggerated since many of the experiments of Francis are misleading. To make this evident, we shall re-evaluate one typical experiment (Model V) of Francis. In the sequel we simply show what the user has typed in the edit field and what Survo has given as results (given here in gray shading). The commands activated by the user are displayed as white text on black background.

Basically, everything in Survo is carried out in an edit field which corresponds to a spreadsheet but has also capabilities of a word processor. The user types text and commands in this working area. When a command is activated, the editor program passes the task to a suitable program module. The results are automatically written partly in the same edit field (in legible form) and partly into files (numerical results in double precision). The user may edit his/her own text and results and type and activate more commands. Thus the following example should be seen as a pale projection of a dynamic process where the actions of the user and the computer are efficiently interlinked.

1 1 SURVO 84C EDITOR Thu Mar 27 11:17:19 1997 D:\ISI\ 200 100 0

1 *_ 2 * MODEL V (Francis 1973) 3 *MATRIX G 4 */// F1 F2 F3 PSI 5 *X1 10 7 4 15 / First the original factor pattern is 6 *X2 10 7 4 15 / given as a matrix G consisting of 7 *X3 10 7 4 15 / loadings of 3 common factors (F) and 8 *X4 10 7 4 15 / standard deviations of the unique 9 *X5 10 7 0 15 / factors (PSI). Thus it is assumed that 10 *X6 10 7 0 20 / a multivariate normal vector X has the 11 *X7 10 7 0 20 / covariance structure S = FF' + PSI^2 . 12 *X8 10 0 0 20 13 *X9 10 0 0 20 14 *X10 10 0 0 20 15 * 16 *Matrix G is saved in a matrix file of Survo and the covariance matrix 17 *S is computed by commands of the matrix interpreter of Survo: 18 *MAT SAVE G 19 *MAT F=G(*,1:3) / F: three first columns of G 20 *MAT PSI=G(*,4) / PSI: fourth column of G 21 *MAT TRANSFORM PSI BY X#*X# / Elements of vector PSI are squared. 22 *MAT S=MMT(F) / S=F*F' 23 *MAT PSI^2=DV(PSI) / Vector converted to a diagonal matrix 24 *MAT S=S+PSI^2 / S=F*F'+PSI^2 25 *The main failure in reasoning of both Francis and Seber is a belief that 26 *the given pattern is a `simple structure' which should be reproduced if 27 *FA really works. On the contrary, it is no `simple structure' at all in 28 *the true spirit of FA. It is also wrong to speak about 3 factors in this 29 *example since two last factors are neglible as shown here: 30 *MAT D=VD(S) / D is the diagonal of S as a vector 31 *MAT TRANSFORM D BY 1/sqrt(X#) / inverses of standard deviations 32 *MAT D=DV(D) / D is diagonal matrix of these inverses. 33 *MAT F2=D*F / F2 is the rescaled factor matrix. 34 *MATRUN SUM2,F2,##.## / Display F2 with sums of squares.

1 1 SURVO 84C EDITOR Thu Mar 27 11:17:19 1997 D:\ISI\ 200 100 0

35 * 36 *"F2_with_sums_of_squares_by_rows_and_columns" 37 */// F1 F2 F3 sumsq 38 *X1 0.51 0.35 0.20 0.42 / The original factor matrix is now 39 *X2 0.51 0.35 0.20 0.42 / given in a rescaled form telling 40 *X3 0.51 0.35 0.20 0.42 / the correlations between variables 41 *X4 0.51 0.35 0.20 0.42 / and common factors. 42 *X5 0.52 0.36 0.00 0.40 / The two last factors account only 43 *X6 0.43 0.30 0.00 0.27 / for 100*(0.81+0.16)/10=9.7 per cent 44 *X7 0.43 0.30 0.00 0.27 / of the total variation. 45 *X8 0.45 0.00 0.00 0.20 / The last factor gives only 46 *X9 0.45 0.00 0.00 0.20 / 100*0.16/10=1.6 per cent of the 47 *X10 0.45 0.00 0.00 0.20 / total variation. 48 *sumsqr 2.26 0.81 0.16 3.23 49 * 50 *Thus in practice one could speak about a one factor case only and there 51 *is no hope of detecting the two last factors in small samples. We would 52 *like to show now that - inspite of these deficiencies in the original 53 *pattern - FA really works fine to the extent that one may expect. By 54 *taking a sample large enough the very weak `signals' of two last factors 55 *will be detected from a high `background noise'. 56 * We generate a multivariate normal sample SAMPLE from N(0,S), compute 57 *correlations and the ML solution with 3 factors: 58 *MNSIMUL S,-,SAMPLE,1000,0 / Generating 1000 observations 59 *CORR SAMPLE / Computing correlation matrix CORR.M 60 *FACTA CORR.M,3,CUR+1 / ML solution one line below (CUR+1) 61 *Factor analysis: Maximum Likelihood (ML) solution 62 *Factor matrix (saved as matrix file FACT.M) 63 * F1 F2 F3 h^2 64 *X1 0.660 -0.125 0.029 0.452 65 *X2 0.689 -0.159 -0.166 0.528 66 *X3 0.648 -0.086 0.041 0.430 67 *X4 0.647 -0.008 -0.065 0.423 68 *X5 0.636 0.135 0.129 0.440 69 *X6 0.469 -0.033 0.148 0.243 70 *X7 0.472 -0.008 0.137 0.241 71 *X8 0.336 0.298 -0.046 0.204 72 *X9 0.385 0.213 -0.055 0.196 73 *X10 0.362 0.290 -0.095 0.224 74 * 75 *To compare the simulated pattern (FACT.M) to the original one (F) a fair 76 *method for comparing these patterns is to make a linear transformation 77 *(rotation) L that minimizes the norm of E = FL - A . 78 *This is achieved in Survo simply by the following sucro command 79 */TRAN-LEASTSQR A,FACT.M 80 *MATRIX L.M 81 *Transformation_matrix 82 */// F1 F2 F3 83 *F1 0.036 0.027 -0.007 84 *F2 0.024 -0.034 0.029 85 *F3 0.034 -0.032 -0.044 86 * 87 *MATRIX E.M 88 *Residual_matrix 89 */// F1 F2 F3 90 *X1 0.001 0.031 -0.070 91 *X2 -0.028 0.064 0.126 92 *X3 0.013 -0.009 -0.081 93 *X4 0.014 -0.086 0.025 94 *X5 -0.111 -0.104 0.009 95 *X6 0.057 0.065 -0.010 96 *X7 0.054 0.040 0.001 97 *X8 0.025 -0.031 -0.019 98 *X9 -0.024 0.054 -0.010 99 *X10 -0.001 -0.023 0.030 100 * 101 *Since all residuals are pretty close to 0, it is seen that the original 102 *factor structure can be restored. However, no standard rotation method 103 *cannot find it just in the form given by Francis since that form does 104 *not meet the conditions of `simple structure'. 105 * If one desires to see what is `simple structure' in this example, 106 *the best possible one is achieved by using the cosine rotation 107 *originally presented by Ahmavaara (1954) and later refined by the first 108 *author of this paper. 109 * This oblique rotation when applied to the original rescaled factor 110 *matrix gives the factor matrix:

1 1 SURVO 84C EDITOR Thu Mar 27 11:17:19 1997 D:\ISI\ 200 100 0

111 *ROTATE F2,3,CUR+1 / METHOD=COS,0.19 112 *Rotated factor matrix AFACT.M=F2*inv(TFACT.M)' 113 * F1 F2 F3 Sumsqr 114 *X1 -0.000 0.000 0.650 0.423 115 *X2 -0.000 0.000 0.650 0.423 116 *X3 -0.000 0.000 0.650 0.423 117 *X4 -0.000 0.000 0.650 0.423 118 *X5 0.631 0.000 0.000 0.398 119 *X6 0.521 -0.000 0.000 0.271 120 *X7 0.521 -0.000 0.000 0.271 121 *X8 0.000 0.447 0.000 0.200 122 *X9 0.000 0.447 0.000 0.200 123 *X10 0.000 0.447 0.000 0.200 124 *Sumsqr 0.941 0.600 1.692 3.234 125 * 126 *In this pattern each variable has non-zero loading on one factor only. 127 *This represents `simple structure' in its strictest form but at the 128 *expense of high factor correlations. 129 * We can also study the distribution of the elements of the residual 130 *matrix E.M by simulation. The next sucro command generates 100 131 *multivariate normal samples of 1000 observations from the factor 132 *structure given by the rescaled factor matrix F2. 133 * 134 */TRAN-SYMTRES F2,*,1000,*,FRES,100,19970001 135 *Simulated residuals in Survo data file FRES.SVO 136 *MAT LOAD FRES,##.###,END+2 / Standard errors of residuals 137 * 138 *MATRIX FRES 139 *Standard_errors_of_residuals_(N=100) 140 */// F1 F2 F3 141 *X1 0.031 0.057 0.173 142 *X2 0.033 0.055 0.155 143 *X3 0.040 0.066 0.171 144 *X4 0.032 0.059 0.141 145 *X5 0.036 0.103 0.151 146 *X6 0.041 0.112 0.167 147 *X7 0.037 0.105 0.155 148 *X8 0.138 0.182 0.080 149 *X9 0.124 0.175 0.088 150 *X10 0.111 0.158 0.104 151 * 152 *As a summary, standard errors of residuals are given.

More information can be obtained from <URL:http://www.helsinki.fi/survo/> .

BIBLIOGRAPHY

Ahmavaara, Y. (1954).: Transformation analysis of factorial data. Ann.Acad.Sci.Fenn., B88,2.
Alanko, T., Mustonen, S., Tienari, M. (1968).: A Statistical Programming Language SURVO 66, BIT, 8, 69-85.
Mustonen, S. (1981).: On Interactive Statistical Data Processing, Scand J Statist, 8, 129-136.
Mustonen, S. (1992).: Survo, an Integrated Environment for Statistical Computing and Related Areas, Survo Systems, Helsinki
Seber, G.A.F. (1985).: Multivariate Observations, Wiley.

SUMMARY

General features of the statistical software package Survo are briefly described. As an integrated environment Survo is suitable for various research and teaching applications. As an example some problems related to factor analysis are considered. This example shows how a work process is documented by combining statistical computations and text editing.

RÉSUMÉ

Les traits généraux du progiciel statistique Survo sont présentés brièvement. Comme environnement intègre Survo convient aux plusieurs applications de recherche et d'enseignement. Comme exemple l'on considére quelques problèmes liés à l'analyse factoriel. Cet exemple montre comment l'on documente un processus de travail en combinant d'opérations statistiques et le traitement du texte.