51st Session of the International Statistical Institute, Istanbul Aug 18-26 1997

Survo as an environment for statistical research and teaching


Seppo Mustonen and Kimmo Vehkalahti
Department of Statistics
P.O.Box 54
00014 University of Helsinki
Finland
Seppo.Mustonen@survo.fi
Kimmo.Vehkalahti@helsinki.fi

Statistical computing has in Finland a long tradition which is not very well known universally. Various forms of the Survo system are essential parts in this tradition. The first Survo systems were created in sixties (Alanko, Mustonen, Tienari 1968) and seventies (Mustonen 1981). SURVO 76 was one of the first interactive statistical packages. All this happened before the current microcomputer age.

The present Survo (SURVO 84C) was started about 13 years ago from new ideas developed and tested in a rudimentary form already in SURVO 76. These ideas are based on a concept of editorial approach. In this approach all functions of the system are controlled by a specific text editor which distributes the tasks between various independent program modules. Thus SURVO 84C is not a single huge program but a large family of small program modules. The Survo editor is the center of all activities and the system as a whole is a general environment for various tasks related not only to statistical analysis and computing. In fact, we have extended functions of Survo to many areas which are important in statistical research and teaching of statistics.

Fundamentally using Survo is like working with a combined word processor and spreadsheet program with extended capabilities to various directions. Thus in Survo one can maintain the whole statistical research process. For this purpose Survo includes functions for data input and screening, general data management, statistical graphics and analysis, matrix computations, making reports in printable form, desktop publishing, etc.

Survo also includes a powerful macro language which enables making of various expert applications by combining automatically and conditionally ready-made functions of the system. The same technique can be used for creating teaching programs on any topic, for example, on certain statistical methods. Survo macros are called sucros. Plenty of teaching programs have been made as sucros on topics related to statistics.

Survo provides also means for making hypertext applications. For example, a basic course on multivariate statistical methods has been made by the first author (in Finnish). It has been published as a textbook, but all the text, formulas, numerical examples, data sets, and sucros related to the topic are available in electronic form when using Survo. Thus when giving the course both text and examples can be shown on the screen in the classroom. The teacher can easily modify the examples and repeat various stages of analyses and simulations during the lecture.

In a short paper it is impossible to render any representative examples about how Survo is used in true research and teaching situations. However, we hope that the following tiny application related to problems in factor analysis (FA) illustrates something essential.

In the textbook on multivariate analysis (Seber 1984, pp. 222 - 235) the author cites critical comments of Francis (1973) based on simulation experiments. Francis was studying various artificial factor structures of ten variables and 2 or 3 factors by creating samples typically of size 50 and trying to detect the original factor pattern by the standard methods, for example, by maximum likelihood factorization and varimax rotation. On the basis of these experiments Francis as quoted by Seber came to very negative results. One of Seber's conclusions is "Even if the postulated model is true - and this is a very strong assumption - the chance of its recovery by present methods does not seem very great".

We think that such claims are exaggerated since many of the experiments of Francis are misleading. To make this evident, we shall re-evaluate one typical experiment (Model V) of Francis. In the sequel we simply show what the user has typed in the edit field and what Survo has given as results (given here in gray shading). The commands activated by the user are displayed as white text on black background.

Basically, everything in Survo is carried out in an edit field which corresponds to a spreadsheet but has also capabilities of a word processor. The user types text and commands in this working area. When a command is activated, the editor program passes the task to a suitable program module. The results are automatically written partly in the same edit field (in legible form) and partly into files (numerical results in double precision). The user may edit his/her own text and results and type and activate more commands. Thus the following example should be seen as a pale projection of a dynamic process where the actions of the user and the computer are efficiently interlinked.

   1  1 SURVO 84C EDITOR Thu Mar 27 11:17:19 1997             D:\ISI\ 200 100 0
   1 *_
   2 *                             MODEL V (Francis 1973)
   3 *MATRIX G
   4 *///   F1    F2    F3  PSI
   5 *X1    10     7     4   15  / First the original factor pattern is
   6 *X2    10     7     4   15  / given as a matrix G consisting of
   7 *X3    10     7     4   15  / loadings of 3 common factors (F) and
   8 *X4    10     7     4   15  / standard deviations of the unique
   9 *X5    10     7     0   15  / factors (PSI). Thus it is assumed that
  10 *X6    10     7     0   20  / a multivariate normal vector X has the
  11 *X7    10     7     0   20  / covariance structure S = FF' + PSI^2 .
  12 *X8    10     0     0   20
  13 *X9    10     0     0   20
  14 *X10   10     0     0   20
  15 *
  16 *Matrix G is saved in a matrix file of Survo and the covariance matrix
  17 *S is computed by commands of the matrix interpreter of Survo:
  18 *MAT SAVE G 
  19 *MAT F=G(*,1:3)             / F: three first columns of G
  20 *MAT PSI=G(*,4)             / PSI: fourth column of G
  21 *MAT TRANSFORM PSI BY X#*X# / Elements of vector PSI are squared.
  22 *MAT S=MMT(F)               / S=F*F'
  23 *MAT PSI^2=DV(PSI)          / Vector converted to a diagonal matrix
  24 *MAT S=S+PSI^2              / S=F*F'+PSI^2
  25 *The main failure in reasoning of both Francis and Seber is a belief that
  26 *the given pattern is a `simple structure' which should be reproduced if
  27 *FA really works. On the contrary, it is no `simple structure' at all in
  28 *the true spirit of FA. It is also wrong to speak about 3 factors in this
  29 *example since two last factors are neglible as shown here:
  30 *MAT D=VD(S)                / D is the diagonal of S as a vector
  31 *MAT TRANSFORM D BY 1/sqrt(X#) / inverses of standard deviations
  32 *MAT D=DV(D)                / D is diagonal matrix of these inverses.
  33 *MAT F2=D*F                 / F2 is the rescaled factor matrix.
  34 *MATRUN SUM2,F2,##.##       / Display F2 with sums of squares.

   1  1 SURVO 84C EDITOR Thu Mar 27 11:17:19 1997             D:\ISI\ 200 100 0
  35 *
  36 *"F2_with_sums_of_squares_by_rows_and_columns"
  37 *///         F1    F2    F3 sumsq
  38 *X1        0.51  0.35  0.20  0.42  / The original factor matrix is now
  39 *X2        0.51  0.35  0.20  0.42  / given in a rescaled form telling
  40 *X3        0.51  0.35  0.20  0.42  / the correlations between variables
  41 *X4        0.51  0.35  0.20  0.42  / and common factors.
  42 *X5        0.52  0.36  0.00  0.40  / The two last factors account only
  43 *X6        0.43  0.30  0.00  0.27  / for 100*(0.81+0.16)/10=9.7 per cent
  44 *X7        0.43  0.30  0.00  0.27  / of the total variation.
  45 *X8        0.45  0.00  0.00  0.20  / The last factor gives only
  46 *X9        0.45  0.00  0.00  0.20  / 100*0.16/10=1.6 per cent of the
  47 *X10       0.45  0.00  0.00  0.20  / total variation.
  48 *sumsqr    2.26  0.81  0.16  3.23
  49 *
  50 *Thus in practice one could speak about a one factor case only and there
  51 *is no hope of detecting the two last factors in small samples. We would
  52 *like to show now that - inspite of these deficiencies in the original
  53 *pattern - FA really works fine to the extent that one may expect. By
  54 *taking a sample large enough the very weak `signals' of two last factors
  55 *will be detected from a high `background noise'.
  56 *    We generate a multivariate normal sample SAMPLE from N(0,S), compute
  57 *correlations and the ML solution with 3 factors:
  58 *MNSIMUL S,-,SAMPLE,1000,0         / Generating 1000 observations
  59 *CORR SAMPLE                       / Computing correlation matrix CORR.M
  60 *FACTA CORR.M,3,CUR+1              / ML solution one line below (CUR+1)
  61 *Factor analysis: Maximum Likelihood (ML) solution
  62 *Factor matrix (saved as matrix file FACT.M)
  63 *             F1     F2     F3    h^2
  64 *X1        0.660 -0.125  0.029  0.452
  65 *X2        0.689 -0.159 -0.166  0.528
  66 *X3        0.648 -0.086  0.041  0.430
  67 *X4        0.647 -0.008 -0.065  0.423
  68 *X5        0.636  0.135  0.129  0.440
  69 *X6        0.469 -0.033  0.148  0.243
  70 *X7        0.472 -0.008  0.137  0.241
  71 *X8        0.336  0.298 -0.046  0.204
  72 *X9        0.385  0.213 -0.055  0.196
  73 *X10       0.362  0.290 -0.095  0.224
  74 *
  75 *To compare the simulated pattern (FACT.M) to the original one (F) a fair
  76 *method for comparing these patterns is to make a linear transformation
  77 *(rotation) L that minimizes the norm of E = FL - A .
  78 *This is achieved in Survo simply by the following sucro command
  79 */TRAN-LEASTSQR A,FACT.M
  80 *MATRIX L.M
  81 *Transformation_matrix
  82 *///          F1     F2     F3
  83 *F1        0.036  0.027 -0.007
  84 *F2        0.024 -0.034  0.029
  85 *F3        0.034 -0.032 -0.044
  86 *
  87 *MATRIX E.M
  88 *Residual_matrix
  89 *///          F1     F2     F3
  90 *X1        0.001  0.031 -0.070
  91 *X2       -0.028  0.064  0.126
  92 *X3        0.013 -0.009 -0.081
  93 *X4        0.014 -0.086  0.025
  94 *X5       -0.111 -0.104  0.009
  95 *X6        0.057  0.065 -0.010
  96 *X7        0.054  0.040  0.001
  97 *X8        0.025 -0.031 -0.019
  98 *X9       -0.024  0.054 -0.010
  99 *X10      -0.001 -0.023  0.030
 100 *
 101 *Since all residuals are pretty close to 0, it is seen that the original
 102 *factor structure can be restored. However, no standard rotation method
 103 *cannot find it just in the form given by Francis since that form does
 104 *not meet the conditions of `simple structure'.
 105 *    If one desires to see what is `simple structure' in this example,
 106 *the best possible one is achieved by using the cosine rotation
 107 *originally presented by Ahmavaara (1954) and later refined by the first
 108 *author of this paper.
 109 *    This oblique rotation when applied to the original rescaled factor
 110 *matrix gives the factor matrix:

   1  1 SURVO 84C EDITOR Thu Mar 27 11:17:19 1997             D:\ISI\ 200 100 0
 111 *ROTATE F2,3,CUR+1 / METHOD=COS,0.19
 112 *Rotated factor matrix AFACT.M=F2*inv(TFACT.M)'
 113 *             F1     F2     F3 Sumsqr
 114 *X1       -0.000  0.000  0.650  0.423
 115 *X2       -0.000  0.000  0.650  0.423
 116 *X3       -0.000  0.000  0.650  0.423
 117 *X4       -0.000  0.000  0.650  0.423
 118 *X5        0.631  0.000  0.000  0.398
 119 *X6        0.521 -0.000  0.000  0.271
 120 *X7        0.521 -0.000  0.000  0.271
 121 *X8        0.000  0.447  0.000  0.200
 122 *X9        0.000  0.447  0.000  0.200
 123 *X10       0.000  0.447  0.000  0.200
 124 *Sumsqr    0.941  0.600  1.692  3.234
 125 *
 126 *In this pattern each variable has non-zero loading on one factor only.
 127 *This represents `simple structure' in its strictest form but at the
 128 *expense of high factor correlations.
 129 *    We can also study the distribution of the elements of the residual
 130 *matrix E.M by simulation. The next sucro command generates 100
 131 *multivariate normal samples of 1000 observations from the factor
 132 *structure given by the rescaled factor matrix F2.
 133 *
 134 */TRAN-SYMTRES F2,*,1000,*,FRES,100,19970001
 135 *Simulated residuals in Survo data file FRES.SVO
 136 *MAT LOAD FRES,##.###,END+2 / Standard errors of residuals
 137 *
 138 *MATRIX FRES
 139 *Standard_errors_of_residuals_(N=100)
 140 *///          F1     F2     F3
 141 *X1        0.031  0.057  0.173
 142 *X2        0.033  0.055  0.155
 143 *X3        0.040  0.066  0.171
 144 *X4        0.032  0.059  0.141
 145 *X5        0.036  0.103  0.151
 146 *X6        0.041  0.112  0.167
 147 *X7        0.037  0.105  0.155
 148 *X8        0.138  0.182  0.080
 149 *X9        0.124  0.175  0.088
 150 *X10       0.111  0.158  0.104
 151 *
 152 *As a summary, standard errors of residuals are given.

More information can be obtained from <URL:http://www.helsinki.fi/survo/> .

BIBLIOGRAPHY

Ahmavaara, Y. (1954).
Transformation analysis of factorial data. Ann.Acad.Sci.Fenn., B88,2.
Alanko, T., Mustonen, S., Tienari, M. (1968).
A Statistical Programming Language SURVO 66, BIT, 8, 69-85.
Mustonen, S. (1981).
On Interactive Statistical Data Processing, Scand J Statist, 8, 129-136.
Mustonen, S. (1992).
Survo, an Integrated Environment for Statistical Computing and Related Areas, Survo Systems, Helsinki
Seber, G.A.F. (1985).
Multivariate Observations, Wiley.

SUMMARY

General features of the statistical software package Survo are briefly described. As an integrated environment Survo is suitable for various research and teaching applications. As an example some problems related to factor analysis are considered. This example shows how a work process is documented by combining statistical computations and text editing.

RÉSUMÉ

Les traits généraux du progiciel statistique Survo sont présentés brièvement. Comme environnement intègre Survo convient aux plusieurs applications de recherche et d'enseignement. Comme exemple l'on considére quelques problèmes liés à l'analyse factoriel. Cet exemple montre comment l'on documente un processus de travail en combinant d'opérations statistiques et le traitement du texte.