51st Session of the International Statistical Institute, Istanbul Aug 18-26 1997
Survo as an environment for statistical research and teaching
Statistical computing has in Finland a long tradition which
is not very well known universally. Various forms of the Survo
system are essential parts in this tradition. The first Survo
systems were created in sixties (Alanko, Mustonen, Tienari 1968)
and seventies (Mustonen 1981). SURVO 76 was one of the first
interactive statistical packages. All this happened before the
current microcomputer age.
The present Survo (SURVO 84C) was started about 13 years
ago from new ideas developed and tested in a rudimentary form
already in SURVO 76. These ideas are based on a concept of
editorial approach. In this approach all functions of the system
are controlled by a specific text editor which distributes the
tasks between various independent program modules. Thus SURVO 84C
is not a single huge program but a large family of small program
modules. The Survo editor is the center of all activities and the
system as a whole is a general environment for various tasks
related not only to statistical analysis and computing. In fact,
we have extended functions of Survo to many areas which are
important in statistical research and teaching of statistics.
Fundamentally using Survo is like working with a combined
word processor and spreadsheet program with extended capabilities
to various directions. Thus in Survo one can maintain the whole
statistical research process. For this purpose Survo includes
functions for data input and screening, general data management,
statistical graphics and analysis, matrix computations, making
reports in printable form, desktop publishing, etc.
Survo also includes a powerful macro language which enables
making of various expert applications by combining automatically
and conditionally ready-made functions of the system. The same
technique can be used for creating teaching programs on any
topic, for example, on certain statistical methods. Survo macros
are called sucros. Plenty of teaching programs have been made as
sucros on topics related to statistics.
Survo provides also means for making hypertext
applications. For example, a basic course on multivariate
statistical methods has been made by the first author (in
Finnish). It has been published as a textbook, but all the text,
formulas, numerical examples, data sets, and sucros related to
the topic are available in electronic form when using Survo. Thus
when giving the course both text and examples can be shown on the
screen in the classroom. The teacher can easily modify the
examples and repeat various stages of analyses and simulations
during the lecture.
In a short paper it is impossible to render any
representative examples about how Survo is used in true research
and teaching situations. However, we hope that the following tiny
application related to problems in factor analysis (FA)
illustrates something essential.
In the textbook on multivariate analysis (Seber 1984, pp.
222 - 235) the author cites critical comments of Francis (1973)
based on simulation experiments. Francis was studying various
artificial factor structures of ten variables and 2 or 3 factors
by creating samples typically of size 50 and trying to detect the
original factor pattern by the standard methods, for example, by
maximum likelihood factorization and varimax rotation. On the
basis of these experiments Francis as quoted by Seber came to
very negative results. One of Seber's conclusions is "Even if the
postulated model is true - and this is a very strong assumption -
the chance of its recovery by present methods does not seem very
great".
We think that such claims are exaggerated since many of the
experiments of Francis are misleading. To make this evident, we
shall re-evaluate one typical experiment (Model V) of Francis. In
the sequel we simply show what the user has typed in the edit
field and what Survo has given as results (given here in gray
shading). The commands activated by the user are displayed as
white text on black background.
Basically, everything in Survo is carried out in an edit
field which corresponds to a spreadsheet but has also
capabilities of a word processor. The user types text and
commands in this working area. When a command is activated, the
editor program passes the task to a suitable program module. The
results are automatically written partly in the same edit field
(in legible form) and partly into files (numerical results in
double precision). The user may edit his/her own text and results
and type and activate more commands. Thus the following example
should be seen as a pale projection of a dynamic process where
the actions of the user and the computer are efficiently
interlinked.
|
1 *_
2 * MODEL V (Francis 1973)
3 *MATRIX G
4 */// F1 F2 F3 PSI
5 *X1 10 7 4 15 / First the original factor pattern is
6 *X2 10 7 4 15 / given as a matrix G consisting of
7 *X3 10 7 4 15 / loadings of 3 common factors (F) and
8 *X4 10 7 4 15 / standard deviations of the unique
9 *X5 10 7 0 15 / factors (PSI). Thus it is assumed that
10 *X6 10 7 0 20 / a multivariate normal vector X has the
11 *X7 10 7 0 20 / covariance structure S = FF' + PSI^2 .
12 *X8 10 0 0 20
13 *X9 10 0 0 20
14 *X10 10 0 0 20
15 *
16 *Matrix G is saved in a matrix file of Survo and the covariance matrix
17 *S is computed by commands of the matrix interpreter of Survo:
18 *MAT SAVE G
19 *MAT F=G(*,1:3) / F: three first columns of G
20 *MAT PSI=G(*,4) / PSI: fourth column of G
21 *MAT TRANSFORM PSI BY X#*X# / Elements of vector PSI are squared.
22 *MAT S=MMT(F) / S=F*F'
23 *MAT PSI^2=DV(PSI) / Vector converted to a diagonal matrix
24 *MAT S=S+PSI^2 / S=F*F'+PSI^2
25 *The main failure in reasoning of both Francis and Seber is a belief that
26 *the given pattern is a `simple structure' which should be reproduced if
27 *FA really works. On the contrary, it is no `simple structure' at all in
28 *the true spirit of FA. It is also wrong to speak about 3 factors in this
29 *example since two last factors are neglible as shown here:
30 *MAT D=VD(S) / D is the diagonal of S as a vector
31 *MAT TRANSFORM D BY 1/sqrt(X#) / inverses of standard deviations
32 *MAT D=DV(D) / D is diagonal matrix of these inverses.
33 *MAT F2=D*F / F2 is the rescaled factor matrix.
34 *MATRUN SUM2,F2,##.## / Display F2 with sums of squares.
|
|
35 *
36 *"F2_with_sums_of_squares_by_rows_and_columns"
37 */// F1 F2 F3 sumsq
38 *X1 0.51 0.35 0.20 0.42 / The original factor matrix is now
39 *X2 0.51 0.35 0.20 0.42 / given in a rescaled form telling
40 *X3 0.51 0.35 0.20 0.42 / the correlations between variables
41 *X4 0.51 0.35 0.20 0.42 / and common factors.
42 *X5 0.52 0.36 0.00 0.40 / The two last factors account only
43 *X6 0.43 0.30 0.00 0.27 / for 100*(0.81+0.16)/10=9.7 per cent
44 *X7 0.43 0.30 0.00 0.27 / of the total variation.
45 *X8 0.45 0.00 0.00 0.20 / The last factor gives only
46 *X9 0.45 0.00 0.00 0.20 / 100*0.16/10=1.6 per cent of the
47 *X10 0.45 0.00 0.00 0.20 / total variation.
48 *sumsqr 2.26 0.81 0.16 3.23
49 *
50 *Thus in practice one could speak about a one factor case only and there
51 *is no hope of detecting the two last factors in small samples. We would
52 *like to show now that - inspite of these deficiencies in the original
53 *pattern - FA really works fine to the extent that one may expect. By
54 *taking a sample large enough the very weak `signals' of two last factors
55 *will be detected from a high `background noise'.
56 * We generate a multivariate normal sample SAMPLE from N(0,S), compute
57 *correlations and the ML solution with 3 factors:
58 *MNSIMUL S,-,SAMPLE,1000,0 / Generating 1000 observations
59 *CORR SAMPLE / Computing correlation matrix CORR.M
60 *FACTA CORR.M,3,CUR+1 / ML solution one line below (CUR+1)
61 *Factor analysis: Maximum Likelihood (ML) solution
62 *Factor matrix (saved as matrix file FACT.M)
63 * F1 F2 F3 h^2
64 *X1 0.660 -0.125 0.029 0.452
65 *X2 0.689 -0.159 -0.166 0.528
66 *X3 0.648 -0.086 0.041 0.430
67 *X4 0.647 -0.008 -0.065 0.423
68 *X5 0.636 0.135 0.129 0.440
69 *X6 0.469 -0.033 0.148 0.243
70 *X7 0.472 -0.008 0.137 0.241
71 *X8 0.336 0.298 -0.046 0.204
72 *X9 0.385 0.213 -0.055 0.196
73 *X10 0.362 0.290 -0.095 0.224
74 *
75 *To compare the simulated pattern (FACT.M) to the original one (F) a fair
76 *method for comparing these patterns is to make a linear transformation
77 *(rotation) L that minimizes the norm of E = FL - A .
78 *This is achieved in Survo simply by the following sucro command
79 */TRAN-LEASTSQR A,FACT.M
80 *MATRIX L.M
81 *Transformation_matrix
82 */// F1 F2 F3
83 *F1 0.036 0.027 -0.007
84 *F2 0.024 -0.034 0.029
85 *F3 0.034 -0.032 -0.044
86 *
87 *MATRIX E.M
88 *Residual_matrix
89 */// F1 F2 F3
90 *X1 0.001 0.031 -0.070
91 *X2 -0.028 0.064 0.126
92 *X3 0.013 -0.009 -0.081
93 *X4 0.014 -0.086 0.025
94 *X5 -0.111 -0.104 0.009
95 *X6 0.057 0.065 -0.010
96 *X7 0.054 0.040 0.001
97 *X8 0.025 -0.031 -0.019
98 *X9 -0.024 0.054 -0.010
99 *X10 -0.001 -0.023 0.030
100 *
101 *Since all residuals are pretty close to 0, it is seen that the original
102 *factor structure can be restored. However, no standard rotation method
103 *cannot find it just in the form given by Francis since that form does
104 *not meet the conditions of `simple structure'.
105 * If one desires to see what is `simple structure' in this example,
106 *the best possible one is achieved by using the cosine rotation
107 *originally presented by Ahmavaara (1954) and later refined by the first
108 *author of this paper.
109 * This oblique rotation when applied to the original rescaled factor
110 *matrix gives the factor matrix:
|
|
111 *ROTATE F2,3,CUR+1 / METHOD=COS,0.19
112 *Rotated factor matrix AFACT.M=F2*inv(TFACT.M)'
113 * F1 F2 F3 Sumsqr
114 *X1 -0.000 0.000 0.650 0.423
115 *X2 -0.000 0.000 0.650 0.423
116 *X3 -0.000 0.000 0.650 0.423
117 *X4 -0.000 0.000 0.650 0.423
118 *X5 0.631 0.000 0.000 0.398
119 *X6 0.521 -0.000 0.000 0.271
120 *X7 0.521 -0.000 0.000 0.271
121 *X8 0.000 0.447 0.000 0.200
122 *X9 0.000 0.447 0.000 0.200
123 *X10 0.000 0.447 0.000 0.200
124 *Sumsqr 0.941 0.600 1.692 3.234
125 *
126 *In this pattern each variable has non-zero loading on one factor only.
127 *This represents `simple structure' in its strictest form but at the
128 *expense of high factor correlations.
129 * We can also study the distribution of the elements of the residual
130 *matrix E.M by simulation. The next sucro command generates 100
131 *multivariate normal samples of 1000 observations from the factor
132 *structure given by the rescaled factor matrix F2.
133 *
134 */TRAN-SYMTRES F2,*,1000,*,FRES,100,19970001
135 *Simulated residuals in Survo data file FRES.SVO
136 *MAT LOAD FRES,##.###,END+2 / Standard errors of residuals
137 *
138 *MATRIX FRES
139 *Standard_errors_of_residuals_(N=100)
140 */// F1 F2 F3
141 *X1 0.031 0.057 0.173
142 *X2 0.033 0.055 0.155
143 *X3 0.040 0.066 0.171
144 *X4 0.032 0.059 0.141
145 *X5 0.036 0.103 0.151
146 *X6 0.041 0.112 0.167
147 *X7 0.037 0.105 0.155
148 *X8 0.138 0.182 0.080
149 *X9 0.124 0.175 0.088
150 *X10 0.111 0.158 0.104
151 *
152 *As a summary, standard errors of residuals are given.
|
More information can be obtained from
<URL:http://www.helsinki.fi/survo/> .
BIBLIOGRAPHY
- Ahmavaara, Y. (1954).
- Transformation analysis of factorial data. Ann.Acad.Sci.Fenn., B88,2.
- Alanko, T., Mustonen, S., Tienari, M. (1968).
- A Statistical Programming Language
SURVO 66, BIT, 8, 69-85.
- Mustonen, S. (1981).
- On Interactive Statistical Data Processing, Scand J Statist, 8, 129-136.
- Mustonen, S. (1992).
- Survo, an Integrated Environment for Statistical Computing
and Related Areas, Survo Systems, Helsinki
- Seber, G.A.F. (1985).
- Multivariate Observations, Wiley.
SUMMARY
General features of the statistical software package Survo are
briefly described. As an integrated environment Survo is suitable
for various research and teaching applications. As an example
some problems related to factor analysis are considered. This
example shows how a work process is documented by combining
statistical computations and text editing.
RÉSUMÉ
Les traits généraux du progiciel statistique Survo sont présentés
brièvement. Comme environnement intègre Survo convient aux
plusieurs applications de recherche et d'enseignement. Comme
exemple l'on considére quelques problèmes liés à l'analyse
factoriel. Cet exemple montre comment l'on documente un processus
de travail en combinant d'opérations statistiques et le
traitement du texte.