CLUSTER <data_file>,L performs cluster analysis on the selected variables and observations of the Survo <data_file>. The clustering criterion is Wilks' lambda and the stepwise procedure for efficient computation of lambda values is based on the algorithm presented by Pekka Korhonen in his doctoral dissertation "A stepwise procedure for multivariate clustering", Computing Centre, University of Helsinki, Research Reports N:o 7, Helsinki 1979. In the CLUSTER module, the dual procedure of Korhonen's stepwise method is applied. For general information on cluster analysis, see e.g. M.R.Anderberg: "Cluster Analysis for Applications", Academic Press, New York and London, 1973. The active observations of <data_file> are defined by IND and CASES specifications. The variables used in the analysis are all active variables in <data_file>, except those activated by 'G' or 'I'. The stepwise clustering procedure is always based on some initial grouping of observations. The user has to give the number (g) of clusters by the GROUPS=g specification. GROUPS=2 is the default. The initial grouping is given by a variable activated by 'I' and the values of this variable must be integers 1,2,..,g. If the initial grouping of observations is not given (no mask 'I' exists), a random initial grouping based on uniform distribution over 1,2,...,g is applied automatically. The initial grouping (defined by the 'I' mask variable) can also be incomplete (with missing values or values outside the permitted ones 1,2,...,g). In this case it is assumed that the user has indicated at least one observation in each group. Then, the initial grouping will be selected on this basis by using the "nearest neighbour" principle in the standardized data matrix X (with the property X'X=I). The main result of CLUSTER is the optimal clustering based on the Wilk's lambda criterion and it is saved in the first variable of <data_file> activated by 'G'. If more 'G' variables exist, the CLUSTER module will save as many of the best solutions found, provided that a specification TRIALS=n where n is >1 is given. The possibility for several trials is important in more complicated cases where different initial groupings may lead to different solutions. Other options in CLUSTER: There are no limits for the size of the data file. The highest number of variables and groups depends on the available memory space. However, it is seldom reasonable to use more than 10-20 variables in one cluster analysis. To speed up the iterative process where the data values are scanned several times, CLUSTER saves the active part of the data set in a special file SURVO.CLU on the path of the temporary files (defined by the line tempdisk in SURVO.APU). This file (path) can be replaced by another (on a RAM disk, for example) by giving a specification TEMPFILE=<filename>. In randomizations for initial groupings, the seed number of the random number generator is selected according to current time. To use a fixed generator (in order to have the possibility to repeat an experiment), a specification of the form SEED=<integer> can be given. Example: Two samples from bivariate normal distribution with different means but same covariance matrix are generated: ................................................................................ FILE CREATE N2,32,10,64,7,100 FIELDS: 1 N 4 X 2 N 4 Y END VAR X,Y TO N2 X=if(ORDER<51)then(X1)else(X2) Y=if(ORDER<51)then(Y1)else(Y2) X1=Z1 Y1=r*Z1+s*Z2 r=0.8 s=sqrt(1-r*r) X2=Z1+2 Y2=r*Z1+s*Z2-2 Z1=probit(rnd(2)) Z2=probit(rnd(2)) ................................................................................ VAR G1:1,G2:1,G3:1 TO N2 G1=0 G2=0 G3=0 ................................................................................ (Continued on next page) (Example continued) The CLUSTER operation with 10 trials, 2 groups, and random number generator 2 gives two different solutions: MASK=AAGGG TRIALS=10 GROUPS=2 SEED=2 CLUSTER N2,CUR+1 Stepwise cluster analysis by Wilks' Lambda criterion Data N2 N=100 Variables: X, Y Best clusterings found in 10 trials are saved as follows: Lambda freq Grouping var 0.04496 6 G1 0.14945 4 G2 ................................................................................ The result can be checked by plotting the graph: GPLOT N2,X,Y / HEADER=Samples_from_bivariate_normal_distributions POINT=G1 (G2 gives the inferior clustering) ................................................................................ C = More information on cluster analysis M = More information on multivariate analysis