Help System (web edition)

CLUSTER <data_file>,L
performs cluster analysis on the selected variables and observations
of the Survo <data_file>.
The clustering criterion is Wilks' lambda and the stepwise procedure
for efficient computation of lambda values is based on the algorithm
presented by Pekka Korhonen in his doctoral dissertation "A stepwise
procedure for multivariate clustering", Computing Centre, University
of Helsinki, Research Reports N:o 7, Helsinki 1979.
In the CLUSTER module, the dual procedure of Korhonen's stepwise
method is applied.
For general information on cluster analysis, see e.g.
M.R.Anderberg: "Cluster Analysis for Applications", Academic Press,
New York and London, 1973.

The active observations of <data_file> are defined by IND and CASES
specifications.
The variables used in the analysis are all active variables in
<data_file>, except those activated by 'G' or 'I'.

The stepwise clustering procedure is always based on some initial
grouping of observations. The user has to give the number (g) of clusters
by the GROUPS=g specification. GROUPS=2 is the default.
The initial grouping is given by a variable activated by 'I' and the
values of this variable must be integers 1,2,..,g. If the initial
grouping of observations is not given (no mask 'I' exists), a random
initial grouping based on uniform distribution over 1,2,...,g is applied
automatically.
The initial grouping (defined by the 'I' mask variable) can also be
incomplete (with missing values or values outside the permitted ones
1,2,...,g). In this case it is assumed that the user has indicated
at least one observation in each group. Then, the initial grouping
will be selected on this basis by using the "nearest neighbour"
principle in the standardized data matrix X (with the property X'X=I).

The main result of CLUSTER is the optimal clustering based on
the Wilk's lambda criterion and it is saved in the first variable
of <data_file> activated by 'G'.
If more 'G' variables exist, the CLUSTER module will save as many of the
best solutions found, provided that a specification TRIALS=n where n is
>1 is given.
The possibility for several trials is important in more complicated
cases where different initial groupings may lead to different solutions.

Other options in CLUSTER:
There are no limits for the size of the data file. The highest number
of variables and groups depends on the available memory space.
However, it is seldom reasonable to use more than 10-20 variables in one
cluster analysis.
To speed up the iterative process where the data values are scanned
several times, CLUSTER saves the active part of the data set in a
special file SURVO.CLU on the path of the temporary files (defined
by the line tempdisk in SURVO.APU).
This file (path) can be replaced by another (on a RAM disk, for example)
by giving a specification TEMPFILE=<filename>.
In randomizations for initial groupings, the seed number of
the random number generator is selected according to current time.
To use a fixed generator (in order to have the possibility to repeat
an experiment), a specification of the form SEED=<integer> can be
given.

Example:
  Two samples from bivariate normal distribution with different means
  but same covariance matrix are generated:
................................................................................
FILE CREATE N2,32,10,64,7,100 
FIELDS:
1 N 4 X
2 N 4 Y
END

VAR X,Y TO N2 
X=if(ORDER<51)then(X1)else(X2) Y=if(ORDER<51)then(Y1)else(Y2)
X1=Z1     Y1=r*Z1+s*Z2      r=0.8 s=sqrt(1-r*r)
X2=Z1+2   Y2=r*Z1+s*Z2-2
Z1=probit(rnd(2)) Z2=probit(rnd(2))
................................................................................
VAR G1:1,G2:1,G3:1 TO N2 
  G1=0 G2=0 G3=0
................................................................................
(Continued on next page)

(Example continued)
  The CLUSTER operation with 10 trials, 2 groups, and random number
  generator 2 gives two different solutions:

MASK=AAGGG    TRIALS=10  GROUPS=2 SEED=2
CLUSTER N2,CUR+1 
Stepwise cluster analysis by Wilks' Lambda criterion
Data N2  N=100
Variables: X, Y
Best clusterings found in 10 trials are saved as follows:
 Lambda     freq  Grouping var
 0.04496      6   G1
 0.14945      4   G2

................................................................................
The result can be checked by plotting the graph:
GPLOT N2,X,Y / HEADER=Samples_from_bivariate_normal_distributions
POINT=G1  (G2 gives the inferior clustering)
................................................................................
  C = More information on cluster analysis 
  M = More information on multivariate analysis