Merkkijonomuuttujien muunnokset

[viesti Survo-keskustelupalstalla (2001-2013)]

Kirjoittaja: Kimmo Vehkalahti
Sähköposti:    -
Päiväys: 29.5.2004 16:12

Survo-Käyttäjäyhdistyksen uuden hallituksen kokouksessa 26.5.2004,
jossa minulla oli ilo olla mukana kutsuttuna ulkopuolisena jäsenenä,
keskusteltiin monista kiinnostavista aiheista. Yhdessä niistä puhuttiin
merkkijonomuuttujien muunnoksista ja käsittelystä yleensäkin.

Survon help-systeemissä on kuvattu VAR-operaation merkkijonokäsittelyt
varsin tiiviissä muodossa, joka ehkä aukenisi monille paremmin jos eri
kohdat avaisi ja selostaisi yksityiskohtaisemmin. Päätinpä siis pohtia
millaisen |EXAMPLE|-sukron tuosta kohdasta voisi saada.

Tulin sitten aloittaneeksi sukron hahmottelun toimituskenttään tehdyllä
tarinalla. Muutaman tunnin kirjoitettuani huomasin että siitä tuli aika
pitkä juttu. Paloittelin jutun sopiviin osiin, mutta ennen kuin alan
muuntaa sitä lainkaan sukromuotoon, ajattelin laittaa sen kaikkien
kiinnostuneiden luettavaksi ja kokeiltavaksi. Kaikki kommentit ovat
tervetulleita! Pidemmittä puheitta, let's go:

 Transformation of string variables
 ----------------------------------

String variables can contain any characters (numbers, letters, spaces).
If a string variable holds numerical information, it can be treated
like numerical variables in numerical transformations (by the usual VAR
operations). Sometimes it is advantageous to store numbers in string
variables, because they might take less space in the data file, and
then all kinds of transformations can be applied according to needs.

In some applications it is important to keep the leading zeros in the
numbers, e.g., "00980". If these kind of codes are stored in a
numerical variable, the leading zeros will disappear ("980"). A good
solution is to store the codes in a string variable.

In the following, we demonstrate the essentials of transforming string
variables by seven short examples.

Let us first create a small data file consisting of two string
variables, Code1 and Code2:

FILE CREATE CODES
FIELDS:
 1 S 6 Code1
 2 S 8 Code2
END

Code1 will include dates in the form DDMMYY, where DD refers to days,
MM to months, and YY to years, each of them with two digits.
Code2 will include years presented with four digits and preceded by
a sequence of 1-3 capital letters and a slash ('/').
Here we have four records of information:

DATA C
Code1   Code2
281037  AB/1967
170667  A/1978
120659  ACD/1977
190939  D/1989

We now copy these data to the data file created above:

FILE COPY C TO CODES

Our aim is to make various transformations for the variables by using
VAR operations designed especially for the string variables.

As said above, a string variable can be treated like a numerical
variable, if it holds numerical information. (Please note that many
other statistical programs are more stringent in this respect, as
they do not allow any numerical operations on string variables.)

In Survo, we could compute logarithms of the Code1 variable, without
the need of copying the codes to a numerical variable:

VAR LCode=log(Code1)

The transformation is successful (although the transformed values have
of course no reasonable interpretation in this case):

FILE LOAD CODES
DATA CODES*,A,B,C
 Code1  Code2          LCode
 281037 AB/1967       12.546
 170667 A/1978        12.047
 120659 ACD/1977      11.701
 190939 D/1989        12.160

The new variable LCode is of numerical type. However, we do not need
it later, and hence using the command

FILE REDUCE CODES,2

we remove it from the data. After that, we expand the data file by

FILE EXPAND CODES

to allow more room for new variables needed in the examples below.

Example 1: putting strings together
-----------------------------------

The first actual challenge is to extract the years from Code1 and
represent them with four digits. We assume that the years refer to
the 20th century, i.e., they should be 1937, 1967, 1959, and 1939.
Hence, we need a four-digit string variable, which must be first
created. A good convention is to initialize a new variable (either
string or numerical) with missing values before transformations
(this applies to numerical variables as well):

VAR Year:S4=MISSING

The transformation is then obtained as

VAR str(Year)=str(Code1,5)

where "str()" denotes a transformation of a string variable. Here we
have str() on both sides of the definition. The left side merely refers
to the newly created variable Year, but the expression on the right
side also includes a position (5), from where the variable (Code1) is
to be considered. Now, only the digits 5 and 6 of Code1 will be taken
into account, since Code1 includes a total of 6 digits, or characters.
Those digits are namely the years 37, 67, 59, and 39. We check the
result of the above transformation:

FILE LOAD CODES
DATA CODES*,D,E,F
 Code1  Code2    Year
 281037 AB/1967  37
 170667 A/1978   67
 120659 ACD/1977 59
 190939 D/1989   39

Note that the values of the Year variable are not aligned to the right
like numerical variables (remember that Year includes four characters).
To represent the years with four digits, we add a constant string "19"
to each year by modifying the above created Year variable:

VAR str(Year)="19"&str(Year)

The constant strings are given in quotes. They can not include spaces
or commas. If they are needed, they must be first replaced by other
characters, e.g., spaces are often replaced by underscores ('_').

The character '&' concatenates (puts together) two strings. They can
be either constant strings (as "19" above) or values of variables
(as "str(Year)" above). We check the result:

FILE LOAD CODES
DATA CODES*,G,H,I
 Code1  Code2    Year
 281037 AB/1967  1937
 170667 A/1978   1967
 120659 ACD/1977 1959
 190939 D/1989   1939

We could have done the transformation in one step by

VAR str(Year)="19"&str(Code1,5)

which simply combines the two steps shown above.

Example 2: combining partial strings from specified positions
-------------------------------------------------------------

In the following example, we are going to extract the years from the
variable Code2, together with the capital letters but changing both
the order and the delimiting character. For example, we would like to
change the value "AB/1967" to "1967:AB".

Let us begin by creating a new variable Code3 similarly than above:

VAR Code3:S8=MISSING

We remind that the data we are transforming now have this form:

FILE LOAD CODES
DATA CODES*,J,K,L
 Code1  Code2    Year Code3
 281037 AB/1967  1937 -
 170667 A/1978   1967 -
 120659 ACD/1977 1959 -
 190939 D/1989   1939 -

The years in Code2 do not begin from the same position, as the number
of letters varies between the records. Luckily, there is a delimiter
between the letters and the years: the slash character ('/'). We must
find the position of the slash for each record. Then we can point the
beginnings of the years and the ends of the letters. In other words,
we can split Code2 into two parts and reorder them as required.

Finding the position is achieved by an auxiliary function pos(), which
is best used by defining a short name for the expression. In this case,
we simply define slash=pos(Code2,/) and write the transformation as

VAR str(Code3)=str(Code2,slash+1,4)&":"&str(Code2,1,slash-1)

FILE LOAD CODES / the result is seen immediately:
DATA CODES*,M,N,O
 Code1  Code2    Year Code3
 281037 AB/1967  1937 1967:AB
 170667 A/1978   1967 1978:A
 120659 ACD/1977 1959 1977:ACD
 190939 D/1989   1939 1989:D

Here, we apply the str() functions in a more general form, where also
the lengths of the partial strings are specified. Without the length,
the rest of the string will be copied. In the first part (the year)
the length is 4. If it were not specified, extra space would be copied
to Code3 for all records where the length of Code2 is less than 8 (see
Example 3 below.)

The new delimiter colon (':') is put in place by a constant string ":".
The letters are concatenated to the strings by another str() function,
where the length is now offered by slash-1, i.e., one less than the
position of the slash, since the letters begin from the first position.

Example 3: removing trailing blanks from the strings
----------------------------------------------------

In this example we briefly demonstrate the effect of the trailing
blanks in the strings by removing the length parameter (4) from the
above transformation (see Example 2):

VAR str(Code3)=str(Code2,slash+1,4)&":"&str(Code2,1,slash-1)
is now changed to:
VAR str(Code3)=str(Code2,slash+1)&":"&str(Code2,1,slash-1)

The result is unacceptable, because the trailing blanks of Code2 appear
in the middle of Code3 making it even more difficult to handle in any
subsequent operations:

FILE LOAD CODES
DATA CODES*,P,Q,R
 Code1  Code2    Year Code3
 281037 AB/1967  1937 1967 :AB
 170667 A/1978   1967 1978  :A
 120659 ACD/1977 1959 1977:ACD
 190939 D/1989   1939 1989  :D

In addition to using the length parameter (as in Example 2), there is
another way to solve the problem of trailing blanks. The solution is
to use a different character for the concatenating. The character (or
connector) '|' (instead of '&') will remove the trailing blanks:

VAR str(Code3)=str(Code2,slash+1)&":"&str(Code2,1,slash-1)
is now changed to:
VAR str(Code3)=str(Code2,slash+1)|":"&str(Code2,1,slash-1)

But, as we see from the data, something goes wrong:

FILE LOAD CODES
DATA CODES*,S,T,U
 Code1  Code2    Year Code3
 281037 AB/1967  1937 1967:ABB
 170667 A/1978   1967 1978:A:A
 120659 ACD/1977 1959 1977:ACD
 190939 D/1989   1939 1989:D:D

There are too many delimiters and letters in Code3! But, the error is
not in the expression itself. It is caused by the old values of the
Code3 variable. It is important to clear the result variable before any
new transformations; otherwise the old values will retain persistently.
Hence, let us re-initialize Code3 and make the transformation again:

VAR Code3=MISSING
VAR str(Code3)=str(Code2,slash+1)|":"&str(Code2,1,slash-1)

FILE LOAD CODES
DATA CODES*,V,W,X
 Code1  Code2    Year Code3
 281037 AB/1967  1937 1967:AB
 170667 A/1978   1967 1978:A
 120659 ACD/1977 1959 1977:ACD
 190939 D/1989   1939 1989:D

Thus, we have the same result as in the previous example. Sometimes
the length of a partial string might be difficult to specify, but
using the connector '|' might help. Most often '|' is used to put
together several strings to form a new string without any spaces.

Example 4: computing lengths of strings
---------------------------------------

We can compute the length of a string as a new variable in the data
by using the pos() function (introduced in Example 2) as

VAR len:1=pos(Code3,sp)-1

where sp (or space) is the notation for the 'space' character.
We now have the data as follows:

FILE LOAD CODES
DATA CODES*,Y,Z,[
 Code1  Code2    Year Code3    len
 281037 AB/1967  1937 1967:AB    7
 170667 A/1978   1967 1978:A     6
 120659 ACD/1977 1959 1977:ACD   8
 190939 D/1989   1939 1989:D     6

Similarly the comma (',') is denoted by word comma.

See also the specifications STR_SPACE and STR_COMMA in the FILE LOAD
operation (FLOAD?).

Example 5: conditional statements in string variables
-----------------------------------------------------

We now turn into conditional statements. Our task is to find records
where Code2 begins with the letter 'D' and indicate those records with
a new variable. Let us call this variable D with values 0 (no D in the
first position) and 1 (D in the first position). The data are here:

FILE LOAD CODES
DATA CODES*,\,],^
 Code1  Code2    Year Code3    len
 281037 AB/1967  1937 1967:AB    7
 170667 A/1978   1967 1978:A     6
 120659 ACD/1977 1959 1977:ACD   8
 190939 D/1989   1939 1989:D     6

Using str() function, we write a conditional statement:

VAR D:1=if(str(Code2,1,1)="D")then(1)else(0)

(We can write the transformation without a separate initialization of
the new variable, because D is of numerical type, and each record will
definitely get a value of 0 or 1.)

Obviously, the only record which fulfills the condition is the last
one, and thus we have the following result:

FILE LOAD CODES
DATA CODES*,`,a,b
 Code1  Code2    Year Code3    len   D
 281037 AB/1967  1937 1967:AB    7   0
 170667 A/1978   1967 1978:A     6   0
 120659 ACD/1977 1959 1977:ACD   8   0
 190939 D/1989   1939 1989:D     6   1

Another conditional statement would be, for example

VAR A:1=if(str(Code1,6,1)<=str(Code3,4,1))then(1)else(0)

where the sixth character of Code1 is compared with the fourth
character of Code3 in each record. Here, they happen to be numbers,
but also letters can be compared. All records except the third one
fulfill the condition, and thus we have:

FILE LOAD CODES
DATA CODES*,c,d,e
 Code1  Code2    Year Code3    len   D   A
 281037 AB/1967  1937 1967:AB    7   0   1
 170667 A/1978   1967 1978:A     6   0   1
 120659 ACD/1977 1959 1977:ACD   8   0   0
 190939 D/1989   1939 1989:D     6   1   1

To express the results as character strings according to the values
of A, we could use, for instance

VAR B:S2=MISSING / with
................
VAR str(B)="le" / IND=A,1 ("le" means "less or equal")
VAR str(B)="gt" / IND=A,0 ("gt" means "greater")
.................................................
(The border lines are necessary because of the IND specifications.)

FILE LOAD CODES
DATA CODES*,f,g,h
 Code1  Code2    Year Code3    len   D   A B
 281037 AB/1967  1937 1967:AB    7   0   1 le
 170667 A/1978   1967 1978:A     6   0   1 le
 120659 ACD/1977 1959 1977:ACD   8   0   0 gt
 190939 D/1989   1939 1989:D     6   1   1 le

The general tool for classification of both numerical and string
variables is the CLASSIFY operation (see CLASSIFY?).

Example 6: lags and leads in string variables
---------------------------------------------

Finally, we consider lags and leads, i.e., the values of the records
before of after the current one. The notation [-1], [+1], [+2], ...
is the same as in all VAR operations. To proceed, we add one new empty
observation to the data by

FILE INIT CODES 1

and then create a new string variable LY for two-digit values of years:

VAR LY:S2=MISSING

Lags and leads can be used in the str() function on both sides of the
definition. For example,

VAR str(LY[+1])=str(Year,3,2)

generates the following:

FILE LOAD CODES
DATA CODES*,i,j,k
 Code1  Code2    Year Code3    len   D   A B  LY
 281037 AB/1967  1937 1967:AB    7   0   1 le -
 170667 A/1978   1967 1978:A     6   0   1 le 37
 120659 ACD/1977 1959 1977:ACD   8   0   0 gt 67
 190939 D/1989   1939 1989:D     6   1   1 le 59
 -      -        -    -          -   -   - -  39

An equivalent way to achieve the result is

VAR str(LY)=str(Year[-1],3,2)

which is seen from

FILE LOAD CODES
DATA CODES*,l,m,n
 Code1  Code2    Year Code3    len   D   A B  LY
 281037 AB/1967  1937 1967:AB    7   0   1 le -
 170667 A/1978   1967 1978:A     6   0   1 le 37
 120659 ACD/1977 1959 1977:ACD   8   0   0 gt 67
 190939 D/1989   1939 1989:D     6   1   1 le 59
 -      -        -    -          -   -   - -  39

.....................................................................

Example 7: transformations of dates by the DATE operation
---------------------------------------------------------

In the previous examples, Code1 included dates in the format DDMMYY.
Although the dates can be transformed by str() and pos() functions,
there is also an operation DATE for manipulating dates in various
formats by number of ways. We close this demonstration with a brief
example of its usage. For more information, see DATE?

For example, to find the weekdays of the dates in Code1, and reformat
the dates completely, we could proceed as follows. First, we expand
the datafile again by

FILE EXPAND CODES

to make some more room, and then create two new variables for the
results, both of string type:

VAR Day:S3,Date:S11 / Day=MISSING Date=MISSING

We specify the roles of the variables by suitable masks given in
the VARS specification: VARS=Code1(D),Day(a),Date(d)
(of course, MASK specification could be applied as well).
The mask letters are explained in the help system, see DATE?

Then, we specify the input and output formats of the dates with
specifications IDATE=DDMMYY and ODATE=YYYYMMMDD together with the
desired output date delimiters, say ODEL1=: and ODEL2=- .
Finally, since we have assumed that the dates refer to the 20th
century, we must set CENTURY=1900 (by default it is equal to 2000).

All specifications needed in this case are now set, and we can
activate the DATE operation simply by

DATE CODES

and look at the results:

FILE LOAD CODES CUR+2 / VARS=ALL

DATA CODES*,o,p,q
 Code1  Code2    Year Code3    len   D   A B  LY Day Date
 281037 AB/1967  1937 1967:AB    7   0   1 le -  Thu 1937:Oct-28
 170667 A/1978   1967 1978:A     6   0   1 le 37 Sat 1967:Jun-17
 120659 ACD/1977 1959 1977:ACD   8   0   0 gt 67 Fri 1959:Jun-12
 190939 D/1989   1939 1989:D     6   1   1 le 59 Tue 1939:Sep-19
 -      -        -    -          -   -   - -  39 -   -


That was the end of the examples of transformation of string variables.

DATE / Saturday May 29 2004 16:10:09 Week=22 Day=150

Vastaukset:
[ei vastauksia]

Survo-keskustelupalstan (2001-2013) viestit arkistoitiin aika ajoin sukrolla, joka automaattisesti rakensi viesteistä (yli 1600 kpl) HTML-muotoisen sivukokonaisuuden. Vuoden 2013 alusta Survo-keskustelua on jatkettu entistäkin aktiivisemmin osoitteessa forum.survo.fi. Tervetuloa mukaan!

Etusivu  |  Keskustelu
Copyright © Survo Systems 2001-2013. All rights reserved.
Updated 2013-06-15.