[viesti Survo-keskustelupalstalla (2001-2013)]
Kirjoittaja: | Kimmo Vehkalahti |
---|---|
Sähköposti: | - |
Päiväys: | 29.5.2004 16:12 |
Survo-Käyttäjäyhdistyksen uuden hallituksen kokouksessa 26.5.2004, jossa minulla oli ilo olla mukana kutsuttuna ulkopuolisena jäsenenä, keskusteltiin monista kiinnostavista aiheista. Yhdessä niistä puhuttiin merkkijonomuuttujien muunnoksista ja käsittelystä yleensäkin. Survon help-systeemissä on kuvattu VAR-operaation merkkijonokäsittelyt varsin tiiviissä muodossa, joka ehkä aukenisi monille paremmin jos eri kohdat avaisi ja selostaisi yksityiskohtaisemmin. Päätinpä siis pohtia millaisen |EXAMPLE|-sukron tuosta kohdasta voisi saada. Tulin sitten aloittaneeksi sukron hahmottelun toimituskenttään tehdyllä tarinalla. Muutaman tunnin kirjoitettuani huomasin että siitä tuli aika pitkä juttu. Paloittelin jutun sopiviin osiin, mutta ennen kuin alan muuntaa sitä lainkaan sukromuotoon, ajattelin laittaa sen kaikkien kiinnostuneiden luettavaksi ja kokeiltavaksi. Kaikki kommentit ovat tervetulleita! Pidemmittä puheitta, let's go: Transformation of string variables ---------------------------------- String variables can contain any characters (numbers, letters, spaces). If a string variable holds numerical information, it can be treated like numerical variables in numerical transformations (by the usual VAR operations). Sometimes it is advantageous to store numbers in string variables, because they might take less space in the data file, and then all kinds of transformations can be applied according to needs. In some applications it is important to keep the leading zeros in the numbers, e.g., "00980". If these kind of codes are stored in a numerical variable, the leading zeros will disappear ("980"). A good solution is to store the codes in a string variable. In the following, we demonstrate the essentials of transforming string variables by seven short examples. Let us first create a small data file consisting of two string variables, Code1 and Code2: FILE CREATE CODES FIELDS: 1 S 6 Code1 2 S 8 Code2 END Code1 will include dates in the form DDMMYY, where DD refers to days, MM to months, and YY to years, each of them with two digits. Code2 will include years presented with four digits and preceded by a sequence of 1-3 capital letters and a slash ('/'). Here we have four records of information: DATA C Code1 Code2 281037 AB/1967 170667 A/1978 120659 ACD/1977 190939 D/1989 We now copy these data to the data file created above: FILE COPY C TO CODES Our aim is to make various transformations for the variables by using VAR operations designed especially for the string variables. As said above, a string variable can be treated like a numerical variable, if it holds numerical information. (Please note that many other statistical programs are more stringent in this respect, as they do not allow any numerical operations on string variables.) In Survo, we could compute logarithms of the Code1 variable, without the need of copying the codes to a numerical variable: VAR LCode=log(Code1) The transformation is successful (although the transformed values have of course no reasonable interpretation in this case): FILE LOAD CODES DATA CODES*,A,B,C Code1 Code2 LCode 281037 AB/1967 12.546 170667 A/1978 12.047 120659 ACD/1977 11.701 190939 D/1989 12.160 The new variable LCode is of numerical type. However, we do not need it later, and hence using the command FILE REDUCE CODES,2 we remove it from the data. After that, we expand the data file by FILE EXPAND CODES to allow more room for new variables needed in the examples below. Example 1: putting strings together ----------------------------------- The first actual challenge is to extract the years from Code1 and represent them with four digits. We assume that the years refer to the 20th century, i.e., they should be 1937, 1967, 1959, and 1939. Hence, we need a four-digit string variable, which must be first created. A good convention is to initialize a new variable (either string or numerical) with missing values before transformations (this applies to numerical variables as well): VAR Year:S4=MISSING The transformation is then obtained as VAR str(Year)=str(Code1,5) where "str()" denotes a transformation of a string variable. Here we have str() on both sides of the definition. The left side merely refers to the newly created variable Year, but the expression on the right side also includes a position (5), from where the variable (Code1) is to be considered. Now, only the digits 5 and 6 of Code1 will be taken into account, since Code1 includes a total of 6 digits, or characters. Those digits are namely the years 37, 67, 59, and 39. We check the result of the above transformation: FILE LOAD CODES DATA CODES*,D,E,F Code1 Code2 Year 281037 AB/1967 37 170667 A/1978 67 120659 ACD/1977 59 190939 D/1989 39 Note that the values of the Year variable are not aligned to the right like numerical variables (remember that Year includes four characters). To represent the years with four digits, we add a constant string "19" to each year by modifying the above created Year variable: VAR str(Year)="19"&str(Year) The constant strings are given in quotes. They can not include spaces or commas. If they are needed, they must be first replaced by other characters, e.g., spaces are often replaced by underscores ('_'). The character '&' concatenates (puts together) two strings. They can be either constant strings (as "19" above) or values of variables (as "str(Year)" above). We check the result: FILE LOAD CODES DATA CODES*,G,H,I Code1 Code2 Year 281037 AB/1967 1937 170667 A/1978 1967 120659 ACD/1977 1959 190939 D/1989 1939 We could have done the transformation in one step by VAR str(Year)="19"&str(Code1,5) which simply combines the two steps shown above. Example 2: combining partial strings from specified positions ------------------------------------------------------------- In the following example, we are going to extract the years from the variable Code2, together with the capital letters but changing both the order and the delimiting character. For example, we would like to change the value "AB/1967" to "1967:AB". Let us begin by creating a new variable Code3 similarly than above: VAR Code3:S8=MISSING We remind that the data we are transforming now have this form: FILE LOAD CODES DATA CODES*,J,K,L Code1 Code2 Year Code3 281037 AB/1967 1937 - 170667 A/1978 1967 - 120659 ACD/1977 1959 - 190939 D/1989 1939 - The years in Code2 do not begin from the same position, as the number of letters varies between the records. Luckily, there is a delimiter between the letters and the years: the slash character ('/'). We must find the position of the slash for each record. Then we can point the beginnings of the years and the ends of the letters. In other words, we can split Code2 into two parts and reorder them as required. Finding the position is achieved by an auxiliary function pos(), which is best used by defining a short name for the expression. In this case, we simply define slash=pos(Code2,/) and write the transformation as VAR str(Code3)=str(Code2,slash+1,4)&":"&str(Code2,1,slash-1) FILE LOAD CODES / the result is seen immediately: DATA CODES*,M,N,O Code1 Code2 Year Code3 281037 AB/1967 1937 1967:AB 170667 A/1978 1967 1978:A 120659 ACD/1977 1959 1977:ACD 190939 D/1989 1939 1989:D Here, we apply the str() functions in a more general form, where also the lengths of the partial strings are specified. Without the length, the rest of the string will be copied. In the first part (the year) the length is 4. If it were not specified, extra space would be copied to Code3 for all records where the length of Code2 is less than 8 (see Example 3 below.) The new delimiter colon (':') is put in place by a constant string ":". The letters are concatenated to the strings by another str() function, where the length is now offered by slash-1, i.e., one less than the position of the slash, since the letters begin from the first position. Example 3: removing trailing blanks from the strings ---------------------------------------------------- In this example we briefly demonstrate the effect of the trailing blanks in the strings by removing the length parameter (4) from the above transformation (see Example 2): VAR str(Code3)=str(Code2,slash+1,4)&":"&str(Code2,1,slash-1) is now changed to: VAR str(Code3)=str(Code2,slash+1)&":"&str(Code2,1,slash-1) The result is unacceptable, because the trailing blanks of Code2 appear in the middle of Code3 making it even more difficult to handle in any subsequent operations: FILE LOAD CODES DATA CODES*,P,Q,R Code1 Code2 Year Code3 281037 AB/1967 1937 1967 :AB 170667 A/1978 1967 1978 :A 120659 ACD/1977 1959 1977:ACD 190939 D/1989 1939 1989 :D In addition to using the length parameter (as in Example 2), there is another way to solve the problem of trailing blanks. The solution is to use a different character for the concatenating. The character (or connector) '|' (instead of '&') will remove the trailing blanks: VAR str(Code3)=str(Code2,slash+1)&":"&str(Code2,1,slash-1) is now changed to: VAR str(Code3)=str(Code2,slash+1)|":"&str(Code2,1,slash-1) But, as we see from the data, something goes wrong: FILE LOAD CODES DATA CODES*,S,T,U Code1 Code2 Year Code3 281037 AB/1967 1937 1967:ABB 170667 A/1978 1967 1978:A:A 120659 ACD/1977 1959 1977:ACD 190939 D/1989 1939 1989:D:D There are too many delimiters and letters in Code3! But, the error is not in the expression itself. It is caused by the old values of the Code3 variable. It is important to clear the result variable before any new transformations; otherwise the old values will retain persistently. Hence, let us re-initialize Code3 and make the transformation again: VAR Code3=MISSING VAR str(Code3)=str(Code2,slash+1)|":"&str(Code2,1,slash-1) FILE LOAD CODES DATA CODES*,V,W,X Code1 Code2 Year Code3 281037 AB/1967 1937 1967:AB 170667 A/1978 1967 1978:A 120659 ACD/1977 1959 1977:ACD 190939 D/1989 1939 1989:D Thus, we have the same result as in the previous example. Sometimes the length of a partial string might be difficult to specify, but using the connector '|' might help. Most often '|' is used to put together several strings to form a new string without any spaces. Example 4: computing lengths of strings --------------------------------------- We can compute the length of a string as a new variable in the data by using the pos() function (introduced in Example 2) as VAR len:1=pos(Code3,sp)-1 where sp (or space) is the notation for the 'space' character. We now have the data as follows: FILE LOAD CODES DATA CODES*,Y,Z,[ Code1 Code2 Year Code3 len 281037 AB/1967 1937 1967:AB 7 170667 A/1978 1967 1978:A 6 120659 ACD/1977 1959 1977:ACD 8 190939 D/1989 1939 1989:D 6 Similarly the comma (',') is denoted by word comma. See also the specifications STR_SPACE and STR_COMMA in the FILE LOAD operation (FLOAD?). Example 5: conditional statements in string variables ----------------------------------------------------- We now turn into conditional statements. Our task is to find records where Code2 begins with the letter 'D' and indicate those records with a new variable. Let us call this variable D with values 0 (no D in the first position) and 1 (D in the first position). The data are here: FILE LOAD CODES DATA CODES*,\,],^ Code1 Code2 Year Code3 len 281037 AB/1967 1937 1967:AB 7 170667 A/1978 1967 1978:A 6 120659 ACD/1977 1959 1977:ACD 8 190939 D/1989 1939 1989:D 6 Using str() function, we write a conditional statement: VAR D:1=if(str(Code2,1,1)="D")then(1)else(0) (We can write the transformation without a separate initialization of the new variable, because D is of numerical type, and each record will definitely get a value of 0 or 1.) Obviously, the only record which fulfills the condition is the last one, and thus we have the following result: FILE LOAD CODES DATA CODES*,`,a,b Code1 Code2 Year Code3 len D 281037 AB/1967 1937 1967:AB 7 0 170667 A/1978 1967 1978:A 6 0 120659 ACD/1977 1959 1977:ACD 8 0 190939 D/1989 1939 1989:D 6 1 Another conditional statement would be, for example VAR A:1=if(str(Code1,6,1)<=str(Code3,4,1))then(1)else(0) where the sixth character of Code1 is compared with the fourth character of Code3 in each record. Here, they happen to be numbers, but also letters can be compared. All records except the third one fulfill the condition, and thus we have: FILE LOAD CODES DATA CODES*,c,d,e Code1 Code2 Year Code3 len D A 281037 AB/1967 1937 1967:AB 7 0 1 170667 A/1978 1967 1978:A 6 0 1 120659 ACD/1977 1959 1977:ACD 8 0 0 190939 D/1989 1939 1989:D 6 1 1 To express the results as character strings according to the values of A, we could use, for instance VAR B:S2=MISSING / with ................ VAR str(B)="le" / IND=A,1 ("le" means "less or equal") VAR str(B)="gt" / IND=A,0 ("gt" means "greater") ................................................. (The border lines are necessary because of the IND specifications.) FILE LOAD CODES DATA CODES*,f,g,h Code1 Code2 Year Code3 len D A B 281037 AB/1967 1937 1967:AB 7 0 1 le 170667 A/1978 1967 1978:A 6 0 1 le 120659 ACD/1977 1959 1977:ACD 8 0 0 gt 190939 D/1989 1939 1989:D 6 1 1 le The general tool for classification of both numerical and string variables is the CLASSIFY operation (see CLASSIFY?). Example 6: lags and leads in string variables --------------------------------------------- Finally, we consider lags and leads, i.e., the values of the records before of after the current one. The notation [-1], [+1], [+2], ... is the same as in all VAR operations. To proceed, we add one new empty observation to the data by FILE INIT CODES 1 and then create a new string variable LY for two-digit values of years: VAR LY:S2=MISSING Lags and leads can be used in the str() function on both sides of the definition. For example, VAR str(LY[+1])=str(Year,3,2) generates the following: FILE LOAD CODES DATA CODES*,i,j,k Code1 Code2 Year Code3 len D A B LY 281037 AB/1967 1937 1967:AB 7 0 1 le - 170667 A/1978 1967 1978:A 6 0 1 le 37 120659 ACD/1977 1959 1977:ACD 8 0 0 gt 67 190939 D/1989 1939 1989:D 6 1 1 le 59 - - - - - - - - 39 An equivalent way to achieve the result is VAR str(LY)=str(Year[-1],3,2) which is seen from FILE LOAD CODES DATA CODES*,l,m,n Code1 Code2 Year Code3 len D A B LY 281037 AB/1967 1937 1967:AB 7 0 1 le - 170667 A/1978 1967 1978:A 6 0 1 le 37 120659 ACD/1977 1959 1977:ACD 8 0 0 gt 67 190939 D/1989 1939 1989:D 6 1 1 le 59 - - - - - - - - 39 ..................................................................... Example 7: transformations of dates by the DATE operation --------------------------------------------------------- In the previous examples, Code1 included dates in the format DDMMYY. Although the dates can be transformed by str() and pos() functions, there is also an operation DATE for manipulating dates in various formats by number of ways. We close this demonstration with a brief example of its usage. For more information, see DATE? For example, to find the weekdays of the dates in Code1, and reformat the dates completely, we could proceed as follows. First, we expand the datafile again by FILE EXPAND CODES to make some more room, and then create two new variables for the results, both of string type: VAR Day:S3,Date:S11 / Day=MISSING Date=MISSING We specify the roles of the variables by suitable masks given in the VARS specification: VARS=Code1(D),Day(a),Date(d) (of course, MASK specification could be applied as well). The mask letters are explained in the help system, see DATE? Then, we specify the input and output formats of the dates with specifications IDATE=DDMMYY and ODATE=YYYYMMMDD together with the desired output date delimiters, say ODEL1=: and ODEL2=- . Finally, since we have assumed that the dates refer to the 20th century, we must set CENTURY=1900 (by default it is equal to 2000). All specifications needed in this case are now set, and we can activate the DATE operation simply by DATE CODES and look at the results: FILE LOAD CODES CUR+2 / VARS=ALL DATA CODES*,o,p,q Code1 Code2 Year Code3 len D A B LY Day Date 281037 AB/1967 1937 1967:AB 7 0 1 le - Thu 1937:Oct-28 170667 A/1978 1967 1978:A 6 0 1 le 37 Sat 1967:Jun-17 120659 ACD/1977 1959 1977:ACD 8 0 0 gt 67 Fri 1959:Jun-12 190939 D/1989 1939 1989:D 6 1 1 le 59 Tue 1939:Sep-19 - - - - - - - - 39 - - That was the end of the examples of transformation of string variables. DATE / Saturday May 29 2004 16:10:09 Week=22 Day=150
Vastaukset: |
---|
Survo-keskustelupalstan (2001-2013) viestit arkistoitiin aika ajoin sukrolla, joka automaattisesti rakensi viesteistä (yli 1600 kpl) HTML-muotoisen sivukokonaisuuden. Vuoden 2013 alusta Survo-keskustelua on jatkettu entistäkin aktiivisemmin osoitteessa forum.survo.fi. Tervetuloa mukaan!