Transform Data

3.3 Transform Data

A standard procedure in data analysis is to transform the values of a variable into different values according to a specified formula. A transformation may transform values of an existing variable into a new variable, or overwrite an existing variable. These transformations are the topic of this section.

3.3.1 Transformations by Formula

For example, the values of Salary may be expressed in dollars, but for purposes of display such as in graphs that display the results of the analysis, the values are to be expressed in thousands

56 Edit Data

of dollars. For example, a value of $64,000 becomes $64. Or a variable measured in hours can

be expressed in terms of minutes. Or the logarithm of a variable can be analyzed instead of the original measurements. To transform the values of a variable use the lessR function Transform , also expressed in

Transform

terms of its abbreviation, trans .

function: Transform the values of an

> mydata <- Transform(Variable=expression with Existing_Variable)

existing variable.

If the variable name on the left side of the equals sign, = , has the same name as the variable on the right side, then the transformed variable replaces the existing variable. If the name of

data option,

the transformed variable is different, a new variable is created. If the variable to be transformed

Section 1.3.5 , p. 13

is not in the mydata data frame, then specify the data frame with the data option.

Arithmetic Operators

To construct the expression that defines the transformation, the usual arithmetic operators apply: + , − , ∗ , and / , for addition, subtraction, multiplication, and division, respectively. R uses the caret symbol, ˆ , for exponentiation. Accordingly, for variable x , x*60 transforms the

Employee data set,

values of x by multiplying each by 60. And xˆ2 squares the values of variable x .

Figure 1.7 , p. 21

As an example, again consider the Employee data.

Scenario New variable from arithmetic An existing variable in the Employee data table is Years, the number of years employed at the company. Define a new variable, the number of months employed at the company.

Use the Transform function to create a new variable, the number of months worked.

lessR Input Arithmetic transformation > mydata <- Transform(Months=Years*12)

The output includes some data from the data table before and after the transformation to facilitate comparison. First, however, is a brief description of the data, in Listing 3.1 .

Number of variables of mydata to transform: 1 Number of cases (rows) of mydata: 37

Listing 3.1 One variable to be transformed over 37 rows of data.

The data values listed before the transformation are just the first six rows of data of the Employee data set. Multiple transformations can be specified in a call to Transform , so a summary of the requested transformations is presented next, here only one in Listing 3.2 .

Then, the first four rows of just the transformed data in the data frame, mydata , are displayed, as shown in Listing 3.3 . Now there is a new variable in mydata , Months. For the

Edit Data 57

Transformation Summary ----------------------

create new variable: Months = Years * 12

Listing 3.2 The requested transformation.

second row of data the value of Years is missing, so this value is also necessarily missing for Months.

After, First five rows of transformed data for data frame: mydata -----------------------------------------------------------------

Months Ritchie, Darnell

Wu, James

NA

Hoang, Binh

Jones, Alissa

Downs, Deborah

Listing 3.3 The first four rows of the transformed data for the new variable Months.

Mathematical Functions

Transformations can also be defined from mathematical functions such as standardizing the values of a variable or calculating the logarithms of the values. Common functions for transformations are listed in Table 3.1 .

Table 3.1 Some R mathematical functions applied here to the variable x.

Operation

Usage

round to n decimal digits

round(x,ndigits)

standard or z-score

scale(x)

natural logarithm

log(x)

square root

sqrt(x)

absolute value

abs(x)

cos

cosine(x)

sin

sin(x)

tan

tan(x)

As an example, consider the responses to the 20-item Mach IV scale. As we see in Chapter 11 , there are some advantages to considering subsets of Mach IV, such as an Honesty subscale that Honesty subscale, consists of Items m06 , m07 , m09 and m10 .

Section 11.10, p. 273

Standardization of the values of a distribution yields a transformed distribution with a standardization: specified mean and a standard deviation. Usually the mean and standard deviation of the Transformation of standardized distribution are 0 and 1, respectively, yielding what are called z-scores. With a

a variable to have data values with a

common mean and standard deviation, the transformed values of each of the four variables specified mean and are more comparable. In the standardized distribution of z-scores, each data value is no longer standard deviation, usually 0 and 1. expressed on an absolute scale from Strongly Disagree to Strongly Agree, but instead in terms of how many standard deviations it is from its own mean.

58 Edit Data

Scenario Standardize variables An analyst wishes to standardize the items before summing them, so that the responses

to the different items are more comparable before the summation. This is required when the items are from different response scales, but sometimes done even when all the items are answered on the same scale.

scale function:

Accomplish standardization in R with the R function scale . In the following, all four

Standardize the values of a variable.

standardization statements for the four variables are included in a single Transform statement, with successive statements separated by a comma. The newly created standardized variables were chosen to all begin with the letter “z”.

lessR Input Transformation with functions > mydata <- Transform(z06=scale(m06), z07=scale(m07),

z09=scale(m09), z10=scale(m10))

Listing 3.4 shows the first four rows of the transformed data.

After, First four rows of transformed data for data frame: mydata -----------------------------------------------------------------

1 -1.30348450 -0.8317862 -0.6707109 -0.007512965 2 0.05013402 -0.1528165 0.1948021 -0.007512965 3 -0.62667524 1.8840925 1.0603151 -0.007512965 4 -0.62667524 -0.8317862 -0.6707109 1.750520786

Listing 3.4 Newly created standardized variables.

For the values of a normally distributed variable, approximately 95% of the values are within two standard deviations of the mean. Because a z-score indicates how many standard deviations the original value is from its mean, most z-scores from normal distributions are between − 2 and 2. Distributions of responses on a 0 to 5 scale for each item are not necessarily normal, but

a general rule is that most z-scores look like the values in Listing 3.4 . There are usually few such values larger than 2 or especially 3 and smaller than − 2 or − 3. With the z-scores defined, the new scale scores on the Honesty subscale can now be calculated for each of the 351 respondents. A limitation of the Transform statement, however, is that multiple transformations can be specified in a single call to Transform , but a new transformed variable cannot be created out of other newly created transformed variables in the same Transform call. Accomplish the calculation of the subscale score from the newly created z-score variables in a separate statement.

> mydata <- Transform(Honesty=z06+z07+z09+z10) All of these transformations only apply to the specified data frame, mydata . Once the R

session ends, so does each data frame. To access the transformed data for future R sessions, one

Edit Data 59

possibility is to write the data frame to your computer’s file system as a native R file, with file type .rda , such as Mach4.rda.

Write function, Section 2.5.1 ,

> wrt.r("Mach4")

p. 48

In a subsequent new analysis session, use Read() to browse for the new file and read it into R for additional analysis.

Save R code, Section 1.5 , p. 19

3.3.2 Define Categorical Variables as Factors data storage type,

Section 1.6.3 , p. 22

A potential confusion in data analysis exists between variable type and data storage type. The variable type, issue is that the values of a categorical variable, the non-numeric, discrete categories, can be Section 1.6.3 , p. 22 represented in the computer as numeric digits, usually integers. It is often just as easy, and likely less confusing, to use mnemonic alphabetic characters to represent these non-numeric

categories. Encoding Gender with an M and an F instead of a 0 and 1 , for example, avoids confusion regarding the meaning of each code, and no one will try to compute the mean of a column of M ’s and F ’s.

Convert an Integer Variable to a Factor

A common data analysis scenario is to recognize some integer coded variables as categorical variables and then convert these variables to the explicit R storage type for categorical variables, factors.

factor, Section 2.2.2 , p. 34

Scenario Recognize integer codes as non-numeric categories The values of Gender in a data file are coded as 0 for Male and 1 for Female. Create a

new variable explicitly defined as a categorical variable and provide the labels of Male and Female in place of 0 and 1.

factor function:

To transform a variable to the R representation of a categorical variable, a factor, the factor Create or define

function relies upon two different parameters, the properties of a levels and labels . The levels option specifies

factor variable.

the values of the variable before the transformation. The labels option specifies the usually non- levels option: The numeric names of the levels. To accomplish the creation of the new factor in the data frame of original values of a categorical interest, usually mydata , embed the factor statement inside the Transform statement.

variable.

Consider the 0/1 coding of Gender in the Mach IV data set, with 0 for Male and 1 for Female. labels option: The The goal is to create a factor variable with value labels Male and Female instead of the integers labels of the

variable after

0 and 1, the levels in this context. The labels are Male and Female.

transformation.

We have no need to retain the 0/1 coding for subsequent analyses, so the following Transform statement uses the same variable, Gender, on both sides of the equals sign, =, which overwrites the data values. To improve readability and better keep track of the multiple parentheses, write this Transform statement on multiple lines.

lessR Input Add value labels to a numerically coded categorical variable > mydata <- Transform(

Gender=factor(Gender, levels=c(0,1), labels=c("Male","Female")) )

60 Edit Data

The result of this transformation appears in Listing 3.5 . The first four values of Gender were originally the integer values 0, 0, 1, 1. Now they are the factor levels, Male, Male, Female, Female. All subsequent analyses will display the new factor levels, Male and Female, on the resulting

value labels: The

output, instead of 0 and 1. Accordingly, the factor levels can be thought of as value labels, which

values of a categorical

label the corresponding categorical values as part of the output of any subsequent data analysis.

variable.

After, First four rows of transformed data for data frame: mydata -----------------------------------------------------------------

Gender 1 Male 2 Male

3 Female 4 Female

Listing 3.5 Transformed Gender as a factor with new levels Male and Female.

On subsequent output for data analysis, R displays the output ordered by the levels in the factor statement. For example, the bars of a bar graph would now be displayed with Male listed before Female because the level for Male is listed before that of Female in the previous factor statement. Change this order with the factor function by changing the order of the listed levels in the levels specification.

To define Gender as a factor with Female listed first in the subsequent data analyses, switch the order for both the levels and the labels arguments in the factor statement. With the factor function, the labels must be listed in the same order as their levels. In this situation, the levels are the corresponding integer codes as read from the data file.

> mydata <- Transform( Gender=factor(Gender, levels=c(1,0), labels=c("Female","Male"))

bar chart,

On a bar graph, for example, the bar for Female now displays before the bar for Male. These

Section 4.2.1 , p. 79

value labels both enhance the interpretability of the output of subsequent analyses, and inform the R data analysis procedures that the variable to be analyzed is categorical and not numeric.

Convert Nominal Data to Ordinal Data

When R reads a data table into an R data frame, any variables with non-numeric data values are automatically converted to factors. This conversion is usually appropriate because unlike integer

categorical variables, variables with non-numeric values cannot be numerical. 1 R goes as far as it can to infer characteristics of the data from the way that the data values are encoded. The issue of ordering the levels of a categorical variable goes beyond the ordering of the

nominal data, Section 1.6.3 ,

levels of the output for a data analysis. This more fundamental distinction refers to two types

p. 23

of data that consist of categories, nominal and ordinal. Nominal data are unordered, discrete

ordinal data,

categories. Ordinal data are ordered, discrete categories or rankings. For example, the data for

Section 1.6.3 , p. 24

Gender are nominal as Male is neither less than nor greater than Female. Consider the data for the Satisfaction variable in the Employee data set with three categorical

Employee data set, Section 1.6.1 ,

levels: low , med , and high . Although the levels are not coded on a numeric scale, the levels are

p. 20

ordered because they reflect different locations along a continuum of Satisfaction. The levels

Edit Data 61

are ranked as: low < med < high . That is, the continuous variable of Satisfaction is coded with ordinal data. Unless this ordering is accounted for, R would misleadingly label subsequent output with the level high listed first because it alphabetically precedes med and low .

Data analysis routines can use this additional information inherent in ordinal data, the ordering of the data values. For example, when presented with an ordinal variable for analysis, BarChart function, the lessR function BarChart , abbreviated bc , presents the proper order of the categories, and Section 4.2.1 , p. 79 then displays the bars in a graded shade of the same hue, from light to dark, to indicate the underlying ordering. Again, the structure of the data in the data frame should align with the structure of the data as it is conceptually defined.

Scenario Provide order to unordered categories R initially recognizes Satisfaction as categorical, but with unordered categories, that is, nominal data. The categories, however, should be ordered as low , med , and high . Redefine the Satisfaction variable within R as ordinal, with ordered, non-numeric categories.

Specify ordinal data with the factor function. Invoke the levels option to obtain the desired order of the levels displayed on subsequent output. This specification only changes the output order, leaving the status of the categorical variable as nominal. To define the variable as ordinal, also specify the ordered=TRUE option.

ordered option: Specify ordinal

data.

lessR Input Specify the order of the value labels and define as ordinal > mydata <- Transform(Satisfaction=factor(Satisfaction,

levels=c("low", "med", "high"), ordered=TRUE))

The output of the Transform function provides both the first four rows of the data before transformation, and then after transformation. However, there is no change in the values of Satisfaction in either listing. Instead, the change in the preceding transformation is a property of the data, not the data values themselves.

Likert Data Defined as Ordinal

The responses to the 20 Mach IV items are coded as integers on a 6-point scale from 0 to 5. Many researchers, your author included, analyze these data on a numeric scale. The resulting Mach IV data, numerical data are subject to a wide variety of numerical analyses such as means, standard Listing 1.8 , p. 27 deviations, correlations, and factor analysis. These analyses are not meaningful when applied to categorical data.

There is no guarantee, however, that the respondents interpret the distances between equal interval data,

Section 1.6.3 scale points as equal. To interpret Likert data as numerical is to imply that the data values ,

p. 23

are interval data with the perception by the respondents of equally spaced categories on the underlying continuum of Disagree/Agree. If Likert data are interval, then the psychological perception of the distance between Strongly Agree and Agree is the same as between Agree and Slightly Agree . These psychological perceptions are presumed equal because the

62 Edit Data

corresponding integer encoding of 5 (Strongly Agree) and 4 (Agree) and then 4 (Agree) and 3 (Slightly Agree) specify equal distances, 1, between each of the two pairs of response categories.

Some researchers prefer to analyze Likert data as ordinal data, as ordered categories without the underlying specification of a numerical scale.

Scenario Convert integer Likert data to ordinal categories Respondents provided answers to Likert data attitude items, here on a 6-point scale from Strongly Disagree to Strongly Agree . The six possible responses to each item were coded as integers from 0 to 5, with 5 representing Strongly Agree . Convert these integer coded responses to each item to ordinal categories with each category appropriately labeled.

Mach IV data, Listing 1.8 , p. 27

Consider the six response categories with which respondents answered each Mach IV item. To simplify these expressions save the Likert response category names into an object we choose

assignment

to call LikertCats. The <- , the assignment operator, indicates to insert that whatever is on the right

operator, Section 1.3.3 , p. 9

of the expression into whatever is on the left. The object LikertCats contains the six category names.

LikertCats <- c("Strongly Disagree", "Disagree", "Slightly Disagree",

"Slightly Agree", "Agree", "Strongly Agree")

colon notation,

Instead of individually listing all six integers, use the colon notation to generate them. Now

Section 1.3.6 , p. 15

apply the factor transformation to each item of interest. Create factor versions for each of the four items on the Mach IV Honest subscale.

> mydata <- Transform(

m06.f = factor(m06, levels=0:5, labels=LikertCats), m07.f = factor(m07, levels=0:5, labels=LikertCats), m09.f = factor(m09, levels=0:5, labels=LikertCats), m10.f = factor(m10, levels=0:5, labels=LikertCats)

After these transformations are run, subsequent data analysis routines will recognize the new variables such as m06.f as categorical variables with the specified value labels. By not replacing the original variables, such as m06 , both the original integer variables and the newly created factor versions of the item responses remain available for analysis.