Subset Data

3.6 Subset Data

3.6.1 Select Rows and/or Columns

The analysis of interest may be directed towards only part of the original data table. Perhaps only some of the rows of data are to be analyzed, such as only the data for Females. Or, perhaps one or more of the variables are not needed for the analysis, and can be discarded. Or, maybe the goal is simply to identify and list some subset of the data frame.

Scenario Retain only the rows of data for specific data values

For some analyses, analyze the data for men and women separately. From the primary

Subset function:

Create a subset of

data table create new data tables for men and for women.

the original data frame.

data option, Section 1.3.5 ,

To create a subset of the original data frame, use the lessR function Subset . If the data

p. 13

frame of interest is not mydata , then specify with the data option. To locate rows of data but

Edit Data 69

leave the original data frame, usually mydata , unmodified, leave off the mydata <- assignment at the beginning of the Subset statement. Or, assign the output from Subset to any desired data frame.

As is true of all functions defined in R , if the order of the arguments in the function call args function: List matches the order in the function definition, then the argument names can be deleted. To get the arguments that can be specified in the arguments of a function, either refer to the manual for that function, here ?Subset , or call

a function call.

the function name with the R function args , as in Listing 3.12 . The first two arguments to the rows argument: Subset function are rows and then columns . So, if the rows to be retained or deleted are listed Specify the rows to retain or discard. as the first argument, then the rows= specification can be omitted.

columns

> args(Subset)

argument: Specify

function (rows, columns, data=mydata, brief=FALSE, holdout=FALSE, ...) the columns to

retain or discard.

Listing 3.12 The arguments of the Subset function.

To specify the subset, use the double equals sign, == , which indicates a test for equality. For ==: Test two

example, reduce the Employee data table to include only data for Females.

expressions for logical equality.

lessR Input Select and then retain only the rows of data from women > mydata <- Subset(Gender=="F")

The function Subset lists the first five rows of data before the subset, then the last four rows of data after the subset. The function also reports the number of rows and columns in the data frame before and after the subset procedure. As can be seen in Listing 3.13 , the first rows after the subset have a Gender value of "F" .

Number of variables in mydata: 6 Number of cases (rows) in mydata: 19

First four rows of data for data frame: mydata --------------------------------------------------------------------

Years Gender Dept

Salary Satisfaction HealthPlan

Jones, Alissa

5 F <NA> 43772.58

<NA>

Downs, Deborah

7 F FINC 47139.90

high

Afshari, Anbar

6 F ADMN 59441.93

high

Kimball, Claire

8 F MKTG 51356.69

high

Listing 3.13 First four rows of data after the subset that specifies only women be included in the revised data frame.

The expression for the rows to be retained can specify multiple criteria, using the logical operators in Table 3.2 . To locate some data without modifying the original data frame, do not assign the output of the function to a data frame. For example, in the Employee data set the following expression locates only women with more than 10 years employment, and then displays the resulting data.

70 Edit Data

Table 3.2 Logical operators.

not equals

greater than

less than

and

or

lessR Input List selected rows of data but do not change the data frame > Subset(Gender=="F" & Years>10)

row.name

A row of data can also be located with the R row.name function, such as the data for

function: Identify a single row of the

employee Scott Fulton.

data frame.

> Subset(row.names(mydata)=="Fulton, Scott")

Note that the name of the data frame, usually mydata , must also be specified in the call to row.names . The function row.names is an R function, and hence does not default to mydata for the input data frame as do the lessR functions. Listing 3.14 displays the row of data for Scott Fulton.

Salary Satisfaction HealthPlan Fulton, Scott

Years Gender Dept

13 M SALE 77785.51

low

Listing 3.14 Locate a row of data based on the row ID, in this case the employee’s name.

The columns option is the second parameter in the definition of the Subset function, as seen from entering ?Subset . If columns is the second argument in the call to Subset , then the columns= specification can be omitted.

For example, the following retains all of the rows of data, but only the data for the columns that contain the data values for the variables Years and Salary.

lessR Input Select and retain only the specified variables > Subset(columns=c(Years, Salary))

c function,

Columns can also be deleted from the data frame of interest. To indicate that data for all

Section 1.3.6 , p. 15

variables are retained except the variables Years and Salary, put a minus sign, − , in front of the

c for the combine function. Or, if there is only a single variable and no combine function, put the minus sign directly in front of the variable name.

> mydata <- Subset(columns=-c(Years, Salary))

Edit Data 71

The rows and columns arguments can also work together. > mydata <- Subset(Gender=="F" & Years>10, columns=c(Years, Salary)) Here rows of data are obtained only for women with more than 10 years work experience,

and only for two variables, Years and Salary.

3.6.2 Randomly Select Rows

The previous uses of the Subset function generated subsets based on logical criteria, such as selecting only those rows of data for the subset in which Gender is equal to "F" . Another possibility is to have Subset randomly select the rows of data to retain, such as to evaluate the stability of a statistical result by doing the same analysis on a different data set. If the original sample is sufficiently large, this dual analysis can be performed on both a random subset of the original data, and then again on those remaining rows of data not included in the original data table.

Scenario Random selection of rows of data Create a data set that consists of 60% of the original data and then a second data set of the remaining 40% of the rows of data.

To create these data sets, specify the rows argument either as an integer to indicate the rows argument: number of rows to retain, or as a proportion to indicate the proportion of rows to obtain. For Specify the number or proportion of example, the following generates a randomly selected subset of 60% of the rows of data in the rows to retain. default mydata data frame. Leave the original data table mydata unmodified by directing the subset of the data to a data frame called mydata.sub .

lessR Input Randomly select a percentage of the rows of mydata data table > mydata.sub <- Subset(.6, holdout=TRUE)

To analyze the data in this subset in a subsequent analysis, specify data=mydata.sub , the name of the data frame chosen to assign the output of the Subset function. The output of this random selection process in Listing 3.15 includes the usual Subset output previously described. Also included is a brief description of the selection process, which results Mach IV data,

Listing 1.8 from applying the previous statement to the Mach IV data. , p. 27

Rows of data randomly extracted ------------------------------------------ Proportion of randomly retained rows: 0.6 Number of randomly retained rows: 211

Listing 3.15 Rows retained in the random selection.

72 Edit Data

hold-out sample:

Subset also provides the code to construct a second data frame, known as a hold-out sample.

A percentage of the original data

The name of this constructed data frame is the name of the original data frame with the

table extracted and

characters .hold appended to the name. The resulting name for the usual mydata data frame is

then retained for later analysis.

mydata.hold . After randomly extracting 60% of the rows of the 351 rows of data in the Mach IV data set, the output in Listing 3.16 shows how to construct a data frame for the remaining 40% of the original data.

mydata.hold <- Subset( row.names(mydata)=="3" | row.names(mydata)=="5" | row.names(mydata)=="6" | row.names(mydata)=="8" ... | row.names(mydata)=="349" | row.names(mydata)=="351" )

Listing 3.16 Excerpt of the code to create from the mydata data frame a hold-out sample called mydata.hold.

To create the hold-out sample named data=mydata.hold , run this code on the original data frame, usually mydata . Or, if the original mydata has been overwritten, re-read the data to reconstruct the original data table before extracting the hold-out sample. To analyze this hold- out sample with the lessR data analysis routines specify data=mydata.hold in the subsequent calls to the corresponding functions.