Sort Data

3.5 Sort Data

One way to organize data is to sort the rows of the data frame by some criterion based on the data values.

3.5.1 Sort by Variables

Consider first sorting the data by the values of one or more variables. When multiple variables are specified, the second variable is sorted within each level of the first variable, and so forth. Each variable can be sorted in ascending or descending order, that is, from smallest to largest or largest to smallest.

Scenario Sort the data according to the values of one or more variables To facilitate a visual inspection of the data, sort the rows of data according to the values of Gender. List all the data rows for the women first, where the value of Gender is "F" , and then again for the men, where Gender is "M" . Within the data rows for each value of Gender, sort by Salary in descending order, listing first the women who make the

most money, and then the men.

Sort function: Sort

the rows of a data frame by specified variables.

To sort the rows of data in the data frame, usually mydata , use the lessR function Sort ,

data option, Section 1.3.5 ,

abbreviated srt . Invoke the data option to specify a data frame with a different name. For

p. 13

example, to sort the data by Gender, enter the following.

Edit Data 67

lessR Input Sort the data frame by the specified variable > mydata <- Sort(Gender)

or

> mydata <- srt(Gender)

Consider a sort first by Gender and then, within each level of Gender, sort by Salary. To sort by multiple criteria such as these, list the multiple variables in the order of their sort. Unless otherwise specified the variables are sorted in the default ascending order from smallest to largest values. Because there is more than one specified variable, that is, a list of variables,

c function,

use the combine function, c , to combine the multiple variables into a single list.

Section 1.3.6 , p. 15

> mydata <- Sort(c(Gender, Salary)) To sort at least one of the specified variables in descending order use the direction option direction option:

for each variable to be sorted. To use this option, list the order of the sort for each variable. A Specify the direction of the "+" indicates an ascending sort, from smallest to largest values. A "-" indicates a descending sort for a variable. sort, from largest to smallest values. For example, to sort by Gender in ascending order, and then Salary in descending order, enter the following.

lessR Input Sort the data frame by specified variables and directions > mydata <- Sort(c(Gender, Salary), direction=c("+", "-"))

Here the output of this sort instruction begins in Listing 3.10 . The first output is a specification of the sort.

Sort Specification -------------------------------

Gender --> ascending Salary --> descending

-------------------------------

Listing 3.10 Sort specification for sorting Gender in ascending order followed by Salary in descending order.

Listing 3.11 displays the first rows of sorted data. All four rows are data from women, and the salaries are listed in descending order beginning with the highest women’s salary of $112,563.38.

After the Sort, first four rows of data for data frame: mydata -------------------------------------------------------------------

Years Gender Dept

Salary Satisfaction HealthPlan

James, Leslie

18 F ADMN 112563.38

low

Kralik, Laura

10 F SALE 82681.19

med

Skrotzki, Sara

18 F MKTG 81352.33

med

Billing, Susan

4 F ADMN 62675.26

med

Listing 3.11 Employee data sorted by Gender in ascending order and then Salary in descending order.

68 Edit Data

3.5.2 Sort by Other Criteria

In R the row names of a data frame are conceptually distinct from the variables. Unlike the variables, the row names are not subject to statistical analysis, such as computing a mean. Instead their purpose is to identify each unique row and to appear on the output to facilitate interpretation, such as to label individual points in a graph.

row.names

The Sort function provides a way to sort by row names, the name of the employees in the

option: A criterion for sorting the data

Employee data set. To do this specify row.names as the criterion by which to sort.

frame by row names.

> mydata <- Sort(row.names)

The value of each row name occurs only once. When row names are to be sorted there is no reason to first sort by values of the variables within each row, so if row.names is specified as the sort criterion then no variables are specified. The direction of the sort, however, can be specified. The default is ascending. If a descending sort by row names is desired, include the direction="-" option.

lessR Input Sort the data frame by row names in descending order > mydata <- Sort(row.names, direction="-")

The Sort function also has an option to randomly shuffle the rows of data. To do so, specify

random option: A

random as the criterion for the sort.

criterion for sorting the data frame randomly.

> mydata <- Sort(random)

This option is useful if the data have been previously sorted by some criterion that is no longer relevant.