Merge Data

3.7 Merge Data

A merge of two data sets combines both data sets into one. Within R the data table is called a data frame. The two basic ways to merge data frames are with a horizontal merge or vertical merge.

3.7.1 Horizontal Merge

horizontal

A horizontal merge creates a new data frame by combining the variables from the two data frames

merge: Join two data frames by

to merge. In this situation the two data frames generally contain different variables, but the data

variables.

values are for the same people. The two data frames also share a common variable, usually a row identifier, an ID field. A horizontal merge yields a new data frame with the variables of both of the input data frames.

Scenario Merge data horizontally One source of employee data provided the data values for the first four variables: Years, Gender, Dept and Salary. The second source provided the values for Satisfaction and HealthPlan. Merge these data sets into a single data frame for subsequent analysis.

Edit Data 73

In this situation, neither of the two data files to be read into R are the data file of primary interest. The data table for subsequent analysis is the merged data frame, typically called mydata for convenience to ease subsequent analysis by the lessR data analysis routines.

We wish to read data into R , but for the first time not directly into the mydata data frame. The two data frames created by reading the data are here called Emp1a and Emp1b. To keep the amount of data manageable for this example, only data from the first four rows of the Employee data set is included.

> Emp1a <- Read("http://lessRstats.com/data/Emp1a.csv", row.names=1) > Emp1b <- Read("http://lessRstats.com/data/Emp1b.csv", row.names=1)

The resulting data frames appear in Listings 3.17 and 3.18 .

> Emp1a Years Gender Dept

Salary

Ritchie, Darnell

7 M ADMN 43788.26

Wu, James

NA

M SALE 84494.58

Hoang, Binh

15 M SALE 101074.86

Jones, Alissa

5 F <NA> 43772.58

Listing 3.17 Data frame with the first four variables of the Employee data set.

> Emp1b Satisfaction HealthPlan Ritchie, Darnell

med

Wu, James

low

Hoang, Binh

low

Jones, Alissa

<NA>

Listing 3.18 Data frame with the last two variables of the Employee data set.

The goal is to merge horizontally the data frames Emp1a and Emp1b to create the primary Merge function: data frame of interest. To accomplish this merge, use the lessR function Merge . The required Merge two data frames. by argument for a horizontal merge provides the ID field that dictates how the columns of by option: Specify data values are matched. Here the match is according to the row.names . A variable name for a the variable by variable that both input data frames have in common could also be specified.

which to join two data frames with a horizontal merge.

lessR Input Horizontal merge that combines columns of data > mydata <- Merge(Emp1a, Emp1b, by="row.names")

The merged data is written to the data frame mydata . The contents of mydata appear in Listing 3.19 .

3.7.2 Vertical Merge

A vertical merge: vertical merge combines the rows of data from the two data frames to create a new data frame

Join two data

in R .

frames by rows.

74 Edit Data

> mydata Row.names Years Gender Dept

Salary Satisfaction HealthPlan 1 Hoang, Binh

3 2 Jones, Alissa

15 M SALE 101074.86

low

1 3 Ritchie, Darnell

5 F <NA>

<NA>

1 4 Wu, James

7 M ADMN

M SALE

low

Listing 3.19 The horizontally merged data frame.

Scenario Merge data vertically The variables in the Employee data set are Years, Gender, Dept, Salary, Satisfaction, and HealthPlan. Two data frames contain data for these variables, but for two different groups of employees. Merge the two data frames into one.

First the two data frames must be created in R by reading the data for each. The merged data frame is the primary data frame of interest, to be analyzed by the subsequent data analysis routines. For convenience name the merged data frame the default name for the lessR data analysis routines, mydata .

The two data frames for the data to be merged are named Emp2a and Emp2b, respectively.

> Emp2a <- Read("http://lessRstats.com/data/Emp2a.csv", row.names=1) > Emp2b <- Read("http://lessRstats.com/data/Emp2b.csv", row.names=1)

The resulting data frames appear in Listings 3.20 and 3.21 . For purposes of illustration, each of these data frames is limited to only four employees each.

> Emp2a Years Gender Dept

Salary Satisfaction HealthPlan Ritchie, Darnell

1 Wu, James

7 M ADMN 43788.26

med

1 Hoang, Binh

NA

M SALE 84494.58

low

3 Jones, Alissa

15 M SALE 101074.86

low

5 F <NA> 43772.58

<NA>

Listing 3.20 A data frame for four employees of the Employee data table.

> Emp2b Years Gender Dept

Salary Satisfaction HealthPlan Knox, Michael

3 Campagna, Justin

18 M MKTG 89062.66

med

1 Kimball, Claire

8 M SALE 62321.36

low

2 Cooper, Lindsay

8 F MKTG 51356.69

high

4 F MKTG 46772.95

high

Listing 3.21 A data frame for another four employees of the Employee data set.

Vertically merge the data frames Emp2a and Emp2b to create the primary data frame of interest. Again use the lessR function Merge , but for the vertical merge do not specify a by variable. The merged data is written to the data frame mydata .

Edit Data 75

lessR Input Vertical merge that combines rows of data > mydata <- Merge(Emp2a, Emp2b)

The contents of the merged data frame, mydata , appear in Listing 3.22 .

> mydata Years Gender Dept

Salary Satisfaction HealthPlan

Ritchie, Darnell

7 M ADMN 43788.26

med

Wu, James

NA

M SALE 84494.58

low

Hoang, Binh

15 M SALE 101074.86

low

Jones, Alissa

5 F <NA> 43772.58

<NA>

Knox, Michael

18 M MKTG 89062.66

med

Campagna, Justin

8 M SALE 62321.36

low

Kimball, Claire

8 F MKTG 51356.69

high

Cooper, Lindsay

4 F MKTG 46772.95

high

Listing 3.22 The vertically merged data frame.

Now the merged data, in the mydata data frame, is ready for analysis.

Worked Problems

1 The three values of HealthPlan coded in the data file – 1, 2, and 3 – correspond to three health plans, respectively named GoodHealth, GetWell, and BestCare.

(a) Is HealthPlan a continuous or categorical variable? Why? (b) How is HealthPlan stored within the R data frame?

(c) Provide the R statement that transforms HealthPlan to a factor.

The Cars93 data set contains much information on 93 1993 car models. One variable is Source with two values, 0 for a foreign car and 1 for a car manufactured in the USA.

?dataCars93 for more information.

> mydata <- Read("Cars93", format="lessR")

2 The variable Type is stored in the data file with non-numeric values. (a) How is the data stored within the corresponding R data frame when the data file is read

into R? (b) Order the values of Type appropriately.

3 In the data file the variable Airbag is integer coded with values of 0, 1, and 2, which correspond to "none", "driver", and "driver+". The meaning of "driver" is driver only, and "driver+" means driver and passenger air bags, so there is an ordered progression across the three levels of Airbag.

(a) What is the formal name for this type of variable? (b) Create the appropriate representation of these data in the R data frame.

76 Edit Data

4 Examine the values of horsepower, HP. (a) Sort the data by HP.

(b) List the 10 most powerful cars in terms of horsepower. 5 The various modifications of the Cars93 data set in the previous problems prepare the data

for subsequent analysis, but the changes are only temporary to the R session in which the changes are made. Chapter 2 presents two strategies for making these changes available in

a subsequent R session. (a) Discuss one strategy.

(b) Discuss another strategy.