Write Data

2.5 Write Data

The function Read reads data from an external file into a specified R data frame. The matching function Write writes a specified data frame within R , usually mydata , to an external file. One format for the file that is written is as a native R format file, readable only by R , but a relatively fast read, which also includes any variable labels present in the data frame. For compatibility with other systems the data can also be exported as a csv text file.

2.5.1 Write a Data Frame in Native R Format

The contents of a data frame necessarily include the data values read into R . The contents also include related information such as the storage type of each variable. The complete contents of

a data frame can be written to an external file on your computer system as a virtual literal copy of the data frame as it is stored within an R session. The data table can then re-read back into R for later analysis. The format of the resulting file is called a native R data file, which can only

be read by the R system. One advantage of saving the complete contents of a data frame is that, especially for large data files, re-reading a previously saved data frame in the form of a native R data file is much faster than reading a text file. The issue is that without prior information R spends a relatively large amount of time reading a text file column by column while attempting to interpret the

type of data contained within each column. 3 Plus, the format of the saved R data frame is

Read/Write Data 49

more compact than an equivalent text file, such as a csv file. Modern computers are fast, but particularly for large data sets, the time difference is noticeable. The larger the file, the greater the benefit from reading a native R data file instead of a text data file.

For example, consider a relatively large csv file of 69.8MB with over 4.2 million rows of data and 6 integer variables. This file was stored on your author’s MacBook Pro with an Intel i5 processor and a solid state drive. The total elapsed time to read the over 4 million of rows of data was just under 31 seconds. The data were then written in native R format as an .rda file. The total time to read the same data in this format was reduced to just under 11 seconds. Further, the size of the written file reduced to 16.5MB.

After reading the data it is often worthwhile to perform one or more transformations of the existing data, the topic of the next chapter. For example, convert a variable with data editing, measurements in inches to measurements in centimeters, or analyze the logarithm of a variable. Chapter 3 , p. 53 These subsequent modifications of the data values are saved when the updated data frame is saved. The alternative is to re-read the initial text file and then re-do the transformations.

Scenario Write the contents of a data frame in native R format Data and the variable labels were read into an R data frame and edited, with several transformations applied to the data. Write the complete contents of the edited data frame in R native format.

To write the complete contents of the mydata data frame to an external file, including all internal R specific formatting, use the lessR function Write with the option format="R" . To Write function:

simplify the input, the abbreviation Write contents of a wrt.r automatically sets this option. Specify a file name, to

data frame to an

which Write automatically appends the file type .rda for R data if the file type is not explicitly external data file. specified.

type="R" option: Write the data

frame as an R

lessR Input Write an R native data file called MyGoodData native file. > Write("MyGoodData", format="R")

or

> wrt.r("MyGoodData")

Where does R write the data file? The answer is what R calls the current working directory, the

directory to which output files are written. In Windows, the default location is your Documents current working

folder. In Macintosh and Linux systems the default location is the top level of your home folder. directory: The For example, the following call to Write was done on your author’s Macintosh. Here the location of where R writes files. contents of the mydata data frame are written to the file called MyGoodData.rda in the gerbing folder, as indicated by the output of the Write function. Note that the file type, .rda , is not part of the information entered to the Write function, but is automatically appended to the file name.

> wrt.r("MyGoodData") The mydata contents was written at the current working directory.

MyGoodData.rda in:

/Users/gerbing

Listing 2.12 Input and output for the Write function for the mydata data frame.

50 Read/Write Data

Then, just move the new data file to the desired location on your computer’s file system, such as by dragging the file’s icon to the desired folder. Optionally, the current working directory can also be changed so that output files can be directly written to the desired location from R . In Windows go to the File menu and choose Change dir… , and on a Macintosh go to the

setwd function:

Misc menu and choose Change Working Directory… . The R function to do this from the

Set the working

command line, applicable to all R users, is setwd , for set working directory. Enter ?setwd for

directory where files are written.

more information.

2.5.2 Write a Data Frame in csv Format

The contents of a data frame can also be written to an external file in csv format as a text file. Here use the Write function with format="csv" , the default value. Obtain the same effect by simply omitting the option. One advantage of writing to the csv format is that data can be read into R, modified, written as a csv file, and then read into another application such as a worksheet.

lessR Input Write a csv data file called MyGoodData > Write("MyGoodData")

or

> wrt("MyGoodData")

The Read and Write functions allow data to freely flow into and out of R . The Write function by default also writes to the csv formatted file the row name for each

row of data. Particularly for data tables read into R that already have row identifiers, such as the Employee data table, this default is generally appropriate. However, for data files without an explicit row ID contained in the data file, R assigns a row ID, which is just an integer from 1 to the number of rows. By default, then, this row number will be written to the output data file

row.names

even though it was not included in the data file originally read into R . To suppress the writing

option: Suppress the writing of row

of the row names, add the row.names=FALSE option to the Write statement.

names by setting

One other issue that relates to how well an output text file of data matches the input text

to FALSE .

file relates to missing data. As discussed, missing data by default in an input text file of data is represented as literally missing: for a csv text file there are two commas with nothing in between, and for a fwd text file there is simply an empty space in the corresponding column. Once read, internally R represents this missing data as NA . When writing data to an external text file, R retains these missing data codes, which are written to the output data file instead of literally being missing. To read this text file back into R , make sure to include the missing=NA option to the Read statement so that R will, once again, interpret these values as missing.

Worked Problems

1 Consider the data in Figure 2.3 , randomly selected from a data file of the body measurements

of thousands of motorcyclists.

(a) Enter these values into a worksheet and create a csv file of these data. (See problem #4

for Chapter 1 .) (b) Read the data into R. (c) Confirm that the data were read correctly.

Read/Write Data 51

1 Gender Weight Height

Figure 2.3 Gender, Weight, and Height of eight motorcyclists.

(d) List the data within R. (e) Write the data to an R native data file.

2 Variable labels. (a) In a worksheet application, construct a variable labels file for the data in Figure 2.3 . For

example, include the units of measurement along with the variable name. (b) Convert the labels file to a csv file. (c) Read the labels file into R.

(d) List the labels within R. 3 Suppose the data from Figure 2.3 were stored in the fwd format, as shown in Listing 2.13 .

Listing 2.13 Data for eight motorcyclists in fixed width format. This data set is also available on the web at: http://lessRstats.com/data/HtWtEg.fwd

(a) Read the data into R directly from the web. (Hint: To read the data, treat the blank space in front of each number as part of its field width.) (b) Create the data file on your computer system. (c) Confirm that the data were read correctly.

(d) Write a csv version of the data file, without the imputed row names, 1 through 8. 4 The following web address (URL) specifies a file in fixed width format with 351 rows of data,

the data for the Hunter, Gerbing, and Boster (1982) analysis. http://lessRstats.com/data/Mach4Plus.fwd

52 Read/Write Data

The codebook for the data follows. ID, 4 columns

Gender, 0 for Male, 1 for Female, 1 column Mach IV, 20 items, 1 column each Dogmatism, 20 items, 1 column each Self-esteem, 10 items, 1 column each Internal locus of control, 8 items, 1 column each External locus of control, Powerful others, 8 items, 1 column each External locus of control, Chance, 8 items, 1 column each

(a) Read the data in an R data frame. (b) List the variable names and the first 6 rows of the data frame.