More Data Formats

2.3 More Data Formats

The procedure of saving a data table stored as a worksheet file, such as from Microsoft Excel or LibreOffice Calc, into a csv formatted file with variable names in the first row provides a straightforward means of preparing data for entry into R . Sometimes, however, data are available in other formats that Read can also access.

2.3.1 Tab Delimited Data

delimiter:

For a text file with the data values for each variable that do not occupy a pre-defined number

Character that separates adjacent

of columns, a character called the delimiter separates adjacent data values. For a csv file

data values.

the delimiter is a comma. Another common delimiter is a tab character. The Read function recognizes both of these file formats by default.

Read/Write Data 41

Any standard character, however, can also serve as a delimiter according to the R separator sep option: Specify option, the character that sep . The tab character, for example, can be explicitly specified as the delimiter. To

separates adjacent

browse for and then read from a data file of standard text with tab-delimited data values, invoke data values. the following.

lessR Input Read tab delimited data > mydata <- Read(sep="\t")

The tab character, itself invisible, is represented by the backward slash and the letter t in the function call. To read a file from the web, specify the first parameter option as the web address enclosed in quotes. Then include the sep option as the second option. More generally, set sep="" to indicate any white space for a delimiter, such as a space or other invisible character.

2.3.2 Decimal Comma Instead of a Decimal Point

Another Read option based directly on the options provided by standard R is to change the

character that indicates the decimal digits in a number. Most English speaking countries, plus decimal

all of North America and China, use a period for the decimal separator, called the decimal point separator: The character in a in this context. Another tradition, favored by Europe, Russia, and all of South America, uses a numeric string that comma for the same purpose, and then, perhaps a semi-colon to delimit adjacent data values indicates where decimal digits instead of a comma. Use the dec option to specify the character that indicates decimal digits, begin. possibly in combination with the sep option.

Read2 function: Read data with a comma for a decimal point.

lessR Input Read data with commas for the decimal separator > mydata <- Read(sep=";", dec=",")

or

> mydata <- Read2()

To simplify reading csv data files in this format, lessR includes the Read2 function, which behaves just as Read , but with sep=";" and dec="," preset. When reading data with Read2 the display of the numbers in the R output still includes the decimal point as a period. To change the display on the output, invoke the OutDec option for the standard R options function.

OutDec option: Specify character for decimal point on R output.

lessR Input Inform R to display a decimal separator as a comma > options(OutDec=",")

There are many other options that can also be set with the options function. Reference the help file with ?options to view these options.

42 Read/Write Data

2.3.3 Skip Beginning Lines of Data File

Sometimes a data file begins with comment lines that describe the purpose of the data and each

skip option: Skip

of the variables contained in it. To read the actual data skip the first specified number of lines

the specified

in the data file with the skip option.

number of lines at the beginning of a data file.

lessR Input Skip the specified number of lines that begin a data file > mydata <- Read(skip=6)

In this example, the actual reading of the data begins on Line 7 of the data file. If the first line of information in the file beyond the comments contains the names of the variables, then that is what is read beginning on Line 7.

2.3.4 Fixed Width Data

csv data format,

A csv file, where a comma or other specified character delimits adjacent data values, is a

Section 1.6.4 , p. 24

text file, a file of just plain alphabetical characters and the digits, and common punctuation

text file, p. 19

characters such as a comma. A text file is a universal format, accessible to virtually all computer

applications. Another kind of text file is a fwd file, the fixed width format, where the data values

fixed width

format: All data

for each variable conform to a specified fixed number of columns in the data file.

values for a variable occur in

The previously introduced example of a fwd formatted text file is the Mach IV data set.

specified columns.

This file contains 25 digits for each row, the data for a single respondent. These data values for each row consist of a four-digit ID, a Gender column, and then 20 digits, the responses of each

Mach IV data, Listing 1.8 , p. 27

respondent to the 20 items on the Mach IV scale. Unlike a csv file, no delimiter separates the adjacent data values. There also are no variable names in the first row of the data table because the names do not fit into the allocated column or columns for the corresponding data.

Scenario Read fixed width formatted data into R The Mach IV data consists of responses to each of the 20 Mach IV items plus Gender, stored in a fixed width format, one column per response. Read this data table into R .

widths option: Specify the widths

The primary tool to specify the widths of the fields for fwd formatted data is the R widths

of the columns for

the data values in

option, which like most of the options for R read functions, such as read.table , also apply

each row of data.

to the lessR function Read . We will also use other functions to reduce the work needed to accomplish this read of the Mach IV data.

rep function:

First consider the standard R function, rep , for repetition. There are 20 columns for each

Create a string of characters by

row of the Mach IV responses, each response of width 1. To read the Mach IV data specify 20

repeating a specified set of

1’s to indicate the 20 column widths, a task best accomplished by rep with two arguments: the

characters.

number to be repeated, 1, and the number of repetitions, 20, as shown in Listing 2.10 . Wherever

Read/Write Data 43

R encounters rep(1,20) , it essentially processes 20 1’s as if they were entered manually, one after the other.

> rep(1,20) [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Listing 2.10 Illustration of the R function rep.

A second function to assist reading the item responses on a multi-item scale such as the Mach IV is the lessR function to . This function simplifies naming a sequence of consecutive to function: Name items. In this situation, we may name the first variable, or item, m01 , the second item m02 ,

a sequential set of variables with

and so forth until m20 . The to function specifies this sequence without actually having to enter consecutive the name of each item, as shown in numbers prefixed Listing 2.11 . To use to , specify the prefix of each item

by a given

within quotes and then the number of the last item. Optionally provide a third argument, the character string. beginning value, otherwise assumed to be one.

> to("m",20) [1] "m01" "m02" "m03" "m04" "m05" "m06" "m07" "m08" "m09" "m10" "m11" [12] "m12" "m13" "m14" "m15" "m16" "m17" "m18" "m19" "m20"

Listing 2.11 Illustration of the lessR function to, which names a sequence of consecutive items.

The next consideration for reading the data in a fwd formatted file is to name the variables as these names are not included in the data file. Use the R col.names option to specify the col.names option: corresponding list of variable names, again specified in order of occurrence in the data file. As Specify the variable names. always, when presenting a list of multiple values with the values in the list separated by commas,

use the combine function c to group or combine the values together.

c function,

The first argument listed in this example is the widths option. The four column ID takes Section 1.3.6 , up the first four columns, Gender the next column, and then one column each for the variables p. 15 m01 to m20 . The variables are named with col.names .

lessR Input Read fixed width formatted data with numbered variable names > mydata <- Read(widths=c(4,1,rep(1,20)),

col.names=c("ID", "Gender", to("m",20)))

To specify the location of the file directly, such as a web address, this information appears as the first argument passed to Read . For example, insert the specified location before the widths option in the previous example.

> mydata <- Read("http://lessRstats.com/data/Mach4.fwd",

widths=c(4,1,rep(1,20)),

variable labels,

col.names=c("ID", "Gender", to("m",20)))

Section 2.4 , p. 45

44 Read/Write Data

2.3.5 Read Data Files Included in lessR The primary purpose of lessR is to provide functions that simplify R data analysis. Also provided

are some data sets for analysis so that data are always available on any computer on which lessR has been downloaded. These data files can be read into R with the Read option format set to

Employee data set,

"lessR" .

Section 1.6.1 , p. 20

Each lessR data file name begins with the prefix data , followed by the descriptive name that identifies the file. For example, consider the Employee data set dataEmployee . This data

format="lessR"

set can be read with the format option set to "lessR" .

option: Read a data set included in lessR.

lessR Input Read data internally from lessR > mydata <- Read("Employee", format="lessR")

The same format applies to any of the other internal lessR data sets, such as the Machiavellianism data set Mach4 . These built-in data files were created by first reading the corresponding text data files into R and then writing the resulting data frames as native R files with the rda file type. The format="lessR" option instructs the Read function to read the corresponding rda file stored within the lessR package into the current mydata data frame. The usual Read output that provides information regarding the data table is available, and because mydata is the data frame name, the data table is automatically accessed by the lessR data analysis routines for analysis.

2.3.6 Read SPSS Native Data

The function Read can read data files written from the statistical package SPSS in their native

.sav filetype: A

format. If the data file’s file type is the usual .sav for an SPSS file, then Read will automatically

native SPSS data file.

detect this attribute and automatically set the format option to SPSS . This option can also be set manually if the data file is a native SPSS file without the usual .sav file type.

All of the data sets analyzed in this book, including those included in lessR , are available in SPSS format at the lessR website.

http://lessRstats.com/data/SPSS

With these data files an SPSS user can run parallel analyses in SPSS and R on the same data to compare the ease of use and output from SPSS and R/lessR . The data frame mydata created from reading the SPSS data file also contains any variable labels that are present in the native SPSS .sav file. The concept of variable labels is discussed in a following section.

Worksheet

2.3.7 Read Data Directly from an Excel File

representation of data, Section 1.6.2 ,

A worksheet application such as Excel, as previously discussed, is an excellent way to enter

p. 21

and store data into a data table. To analyze this data, one strategy is to save the data as a csv

csv data file, Section 1.6.4 ,

text file. Another option is to read the data directly from the Excel file using, for example, the

p. 24

read.xlsx function from the xlsx package (Dragulescu, 2012). The reason why this option

Read/Write Data 45

has not been promoted as the primary option is that there is an additional issue that must

be addressed when reading data directly from Excel: some software in addition to R must be installed. The read.xlsx function relies upon java . So if java is properly installed, and the install.packages, xlsx package is installed and then loaded with the library function, the following example Section 1.2.3 , p. 5 reads data from the first worksheet from the specified Excel file stored on the library function, lessR web server.

Section 1.3.1 , p. 6

> read.xlsx("http://lessRstats.com/data/Employee.xlsx", sheetIndex=1)

With read.xlsx there is no need to first convert the data to the csv format before reading into R . The problem is that the installation of java is not necessarily straightforward. One concern is that many consider java to be a security risk. Another issue is that there are 32-bit and 64- bit versions of java , just as there are 32-bit and 64-bit versions of R , and these architectures must be aligned. For example, if there is only a 32-bit version of java installed, and the 64-bit version of R is run, then the xlsx package will not work correctly. A related problem is that some applications other than R that depend on one version of java may cease to work correctly after installation of the second version. Yet another issue is that the company that provides the java software makes money when a particular search toolbar is installed along with the java software. There is considerable effort made to encourage the user to install this toolbar, and only careful reading of the prompts during the installation process will avoid the usually unwanted installation.

If these issues are addressed, then read.xlsx works well in reading data directly from an Excel file. The worksheet must be specified explicitly even if the Excel file contains only one worksheet. To do this, specify either the sheetIndex parameter, as in the previous example, or the sheetName parameter that specifies the name of the desired worksheet enclosed in quotes. To obtain the same feedback that the Read function provides regarding the newly created data details function, frame, manually invoke a call to the lessR function details .

Section 2.2.3 , p. 34