Read Data

2.2 Read Data

As a data analysis project progresses through various stages over time, the data table can simultaneously exist in several different forms. The data can be stored indefinitely on the computer in many different file formats, such as a universally readable text data file in csv

csv format,

format. The data can exist in a worksheet format specific to an application such as from MS Excel

Section 1.6.4 , p. 24

or LibreOffice Calc. The data table can also exist within an R session, what is called a data frame, and which can be saved as a computer file in native R format. As shown throughout this chapter,

data frame: A data table within

the data table can be easily transferred back and forth between these various forms.

an R session, ready for analysis.

2.2.1 Read Text, R or SPSS Data Files

To begin data analysis, usually first read the data from a data file into R . Except for two specific file types, the default format of the external data file is a text file with adjacent data values separated by either commas, csv , or tabs, tab-delimited . Data files created from within R are identified by the usual .rda file type for R da ta, and data files from the SPSS data analysis system by the usual .sav file type. Other file types and data formats are also accommodated when specific options are activated, as explained throughout this chapter.

Scenario Locate and read the data into R Browse your file system to locate a comma or tab-delimited text data file, or R or SPSS data file, then read the data in the file into R for subsequent analysis. Display relevant characteristics of the data to help guide this analysis.

Read/Write Data 33

Invoke the lessR function Read with the simplest possible function call, that is, with no arguments, to browse interactively for a file and then read the corresponding data. To browse for browse for a file:

a file means that the usual window provided by your operating system automatically opens that Navigate the file system to locate a allows you to locate the file within a specific directory (folder) on your computer or network. file. Then either double-click on the file name or click on the file name and then click on the Open button.

Once the file is identified, the Read function reads the data from the file. The data values that are read from an external file are usually directed into a designated internal R storage container for a data table, what R calls a data frame. Any valid R name for the data frame can be specified, though mydata is the usual choice. The R assignment expression, <- , indicates the object into assignment which the data values are stored.

statement, Section 1.3.3 , p. 9

Read function:

lessR Input Default read of csv or SPSS or R data from a local file

Read data from a file into an R data

> mydata <- Read()

or

> mydata <- rd()

frame.

Most lessR functions can also be referenced with an abbreviated name, such as rd for the function Read . Either form can be used, the full name to explicitly indicate the task performed by the function, or, an abbreviation to minimize keystrokes. Also, R pays attention to capitalization, so be careful to also do so. There are standard R functions based on the spelling read with a lower case r, but this refers to something other than the lessR function Read with the uppercase R .

Alternatively, locate the data file with the file’s path name or full web address. List this name as the first argument within quotes in the call to the Read function. Here employee.csv is a csv data file stored in the data directory at the lessR website.

lessR Input Read data from a file on the web > mydata <- Read("http://lessRstats.com/data/employee.csv")

read data from

lessR For this particular file, another possibility is to read the employee.csv data directly from ,

Section 1.3.3 ,

within lessR itself as it is included in the lessR installation.

p. 10

Each lessR data analysis routine assumes, by default, that the data frame for which to apply the corresponding analysis is named mydata . So mydata does not need to be explicitly data option, designated by a lessR data analysis function (by providing a value for the data option). All of Section 1.3.3 , p. 13 the many, many standard R data analysis routines are also available to analyze the data in the mydata data frame. These procedures require the explicit statement of the name mydata , such $ notation, as with the $ notation, because they make no assumption regarding the data frame name.

Section 1.3.5 , p. 14

2.2.2 Types of Variables in R The data analyst has a conceptual understanding of how the data values for a variable are

structured. Are they numeric or non-numeric? What is their valid response range? The analysis of the data for the variables is done on the computer, so how the data are conceptually defined should align with the way in which they are stored within the computer, here within an R data frame.

34 Read/Write Data

Accordingly, when the data values are read into any data analysis application such as R , their structure and content as read by R should be examined and verified before their analysis begins. The analyst should confirm that the data were read correctly and are properly represented in the resulting data frame. Many things can go wrong. Perhaps errors occurred as the data values were entered into the data file. Perhaps the data values were not correctly read into R . Perhaps

data storage

there is too much missing data to permit a meaningful analysis.

type: How the data values of a

The data storage type is how the computer physically represents a data value in its memory.

variable are

The storage type should match the conceptual definition of the variable. The common R data

physically stored in the computer.

storage types for numeric variables are type integer , for numbers without decimal digits, and what R calls numeric , for numbers with decimal digits. 1

Categorical variables usually have less than 10 or so unique values, and perhaps as few as two, such as for Gender. For a categorical variable different data storage types could be used both in the data file that stores the data or in the way that R represents the data within a data processing session, its data frame. For example, the data values can be integers with a different integer assigned to each level, that is value, of the categorical variable, such as 0 for Male and 1 for Female. Or, the values could be stored as non-numeric characters, what R calls

type character , such as M for Male and F for Female. R , however, has a separate data storage

factor: R data

type expressly dedicated to represent categorical variables. This storage type is a factor, which

storage type for categorical

combines these two representations such that the data are internally stored as integers but

variables.

displayed in terms of descriptive labels. When R reads the data from an external file it scans the characters that form the data values for each variable. If it finds only digits for the data values of a variable, R defines the variable as integer . If only digits are found and at least one data value has a decimal point, the variable is defined as numeric . Based on the data in Figure 1.7 on p. 21, after R reads the data values, the variable Years is represented in the data frame as integer and Salary as numeric .

When R reads the data and finds any non-digit character except a decimal point in any data value for a variable, then R by default concludes that the variable cannot be numeric and so must be categorical. Accordingly, R defines the variable as a factor. From Figure 1.7 on p. 21, the data values for Gender, Dept and Satisfaction consist of non-numeric characters, and so each is interpreted as a factor . 2

These coding interpretations imply that numeric data values in a data file should not include commas or dollar signs. Otherwise R would interpret the resulting variable as a factor, presuming an underlying categorical variable. Then its data values would not be amenable to any numerical

save a csv file,

statistical operation such as the computation of their mean. If the data values are in an Excel or

Section 1.6.4 , p. 24

other worksheet, remove the commas and dollar signs in the formatting before saving the data as a csv data file for analysis with R .

2.2.3 Output

Unless the option quiet is set to TRUE , the function Read displays a summary of the data frame

details function:

before proceeding with the subsequent data analysis. The output from Read is from an internal

Provide many details about a

call to the lessR function details . This function can be manually called at any time after a

data table.

data table has been read into R with the specified data frame name. The default data frame is mydata .

The first part of the output, shown in Listing 2.1 , displays the dimensions of the data table. The next section of output is a summary of the row names, presented here in Listing 2.2 .

Every data frame, the R container for the data table, always has a unique identifier or ID for

Read/Write Data 35

Basics ------------------------------------------------------- Number of Variables in mydata:

Number of Rows of Data in mydata: 37

Listing 2.1 Initial output of Read.

each row in the data table, a row name. This ID can be assigned from a column of the data table. Or, the ID can be assigned by R , in which case the IDs are just the integers from 1 to the last ID field, row of data, as in Listing 2.2 .

Section 2.2.6 , p. 37

Row Names --------------------------------------- First two row names: 1

Last two row names: 36 37

Listing 2.2 The row names of the data frame are the consecutive integers from 1 to 37.

A description follows of each of the variables that have been read into R , including the data storage type. A brief dictionary of common data storage types is first presented, illustrated in Listing 2.3 .

Variable Names and Types of the Values Read ------------------------------------------------------------------- factor: Non-numeric categories, which, as read here, are unordered integer: The values are numeric and integers numeric: The values are numeric with decimal digits

Listing 2.3 Some interpretations R makes of a variable.

Read then displays the variable names, as shown in the first column of Listing 2.4 . Pay variable name: particular attention to the variable names, including the pattern of capitalization, because it is The reference for a variable in by its name that the variable is referenced in any subsequent data analysis. These names appear subsequent data in the first column, under the heading Variable .

analysis.

To help validate that the data table was read correctly, Read lists some of the first and last values of each variable. It is recommended to match some of these data values against the contents of the file from which the data were read. Before beginning the subsequent data analyses, make sure that you are analyzing the data you intend to analyze.

Missing Unique

Variable

Type Values Values Values

First and last values

----------------------------------------------------------------------------- Years integer

36 1 16 7 NA

Gender factor

37 0 2 M

M M ... F F M

Dept factor 36 1 5 ADMN SALE ... SALE FINC Salary

numeric

Satisfaction factor

35 2 3 med low ... low high

HealthPlan integer

Listing 2.4 The summary of the variables read into the data frame mydata with the variable names listed in the first column.

36 Read/Write Data

The data value of Years for the second row of data, the data for James Wu, is NA , R ’s missing

NA: Data value

data code for numeric data, which, for non-numeric data, is displayed as <NA> . The NA means

that is missing.

“Not Available”.

2.2.4 Display an R Object

To display the contents of any R object in its entirety, simply enter the name of the object in response to the command prompt.

R Input Display the contents of an object such as the data frame mydata > mydata

print function: Display the contents of an R

Enter the name of the data frame, mydata , to display all the data. Entering just the name of

object at the

the R object, such as a data frame, calls the print function. Explicitly evoke the print function

console.

to achieve more control of the listing.

> print(mydata)

head, tail

See ?print for more specifics.

functions: List the first or last lines of

List just the first or last lines of the data frame with the R head and tail functions.

the specified object.

R Input List the contents of the first and last six rows of the data frame > head(mydata)

> tail(mydata)

n option: Specify the number of lines

The R head and tail functions are convenient for checking the form of the data without

to display with the head and tail

listing all rows of the data frame. The default number of lines listed can be changed from 6 with

functions.

the n option.

2.2.5 Missing Values

missing data,

Another issue is a consideration of missing data values. By default, Read interprets a data value

Section 1.6.5 , p. 25

as missing when it is actually missing, corresponding to a blank cell in a worksheet. Many if not most data sets have at least some missing data values. Via the details function, Read provides

a comprehensive analysis of missing values, which includes how many values are missing for each row of data or case (observation), and how many are missing for each variable. The variable summary provides number of missing data values for each variable, illustrated in Listing 2.4 , under the column Missing Values . The complementary information, the number of non-missing values, is presented under the previous column, Values . The sum of these two numbers for any variable is the total number of rows of data, here 37. For example, the variable Satisfaction has two missing data values and 35 values that exist.

Read/Write Data 37

Read also reports the number of missing values for each case, also referred to as an observation. Identify each observation by its row name. Find this report for the Employee data case or observation, set in Listing 2.5 .

Section 1.6.1 , p. 21

Missing Data Analysis -------------------------------------------------- n.miss Observation

Total number of cells in data table: 259 Total number of cells with the value missing: 4

Listing 2.5 Number of missing data values for each row of data, that is, case.

From this analysis we see that the employee with the most missing data is the person listed row names, in the fourth row of the data table. Here we see an advantage of more appropriately identifying Figure 2.2.6, p. 37 the names of each person as an ID field. Then each person’s name would be visible instead of his or her row number in this output.

R ’s code for missing data is NA , which indicates Not Available. Each NA code in the R data frame corresponds to a blank cell in a corresponding worksheet. R also makes it possible for example worksheet missing data in the data file to be represented by codes instead of just leaving the cell blank.

with missing data, Section 1.7, p. 21

Scenario Missing value codes in the data file The data file of interest stored on the computer system represents missing data with one

or more designated codes, usually of data values that would not naturally occur, such as − 99 for a numerical data value that only has positive values and XX for character data values. Read the data into R and interpret these codes to represent missing data.

Use the Read missing option to inform R as to what values define a missing value.

missing option: Designated data value to indicate missing data.

lessR Input Read data with both specified character and numeric missing data codes > mydata <- Read(missing=c("XX",-99))

c function,

When presenting a list of multiple values with the values in the list separated by commas, Section 1.3.6 , p. 15

always use the combine function c to group the values together. The result of this specification

is that every XX and every − 99 that exists in the data file read into R is replaced with an NA in

the resulting R data frame, here mydata .

2.2.6 Row Names

By default R numbers each row of the data frame with a row number. An alternative is to have R Employee data label each specific row of data, a case, with a unique label from that row of data. Refer back

table, Figure 1.7 ,

to the Employee data table. The names in the first column are not values to be analyzed, but p. 21

38 Read/Write Data

rather ID values that uniquely identify each row. Many different R analyses identify data by its row identifier, such as labeling points in a scatterplot, where an ID value from the data provides

a more meaningful label than a row number. The Read function provides some guidance regarding the implementation of row names, as shown in Listing 2.6 . If Read detects in the data file a column of non-numeric data with unique values, then that column is noted as a potential ID column.

For the following ’variable’, each row of data is unique. Perhaps these values specify a unique ID for each row. To implement, re-read with the following setting added to your Read statement: row.names=1 ---------------------------------------------------------------------- Name

Listing 2.6 A note from Read suggesting a possible ID column.

Read provides useful information when it recognizes a potential ID field.

Scenario Read the data and identify a column of data as the row names One column in the data table of interest consists of unique names, that is, each name uniquely identifies a specific row. Read the data into R and identify this column as row names to identify specific data values in the R output.

Use the row.names option in the Read statement to indicate the column number that contains the IDs. To interactively browse for this data file on the local file system or network, indicated by not specifying a file name, include here only the row.names option.

lessR Input Read data and assign the first column as row IDs > mydata <- Read(row.names=1)

The following example applies when the row names in the data file are in the first column in the call to Read and the file name and location are specified.

> mydata <- Read("http://lessRstats.com/data/employee.csv", row.names=1)

The result of specifying the row names in the first column is that R now correctly no longer regards the first column of information as a variable, as shown in Listing 2.7 . As opposed to the previous Read output, now only 6 variables are recognized in the analysis instead of the previously listed value of 7 from Listing 2.1 .

Basics -------------------------------------------------------- Name of data frame that contains the data: mydata Number of Variables in mydata:

Number of Rows of Data in mydata: 37

Listing 2.7 Initial output of Read, here after the first column of the data file is specified as the row names.

Read/Write Data 39

If a column of the data table is not a variable but an ID field, then R should be properly informed of this structure. R will then accordingly use this information to enhance the quality of its output, as in Listing 2.8 . The names of each employee are now listed instead of the row number.

Missing Data Analysis ---------------------------------------- n.miss Observation

1 Wu, James 2 Jones, Alissa 1 Korhalkar, Jessica

Listing 2.8 Row-wise missing value analysis after the first column of the data file is specified as the row names.

One meaningful constraint regarding the row names is that they should be unique for each row of data. If not, the read is not successful. Nor should the read be allowed to be successful without unique IDs because confusion would result from linking two or more rows of data to the same ID.

2.2.7 Categorical Variables

variable types, Section 1.6.3 ,

A primary distinction among variables is the distinction between continuous and categorical p. 22 variables. A categorical variable has relatively few unique data values, called levels: Values of a levels. For example,

categorical

the variable Gender typically has two values, Male and Female.

variable.

Another consideration is how the variable is stored on the computer. A potential confusion computer storage, Section 2.2.2 is that the values of categorical variables are not numeric, yet they can be stored as numbers, ,

p. 33

usually integers, on the computer. A common example is that the values of Gender may be Mach IV data, coded numerically, such as a 0 and 1, as in the Mach IV data set. Or, for the Employee data set, Listing 1.8 , p. 27 each employee’s choice of a health plan is coded as a 1, 2, or 3 instead of the actual names of the health plans.

The computer program, Employee data set, R or anything else, cannot know without additional information

Section 1.6.1 ,

if the numeric values are measurements of a continuous variable, or if they are numeric p. 20 codes for non-numeric categories. Fortunately, there is a clue that suggests that a variable is categorical: only a small number of integer values. Read can check for this criterion, with a n.cat option: The system parameter defined by lessR called n.cat , an abbreviation for “number of categories”, maximum number of unique values to which defines the maximum number of unique integer values for which to consider a variable consider a numeric

as categorical. variable as

categorical.

By default, n.cat is turned off, set to 0, but can be specified to any value in one of two ways. A value of n.cat such as 4 can be passed to a relevant lessR data analysis function, such as for summary statistics, so that the value is applicable just for that specific analysis. An analysis of summary statistics, for example, of an integer variable in this situation with 4 or less unique values yields a frequency table rather than numeric summaries such as the mean. Or, a value can be set for all subsequent analyses with the lessR function set .

set function, Section 1.4.1 ,

lessR Input Set maximum number of unique values to interpret as categorical

p. 16

> set(n.cat=4)

40 Read/Write Data

For example, as shown in Listing 2.9 , having set n.cat at the value of 4 defines, for all subsequent analyses, all numerical variables with 4 or less unique values to be interpreted as categorical.

Each of these variables is numeric, but has less than or equal 4 unique values. If these variables are categorical consider to transform each variable into a factor with the Transform and factor functions. To see examples enter: > ?trans Or, specify a value for n.cat, such as: > set(n.cat=4) ------------------------------------------------------------------------ HealthPlan

Listing 2.9 The variable HealthPlan is likely categorical even though coded numerically.

factor function,

A more formal solution to the issue of a categorical variable represented as type integer is to

Section 1.6.3 , p. 22

invoke the R function factor to define the variable as an R factor, R ’s variable type specifically designed for categorical variables.

2.2.8 No Text Output

Presumably the output of the Read function to the console is generally useful. However, if this information is not needed, such as when a data set is re-read at a later time for re-analysis, then

quiet option,

the console output can be suppressed.

Section 1.3.5 , p. 14

lessR Input Suppress all output from Read > mydata <- Read(quiet=TRUE)

This option applies to all the lessR functions that provide text feedback at the console, often where the primary task is to generate output elsewhere, usually to a graphics window or

set function,

to a data structure such as mydata . The value of quiet can also be set at the system level with

Section 1.4.1 , p. 16

the lessR function set .