Software Tools

1.8 Software Tools

There are many software tools for statistical analysis, covering a broad spectrum of possibilities. At one end we find “closed” products where the user can only

20 1 Introduction

perform menu operations. SPSS and STATISTICA are examples of “closed” products. At the other end we find “open” products allowing the user to program any arbitrarily complex sequence of statistical analysis operations. MATLAB and R are examples of “open” products providing both a programming language and an environment for statistical and graphic operations.

This book explains how to apply SPSS, STATISTICA, MATLAB or R to solving statistical problems. The explanation is guided by solved examples where we usually use one of the software products and provide indications (in specific “Commands” frames) on how to use the other ones. We use the releases SPSS STATISTICA 7.0, MATLAB 7.1 with the Statistics Toolbox and R 2.2.1 for the Windows operating system; there is, usually, no significant difference when using another release of these products (especially if it is a more advanced one), or running these products in other non-Windows based platforms. All book figures obtained with these software products are presented in greyscale, therefore sacrificing some of the original display quality.

The reader must bear in mind that the present book is not intended as a substitute of the user manuals or on-line helps of SPSS, STATISTICA, MATLAB and R. However, we do provide the really needed information and guidance on how to use these software products, so that the reader will be able to run the examples and follow the taught matters with a minimum effort. As a matter of fact, our experience using this book as a teaching aid is that usually those explanations are sufficient for solving most practical problems. Anyway, besides user manuals and on-line helps, the reader interested in deepening his/her knowledge of particular topics may also find it profitable to consult the specific bibliography on these software products mentioned in the References. In this section we limit ourselves to describing a few basic aspects that are essential as a first hands-on.

1.8.1 SPSS and STATISTICA

SPSS from SPSS Inc. and STATISTICA from StatSoft Inc. are important and popularised software products of the menu-driven type on window environments with user-friendly facilities of data edition, representation and graphical support in an interactive way. Both products require minimal time for familiarization and allow the user to easily perform statistical analyses using a spreadsheet-based philosophy for operating with the data.

Both products reveal a lot of similarities, starting with the menu bars shown in Figures 1.8 and 1.9, namely the individual options to manage files, to edit the data spreadsheets, to manage graphs, to perform data operations and to apply statistical analysis procedures.

Concerning flexibility, both SPSS and STATISTICA provide command language and macro construction facilities. As a matter of fact STATISTICA is close to an “open” product type, since it provides advanced programming facilities such as the use of external code (DLLs) and application programming interfaces (API), as well as the possibility of developing specific routines in a Basic-like programming language.

1.8 Software Tools 21

In the following we use courier type font for denoting SPSS and STATISTICA commands.

1.8.1.1 SPSS

The menu bar of the SPSS user interface is shown in Figure 1.8 (with the data file Meteo.sav in current operation). The contents of the menu options (besides the obvious Window and Help), are as follows:

File: Operations with data files ( *.sav), syntax files (*.sps), output files ( *.spo), print operations, etc.

Edit: Spreadsheet edition. View:

View configuration of spreadsheets, namely of value labels and gridlines.

Data: Insertion and deletion of variables and cases, and operations with the data, namely sorting and transposition.

Transform: More operations with data, such as recoding and computation of new variables.

Analyze: Statistical analysis tools. Graphs:

Operations with graphs. Utilities:

Variable definition reports, running scripts, etc.

Besides the menu options there are alternative ways to perform some operations using icons.

Figure 1.8. Menu bar of SPSS user interface (the dataset being currently operated is Meteo.sav).

1.8.1.2 STATISTICA

The menu bar of STATISTICA user interface is shown in Figure 1.9 (with the data file Meteo.sta in current operation). The contents of the menu options (besides the obvious Window and Help) are as follows:

File: Operations with data files ( *.sta), scrollsheet files (*.scr), graphic files ( *.stg), print operations, etc.

Edit: Spreadsheet edition, screen catching. View:

View configuration of spreadsheets, namely of headers, text labels and case names.

Insert: Insertion and copy of variables and cases. Format:

Format specifications of spreadsheet cells, variables and cases. Statistics: Statistical analysis tools and STATISTICA Visual Basic.

22 1 Introduction

Graphs: Operations with graphs. Tools:

Selection conditions, macros, user options, etc.

Data: Several operations with the data, namely sorting, recalculation and recoding of data.

Besides the menu options there are alternative ways to perform a given operation using icons and key combinations (using underlined characters).

Figure 1.9. Menu bar of STATISTICA user interface (the dataset being currently operated is Meteo.sta).

1.8.2 MATLAB and R

MATLAB, a mathematical software product from The MathWorks, Inc., and R (R:

A Language and Environment for Statistical Computing) from the R Development Core Team (R Foundation for Statistical Computing, Vienna, Austria, ISBN 3- 900051-07-0), a free software product for statistical computing, are popular examples of “open” products. R can be downloaded from the Internet URL http://www.r-project.org/. This site explains the R history and indicates a set of URLs (the so-called CRAN mirrors) that can be used for downloading R. It also explains the relation of the R programming language to other statistical processing languages such as S and S-Plus.

Performing statistical analysis with MATLAB and R gives the user complete freedom to implement specific algorithms and perform complex custom-tailored operations. MATLAB and R are also especially useful when the statistical operations are part of a larger project. For instance, when developing a signal or image classification project one may have to first compute signal or image features using specific MATLAB or R toolboxes, followed by the application of appropriate statistical classification procedures. The penalty to be paid for this flexibility is that the user must learn how to program with the MATLAB or R language. In this book we restrict ourselves to present the essentials of MATLAB and R command-driven operations and will not enter into programming topics.

We use courier type font for denoting MATLAB and R commands. When needed, we will clarify the correspondence between the mathematical and the software symbols. For instance MATLAB or R matrix x will often correspond to the mathematical matrix X.

1.8.2.1 MATAB

MATLAB command lines are written with appropriate arguments following the prompt, », in a MATLAB console as shown in Figure 1.10. This same Figure

1.8 Software Tools 23

illustrates that after writing down the command help stats (ending with the “Return” or the “Enter” key), one obtains a list of all available commands (functions) of the MATLAB Statistical toolbox. One could go on and write, for instance, help betafit, getting help about the betafit function.

Figure 1.10. The command window of MATLAB showing the list of available statistical functions (obtained with the help command).

Note that MATLAB is case-sensitive. For instance, Betafit is not the same as betafit. The basic data type in MATLAB and the one that will use more often are matrices. Matrix values can be directly typed in the MATLAB console. For instance, the following command defines a 2 × 2 matrix x with the typed in values:

» x=[1 2

The “= symbol is an assignment operator. The symbol “x” is the matrix ” identifier. Object identifiers in MATLAB can be arbitrary strings not starting by a digit; exception is made to reserved MATLAB words.

Indexing in MATLB is straightforward using the parentheses as index qualifier. Thus, for example x(2,1) is the element of the second row and first column of x with value 3.

A vector is just a special matrix that can be thought of as a 1 ×n (row vector) or as an n ×1 (column vector) matrix.

MATLAB allows the definition of character vectors (e.g. c=[‘abc’]) and also of vectors of strings. In this last case one must use the so-called “cell array” which is simply an object recipient array. Consider the following sequence of commands:

>> c=cell(1,3); >> c(1,1)={‘Pmax’};

24 1 Introduction

>> c(1,2)={‘T80’}; >> c(1,3)={‘T82’}; >> c c=

‘Pmax’ ‘T80’ ‘T82’ The first command uses function cell to define a cell array with 1×3 objects. These are afterwards assigned some string values (delimited with ‘). When printing the c values one gets the confirmation that c is a row vector with the three strings (e.g., c(1,2) is ‘T80’).

When specifying matrices in MATLAB one may use comma to separate column values and semicolon to separate row values as in:

» x=[1, 2 ; 3, 4];

Matrices can also be used to define other matrices. Thus, the previous matrix x could also be defined as:

» x=[[1 2] ; [3 4]]; » x=[[1; 3], [2; 4]];

One can confirm that the matrix has been defined as intended, by typing x after the prompt, and obtaining:

x=

3 4 The same result could be obtained by removing the semicolon terminating the

previous command. In MATLAB a semicolon inhibits the production of screen output. Also MATLAB commands can either be used in a procedure-like manner, producing output (as “answers”, denoted ans), or in a function-like manner producing a value assigned to a variable (considered to be a matrix). This is illustrated next, with the command that computes the mean of a sequence of values structured as a row vector:

» v=[1 2 3 4 5 6]; » mean(v) ans =

3.5000 » y=mean(v) y=

Whenever needed one may know which objects (e.g. matrices) are currently in the console environment by issuing who. Object removal is performed by writing clear followed by the name of the object. For instance, clear x removes matrix x from the environment; it will no longer be available. The use of clear without arguments removes all objects from the environment.

1.8 Software Tools 25

On-line help about general or specific topics of MATLAB can be obtained from the Help menu option. On-line help about a specific function can be obtained by just typing it after the help command, as seen above.

1.8.2.2 R

R command lines are written with appropriate arguments following the R prompt, >, in the R Gui interface (R console) as shown in Figure 1.11. As in MATLAB command lines must be terminated with the “Return” or the “Enter” key.

Data is represented in R by means of vectors, matrices and data frames. The basic data representation in R is a column vector but for statistical analyses one mostly uses data frames. Let us start with vectors. The command

> x <- c(1,2,3,4,5,6)

defines a column vector named x containing the list of values between parentheses. The “<-” symbol is the assignment operator. The “c” function fills the vector with the list of values. The symbol “x” is the vector identifier. Object identifiers in R can be arbitrary strings not starting by a digit; exception is made to reserved R words.

Figure 1.11. The R Gui showing the definition of a vector.

We may list the contents of x just by issuing it as a command:

26 1 Introduction

>x [1] 1 2 3 4 5 6

The [1] means the first element of x. For instance,

> y <- rnorm(12) >y

[1] -0.1354 -0.2519 0.5716 0.6845 -1.5148 -0.1190 [7] 0.7328 -1.0274 0.3319 -0.3468 -1.2619 0.7146

generates and lists a vector with 12 normally distributed random numbers. The 1 st and 7 th elements are indicated. (The numbers are represented here with four digits after the decimal point because of page width constraints. In R the representation is with seven digits.) One could also obtain the previous list by just issuing: > rnorm(12). Most R functions also behave as procedures in that way, displaying lists of values in the R console.

A vector can be filled with strings (delimited with “), as in v <- c(“Pmax”, T80”, T82”). Now v is a vector containing three strings. The “ “ second vector element, v[2], is “T80”

R also provides a function, named seq, to define evenly spaced number sequences, as in the following example:

> seq(-1,1,0.2)

[1] -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0

A matrix can be obtained in R by suitably transforming a vector. For instance,

> dim(x) <- c(2,3) >x

transforms (through the dim function) the previous vector x into a matrix of 2×3 elements. Note the display of row and column numbers.

One can also aggregate vectors into a matrix by using the function cbind (“column binding”) or rbind (“row binding”) as in the following example:

> u <- c(1,2,3) > v <- c(-1,-2,-3) > m <- cbind(u,v) >m

u v [1,] 1 -1 [2,] 2 -2 [3,] 3 -3

Matrix indexing in R uses square brackets as index qualifier. As an example, m[2,2] has the value -2.

Note that R is case-sensitive. For instance, Cbind cannot be used as a replacement for cbind.

1.8 Software Tools 27

Figure 1.12. An illustration of R on-line help of function mean. The “Help on ‘mean’” is displayed in a specific window.

An R data frame is a recipient for a list of objects. We mostly use data frames that are simply data matrices with appropriate column names, as in the above matrix m.

Operations on data are obtained by using suitable R functions. For instance,

> mean(x) [1] 3.5

displays the mean value of the x vector on the console. Of course one could also assign this mean value to a new variable, say mu, by issuing the command mu <- mean(x).

Whenever needed one may obtain the information on which objects are currently in the console environment by using ls()(“list”). (Be sure to include the parentheses; otherwise R will interpret it as you wishing to obtain the ls function code.) Object removal is performed by applying the function rm (“remove”) to a list of object identifiers. For instance, rm(x) removes matrix x from the environment; it will no longer be available.

On-line help about general topics of R, namely command constructs and available functions, can be obtained from the Help menu option of the R Gui. On- line help about a specific function can be obtained using the R help function as illustrated in Figure 1.12.

28 1 Introduction

Figure 1.13. A partial view of the R “Package Index”.

The functions available in R are collected in so-called packages (somehow resembling the MATLAB toolboxes; an important difference is that R packages may also include datasets). One can inspect which packages are currently loaded by issuing the search() command (with no arguments). Consider that you have done that and obtained:

> search() [1]”.GlobalEnv” “package:methods” “package:stats” [4]”package:graphics” “package:grDevices” “package:utils” [7]”package:datasets” “Autoloads” “package:base”

We will often use functions of the stats package. In order to get the information of which functions are available in the stats package one may issue the help.start() command. An Internet window pops up from where one clicks on “Packages” and obtains the “Package Index” window partially shown in Figure 1.13.

By clicking on stats of the “Package Index” one obtains a complete list of the available stats functions. The same procedure can be followed to obtain function (and dataset) lists of other packages.

The command library()issues a list of the packages installed at one’s site. One of the listed packages is the boot package. In order to have it currently loaded one should issue library(boot). A following search() would display:

> search() [1] “.GlobalEnv”

“package:methods” [4] “package:stats” “package:graphics” “package:grDevices” [7] “package:utils” “package:datasets” “Autoloads” [10]”package:base”

“package:boot”