Directory UMM :Data Elmu:jurnal:L:Livestock Production Science:Vol66.Issue1.Sept2000:

(1)

www.elsevier.com / locate / livprodsci

The relation between breeding management and 305-day milk

production, determined via principal components regression and

partial least squares

a,b ,

_*

c d

C.W. Rougoor

, R. Sundaram , J.A.M. van Arendonk

Research Station for Cattle, Sheep and Horse Husbandry, Runderweg 6, NL-8219PK Lelystad, The Netherlands

Department of Economics& Management, Wageningen Agricultural University, Wageningen, The Netherlands

Department of Statistics and Probability, Michigan State University, East Lansing, The Netherlands

Animal Breeding and Genetics Group, Wageningen Institute of Animal Sciences, Wageningen Agricultural University, Wageningen,

The Netherlands

Received 20 November 1998; received in revised form 19 July 1999; accepted 13 December 1999

Abstract

A field study was set up to investigate the relation between breeding management and 305-day milk production. Second goal of the study was to investigate advantages and disadvantages of principal components regression (PCR) and partial least squares (PLS) for livestock management research. Multicollinearity was present in the data set and the number of variables was high compared to the number of observations. Out of 70 variables related to breeding management and technical results at dairy farms, 19 were selected for PLS and PCR, based on a correlation of $0.25 or # 20.25 with 305-day milk production. Five principal components (PCs) were selected for PC-regression with 305-day milk production being the goal variable. Related variables were combined into one so-called synthetic factor. All synthetic variables were used in a path-analysis. The same path-analysis was worked out with PLS. PLS forms synthetic factors capturing most of the information for the independent X-variables that is useful for predicting the dependent Y-variable(s) while reducing the dimensionality. Both methodologies showed that milk production per cow is related to critical success factors of the producer, farm size, breeding value for production and conformation. Milk production per cow was the result of the attitude of the farmer as well as the genetic capacity of the cow. It was found that at high producing farms the producer put relatively much emphasis on the quality of the udder and less on the kg of milk. Advantages of PLS are the optimization towards the

Y-variable, resulting in a higher R , and the possibility to include more than one Y-variable. Advantages of PCR are that

hypothesis testing can be performed, and that complete optimisation is used in determining the PCs. It is concluded that PLS is a good alternative for PCR when relations are complex and the number of observations is small.  2000 Elsevier Science B.V. All rights reserved.

Keywords: Dairy cattle; Principal component; Partial least squares; Management; Milk production

1. Introduction

*Corresponding author. CLM, P.O. Box 10015, 3505 AA Utrecht. Tel.:131-30-2441-301; fax:131-30-2441-318.

(2)

results on dairy farms are complex. The current 1. To investigate the relationships between breeding paper is part of a project with the purpose of getting management and 305-day milk production; a better insight into relationships under practical 2. To investigate the advantages and disadvantages conditions. The major focus was on operational of PLS and PCR for use in livestock management management (the day-to-day decision making) and research.

its relation with 305-day milk production. This kind of livestock management studies often face the

problem that multicollinearity is present in the data 2. Materials and methods

set, and that the number of variables is large

compared to the number of observations. In case of 2.1. Data multicollinearity, standard statistical techniques, as

linear regression, will give unstable estimates of the The data used in the current paper were part of a regression coefficients which hinders their interpreta- bigger project, which aimed at determining the tion. In addition, a large number of variables com- relation between management and milk production pared to the number of observations will decrease the and gross margin. Therefore, management is divided degrees of freedom of the residual variance dramati- into management in the areas of grassland, feeding, cally. animal health, fertility and breeding. Thirty-nine Multicollinearity is often difficult to detect. Afifi Dutch dairy farms were included in the field study. and Clark (1984) state that a simple way to check for Farms were selected on the basis of gross margin multicollinearity is to examine the correlations from May 1993 till May 1994 (for farms using a among the independent variables. When priori in- non-calendar year accounting period) or January formation is available on relationships between 1994 to January 1995 (for farms with a calendar year variables and complex relations have to be deter- accounting period), and the 305-day milk production mined, path-analysis can be a useful tool to use between August 1993 and August 1994. Farms had (Rougoor et al., 1997). This methodology, however, to meet the following two criteria:

can only be used when the number of observations is

large compared to the number of variables. Bigras- 1. A gross margin below Dfl. 77.40 or above 78.40 Poulin (1985), Lafi and Kaneene (1992a), Ferguson per 100 kg of milk;

et al. (1994), and Webster et al. (1997) used 2. A 305-day milk production below 7270 kg per Principal Components Analysis and Principal Com- cow, or above 7450 kg.

ponents Regression (PCR) to reduce the number of

independent variables (i.e. to reduce dimensionality) These cut-off values reflect farms selected out of and to avoid problems regarding multicollinearity. the ‘tails’ of the normal distribution. This way, a data Faye et al. (1997) used canonical correspondence set was created with four groups of farms which analysis, which is a generalisation of the principal could be used to investigate the relation between component analysis. Steenkamp and Van Trijp variation in management and milk production and (1996) were facing the same kind of problems, but gross margin. Unfortunately, the differences in gross used Partial Least Squares (PLS) to reduce dimen- margin between the four groups had become much sionality and multicollinearity. PLS is considered to smaller over the past few years. Therefore, the be useful for describing complex relationships (For- analyses were done on individual data, without nell and Bookstein, 1982; Fornell et al., 1990). PLS differentiating between the groups. During the period has proved to be successful for forming prediction of data collection one farm dropped out, so analyses equations to relate a substance’s chemical composi- were based on data of 38 farms. From May 1996 to tion to its near-infrared spectra (Garthwaite, 1994). May 1997 the farms were visited monthly to collect However, PLS has hardly been used in livestock data. To get insight into the breeding management of management research. the producers, a management questionnaire on breed-The goal of the current study is twofold: ing decisions was developed and during one of the

(3)

farm visits exposed to the producer. Questions a total of 70 variables. The CSFs are measured on a focused on the breeding goal of the producer, the sire 6-point-scale form ‘not important’ to ‘very impor-selection and the use of natural service sires. The tant’. Theoretically spoken, these variables are ordi-producer was also asked to indicate what the critical nal variables. However, an increasing score indicates success factors (CSFs) were at his farm regarding ‘more important’. Therefore, the variables were production and breeding. Milk production data of the assumed to be continuous. To indicate whether farms, as well as data on breeding values of cows at multicollinearity was likely to be present in the data the herds, were made available by the Royal Dutch set, simple linear correlations between the 19 vari-Cattle Syndicate (NRS). A first selection of variables ables was carried out (Afifi and Clark, 1984). was based on simple linear correlation of $0.25 or The number of variables was large compared to

# 20.25 with farm average 305-day milk product- the number of farms (our unit of observation). ion. Table 1 gives an overview of the 19 variables Therefore, the variables were grouped into so-called that were selected for the multivariate analyses out of ‘synthetic factors’. The calculation of synthetic

vari-Table 1

Description of variables used in the multivariate analyses and their average value for the 38 farms

Synthetic factor Variable Description of variable Avg. value

Critical success Production CSF ‘milk production per cow’ on a 0 (not

factors (CSF) mentioned) to 5 (most important) scale 2.20

Culling CSF ‘culling policy’ on a 0 to 5 scale 0.69 Winter milk CSF ‘% of milk produced in winter’ on a 0 0.58

to 5 scale

Breeding goal (BG) Kg milk % of points producer gives to ‘kg milk’ as a

producer breeding goal at his farm 14.56

Udder % of points producer gives to ‘udder’ 10.35

Farm size No. Inseminated No. of inseminated cows 13.5

Total no. of cows Total number of cows at the farm 65.1 Avg. no of mc Average no. of cows that are not dried off 55.4 Use natural service Cows% % of cows inseminated with natural service sires 3% Breeding value production BV Milk Avg. breeding value of cows for kg of milk 213 kg

BV Fat Avg breeding value of cows for kg of fat 6.8 kg

BV INET Avg. breeding value of cows for INET 81.3 Breeding value Development Avg. breeding value of cows for ‘development’ 100.2

conformation Type Avg. breeding value of cows for ‘type’ 100.1

Udder Avg. breeding value of cows for ‘udder’ 100.1 Legs Avg. breeding value of cows for ‘legs’ 100.7 Total Avg. breeding value for cows ‘total conformation’ 100.2 Age at calving Age heifers Expected age of calving of heifers 787 days

]

Calving Age Average age of dairy cows at calving 1485 days

]

Milk production 305-day Farm average 305-day milk production 8342 kg

Producer is asked to assign 100 points to different genetic aspects, as he is taking into account for the breeding of his cows.

INET5weighed averaged of the breeding values for kg milk, kg fat and kg protein, based on the price paid for these different components.

(4)

ables from the underlying variables differed between transformation of a set of correlated explanatory PLS and PCR. This will be discussed when these X-variables into an equal number of uncorrelated

methodologies are discussed. variables. These new variables, the so-called princi-Fig. 1 gives the null-path model for the path- pal components (PCs), are all linear combinations of analysis. For both methodologies (PLS and PCR) the the original correlated X-variables. The PCs are researcher has to use prior knowledge and intuition arranged in decreasing order of contribution to to define the synthetic factors and the null-path variance. Dimensionality can be reduced by selecting model. The specification of the synthetic factors was only a couple of PCs with a high contribution to based upon a logical separation of different parts and variance. The number of PCs selected may be levels of breeding management. The design of the determined by examining the proportion of total null-path model was based on the framework as variance explained by each component, or by the described by Rougoor et al. (1998). The decision- cumulative proportion of total variance explained. A making process (business goals and CSFs) influences rule of thumb adopted by many investigators is to biological and technical aspects and processes select only the PCs explaining at least 100 /P percent (breeding value, use of natural service and age at of the total variance, with P being the total number calving), which in turn influences the 305-day milk of variables (Afifi and Clark, 1984). This selection production. Farm size, in turn, might have influenced criterion was also used in the current paper. Besides the average breeding value of cows on the farm. The the percentage of variance explained, the eigen path diagram was analysed by PLS and PCR. To get values of the PCs can be of use to decide how many comparable results, for both methodologies the rule PCs to include in the PCR of the PCs on the was applied that only arrows with a standardised Y-variable (the 305-day milk production). The eigen

path coefficient larger than 0.20 were kept in the value is the variance of that PC. When an eigen value model. of a PC is close to zero, it means that multicollineari-ty is present among the original variables. In that 2.2. Principal Components Regression (PCR) case that PC can be excluded from the regression. These two selection criteria (both using a so-called Principal component analysis, a statistical tech- top-down approach) do not include PCs with small nique originated by Hotelling (1933), is performed in contribution to variance in the regression. This order to simplify the description of a set of interre- results in a reliable estimate of the regression lated variables (Afifi and Clark, 1984). It allows the parameters. The selected PCs were utilised as

(5)

correlated explanatory variables in the regression predicting the dependent Y-variables (in our case the model. Parameter estimates were generated by the ‘305-day milk production’). In the meantime PLS equation: reduces the dimensionality of the regression problem by using fewer synthetic factors than the number of 305-day milk production5a1b * (PC )1 1 _{X-variables. Major difference between PCR and PLS}

is that with PLS the data values of both the X- and

1b * (PC )2 2 1 . . . b * (PC )n n 1e (1)

Y-variables influence the construction of the

syn-where a is the intercept term, b is the regressioni _{thetic factors. In the previous paragraph it was} coefficient, PC is the principal component i, n is thei _{explained that the PCs in a PCR are determined} number of PCs included in the regression, and e is _{without taking into account the Y-variable} (Garth-the residual (error) term. These estimates of (Garth-the _{waite, 1994). Another difference between the two} regression coefficients were used to reconstitute _{methodologies is that PLS has the opportunity to} regression coefficients for the explanatory variables, _{take into account more than one Y-variable at the} as was done by Lafi and Kaneene (1992b): _{same time (however, this option will not be used in}

the current paper).

RCvar( j )₅(loadPC ) * b

1,var( j ) 1

Input of the PLS-model are the raw data, the

1(loadPC2,var( j )) * b2 _{set-up of the synthetic factors and the set-up of the}

null-path model. PLS estimates the relations between

1 . . . (loadPCn,var( j )) * bn (2)

these data and factors. It distinguishes between where RCvar( j ) is the standardized reconstituted _{different components of the path model. The}

rela-regression coefficient of variable j, loadPCi,var( j ) is tionships between the synthetic factors are the so-the loading of variable j on PC , and b is so-thei i called inner relations, for instance the relation be-regression coefficient as was estimated in Eq. (1). tween the synthetic factors ‘CSF’ and ‘Breeding Due to these transformations, these explanatory Goal Producer’. These are given by the inner path variables (the PCs) are corrected in such a way as to coefficients, ranging from –1 (a strong negative minimize the effect of multicollinearity. The recon- relationship) to 11 (a strong positive relationship). stituted regression coefficients were used to construct Relations between the variables and the synthetic the synthetic factors. This way, dimensionality could factors are the outer relations, for instance the be reduced without losing much of the information. relation between the breeding goal ‘Kg milk’ and the Besides that, interpretability will be increased (Afifi synthetic factor ‘Breeding Goal Producer’. These are and Clark, 1984). The synthetic variables were used given by the factor loadings. Factor loadings can in a multivariate path-analysis. Standardized path- vary between –1 (indicating a very strong negative coefficients were calculated as described by Rougoor relationship; all variance of that variable is captured et al. (1997). The procedures PCP and MODEL of in that synthetic factor) and 11 (a very strong the statistical package Genstat (Payne et al., 1995) positive relationship). These are estimated in such a were used to do the calculations. way that the model is optimal in the inner part (i.e. between the synthetic factors) as well as the outer part (i.e. towards the X- and Y-variables). PLS seeks 2.3. Partial Least Squares (PLS)

values for the factor loadings and structural parame-ters that minimize residual variance for the synthetic PLS is a methodology that can be used for theory

factors and the X- and Y-variables. This way, a confirmation, but can also be used to suggest where

synthetic factor is estimated to be the best predict-relationships might or might not exist and to suggest

able variable of its X-variables as well as the best propositions for later testing. It intents to form so

predictor of subsequent dependent synthetic variables called ‘latent variables’ (in our case these are the

or Y-variables (Steenkamp and Van Trijp, 1996). synthetic factors, for instance ‘Breeding Goal

The PLS algorithm proceeds in three stages. The Producer’) that capture most of the information for

first stage gives estimates of the case values of the the independent X-variables (i.e. the two breeding

synthetic variables. The second stage of the PLS goals ‘Kg milk’ and ‘Udder’) that is useful for

(6)

algorithm uses the estimates of the synthetic factors parameter. Jack knifing provides information about in the first stage to estimate the inner and outer the precision of the parameter estimates. The PLS-relations, without location parameters. The third step model was estimated with the LVPLS 1.8 program

of the algorithm estimates the location parameters of (Lohmoller, 1987). the synthetic factors and the structural relations

estimated in the first two stages (Wold, 1982). A

detailed overview of these three steps is given by 3. Results

Wold (1985).

No distributional assumptions are made in PLS 3.1. Correlation between variables (Fornell and Cha, 1994). Therefore, the traditional

statistical testing methods are not well suited. The The correlations between variables within the variance extracted measures the amount of variance synthetic variable ‘Breeding Value Conformation’ of the X- or Y-variable that is captured by the varied between 0.65 and 0.94. The correlations synthetic factor. This variable can vary from 0 to between variables within the synthetic variable

11. The average variance extracted (AVE) is the ‘Farm Size’ varied between 0.73 and 0.93. These average of the variances extracted of all X- or Y- examples show that the correlations between vari-variables of one specific synthetic factor. A high ables within a synthetic variable can be high. So, AVE indicates that the amount of variance captured multicollinearity is likely to exist. Afifi and Clark by the synthetic factor is big compared with the (1984) stated that when two variables are highly amount of unexplained variance of the X- or Y- correlated (greater than 0.95), it may be simplest to variables. It is a measure to evaluate the relationship use only one of them, since one variable conveys between the synthetic factor and its X-variables: the essentially all of the information contained in the outer model. This can be used to evaluate the other. However, all correlations were smaller than goodness of measurement model, that is, reliability 0.95 in this case. Besides that, the presence of these of the synthetic factors (Fornell and Cha, 1994). The big correlations might emphasise differences

be-2

R measures the explanatory power of the relations

between the different synthetic factors. It shows how

Table 2 well a synthetic factor is predicted by other synthetic

Percentage of variance explained and the eigen values of the 19 factors. This value is dependent upon the set-up of _{principal components}

the path-model. The predictive value of the model

Principal % variation Eigen value can be shown by the Stone–Geisser test or by jack _component

knifing. The Stone-Geisser test calculates a criterion

2 PC1 37.17 7.06

Q that indicates how well the observed values can

PC2 15.77 3.00

be reconstructed by the model. It is evaluated as an _PC _9.23 _1.75

3 2

R in Ordinary Least Squares (OLS) without loss of PC4 6.23 1.18

PC 5.74 1.09

degrees of freedom. The general form of the Q is 5

2 _PC _5.18 _0.98

Q 512 E /O, where E is the sum of squares of the

PC7 4.60 0.88

prediction errors and O is the sum of squares of the

PC8 3.96 0.75

errors from the prediction given by the mean of the _PC _3.26 _0.62

9 2

remaining data points. When Q .0 it indicates that PC10 2.38 0.45

PC 2.06 0.39

there is predictive relevance of the model, whereas 11

2 _PC _1.62 _0.31

12 Q ,0 suggests lack of relevance. Jack knifing can

PC13 0.82 0.16

be used to obtain standard deviations of the

parame-PC14 0.74 0.14

ter estimates (Miller, 1974). This is done by estimat- _PC _0.48 _0.09

ing the parameters N times in a data set with N PC16 0.31 0.06

PC 0.24 0.05

observations, each time cutting off just one observa- 17

PC18 0.19 0.04

tion. The different estimates for the same parameter,

PC19 0.03 0.01

(7)

tween PLS and PCR, so all variables were retained back to the original variables on a standardized and in the analysis. on their original scale. These regression coefficients are shown in Table 3. Because the regression co-3.2. Principal Components Regression (PCR) efficients were reconstituted, no significance values were available for these variables. The standardized The percentage of variance explained by the 19 regression coefficients were used to compare the PCs and the eigen values of these PCs are shown in outcome with the outcome of the PLS-modelling. Table 2. These results also showed that multicol- The regression coefficients based on the original linearity is present in the dataset, because component scale could be used to interpret the results. For nineteen had an eigen value close to zero (0.01). instance, the regression coefficient on the percentage When the rule of thumb was used that a PC has to of use of natural services indicates that at farms with explain at least 100 /P% of the variance to be a 1% higher use of natural services the 305-day milk included in the regression, the percentage of variance production is expected to be 436 kg lower.

explained by one PC has to be at least 100 / 195 The regression coefficients on the original scale 5.26%. Only the first five of the original 19 PCs were used to calculate the synthetic variables, which could satisfy this criterion (see Table 2). These five were used in a multivariate path-analysis. Fig. 2

PCs together explained 74.14% of the variance in the shows the outcome of this path-analysis. The R data set. These five PCs were used in a linear showed that the model could explain 36% of the regression. The coefficients were then transformed differences in milk production. The synthetic factor

Table 3

Results of principal components regression on 305-day milk production with five PCs included

Variable Regression coefficients Regression coefficients

on standardized scale on original scale

CSF-Production 0.155 64.75

CSF-Culling 0.064 35.30

CSF-Winter milk 20.126 277.73

BG-Kg milk 20.046 23.05

BG-Udder 0.025 2.52

Farm Size-No inseminated 20.155 216.69

Farm Size-Total no of cows 20.093 23.74

Farm Size-Avg. no of mc 20.090 24.21

Use natural service – Cow% 20.056 2435.51

BV Milk 0.059 0.29

BV Fat 0.061 8.05

BV INET 0.107 1.63

BV-Development 0.001 0.68

BV-Type 0.039 15.87

BV-Udder 0.014 5.35

BV-Legs 0.030 18.14

BV-Total 0.019 6.60

Age heifers 20.185 24.25

]

Calving Age 20.091 20.48

]

Change in farm average 305-day milk production per point change in CSF.

Ditto per percent change in breeding goal.

Ditto per extra cow.

Ditto per percent change in use of natural service sires.

Ditto per kg change in breeding value.

Ditto per point change in INET.

Ditto per point change in breeding value.

(8)

Fig. 2. Path coefficients for PCR-modelling. NS5not significant; *5P,.05; **5P,.01.

Table 4

Measurement part of the PLS-model

?Synthetic factor Factor loading Mult. R (Average) variance

variable extracted

?Critical success factor NA 0.45

Milk production 0.39 0.15

Culling 0.60 0.35

Winter milk 20.91 0.83

?Breeding goal producer 0.06 0.59

Kg milk 20.81 0.65

Udder 0.72 0.52

?Use natural service sires 0.16 NA

Cow% 21.00

?Farm size NA 0.87

No. inseminated 20.90 0.80

Total no. of cows 20.94 0.89

Avg. no. milking cows 20.95 0.90

?Breeding value production 0.36 0.91

Milk 0.94 0.88

Fat 0.96 0.92

INET 0.97 0.95

?Breeding value conformation NA 1.00

Development 1.00 1.00

Type 1.00 1.00

Udder 1.00 1.00

Legs 1.00 1.00

Total 1.00 1.00

?Milk production 0.47 NA

305-day milk production 1.00

NA5not available; this Synthetic factor was not predicted by any other Synthetic factor.

(9)

‘Age at Calving’ was not used by the model because to the PCR-model, no significance values are given all path coefficients to and from this factor were here, because traditional statistical testing methods smaller than 0.20. Table 3 and Fig. 2 show that milk are not well suited. The Stone–Geisser test criterion

production was higher on farms with managers who Q was used as an alternative method to evaluate the

thought that ‘milk production per cow’ was a CSF model. It had a value of 0.31 indicating that the for their farm. At these farms the breeding value for model had predictive relevance, because it was conformation was higher. The breeding goal of the bigger than zero. The same main results as with PCR producer indicated, however, that these producers put were found with PLS. Small differences were found relatively much emphasis on the quality of the udder in the relation between the synthetic factors ‘Natural and less on the kg of milk. Service Sires’ and ‘Breeding Value Conformation’. PCR found a path coefficient of 0.25, whereas in the 3.3. Partial Least Squares (PLS) PLS-model it was smaller than 0.20 and therefore deleted. This indicates that a high percentage of Table 4 provides the factor loadings for each of natural services at the farm has a relatively strong

the measures. The R of each synthetic factor, the negative effect on the breeding value for production variance extracted for each variable, and the average and a smaller negative effect on breeding value variance extracted for each synthetic factor are conformation. Besides that, in the PLS-model, direct given. The factor loadings show that the variable effects of the synthetic factor ‘Critical Success ‘Winter milk’ is the most important variable of the Factors’ on ‘Breeding Value Production’ and ‘Natural synthetic variable ‘Critical Success Factors’. The Service Sires’ were found, whereas in the PCR-positive and negative signs of the two variables in model these path coefficients were too small. the synthetic variable ‘Breeding Goal Producer’

show that a farmer who has a high score on this

synthetic factor has said that the udder is an im- 4. Discussion

portant breeding goal at his farm, whereas kg of milk

is not. In this model, the age at calving was also not 4.1. Breeding management used, because the path coefficient was here also

lower than 0.20. The R of the synthetic factor ‘Milk The path coefficient diagrams (Figs. 2 and 3) Production’ shows that the model explained 47% of showed the same main effects. Milk production per the differences in milk production. cow was inverse related to farm size (the regression Fig. 3 gives a graphical representation of the coefficients and loadings were negative for this PLS-model with the inner path coefficients. Contrary synthetic factor). Milk production per cow was

(10)

directly positively related to breeding value for 4.2. Comparing the methods conformation, and to breeding value for production.

These variables, in turn, were related to goals and Wold (1985) states that PLS is useful when the CSFs of the producer, indicating that milk pro- main focus of the study shifts from individual duction is not only related to technical parameters, variables and parameters to packages of variables but also to the attitude of the producer. So, with and aggregate parameters. He stated that ‘in large, respect to the aim of the data collection to determine complex models with latent variables PLS is virtual-the relationship between breeding management and ly without competition’. Rossa (1982) showed a map 305-day milk production, it can be concluded that of statistical methods with regard to the complexity the producers’ breeding management was related of the problem and their degree of prior information with the 305-day milk production. Surprisingly, it and concluded that PCR and PLS are both useful for was found that farmers who stated that they focused complex problems. However, for PLS-modelling mainly on ‘kg of milk’ as a breeding goal, had a more prior information is needed, because the re-lower breeding value for milk production and they searcher has to design a path diagram with expected realised a lower 305-day production than producers relationships on forehand.

who stated that they also took into account ‘udder’ Helland and Almøy (1994) compared PCR and into their breeding strategy. A second aspect that PLS and concluded that there is not one method that comes forward is the use of natural services sires. dominates the other, and that the difference between Table 1 shows that natural services were rarely used the methods is typically small when the number of by the producers in the research group: only 3% of observations is large. PCR does well when the eigen the cows was inseminated with natural service sires. values from the irrelevant components are extremely However, it still was related with the breeding value. small or extremely large. PLS does well for inter-Producers who made more use of artificial insemina- mediate irrelevant eigen values (Helland and Almøy, tion, had cows with a higher breeding value and, 1994). In case of multicollinearity, the eigen values related with that, a higher 305-day milk production. might not be dominating ones. In that case PLS The CSFs of the producer were related to the becomes closer to ordinary least squares, which is a breeding goal of the producer, which in turn was desirable property of PLS. Garthwaite (1994) com-related to the breeding value for production through pared PLS with four other methods, including PCR, the selection decision. Differences between PCR and and concluded that PLS is a useful method for PLS came out for the synthetic factor ‘Breeding forming prediction equations when there are a large Value Conformation’. The underlying variables of number of explanatory variables.

this factor were highly related to each other (correla- The R of the milk production models differed tions between 0.65 and 0.94). PLS deals with that by considerably between the two methodologies: 0.36 making one synthetic factor out of it, which has a for the PCR-model and 0.47 for the PLS-model. This high loading on all these variables. PCR, in turn, can be explained by differences in optimizing tech-tries to minimize multicollinearity by taking one niques employed in deriving the synthetic factors. variable more into account than the other one. Table PLS forms the synthetic factors by using the co-2 shows that especially ‘Breeding Value Legs’ and variance between the X- and Y-variables already, ‘Breeding Value Type’ were included in this factor in whereas with PCR the PCs are formed based on the the PCR. Because the factor ‘Breeding Value Con- X-variables only. As a result of that, the synthetic

formation’ was built up differently in the two factors in PLS explain differences in the Y-variable models, the relationships towards the other synthetic better than PCR can do. In the current PCR 14 PCs factors were also different. The positive relation were eliminated, based on their low eigen value. between ’CSF’ and ’Breeding Value Conformation’ Another option is to eliminate components that have in the PCR-model indicated that farmers who stated low correlation with the response variable. This

that milk production per cow was a major critical results in a larger R (0.45 in this case when five PCs success factor for the farm had a higher breeding with the highest correlation with 305-day milk value for type and udder. production were selected). However, the elimination

(11)

Table 5

Requirements and (dis)advantages of Principal Component Regression (PCR) and Partial Least Squares (PLS)

PCR PLS

Requirements

Possibilities complexity path-analysis Not complex Very complex

Degree of prior-information required Not much Much

[_{cases: No. of variables} [_cases_{. .}[_variables [_cases_,_,₅_{, or}_.[_variables

Assumption on distribution variables Normal distribution Distribution-free

Number of Y-variables 51 . 51

(Dis)advantages

Multicollinearity Accounted for Accounted for

Analysis Complete Partial

Y-variable included in optimisation No Yes

Calculation P-values Possible Not possible

procedure that was used in this study guarantees certain data set, other aspects of PLS and PCR have variance reduction in the X-variables, but using the to be compared as well. Advantages that did not alternative method does not (Mason and Gunst, come out of the current analyses but which are useful 1985), and the alternative method gives less stable to take into account are that in PLS the investigator results (Xie and Kalivas, 1997). is free to define more than one Y-variable, that the The results of the two analyses showed some number of variables can be large compared to the advantages and disadvantages of both methodolo- number of observations, and that no distributional gies. PLS has a clear advantage that it is optimizing assumptions are made. This last aspect makes more towards the Y-variable right from the beginning, data sets suitable for PLS-analysis. However, at the whereas with PCR some variance in the data set same time it implicates the disadvantage that signifi-might be left out that still has a reasonable effect on cance values cannot be calculated. A disadvantage of the Y-variable. As a result of that, the percentage of PLS is that it is a partial procedure in the sense that variance that can be explained with the model is each step of the estimation minimizes a residual bigger for PLS. PCR, on the contrary, has a well- variance with respect to a subset of X-, Y- and developed theory, which makes it possible to esti- synthetic variables (Steenkamp and Van Trijp, 1996). mate P-values within the model. This makes the So, there is no total residual variance or other overall

¨ model statistically more attractive than PLS that optimum criterion that is strictly optimized (Joreskog lacks a good statistical inferential base. This could and Wold, 1982). The requirements, advantages and probably be overcome by using data permutation to disadvantages of both methodologies are summarized generate distributions under the null hypothesis in Table 5.

(Churchill and Doerge, 1994). Besides that, the regression coefficients of PCR on the original scale

can be interpreted more easily. In PCR the synthetic 5. Conclusions

factors were based on regression coefficients on milk

production. In the path-analysis, however, some Regarding the first goal of the study, to investigate synthetic factors were not related to milk production relationships between breeding management and straightforwardly. In that case, it is not logical to milk production, the following conclusions can be calculate the synthetic factors this way. PLS is then drawn:

more sufficient; the synthetic factors were formed

based on their surrounding synthetic factors. Due to • 305-day milk production per cows is the result of the way the analysis was set up, PLS can be the attitude of the farmer as well as of the genetic generalized to a multivariate set-up very easily. capacity (the breeding value) of the cow;

(12)

Fornell, C., Bookstein, F., 1982. Two structural equation models: has negative effects on breeding values for

pro-LISREL and PLS applied to consumer exit-voice theory. J. duction and conformation;

Marketing Res. 19, 440–452.

• Farmers who state that milk production per cow is _{Fornell, C., Cha, J., 1994. Partial Least Squares. In: Bagozzi, R.P.} a major critical success factor for their farm, have (Ed.), Advanced methods of marketing research. Blackwell,

Cambridge, MA, pp. 52–78. herds with higher breeding values for type and

Fornell, C., Lorange, P., Roos, J., 1990. The cooperative venture udder;

formation process: a latent variable structural modeling

ap-• At high producing farms the farmer puts rela- _{proach. Management Science 36 (10), 1246–1255.}

tively more emphasis on the quality of the udder Garthwaite, P., 1994. An interpretation of Partial Least Squares. J. Am. Stat. Assoc. 89, 122–127.

and less on the kg of milk.

Helland, I.S., Almøy, T., 1994. Comparison of prediction methods when only a few components are relevant. J. Am. Stat. Assoc Regarding the second goal of the study, it can be _{89 (426), 583–591.}

concluded that PLS is a useful alternative to PCR in Hotelling, H., 1933. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441, livestock management research, to explain complex

498–520. relationships, but the advantages over PCR are not

Joreskog, K.G., Wold, H., 1982. The ML and PLS techniques for large. This implies that it is not possible to give a _{modeling with latent variables. Historical and comparative} general advice on what methodology to use. For each _{aspects. In: Joreskog, K.G., Wold, H. (Eds.), Systems under}_¨ data set the researcher has to decide what methodolo- indirect observations. Causality, Structure, Prediction. Part I,

pp. 263–270, Amsterdam. gy fits best. To decide which methodology to use, the

Lafi, S.Q., Kaneene, J.B., 1992a. Epidemiologic and economic researcher has to take into account the requirements

study of the repeat breeder syndrome in Michigan dairy cattle. of both methodologies. This includes the complexity _{I. Epidemiological modeling. Prev. Vet. Med. 14, 87–98.} of the path-analysis, the degree of prior-information _{Lafi, S.Q., Kaneene, J.B., 1992b. An explanation of the use of} that is available, the number of variables compared principal-components analysis to detect and correct for

mul-ticollinearity. Prev. Vet. Med. 13, 261–275. to the number of observations, the distribution of the

Lohmoller, J.B., 1987. LVPLS 1.8 Program Manual: Latent variables, and the number of Y-variables.

Variable path analysis with Partial Least Squares estimation.

¨ ¨

Zentralarchiv fur Empirische Sozialforschung der Universitat

¨ ¨

zu Koln, Koln.

Mason, R.L., Gunst, R.F., 1985. Selection principal components in

Acknowledgements

regression. Stat. Probab. Lett. 3, 299–301.

Miller, R.G., 1974. The jackknife – a review. Biometrika 61, The costs of this study were partly covered by

1–15.

financial support from the Netherlands Organization _{Payne, R.W., Harding, S.A., Arnold, G.M., 1995. Genstat 5:} for Scientific Research (NWO). Procedure library manual: release 3[3]. Lawes Agricultural

Trust, 383 pp.

Rossa, P.J., 1982. Explaining international political behavior and ¨ conflict through partial least squares modeling. In: Joreskog,

References _{K.G., Wold, H. (Eds.), Systems under indirect observation.}

Causality, structure, prediction. Part II, pp. 131–159, Am-sterdam.

Afifi, A.A., Clark, V., 1984. Computer-aided multivariate analysis.

Rougoor, C.W., Dijkhuizen, A.A., Huirne, R.B.M., Mandersloot, Lifetime Learning Publications, Belmont, California, 458 pp.

F., Schukken, Y.H., 1997. Relationships between technical, Bigras-Poulin, M., 1985. Interrelationships among calving events,

economic and environmental results on dairy farms: an ex-health problems, disposal, death and milk production in

planatory study. Livest. Prod. Sci. 47, 235–244. Ontario Holstein cows. PhD-thesis. University of Guelph,

Rougoor, C.W., Trip, G., Huirne, R.B.M., Renkema, J.A., 1998. Ontario.

How to define and study farmers’ management capacity: theory Churchill, G.A., Doerge, R.W., 1994. Empirical threshold values

and use in agricultural economics. Agric. Econ. 18, 261–272. for quantitative trait mapping. Genetics 138, 963–971.

Steenkamp, J.B.E.M., Van Trijp, H.C.M., 1996. Quality guidance: Faye, B., Lexcourret, F., Dorr, N., Tillard, E., MacDermott, B.,

A consumer-based approach to food quality improvement using McDermott, J., 1997. Interrelationships between herd

manage-partial least squares. Eur. Rev. Agric. Econ. 23, 195–215. ment practices and udder health status using canonical

corre-Webster, F.B., Lean, I.J., Curtis, M.A., 1997. A case-control study spondence analysis. Prev. Vet. Med. 32, 171–192.

to identify farm factors affecting fertility of dairy herds. Ferguson, J.D., Galligan, D.T., Thomsen, N., 1994. Principal

Multivariate description of factors. Austr. Vet. J., 262–265. descriptors of body condition score in Holstein cows. J. Dairy

Wold, H., 1982. Soft modeling. The basic design and some Sci. 77, 2695.

(13)

extensions. In: Joreskog, K.G., Wold, H. (Eds.), Systems under Xie, Y.L., Kalivas, J.H., 1997. Evaluation of principal component indirect observation. Causality, structure, prediction. Part II, selection methods to form a global prediction model by pp. 1–54, Amsterdam. principal component regression. Analytica Chimica Acta 348, Wold, H., 1985. Partial Least Squares. In: Kotz, S., Johnson, N.L. 19–27.

(Eds.). Encyclopedia of Statistical Sciences, Vol. 6. Wiley, New York, pp. 581–591.

(1)

Fig. 2. Path coefficients for PCR-modelling. NS5not significant; *5P,.05; **5P,.01.

Table 4

Measurement part of the PLS-model

?Synthetic factor Factor loading Mult. R (Average) variance

variable extracted

?Critical success factor NA 0.45

Milk production 0.39 0.15

Culling 0.60 0.35

Winter milk 20.91 0.83

?Breeding goal producer 0.06 0.59

Kg milk 20.81 0.65

Udder 0.72 0.52

?Use natural service sires 0.16 NA

Cow% 21.00

?Farm size NA 0.87

No. inseminated 20.90 0.80

Total no. of cows 20.94 0.89

Avg. no. milking cows 20.95 0.90

?Breeding value production 0.36 0.91

Milk 0.94 0.88

Fat 0.96 0.92

INET 0.97 0.95

?Breeding value conformation NA 1.00

Development 1.00 1.00

Type 1.00 1.00

Udder 1.00 1.00

Legs 1.00 1.00

Total 1.00 1.00

?Milk production 0.47 NA

305-day milk production 1.00 a

NA5not available; this Synthetic factor was not predicted by any other Synthetic factor. b

(2)

‘Age at Calving’ was not used by the model because to the PCR-model, no significance values are given

all path coefficients to and from this factor were here, because traditional statistical testing methods

smaller than 0.20. Table 3 and Fig. 2 show that milk are not well suited. The Stone–Geisser test criterion

production was higher on farms with managers who Q was used as an alternative method to evaluate the

thought that ‘milk production per cow’ was a CSF model. It had a value of 0.31 indicating that the

for their farm. At these farms the breeding value for model had predictive relevance, because it was

conformation was higher. The breeding goal of the bigger than zero. The same main results as with PCR

producer indicated, however, that these producers put were found with PLS. Small differences were found

relatively much emphasis on the quality of the udder in the relation between the synthetic factors ‘Natural

and less on the kg of milk. Service Sires’ and ‘Breeding Value Conformation’.

PCR found a path coefficient of 0.25, whereas in the

3.3. Partial Least Squares (PLS) PLS-model it was smaller than 0.20 and therefore

deleted. This indicates that a high percentage of

Table 4 provides the factor loadings for each of natural services at the farm has a relatively strong

the measures. The R of each synthetic factor, the negative effect on the breeding value for production

variance extracted for each variable, and the average and a smaller negative effect on breeding value

variance extracted for each synthetic factor are conformation. Besides that, in the PLS-model, direct

given. The factor loadings show that the variable effects of the synthetic factor ‘Critical Success

‘Winter milk’ is the most important variable of the Factors’ on ‘Breeding Value Production’ and ‘Natural

synthetic variable ‘Critical Success Factors’. The Service Sires’ were found, whereas in the

PCR-positive and negative signs of the two variables in model these path coefficients were too small.

the synthetic variable ‘Breeding Goal Producer’ show that a farmer who has a high score on this

synthetic factor has said that the udder is an im- 4. Discussion

portant breeding goal at his farm, whereas kg of milk

is not. In this model, the age at calving was also not 4.1. Breeding management

used, because the path coefficient was here also

lower than 0.20. The R of the synthetic factor ‘Milk The path coefficient diagrams (Figs. 2 and 3)

Production’ shows that the model explained 47% of showed the same main effects. Milk production per

the differences in milk production. cow was inverse related to farm size (the regression

Fig. 3 gives a graphical representation of the coefficients and loadings were negative for this

PLS-model with the inner path coefficients. Contrary synthetic factor). Milk production per cow was

(3)

directly positively related to breeding value for 4.2. Comparing the methods conformation, and to breeding value for production.

These variables, in turn, were related to goals and Wold (1985) states that PLS is useful when the

CSFs of the producer, indicating that milk pro- main focus of the study shifts from individual

duction is not only related to technical parameters, variables and parameters to packages of variables

but also to the attitude of the producer. So, with and aggregate parameters. He stated that ‘in large,

respect to the aim of the data collection to determine complex models with latent variables PLS is

virtual-the relationship between breeding management and ly without competition’. Rossa (1982) showed a map

305-day milk production, it can be concluded that of statistical methods with regard to the complexity

the producers’ breeding management was related of the problem and their degree of prior information

with the 305-day milk production. Surprisingly, it and concluded that PCR and PLS are both useful for

was found that farmers who stated that they focused complex problems. However, for PLS-modelling

mainly on ‘kg of milk’ as a breeding goal, had a more prior information is needed, because the

re-lower breeding value for milk production and they searcher has to design a path diagram with expected

realised a lower 305-day production than producers relationships on forehand.

who stated that they also took into account ‘udder’ Helland and Almøy (1994) compared PCR and

into their breeding strategy. A second aspect that PLS and concluded that there is not one method that

comes forward is the use of natural services sires. dominates the other, and that the difference between

Table 1 shows that natural services were rarely used the methods is typically small when the number of

by the producers in the research group: only 3% of observations is large. PCR does well when the eigen

the cows was inseminated with natural service sires. values from the irrelevant components are extremely

However, it still was related with the breeding value. small or extremely large. PLS does well for

inter-Producers who made more use of artificial insemina- mediate irrelevant eigen values (Helland and Almøy,

tion, had cows with a higher breeding value and, 1994). In case of multicollinearity, the eigen values

related with that, a higher 305-day milk production. might not be dominating ones. In that case PLS

The CSFs of the producer were related to the becomes closer to ordinary least squares, which is a

breeding goal of the producer, which in turn was desirable property of PLS. Garthwaite (1994)

com-related to the breeding value for production through pared PLS with four other methods, including PCR,

the selection decision. Differences between PCR and and concluded that PLS is a useful method for

PLS came out for the synthetic factor ‘Breeding forming prediction equations when there are a large

Value Conformation’. The underlying variables of number of explanatory variables.

this factor were highly related to each other (correla- The R of the milk production models differed

tions between 0.65 and 0.94). PLS deals with that by considerably between the two methodologies: 0.36

making one synthetic factor out of it, which has a for the PCR-model and 0.47 for the PLS-model. This

high loading on all these variables. PCR, in turn, can be explained by differences in optimizing

tech-tries to minimize multicollinearity by taking one niques employed in deriving the synthetic factors.

variable more into account than the other one. Table PLS forms the synthetic factors by using the

co-2 shows that especially ‘Breeding Value Legs’ and variance between the X- and Y-variables already,

‘Breeding Value Type’ were included in this factor in whereas with PCR the PCs are formed based on the

the PCR. Because the factor ‘Breeding Value Con- X-variables only. As a result of that, the synthetic

formation’ was built up differently in the two factors in PLS explain differences in the Y-variable

models, the relationships towards the other synthetic better than PCR can do. In the current PCR 14 PCs

factors were also different. The positive relation were eliminated, based on their low eigen value.

between ’CSF’ and ’Breeding Value Conformation’ Another option is to eliminate components that have

in the PCR-model indicated that farmers who stated low correlation with the response variable. This

that milk production per cow was a major critical results in a larger R (0.45 in this case when five PCs

success factor for the farm had a higher breeding with the highest correlation with 305-day milk

(4)

Table 5

Requirements and (dis)advantages of Principal Component Regression (PCR) and Partial Least Squares (PLS)

PCR PLS

Requirements

Possibilities complexity path-analysis Not complex Very complex

Degree of prior-information required Not much Much

[_{cases: No. of variables} [_cases_{. .}[_variables [_cases_,_,₅_{, or}_.[_variables Assumption on distribution variables Normal distribution Distribution-free

Number of Y-variables 51 . 51

(Dis)advantages

Multicollinearity Accounted for Accounted for

Analysis Complete Partial

Y-variable included in optimisation No Yes

Calculation P-values Possible Not possible

procedure that was used in this study guarantees certain data set, other aspects of PLS and PCR have

variance reduction in the X-variables, but using the to be compared as well. Advantages that did not

alternative method does not (Mason and Gunst, come out of the current analyses but which are useful

1985), and the alternative method gives less stable to take into account are that in PLS the investigator

results (Xie and Kalivas, 1997). is free to define more than one Y-variable, that the

The results of the two analyses showed some number of variables can be large compared to the

advantages and disadvantages of both methodolo- number of observations, and that no distributional

gies. PLS has a clear advantage that it is optimizing assumptions are made. This last aspect makes more

towards the Y-variable right from the beginning, data sets suitable for PLS-analysis. However, at the

whereas with PCR some variance in the data set same time it implicates the disadvantage that

signifi-might be left out that still has a reasonable effect on cance values cannot be calculated. A disadvantage of

the Y-variable. As a result of that, the percentage of PLS is that it is a partial procedure in the sense that

variance that can be explained with the model is each step of the estimation minimizes a residual

bigger for PLS. PCR, on the contrary, has a well- variance with respect to a subset of X-, Y- and

developed theory, which makes it possible to esti- synthetic variables (Steenkamp and Van Trijp, 1996).

mate P-values within the model. This makes the So, there is no total residual variance or other overall

model statistically more attractive than PLS that optimum criterion that is strictly optimized (Joreskog

lacks a good statistical inferential base. This could and Wold, 1982). The requirements, advantages and

probably be overcome by using data permutation to disadvantages of both methodologies are summarized

generate distributions under the null hypothesis in Table 5.

(Churchill and Doerge, 1994). Besides that, the regression coefficients of PCR on the original scale

can be interpreted more easily. In PCR the synthetic 5. Conclusions

factors were based on regression coefficients on milk

production. In the path-analysis, however, some Regarding the first goal of the study, to investigate

synthetic factors were not related to milk production relationships between breeding management and

straightforwardly. In that case, it is not logical to milk production, the following conclusions can be

calculate the synthetic factors this way. PLS is then drawn:

more sufficient; the synthetic factors were formed

based on their surrounding synthetic factors. Due to • 305-day milk production per cows is the result of

the way the analysis was set up, PLS can be the attitude of the farmer as well as of the genetic

generalized to a multivariate set-up very easily. capacity (the breeding value) of the cow;

(5)

Fornell, C., Bookstein, F., 1982. Two structural equation models:

has negative effects on breeding values for

pro-LISREL and PLS applied to consumer exit-voice theory. J.

duction and conformation;

Marketing Res. 19, 440–452.

Cambridge, MA, pp. 52–78.

herds with higher breeding values for type and

Fornell, C., Lorange, P., Roos, J., 1990. The cooperative venture

udder;

formation process: a latent variable structural modeling

ap-• At high producing farms the farmer puts rela- _{proach. Management Science 36 (10), 1246–1255.}

tively more emphasis on the quality of the udder Garthwaite, P., 1994. An interpretation of Partial Least Squares. J. Am. Stat. Assoc. 89, 122–127.

and less on the kg of milk.

Helland, I.S., Almøy, T., 1994. Comparison of prediction methods when only a few components are relevant. J. Am. Stat. Assoc

Regarding the second goal of the study, it can be _{89 (426), 583–591.}

concluded that PLS is a useful alternative to PCR in Hotelling, H., 1933. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441,

livestock management research, to explain complex

498–520.

relationships, but the advantages over PCR are not

Joreskog, K.G., Wold, H., 1982. The ML and PLS techniques for

large. This implies that it is not possible to give a _{modeling with latent variables. Historical and comparative} general advice on what methodology to use. For each _{aspects. In: Joreskog, K.G., Wold, H. (Eds.), Systems under}_¨ data set the researcher has to decide what methodolo- indirect observations. Causality, Structure, Prediction. Part I,

pp. 263–270, Amsterdam.

gy fits best. To decide which methodology to use, the

Lafi, S.Q., Kaneene, J.B., 1992a. Epidemiologic and economic

researcher has to take into account the requirements

study of the repeat breeder syndrome in Michigan dairy cattle.

of both methodologies. This includes the complexity _{I. Epidemiological modeling. Prev. Vet. Med. 14, 87–98.} of the path-analysis, the degree of prior-information _{Lafi, S.Q., Kaneene, J.B., 1992b. An explanation of the use of} that is available, the number of variables compared principal-components analysis to detect and correct for

mul-ticollinearity. Prev. Vet. Med. 13, 261–275.

to the number of observations, the distribution of the

Lohmoller, J.B., 1987. LVPLS 1.8 Program Manual: Latent

variables, and the number of Y-variables.

Variable path analysis with Partial Least Squares estimation.

¨ ¨

Zentralarchiv fur Empirische Sozialforschung der Universitat

¨ ¨

zu Koln, Koln.

Mason, R.L., Gunst, R.F., 1985. Selection principal components in

Acknowledgements

regression. Stat. Probab. Lett. 3, 299–301.

Miller, R.G., 1974. The jackknife – a review. Biometrika 61,

The costs of this study were partly covered by

1–15.

financial support from the Netherlands Organization _{Payne, R.W., Harding, S.A., Arnold, G.M., 1995. Genstat 5:}

for Scientific Research (NWO). Procedure library manual: release 3[3]. Lawes Agricultural