Introduction Materials and methods

Livestock Production Science 66 2000 71–83 www.elsevier.com locate livprodsci The relation between breeding management and 305-day milk production, determined via principal components regression and partial least squares a,b , c d C.W. Rougoor , R. Sundaram , J.A.M. van Arendonk a Research Station for Cattle , Sheep and Horse Husbandry, Runderweg 6, NL-8219PK Lelystad, The Netherlands b Department of Economics Management, Wageningen Agricultural University, Wageningen, The Netherlands c Department of Statistics and Probability , Michigan State University, East Lansing, The Netherlands d Animal Breeding and Genetics Group , Wageningen Institute of Animal Sciences, Wageningen Agricultural University, Wageningen, The Netherlands Received 20 November 1998; received in revised form 19 July 1999; accepted 13 December 1999 Abstract A field study was set up to investigate the relation between breeding management and 305-day milk production. Second goal of the study was to investigate advantages and disadvantages of principal components regression PCR and partial least squares PLS for livestock management research. Multicollinearity was present in the data set and the number of variables was high compared to the number of observations. Out of 70 variables related to breeding management and technical results at dairy farms, 19 were selected for PLS and PCR, based on a correlation of 0.25 or 2 0.25 with 305-day milk production. Five principal components PCs were selected for PC-regression with 305-day milk production being the goal variable. Related variables were combined into one so-called synthetic factor. All synthetic variables were used in a path-analysis. The same path-analysis was worked out with PLS. PLS forms synthetic factors capturing most of the information for the independent X-variables that is useful for predicting the dependent Y-variables while reducing the dimensionality. Both methodologies showed that milk production per cow is related to critical success factors of the producer, farm size, breeding value for production and conformation. Milk production per cow was the result of the attitude of the farmer as well as the genetic capacity of the cow. It was found that at high producing farms the producer put relatively much emphasis on the quality of the udder and less on the kg of milk. Advantages of PLS are the optimization towards the 2 Y-variable, resulting in a higher R , and the possibility to include more than one Y-variable. Advantages of PCR are that hypothesis testing can be performed, and that complete optimisation is used in determining the PCs. It is concluded that PLS is a good alternative for PCR when relations are complex and the number of observations is small.  2000 Elsevier Science B.V. All rights reserved. Keywords : Dairy cattle; Principal component; Partial least squares; Management; Milk production

1. Introduction

Corresponding author. CLM, P.O. Box 10015, 3505 AA Utrecht. Tel.: 131-30-2441-301; fax: 131-30-2441-318. E-mail address : crougoorclm.nl C.W. Rougoor. Relationships between management and technical 0301-6226 00 – see front matter  2000 Elsevier Science B.V. All rights reserved. P I I : S 0 3 0 1 - 6 2 2 6 0 0 0 0 1 5 6 - 1 72 C .W. Rougoor et al. Livestock Production Science 66 2000 71 –83 results on dairy farms are complex. The current 1. To investigate the relationships between breeding paper is part of a project with the purpose of getting management and 305-day milk production; a better insight into relationships under practical 2. To investigate the advantages and disadvantages conditions. The major focus was on operational of PLS and PCR for use in livestock management management the day-to-day decision making and research. its relation with 305-day milk production. This kind of livestock management studies often face the problem that multicollinearity is present in the data

2. Materials and methods

set, and that the number of variables is large compared to the number of observations. In case of 2.1. Data multicollinearity, standard statistical techniques, as linear regression, will give unstable estimates of the The data used in the current paper were part of a regression coefficients which hinders their interpreta- bigger project, which aimed at determining the tion. In addition, a large number of variables com- relation between management and milk production pared to the number of observations will decrease the and gross margin. Therefore, management is divided degrees of freedom of the residual variance dramati- into management in the areas of grassland, feeding, cally. animal health, fertility and breeding. Thirty-nine Multicollinearity is often difficult to detect. Afifi Dutch dairy farms were included in the field study. and Clark 1984 state that a simple way to check for Farms were selected on the basis of gross margin multicollinearity is to examine the correlations from May 1993 till May 1994 for farms using a among the independent variables. When priori in- non-calendar year accounting period or January formation is available on relationships between 1994 to January 1995 for farms with a calendar year variables and complex relations have to be deter- accounting period, and the 305-day milk production mined, path-analysis can be a useful tool to use between August 1993 and August 1994. Farms had Rougoor et al., 1997. This methodology, however, to meet the following two criteria: can only be used when the number of observations is large compared to the number of variables. Bigras- 1. A gross margin below Dfl. 77.40 or above 78.40 Poulin 1985, Lafi and Kaneene 1992a, Ferguson per 100 kg of milk; et al. 1994, and Webster et al. 1997 used 2. A 305-day milk production below 7270 kg per Principal Components Analysis and Principal Com- cow, or above 7450 kg. ponents Regression PCR to reduce the number of independent variables i.e. to reduce dimensionality These cut-off values reflect farms selected out of and to avoid problems regarding multicollinearity. the ‘tails’ of the normal distribution. This way, a data Faye et al. 1997 used canonical correspondence set was created with four groups of farms which analysis, which is a generalisation of the principal could be used to investigate the relation between component analysis. Steenkamp and Van Trijp variation in management and milk production and 1996 were facing the same kind of problems, but gross margin. Unfortunately, the differences in gross used Partial Least Squares PLS to reduce dimen- margin between the four groups had become much sionality and multicollinearity. PLS is considered to smaller over the past few years. Therefore, the be useful for describing complex relationships For- analyses were done on individual data, without nell and Bookstein, 1982; Fornell et al., 1990. PLS differentiating between the groups. During the period has proved to be successful for forming prediction of data collection one farm dropped out, so analyses equations to relate a substance’s chemical composi- were based on data of 38 farms. From May 1996 to tion to its near-infrared spectra Garthwaite, 1994. May 1997 the farms were visited monthly to collect However, PLS has hardly been used in livestock data. To get insight into the breeding management of management research. the producers, a management questionnaire on breed- The goal of the current study is twofold: ing decisions was developed and during one of the C .W. Rougoor et al. Livestock Production Science 66 2000 71 –83 73 farm visits exposed to the producer. Questions a total of 70 variables. The CSFs are measured on a focused on the breeding goal of the producer, the sire 6-point-scale form ‘not important’ to ‘very impor- selection and the use of natural service sires. The tant’. Theoretically spoken, these variables are ordi- producer was also asked to indicate what the critical nal variables. However, an increasing score indicates success factors CSFs were at his farm regarding ‘more important’. Therefore, the variables were production and breeding. Milk production data of the assumed to be continuous. To indicate whether farms, as well as data on breeding values of cows at multicollinearity was likely to be present in the data the herds, were made available by the Royal Dutch set, simple linear correlations between the 19 vari- Cattle Syndicate NRS. A first selection of variables ables was carried out Afifi and Clark, 1984. was based on simple linear correlation of 0.25 or The number of variables was large compared to 2 0.25 with farm average 305-day milk product- the number of farms our unit of observation. ion. Table 1 gives an overview of the 19 variables Therefore, the variables were grouped into so-called that were selected for the multivariate analyses out of ‘synthetic factors’. The calculation of synthetic vari- Table 1 Description of variables used in the multivariate analyses and their average value for the 38 farms Synthetic factor Variable Description of variable Avg. value Critical success Production CSF ‘milk production per cow’ on a 0 not factors CSF mentioned to 5 most important scale 2.20 Culling CSF ‘culling policy’ on a 0 to 5 scale 0.69 Winter milk CSF ‘ of milk produced in winter’ on a 0 0.58 to 5 scale Breeding goal BG Kg milk of points producer gives to ‘kg milk’ as a a producer breeding goal at his farm 14.56 Udder of points producer gives to ‘udder’ 10.35 Farm size No. Inseminated No. of inseminated cows 13.5 Total no. of cows Total number of cows at the farm 65.1 Avg. no of mc Average no. of cows that are not dried off 55.4 Use natural service Cows of cows inseminated with natural service sires 3 Breeding value production BV Milk Avg. breeding value of cows for kg of milk 213 kg BV Fat Avg breeding value of cows for kg of fat 6.8 kg b BV INET Avg. breeding value of cows for INET 81.3 Breeding value Development Avg. breeding value of cows for ‘development’ 100.2 conformation Type Avg. breeding value of cows for ‘type’ 100.1 Udder Avg. breeding value of cows for ‘udder’ 100.1 Legs Avg. breeding value of cows for ‘legs’ 100.7 Total Avg. breeding value for cows ‘total conformation’ 100.2 Age at calving Age heifers Expected age of calving of heifers 787 days ] Calving Age Average age of dairy cows at calving 1485 days ] Milk production 305-day Farm average 305-day milk production 8342 kg a Producer is asked to assign 100 points to different genetic aspects, as he is taking into account for the breeding of his cows. b INET5weighed averaged of the breeding values for kg milk, kg fat and kg protein, based on the price paid for these different components. 74 C .W. Rougoor et al. Livestock Production Science 66 2000 71 –83 ables from the underlying variables differed between transformation of a set of correlated explanatory PLS and PCR. This will be discussed when these X-variables into an equal number of uncorrelated methodologies are discussed. variables. These new variables, the so-called princi- Fig. 1 gives the null-path model for the path- pal components PCs, are all linear combinations of analysis. For both methodologies PLS and PCR the the original correlated X-variables. The PCs are researcher has to use prior knowledge and intuition arranged in decreasing order of contribution to to define the synthetic factors and the null-path variance. Dimensionality can be reduced by selecting model. The specification of the synthetic factors was only a couple of PCs with a high contribution to based upon a logical separation of different parts and variance. The number of PCs selected may be levels of breeding management. The design of the determined by examining the proportion of total null-path model was based on the framework as variance explained by each component, or by the described by Rougoor et al. 1998. The decision- cumulative proportion of total variance explained. A making process business goals and CSFs influences rule of thumb adopted by many investigators is to biological and technical aspects and processes select only the PCs explaining at least 100 P percent breeding value, use of natural service and age at of the total variance, with P being the total number calving, which in turn influences the 305-day milk of variables Afifi and Clark, 1984. This selection production. Farm size, in turn, might have influenced criterion was also used in the current paper. Besides the average breeding value of cows on the farm. The the percentage of variance explained, the eigen path diagram was analysed by PLS and PCR. To get values of the PCs can be of use to decide how many comparable results, for both methodologies the rule PCs to include in the PCR of the PCs on the was applied that only arrows with a standardised Y-variable the 305-day milk production. The eigen path coefficient larger than 0.20 were kept in the value is the variance of that PC. When an eigen value model. of a PC is close to zero, it means that multicollineari- ty is present among the original variables. In that 2.2. Principal Components Regression PCR case that PC can be excluded from the regression. These two selection criteria both using a so-called Principal component analysis, a statistical tech- top-down approach do not include PCs with small nique originated by Hotelling 1933, is performed in contribution to variance in the regression. This order to simplify the description of a set of interre- results in a reliable estimate of the regression lated variables Afifi and Clark, 1984. It allows the parameters. The selected PCs were utilised as un- Fig. 1. Null-path model of relation between breeding management and 305-day milk production. C .W. Rougoor et al. Livestock Production Science 66 2000 71 –83 75 correlated explanatory variables in the regression predicting the dependent Y-variables in our case the model. Parameter estimates were generated by the ‘305-day milk production’. In the meantime PLS equation: reduces the dimensionality of the regression problem by using fewer synthetic factors than the number of 305-day milk production 5 a 1 b PC 1 1 X-variables. Major difference between PCR and PLS is that with PLS the data values of both the X- and 1 b PC 1 . . . b PC 1 e 1 2 2 n n Y-variables influence the construction of the syn- where a is the intercept term, b is the regression i thetic factors. In the previous paragraph it was coefficient, PC is the principal component i, n is the i explained that the PCs in a PCR are determined number of PCs included in the regression, and e is without taking into account the Y-variable Garth- the residual error term. These estimates of the waite, 1994. Another difference between the two regression coefficients were used to reconstitute methodologies is that PLS has the opportunity to regression coefficients for the explanatory variables, take into account more than one Y-variable at the as was done by Lafi and Kaneene 1992b: same time however, this option will not be used in the current paper. RCvar j 5 loadPC b 1,var j 1 Input of the PLS-model are the raw data, the 1 loadPC b 2,var j 2 set-up of the synthetic factors and the set-up of the null-path model. PLS estimates the relations between 1 . . . loadPC b 2 n,var j n these data and factors. It distinguishes between where RCvar j is the standardized reconstituted different components of the path model. The rela- regression coefficient of variable j, loadPC is tionships between the synthetic factors are the so- i,var j the loading of variable j on PC , and b is the called inner relations, for instance the relation be- i i regression coefficient as was estimated in Eq. 1. tween the synthetic factors ‘CSF’ and ‘Breeding Due to these transformations, these explanatory Goal Producer’. These are given by the inner path variables the PCs are corrected in such a way as to coefficients, ranging from –1 a strong negative minimize the effect of multicollinearity. The recon- relationship to 11 a strong positive relationship. stituted regression coefficients were used to construct Relations between the variables and the synthetic the synthetic factors. This way, dimensionality could factors are the outer relations, for instance the be reduced without losing much of the information. relation between the breeding goal ‘Kg milk’ and the Besides that, interpretability will be increased Afifi synthetic factor ‘Breeding Goal Producer’. These are and Clark, 1984. The synthetic variables were used given by the factor loadings. Factor loadings can in a multivariate path-analysis. Standardized path- vary between –1 indicating a very strong negative coefficients were calculated as described by Rougoor relationship; all variance of that variable is captured et al. 1997. The procedures PCP and MODEL of in that synthetic factor and 11 a very strong the statistical package Genstat Payne et al., 1995 positive relationship. These are estimated in such a were used to do the calculations. way that the model is optimal in the inner part i.e. between the synthetic factors as well as the outer part i.e. towards the X- and Y-variables. PLS seeks 2.3. Partial Least Squares PLS values for the factor loadings and structural parame- ters that minimize residual variance for the synthetic PLS is a methodology that can be used for theory factors and the X- and Y-variables. This way, a confirmation, but can also be used to suggest where synthetic factor is estimated to be the best predict- relationships might or might not exist and to suggest able variable of its X-variables as well as the best propositions for later testing. It intents to form so predictor of subsequent dependent synthetic variables called ‘latent variables’ in our case these are the or Y-variables Steenkamp and Van Trijp, 1996. synthetic factors, for instance ‘Breeding Goal The PLS algorithm proceeds in three stages. The Producer’ that capture most of the information for first stage gives estimates of the case values of the the independent X-variables i.e. the two breeding synthetic variables. The second stage of the PLS goals ‘Kg milk’ and ‘Udder’ that is useful for 76 C .W. Rougoor et al. Livestock Production Science 66 2000 71 –83 algorithm uses the estimates of the synthetic factors parameter. Jack knifing provides information about in the first stage to estimate the inner and outer the precision of the parameter estimates. The PLS- relations, without location parameters. The third step model was estimated with the LVPLS 1.8 program ¨ of the algorithm estimates the location parameters of Lohmoller, 1987. the synthetic factors and the structural relations estimated in the first two stages Wold, 1982. A detailed overview of these three steps is given by

3. Results