Livestock Production Science 66 2000 71–83 www.elsevier.com locate livprodsci
The relation between breeding management and 305-day milk production, determined via principal components regression and
partial least squares
a,b , c
d
C.W. Rougoor , R. Sundaram , J.A.M. van Arendonk
a
Research Station for Cattle , Sheep and Horse Husbandry, Runderweg 6, NL-8219PK Lelystad, The Netherlands
b
Department of Economics Management, Wageningen Agricultural University, Wageningen, The Netherlands
c
Department of Statistics and Probability , Michigan State University, East Lansing, The Netherlands
d
Animal Breeding and Genetics Group , Wageningen Institute of Animal Sciences, Wageningen Agricultural University, Wageningen,
The Netherlands Received 20 November 1998; received in revised form 19 July 1999; accepted 13 December 1999
Abstract
A field study was set up to investigate the relation between breeding management and 305-day milk production. Second goal of the study was to investigate advantages and disadvantages of principal components regression PCR and partial least
squares PLS for livestock management research. Multicollinearity was present in the data set and the number of variables was high compared to the number of observations. Out of 70 variables related to breeding management and technical results
at dairy farms, 19 were selected for PLS and PCR, based on a correlation of 0.25 or 2 0.25 with 305-day milk production. Five principal components PCs were selected for PC-regression with 305-day milk production being the goal
variable. Related variables were combined into one so-called synthetic factor. All synthetic variables were used in a path-analysis. The same path-analysis was worked out with PLS. PLS forms synthetic factors capturing most of the
information for the independent X-variables that is useful for predicting the dependent Y-variables while reducing the dimensionality. Both methodologies showed that milk production per cow is related to critical success factors of the
producer, farm size, breeding value for production and conformation. Milk production per cow was the result of the attitude of the farmer as well as the genetic capacity of the cow. It was found that at high producing farms the producer put relatively
much emphasis on the quality of the udder and less on the kg of milk. Advantages of PLS are the optimization towards the
2
Y-variable, resulting in a higher R , and the possibility to include more than one Y-variable. Advantages of PCR are that hypothesis testing can be performed, and that complete optimisation is used in determining the PCs. It is concluded that PLS
is a good alternative for PCR when relations are complex and the number of observations is small.
2000 Elsevier Science B.V. All rights reserved.
Keywords : Dairy cattle; Principal component; Partial least squares; Management; Milk production
1. Introduction
Corresponding author. CLM, P.O. Box 10015, 3505 AA Utrecht. Tel.: 131-30-2441-301; fax: 131-30-2441-318.
E-mail address : crougoorclm.nl C.W. Rougoor.
Relationships between management and technical
0301-6226 00 – see front matter
2000 Elsevier Science B.V. All rights reserved. P I I : S 0 3 0 1 - 6 2 2 6 0 0 0 0 1 5 6 - 1
72 C
.W. Rougoor et al. Livestock Production Science 66 2000 71 –83
results on dairy farms are complex. The current 1. To investigate the relationships between breeding
paper is part of a project with the purpose of getting management and 305-day milk production;
a better insight into relationships under practical 2. To investigate the advantages and disadvantages
conditions. The major focus was on operational of PLS and PCR for use in livestock management
management the day-to-day decision making and research.
its relation with 305-day milk production. This kind of livestock management studies often face the
problem that multicollinearity is present in the data
2. Materials and methods
set, and that the number of variables is large compared to the number of observations. In case of
2.1. Data multicollinearity, standard statistical techniques, as
linear regression, will give unstable estimates of the The data used in the current paper were part of a
regression coefficients which hinders their interpreta- bigger project, which aimed at determining the
tion. In addition, a large number of variables com- relation between management and milk production
pared to the number of observations will decrease the and gross margin. Therefore, management is divided
degrees of freedom of the residual variance dramati- into management in the areas of grassland, feeding,
cally. animal health, fertility and breeding. Thirty-nine
Multicollinearity is often difficult to detect. Afifi Dutch dairy farms were included in the field study.
and Clark 1984 state that a simple way to check for Farms were selected on the basis of gross margin
multicollinearity is to examine the correlations from May 1993 till May 1994 for farms using a
among the independent variables. When priori in- non-calendar year accounting period or January
formation is available on relationships between 1994 to January 1995 for farms with a calendar year
variables and complex relations have to be deter- accounting period, and the 305-day milk production
mined, path-analysis can be a useful tool to use between August 1993 and August 1994. Farms had
Rougoor et al., 1997. This methodology, however, to meet the following two criteria:
can only be used when the number of observations is large compared to the number of variables. Bigras-
1. A gross margin below Dfl. 77.40 or above 78.40 Poulin 1985, Lafi and Kaneene 1992a, Ferguson
per 100 kg of milk; et al. 1994, and Webster et al. 1997 used
2. A 305-day milk production below 7270 kg per Principal Components Analysis and Principal Com-
cow, or above 7450 kg. ponents Regression PCR to reduce the number of
independent variables i.e. to reduce dimensionality These cut-off values reflect farms selected out of
and to avoid problems regarding multicollinearity. the ‘tails’ of the normal distribution. This way, a data
Faye et al. 1997 used canonical correspondence set was created with four groups of farms which
analysis, which is a generalisation of the principal could be used to investigate the relation between
component analysis. Steenkamp and Van Trijp variation in management and milk production and
1996 were facing the same kind of problems, but gross margin. Unfortunately, the differences in gross
used Partial Least Squares PLS to reduce dimen- margin between the four groups had become much
sionality and multicollinearity. PLS is considered to smaller over the past few years. Therefore, the
be useful for describing complex relationships For- analyses were done on individual data, without
nell and Bookstein, 1982; Fornell et al., 1990. PLS differentiating between the groups. During the period
has proved to be successful for forming prediction of data collection one farm dropped out, so analyses
equations to relate a substance’s chemical composi- were based on data of 38 farms. From May 1996 to
tion to its near-infrared spectra Garthwaite, 1994. May 1997 the farms were visited monthly to collect
However, PLS has hardly been used in livestock data. To get insight into the breeding management of
management research. the producers, a management questionnaire on breed-
The goal of the current study is twofold: ing decisions was developed and during one of the
C .W. Rougoor et al. Livestock Production Science 66 2000 71 –83
73
farm visits exposed to the producer. Questions a total of 70 variables. The CSFs are measured on a
focused on the breeding goal of the producer, the sire 6-point-scale form ‘not important’ to ‘very impor-
selection and the use of natural service sires. The tant’. Theoretically spoken, these variables are ordi-
producer was also asked to indicate what the critical nal variables. However, an increasing score indicates
success factors CSFs were at his farm regarding ‘more important’. Therefore, the variables were
production and breeding. Milk production data of the assumed to be continuous. To indicate whether
farms, as well as data on breeding values of cows at multicollinearity was likely to be present in the data
the herds, were made available by the Royal Dutch set, simple linear correlations between the 19 vari-
Cattle Syndicate NRS. A first selection of variables ables was carried out Afifi and Clark, 1984.
was based on simple linear correlation of 0.25 or The number of variables was large compared to
2 0.25 with farm average 305-day milk product- the number of farms our unit of observation.
ion. Table 1 gives an overview of the 19 variables Therefore, the variables were grouped into so-called
that were selected for the multivariate analyses out of ‘synthetic factors’. The calculation of synthetic vari-
Table 1 Description of variables used in the multivariate analyses and their average value for the 38 farms
Synthetic factor Variable
Description of variable Avg. value
Critical success Production
CSF ‘milk production per cow’ on a 0 not factors CSF
mentioned to 5 most important scale 2.20
Culling CSF ‘culling policy’ on a 0 to 5 scale
0.69 Winter milk
CSF ‘ of milk produced in winter’ on a 0 0.58
to 5 scale Breeding goal BG
Kg milk of points producer gives to ‘kg milk’ as a
a
producer breeding goal at his farm
14.56 Udder
of points producer gives to ‘udder’ 10.35
Farm size No. Inseminated
No. of inseminated cows 13.5
Total no. of cows Total number of cows at the farm
65.1 Avg. no of mc
Average no. of cows that are not dried off 55.4
Use natural service Cows
of cows inseminated with natural service sires 3
Breeding value production BV Milk
Avg. breeding value of cows for kg of milk 213 kg
BV Fat Avg breeding value of cows for kg of fat
6.8 kg
b
BV INET Avg. breeding value of cows for INET
81.3 Breeding value
Development Avg. breeding value of cows for ‘development’
100.2 conformation
Type Avg. breeding value of cows for ‘type’
100.1 Udder
Avg. breeding value of cows for ‘udder’ 100.1
Legs Avg. breeding value of cows for ‘legs’
100.7 Total
Avg. breeding value for cows ‘total conformation’ 100.2
Age at calving Age heifers
Expected age of calving of heifers 787 days
] Calving Age
Average age of dairy cows at calving 1485 days
] Milk production
305-day Farm average 305-day milk production
8342 kg
a
Producer is asked to assign 100 points to different genetic aspects, as he is taking into account for the breeding of his cows.
b
INET5weighed averaged of the breeding values for kg milk, kg fat and kg protein, based on the price paid for these different components.
74 C
.W. Rougoor et al. Livestock Production Science 66 2000 71 –83
ables from the underlying variables differed between transformation of a set of correlated explanatory
PLS and PCR. This will be discussed when these X-variables into an equal number of uncorrelated
methodologies are discussed. variables. These new variables, the so-called princi-
Fig. 1 gives the null-path model for the path- pal components PCs, are all linear combinations of
analysis. For both methodologies PLS and PCR the the original correlated X-variables. The PCs are
researcher has to use prior knowledge and intuition arranged in decreasing order of contribution to
to define the synthetic factors and the null-path variance. Dimensionality can be reduced by selecting
model. The specification of the synthetic factors was only a couple of PCs with a high contribution to
based upon a logical separation of different parts and variance. The number of PCs selected may be
levels of breeding management. The design of the determined by examining the proportion of total
null-path model was based on the framework as variance explained by each component, or by the
described by Rougoor et al. 1998. The decision- cumulative proportion of total variance explained. A
making process business goals and CSFs influences rule of thumb adopted by many investigators is to
biological and technical aspects and processes select only the PCs explaining at least 100 P percent
breeding value, use of natural service and age at of the total variance, with P being the total number
calving, which in turn influences the 305-day milk of variables Afifi and Clark, 1984. This selection
production. Farm size, in turn, might have influenced criterion was also used in the current paper. Besides
the average breeding value of cows on the farm. The the percentage of variance explained, the eigen
path diagram was analysed by PLS and PCR. To get values of the PCs can be of use to decide how many
comparable results, for both methodologies the rule PCs to include in the PCR of the PCs on the
was applied that only arrows with a standardised Y-variable the 305-day milk production. The eigen
path coefficient larger than 0.20 were kept in the value is the variance of that PC. When an eigen value
model. of a PC is close to zero, it means that multicollineari-
ty is present among the original variables. In that 2.2. Principal Components Regression PCR
case that PC can be excluded from the regression. These two selection criteria both using a so-called
Principal component analysis, a statistical tech- top-down approach do not include PCs with small
nique originated by Hotelling 1933, is performed in contribution to variance in the regression. This
order to simplify the description of a set of interre- results in a reliable estimate of the regression
lated variables Afifi and Clark, 1984. It allows the parameters. The selected PCs were utilised as un-
Fig. 1. Null-path model of relation between breeding management and 305-day milk production.
C .W. Rougoor et al. Livestock Production Science 66 2000 71 –83
75
correlated explanatory variables in the regression predicting the dependent Y-variables in our case the
model. Parameter estimates were generated by the ‘305-day milk production’. In the meantime PLS
equation: reduces the dimensionality of the regression problem
by using fewer synthetic factors than the number of 305-day milk production 5 a 1 b PC
1 1
X-variables. Major difference between PCR and PLS is that with PLS the data values of both the X- and
1 b PC 1 . . . b PC 1 e 1
2 2
n n
Y-variables influence the construction of the syn- where a is the intercept term, b is the regression
i
thetic factors. In the previous paragraph it was coefficient, PC is the principal component i, n is the
i
explained that the PCs in a PCR are determined number of PCs included in the regression, and e is
without taking into account the Y-variable Garth- the residual error term. These estimates of the
waite, 1994. Another difference between the two regression coefficients were used to reconstitute
methodologies is that PLS has the opportunity to regression coefficients for the explanatory variables,
take into account more than one Y-variable at the as was done by Lafi and Kaneene 1992b:
same time however, this option will not be used in the current paper.
RCvar j 5 loadPC b
1,var j 1
Input of the PLS-model are the raw data, the 1 loadPC
b
2,var j 2
set-up of the synthetic factors and the set-up of the null-path model. PLS estimates the relations between
1 . . . loadPC b
2
n,var j n
these data and factors. It distinguishes between where RCvar j is the standardized reconstituted
different components of the path model. The rela- regression coefficient of variable j, loadPC
is tionships between the synthetic factors are the so-
i,var j
the loading of variable j on PC , and b is the called inner relations, for instance the relation be-
i i
regression coefficient as was estimated in Eq. 1. tween the synthetic factors ‘CSF’ and ‘Breeding
Due to these transformations, these explanatory Goal Producer’. These are given by the inner path
variables the PCs are corrected in such a way as to coefficients, ranging from –1 a strong negative
minimize the effect of multicollinearity. The recon- relationship to 11 a strong positive relationship.
stituted regression coefficients were used to construct Relations between the variables and the synthetic
the synthetic factors. This way, dimensionality could factors are the outer relations, for instance the
be reduced without losing much of the information. relation between the breeding goal ‘Kg milk’ and the
Besides that, interpretability will be increased Afifi synthetic factor ‘Breeding Goal Producer’. These are
and Clark, 1984. The synthetic variables were used given by the factor loadings. Factor loadings can
in a multivariate path-analysis. Standardized path- vary between –1 indicating a very strong negative
coefficients were calculated as described by Rougoor relationship; all variance of that variable is captured
et al. 1997. The procedures PCP and MODEL of in that synthetic factor and 11 a very strong
the statistical package Genstat Payne et al., 1995 positive relationship. These are estimated in such a
were used to do the calculations. way that the model is optimal in the inner part i.e.
between the synthetic factors as well as the outer part i.e. towards the X- and Y-variables. PLS seeks
2.3. Partial Least Squares PLS values for the factor loadings and structural parame-
ters that minimize residual variance for the synthetic PLS is a methodology that can be used for theory
factors and the X- and Y-variables. This way, a confirmation, but can also be used to suggest where
synthetic factor is estimated to be the best predict- relationships might or might not exist and to suggest
able variable of its X-variables as well as the best propositions for later testing. It intents to form so
predictor of subsequent dependent synthetic variables called ‘latent variables’ in our case these are the
or Y-variables Steenkamp and Van Trijp, 1996. synthetic
factors, for
instance ‘Breeding
Goal The PLS algorithm proceeds in three stages. The
Producer’ that capture most of the information for first stage gives estimates of the case values of the
the independent X-variables i.e. the two breeding synthetic variables. The second stage of the PLS
goals ‘Kg milk’ and ‘Udder’ that is useful for
76 C
.W. Rougoor et al. Livestock Production Science 66 2000 71 –83
algorithm uses the estimates of the synthetic factors parameter. Jack knifing provides information about
in the first stage to estimate the inner and outer the precision of the parameter estimates. The PLS-
relations, without location parameters. The third step model was estimated with the LVPLS 1.8 program
¨ of the algorithm estimates the location parameters of
Lohmoller, 1987. the synthetic factors and the structural relations
estimated in the first two stages Wold, 1982. A detailed overview of these three steps is given by
3. Results