Wen SLR of Machine Learning based Softare dev Effort Estimation Models 2012

Information and Software Technology 54 (2012) 41–59

Contents lists available at SciVerse ScienceDirect

Information and Software Technology
journal homepage: www.elsevier.com/locate/infsof

Systematic literature review of machine learning based software development
effort estimation models
Jianfeng Wen a,⇑, Shixian Li a, Zhiyong Lin b, Yong Hu c, Changqin Huang d
a

Department of Computer Science, Sun Yat-sen University, Guangzhou, China
Department of Computer Science, Guangdong Polytechnic Normal University, Guangzhou, China
c
Institute of Business Intelligence and Knowledge Discovery, Department of E-commerce, Guangdong University of Foreign Studies, Sun Yat-sen University, Guangzhou, China
d
Engineering Research Center of Computer Network and Information Systems, South China Normal University, Guangzhou, China
b

a r t i c l e


i n f o

Article history:
Received 28 October 2010
Received in revised form 8 August 2011
Accepted 8 September 2011
Available online 16 September 2011
Keywords:
Software effort estimation
Machine learning
Systematic literature review

a b s t r a c t
Context: Software development effort estimation (SDEE) is the process of predicting the effort required to
develop a software system. In order to improve estimation accuracy, many researchers have proposed
machine learning (ML) based SDEE models (ML models) since 1990s. However, there has been no attempt
to analyze the empirical evidence on ML models in a systematic way.
Objective: This research aims to systematically analyze ML models from four aspects: type of ML technique, estimation accuracy, model comparison, and estimation context.
Method: We performed a systematic literature review of empirical studies on ML model published in the

last two decades (1991–2010).
Results: We have identified 84 primary studies relevant to the objective of this research. After investigating these studies, we found that eight types of ML techniques have been employed in SDEE models. Overall speaking, the estimation accuracy of these ML models is close to the acceptable level and is better than
that of non-ML models. Furthermore, different ML models have different strengths and weaknesses and
thus favor different estimation contexts.
Conclusion: ML models are promising in the field of SDEE. However, the application of ML models in
industry is still limited, so that more effort and incentives are needed to facilitate the application of
ML models. To this end, based on the findings of this review, we provide recommendations for researchers as well as guidelines for practitioners.
Ó 2011 Elsevier B.V. All rights reserved.

Contents
1.
2.

3.

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.
Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.

Search strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1.
Search terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2.
Literature resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3.
Search process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.
Study selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.
Study quality assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.
Data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.
Data synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.
Threats to validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.
Overview of selected studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2.
Types of ML techniques (RQ1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.
Estimation accuracy of ML models (RQ2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

⇑ Corresponding author. Tel.: +86 20 34022695.
E-mail address: [email protected] (J. Wen).
0950-5849/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.infsof.2011.09.002

42
42
43
43
43
43
43
44
44
45

45
46
46
46
47
48

42

4.
5.
6.

J. Wen et al. / Information and Software Technology 54 (2012) 41–59

3.4.
ML models vs. non-ML models (RQ3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.
ML models vs. other ML models (RQ4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.

Favorable estimation contexts of ML models (RQ5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Implications for research and practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Limitations of this review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1. Introduction
Software development effort estimation (SDEE) is the process of
predicting the effort required to develop a software system. Generally speaking, SDEE can be considered as a sub-domain of software
effort estimation, which includes the predictions of not only software
development effort but also software maintenance effort. The terminology software cost estimation is often used interchangeably with
software effort estimation in research community, though effort just
forms the major part of software cost. Estimating development effort
accurately in the early stage of software life cycle plays a crucial role
in effective project management. Since 1980s, many estimation
methods have been proposed in SDEE domain. Jørgensen and Shepperd [1] conducted a review which identifies up to 11 estimation

methods used for SDEE. Among them, regression-based method is
found to dominate, and the use of expert judgment and analogy
method is increasing. In recent years, machine learning (ML) based
method has been receiving increasing attention in SDEE research.
Some researchers [2–4] even view ML based method as one of the
three major categories of estimation methods (the other two are expert judgment and algorithmic model). Boehm and Sullivan [5] considered the learning-oriented technique as one of the six categories
of software effort estimation techniques. Zhang and Tsai [6] summarized the applications of various ML techniques in SDEE domain,
including case-based reasoning, decision trees, artificial neural networks, and genetic algorithms.
Despite the large number of empirical studies on ML models,1
inconsistent results have been reported regarding the estimation
accuracy of ML models, the comparisons between ML model and
non-ML model, and the comparisons between different ML models.
For example, it was reported that the estimation accuracy varies under the same ML model when it is constructed with different historical project data sets [7–10] or different experimental designs [11].
As for the comparison between ML model and regression model,
studies in [12–14] reported that ML model is superior to regression
model, while studies in [2,15,16] concluded that regression model
outperforms ML model. As for the comparison between different
ML models (e.g., artificial neural networks and case-based reasoning), study in [17] suggested that the former outperforms the latter
while study in [18] reported the opposite result.
The divergence in the existing empirical studies on ML models,

the causes of which are still not fully understood up to now, may
prevent practitioners from adopting ML models in practice. Comparing with other domains in which ML techniques have been applied successfully, SDEE domain poses many challenges such as
small training data sets, information uncertainty, qualitative metrics, and human factors. Furthermore, the theory of ML techniques
is much complicated than that of conventional estimation tech-

1
We use the term ‘‘ML model’’ in this review to denote ‘‘ML based SDEE method’’.
This term may cause confusion since one ML technique identified in this review, i.e.,
case-based reasoning (CBR), belongs to lazy learning technique and thus has no
model. Despite that, we still use this term for descriptive convenience.

48
50
50
51
51
52
57
57
57

57
57

niques. Although the research on ML model is increasing in academia, recent surveys [1,19,20] have revealed that expert judgment
(a kind of non-ML method) is still the dominant method for SDEE
in industry. To facilitate the applications of ML techniques in SDEE
domain, it is crucial to systematically summarize the empirical evidence on ML models in current research and practice.
Existing literature reviews of SDEE can be divided into two categories: traditional literature reviews and systematic literature reviews
(SLR). The traditional reviews [5,21–26] mainly cover the state-ofthe-art and research trends, whereas the SLRs [1,19,20,27–31] aim to
answer various research questions pertaining to SDEE. To the best of
authors’ knowledge, there is no existing SLR that focuses on ML models,
which motivates our work in this paper. Specifically, we performed an
SLR on ML models published in the period from 1 January 1991 to 31
December 2010. The purpose of this SLR is to summarize and clarify
the available evidence regarding (1) the ML techniques for constructing
SDEE models, (2) the estimation accuracy of ML models, (3) the comparisons of estimation accuracy between ML models and non-ML models, (4) the comparisons of estimation accuracy between different ML
models, and (5) the favorable estimation contexts of ML models. We
further provide practitioners with guidelines on the application of
ML models in SDEE practice.
The rest of this paper is organized as follows. Section 2 describes

the methodology used in this review. Section 3 presents and discusses the review results. Section 4 provides the implications for research and practice. Section 5 discusses the limitations of this
review. Conclusion and future work are presented in Section 6.
2. Method
We planned, conducted, and reported the review by following
the SLR process suggested by Kitchenham and Charters [32]. We
developed the review protocol at the planning phase of this SLR.
The review protocol mainly includes six stages: research questions
definition, search strategy design, study selection, quality assessment, data extraction, and data synthesis. Fig. 1 outlines the six
stages of the review protocol.
In the first stage, we raised a set of research questions based on
the objective of this SLR. Then, in the second stage, aiming at the research questions, we designed search strategy to find out the studies
relevant to the research questions. This stage involves both the
determination of search terms and the selection of literature resources, which are necessary for the subsequent search process. In
the third stage, we defined study selection criteria to identify the relevant studies that can really contribute to addressing the research
questions. As part of this stage, pilot study selection was employed
to further refine the selection criteria. Next, the relevant studies
underwent a quality assessment process in which we devised a
number of quality checklists to facilitate the assessment. The
remaining two stages involve data extraction and data synthesis,
respectively. In the data extraction stage, we initially devised data

extraction form and subsequently refined it through pilot data
extraction. Finally, in the data synthesis stage, we determined the

J. Wen et al. / Information and Software Technology 54 (2012) 41–59

43

proper methodologies for synthesizing the extracted data based on
the types of the data and the research questions the data addressed.
A review protocol is of critical importance for an SLR. To ensure
the rigorousness and repeatability of this SLR and to reduce researcher bias as well, we elaborately developed the review protocol by frequently holding group discussion meetings on protocol
design. In the following Subsection 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, we will
present the details of the review protocol. At the end of this section, we will analyze the threats to validity of the review protocol.
2.1. Research questions
This SLR aims to summarize and clarify the empirical evidence
on ML based SDEE models. Towards this aim, five research questions (RQs) were raised as follows.
(1) RQ1: Which ML techniques have been used for SDEE?
RQ1 aims at identifying the ML techniques that have been
used to estimate software development effort. Practitioners
can take the identified ML techniques as candidate solutions
in their practice. For ML techniques that have not yet been
employed in SDEE, researchers can explore the possibility
of using them as potential feasible solutions.
(2) RQ2: What is the overall estimation accuracy of ML models?
RQ2 is concerned with the estimation accuracy of ML models.
Estimation accuracy is the primary performance metric for ML
models. This question focuses on the following four aspects of
estimation accuracy: accuracy metric, accuracy value, data set
for model construction, and model validation method.
(3) RQ3: Do ML models outperform non-ML models?
In most of the existing studies, the proposed ML models are
compared with conventional non-ML models in terms of
estimation accuracy. RQ3 therefore aims to verify whether
ML models are superior to non-ML models.
(4) RQ4: Are there any ML models that distinctly outperform other
ML models?
The evidence of comparisons between different ML models
can be synthesized to determine which ML models consistently outperform other ML models. Thus, RQ4 aims to identify the ML models with relatively excellent performance.
(5) RQ5: What are the favorable estimation contexts of ML models?
RQ5 aims at identifying the strengths and weaknesses of different ML models. With fully understanding the characteristics of the candidate ML models, practitioners can make a
rational decision on choosing the ML models that favor the
focused estimation contexts.

Fig. 1. Stages of review protocol.

software AND (effort OR cost OR costs) AND (estimat⁄ OR predict⁄) AND (learning OR ‘‘data mining’’ OR ‘‘artificial intelligence’’ OR ‘‘pattern recognition’’ OR analogy OR ‘‘case based
reasoning’’ OR ‘‘nearest neighbo⁄’’ OR ‘‘decision tree⁄’’ OR
‘‘regression tree⁄’’ OR ‘‘classification tree⁄’’ OR ‘‘neural net⁄’’
OR ‘‘genetic programming’’ OR ‘‘genetic algorithm⁄’’ OR ‘‘bayesian belief network⁄’’ OR ‘‘bayesian net⁄’’ OR ‘‘association rule⁄’’
OR ‘‘support vector machine⁄’’ OR ‘‘support vector regression’’).

Derive major terms from the research questions.
Identify alternative spellings and synonyms for major terms.
Check the keywords in relevant papers or books.
Use the Boolean OR to incorporate alternative spellings and
synonyms.
(e) Use the Boolean AND to link the major terms.

2.2.2. Literature resources
The literature resources we used to search for primary studies
include six electronic databases (IEEE Xplore, ACM Digital Library,
ScienceDirect, Web of Science, EI Compendex, and Google Scholar)
and one online bibliographic library (BESTweb). Some other important resources such as DBLP, CiteSeer, and The Collection of Computer
Science Bibliographies have not been considered, since they are almost covered by the selected literature resources.
The online bibliographic library BESTweb (http://www.
simula.no/BESTweb), which is maintained by Simula Research Laboratory, supplies both journal papers and conference papers on software effort estimation. At the time of conducting this review, the
library contained up to 1365 papers that were published in the period 1927–2009. More details about BESTweb can be found in [1].
The search terms constructed previously were used to search
for journal papers and conference papers in the six electronic databases. The search terms were adjusted to accommodate different
databases, since the search engines of different databases use different syntax of search strings. The search was conducted on the
first five databases covering title, abstract, and keywords. For
Google Scholar, only title was searched since the full text search
will return millions of irrelevant records.
We restricted the search to the period from 1 January 1991 to
31 December 2010, because the application of ML techniques in
SDEE domain was launched just in the early 1990s. For example,
Mukhopadhyay et al. [36] used case-based reasoning technique,
Briand et al. [37] used decision trees technique, both in 1992.

The resulting complete search terms are as follows. Note that
the terms of ML techniques mainly come from the textbooks on
machine learning [33–35].

2.2.3. Search process
SLR requires a comprehensive search of all relevant sources. For
this reason, we defined the search process and divided it into the

2.2. Search strategy
The search strategy comprises search terms, literature resources,
and search process, which are detailed one by one as follows.
2.2.1. Search terms
The following steps were used to construct the search terms
[30]:
(a)
(b)
(c)
(d)

44

J. Wen et al. / Information and Software Technology 54 (2012) 41–59

Inclusion criteria:
 Using ML technique to estimate development effort.
 Using ML technique to preprocess modeling data.
 Using hybrid model that employs at least two ML techniques or combines ML technique with non-ML technique
(e.g., combining with statistics method, fuzzy set, or rough
set) to estimate development effort.
 Comparative study that compares different ML models or
compares ML model with non-ML model.
 For study that has both conference version and journal version, only the journal version will be included.
 For duplicate publications of the same study, only the most
complete and newest one will be included.
Notes:
a
b
c
d
e

Exclusion criteria:

Remove duplicate papers
Apply selection criteria
Derive references from relevant papers
Search additional relevant papers
Apply quality assessment criteria

 Estimating software size, schedule, or time only, but without estimating effort.
 Estimating maintenance effort or testing effort.
 Addressing software project control and planning issues
(e.g., scheduling, staff allocation, development process).
 Review papers will be excluded.

Fig. 2. Search and selection process.

following two phases (note that the relevant papers are those papers that meet the selection criteria defined in the next section).
 Search phase 1: Search the six electronic databases separately and then gather the returned papers together with
those from BESTweb to form a set of candidate papers.
 Search phase 2: Scan the reference lists of the relevant
papers to find extra relevant papers and then, if any, add
them into the set.
Software package Endnote (http://www.endnote.com) was used
to store and manage the search results. We identified 179 relevant
papers according to the search process. The detailed search process
and the number of papers identified at each phase are shown in
Fig. 2.

2.3. Study selection
Search phase 1 resulted in 2191 candidate papers (see Fig. 2).
Since many of the candidate papers provide no information useful
to address the research questions raised by this review, further filtering is needed to identify the relevant papers. This is exactly
what study selection aims to do. Specifically, as illustrated in
Fig. 2, the study selection process consists of the following two
phases. (Note that in each selection phase, two researchers of this
review conducted the selection independently. If there were any
disagreements between them, a group meeting involving all
researchers would be held to resolve the disagreements.)
 Selection phase 1: Apply the inclusion and exclusion criteria (defined below) to the candidate papers so as to identify
the relevant papers, which provide potential data for
answering the research questions.
 Selection phase 2: Apply the quality assessment criteria
(defined in the next section) to the relevant papers so as
to select the papers with acceptable quality, which are
eventually used for data extraction.
We defined the following inclusion and exclusion criteria,
which had been refined through pilot selection. We carried out
the study selection by reading the titles, abstracts, or full text of
the papers.

By applying the selection criteria in selection phase 1, we identified 171 relevant papers. Then, by scanning the references in
these relevant papers, we identified 8 extra relevant papers that
were missed in the initial search. Therefore, we identified in total
179 relevant papers.2 After applying the quality assessment criteria
in selection phase 2, we ultimately identified 84 papers as the final
selected studies, which were then used for data extraction. The details of the quality assessment are described in the next section.
The complete list of these 84 selected papers can be found in Table
A.8 in Appendix A.

2.4. Study quality assessment
The quality assessment (QA) of selected studies is originally
used as the basis for weighting the retrieved quantitative data in
meta-analysis [38], which is an important data synthesis strategy.
However, since the retrieved data in this review were produced by
different experimental designs, and the amount of data is relatively
small, meta-analysis is thus unsuitable for such cases. For this reason, we did not use the quality assessment results to weight the retrieved data. Instead, we just used the quality assessment results to
guide the interpretation of review findings and to indicate the
strength of inferences. Moreover, the quality assessment results
will be served as an additional criterion for study selection (see
selection phase 2 in Fig. 2).
We devised a number of quality assessment questions to assess
the rigorousness, credibility, and relevance of the relevant studies.
These questions are presented in Table 1. Some of them (i.e., QA1,
QA7, QA8, and QA10) are derived from [39]. Each question has only
three optional answers: ‘‘Yes’’, ‘‘Partly’’, or ‘‘No’’. These three answers are scored as follows: ‘‘Yes’’ = 1, ‘‘Partly’’ = 0.5, and ‘‘No’’ = 0.
For a given study, its quality score is computed by summing up the
scores of the answers to the QA questions.
Two researchers of this review performed the quality assessment of the relevant studies individually. All disagreements on
the quality assessment results were discussed among all researchers, and the consensus was reached eventually. To ensure the reliability of the findings of this review, we considered only the
relevant studies with acceptable quality, i.e., with quality score
2

Please contact the authors to obtain the full list of the relevant papers.

J. Wen et al. / Information and Software Technology 54 (2012) 41–59
Table 1
Quality assessment questions.
No.

Question

QA1
QA2
QA3
QA4
QA5a
QA6
QA7
QA8

Are the aims of the research clearly defined?
Is the estimation context adequately described?
Are the estimation methods well defined and deliberate?
Is the experimental design appropriate and justifiable?
Is the experiment applied on sufficient project data sets?
Is the estimation accuracy measured and reported?
Is the proposed estimation method compared with other methods?
Are the findings of study clearly stated and supported by reporting
results?
Are the limitations of study analyzed explicitly?
Does the study add value to academia or industry community?

QA9
QA10

a
‘‘Yes’’ (Sufficient): two or more data sets; ‘‘Partly’’ (Partly sufficient): only one
data set; ‘‘No’’ (Insufficient): no data set.

greater than 5 (50% of perfect score), for the subsequent data
extraction and data synthesis. Accordingly, we further dropped
95 relevant papers with quality score not more than 5 in selection
phase 2 (see Fig. 2). The quality scores of the remaining studies are
presented in Table B.9 in Appendix B.
2.5. Data extraction
By data extraction, we exploited the selected studies to collect
the data that contribute to addressing the research questions concerned in this review. We devised the cards with the form of Table
2 to facilitate data extraction. This form had been refined through
pilot data extraction with several selected studies. For ease of the
subsequent data synthesis, the items in Table 2 are grouped
according to the associated research questions.
Table 2 shows that the data extracted for addressing RQ2, RQ3,
and RQ4 are related to the experiments conducted by the selected
studies. We notice that the term ‘‘experiment’’ is used with varying
meaning in the software engineering community, and most of the
selected studies used this term without clear definition. Hence, to
avoid confusion, in this review we explicitly define the term experiment as a process in which a technique or an algorithm is evaluated with appropriate performance metrics based on a particular
data set. It should be pointed out that the term experiment defined
above is different from the term controlled experiment in software
engineering defined by Sjøeberg et al. [40].
We used the data extraction cards to collect data from the selected studies. One researcher of this review extracted the data
and filled them into the cards. Another researcher double-checked
the extraction results. The checker discussed disagreements (if
any) with the extractor. If they failed to reach a consensus, other
researchers would be involved to discuss and resolve the disagreements. The verified extracted data were documented into a file,
which would be used in the subsequent data synthesis.
During the data extraction, some data could not be extracted directly from the selected studies. Nevertheless, we could obtain
them indirectly by processing the available data appropriately.
For example, the estimation accuracy values of the same model
may vary due to different model configurations or different samplings of data set. For the former case, we chose the accuracy value
generated by the optimal configuration; for the latter case, we calculated the mean of the accuracy values. Moreover, to ensure the
reliability of the extracted estimation accuracy of ML models, we
did not extract the accuracy data from the experiments that were
conducted on student projects data sets or simulated project data
sets. That is, we extracted only the accuracy data from the experiments conducted on real-world project data sets.
Another issue we encountered during data extraction was that
some studies used different terminologies for the same ML

45

Table 2
The form of data extraction card.
Data extractor
Data checker
Study identifier
Year of publication
Name of authors
Source
Article title
Type of study (experiment, case study, or survey)
RQ1: Which ML techniques have been used for SDEE?
ML techniques used to estimate software development effort
RQ2: What is the overall estimation accuracy of ML models?
Data sets used in experiments
Validation methods used in experiments
Metrics used to measure estimation accuracy
Estimation accuracy values
RQ3. Do ML models outperform non-ML models?
Non-ML models that this ML model compares with
Rank of the ML and non-ML models regarding estimation accuracy
Degree of improvement in estimation accuracy
RQ4. Are there any ML models that distinctly outperform other ML models?
Other ML models that this ML model compares with
Rank of the ML models regarding estimation accuracy
Degree of improvement in estimation accuracy
RQ5. What are the favorable estimation contexts of ML models?
Strengths of ML technique in terms of SDEE
Weaknesses of ML technique in terms of SDEE

technique. For example, some studies used analogy as the synonym
of case-based reasoning. Similarly, some studies used regression
trees, but other ones used classification and regression trees; both
techniques belong to the category of decision trees. Also, optimized
set reduction [37] is essentially a form of decision trees. To avoid
ambiguity, we adopted the terms case-based reasoning and
decision trees uniformly in data extraction and throughout this
review.
Not every selected study provides answers to all the five research questions. For ease of tracing the extracted data, we explicitly labeled each study with the IDs of the research questions to
which the study can provide the corresponding answers. See Table
A.8 in Appendix A for the details.
2.6. Data synthesis
The goal of data synthesis is to aggregate evidence from the selected studies for answering the research questions. A single piece
of evidence might have small evidence force, but the aggregation of
many of them can make a point stronger [41]. The data extracted in
this review include both quantitative data (e.g., values of estimation accuracy) and qualitative data (e.g., strengths and weaknesses
of ML techniques). For the quantitative data, the ideal synthesis
methodology is the canonical meta-analysis [38] due to its solid
theoretical background. However, the quantitative data in this
review were achieved by using varying experimental designs.
Moreover, the number of selected studies was relatively small.
Therefore, it is unsuitable to employ the standard meta-analysis
directly in such a situation [42–44].
We employed different strategies to synthesize the extracted
data pertaining to different kinds of research questions. The synthesis strategies are explained in detail as follows.
For the data pertaining to RQ1 and RQ2, we used narrative synthesis method. That is, the data were tabulated in a manner consistent with the questions. Some visualization tools, including bar
chart, pie chart, and box plot, were also used to enhance the

46

J. Wen et al. / Information and Software Technology 54 (2012) 41–59

presentation of the distribution of ML techniques and their estimation accuracy data.
For the data pertaining to RQ3 and RQ4, which focus on the
comparison of estimation accuracy between different estimation
models, we used vote counting method. That is, suppose we want
to synthesize the comparisons between model A and model B.
We just need to count the number of experiments in which model
A is reported to outperform model B and the number of experiments in which model B is reported to outperform model A. The
two numbers are then used as the basis to analyze the overall comparison between model A and model B. By this way, we can obtain
a general idea about whether an estimation technique outperforms
another technique or not.
For the data pertaining to RQ5, we used reciprocal translation
method [45], which is one of the meta-ethnography techniques
for synthesizing qualitative data. As Kitchenham and Charters suggested [32], ‘‘when studies are about similar things and researchers
are attempting to provide an additive summary, synthesis can be
achieved by ‘translating’ each case into each of the other cases’’.
In this review, reciprocal translation method is applicable when
some strengths or weaknesses of an ML technique retrieved from
different studies have similar meanings. We will translate these
strengths or weaknesses into a uniform description of strength or
weakness. For instance, suppose we have identified three different
but similar descriptions on the strength of an ML technique from
three studies: (1) ‘‘can deal with problems where there are complex relationships between inputs and outputs’’, (2) ‘‘can automatically approximate any continuous function’’, and (3) ‘‘has the
ability of modeling complex non-linear relationships and are capable of approximating any measurable function’’. By applying reciprocal translation, we may unify the above three descriptions
into the same one ‘‘can learn complex (non-linear) functions’’.
2.7. Threats to validity
The main threats to validity of this review protocol are analyzed
from the following three aspects: study selection bias, publication
bias, and possible inaccuracy in data extraction.
The selection of studies depends on the search strategy, the literature sources, the selection criteria, and the quality criteria. We
have constructed the search terms matching with the research
questions and have used them to retrieve the relevant studies in
the six electronic databases. However, some desired relevant studies may not use the terms related to the research questions in their
titles, abstracts, or keywords. As a result, we may have the high risk
of leaving out these studies in the automatic search. To address this
threat, we manually searched BESTweb [1], a bibliographic library
with abundant papers on software effort estimation, as a supplementary way to find out those relevant studies that were missed
in the automatic search. Besides this solution, we also manually
scanned the references list of each relevant study to look for the
extra relevant studies that were not covered by the search of the
six electronic databases and BESTweb library. In addition, we have
defined the selection criteria that strictly comply with the research
questions to prevent the desired studies from being excluded
incorrectly. The final decision on study selection was made
through double confirmation, i.e., separate selections by two
researchers at first and then disagreements resolution among all
researchers. Nevertheless, it is still possible that some relevant
studies have been missed. If such studies do exist, we believe that
the number of them would be small.
Publication bias is possibly another threat to validity of this review. Given the likelihood of publication bias, positive results on
ML models are more likely to be published than negative results,
or researchers may tend to claim that their methods outperform
others’. As a consequence, this will lead to an overestimation of

the performance of ML models. Fortunately, one inclusion criterion
(i.e., the fourth inclusion criterion) in this review may mitigate this
threat. This inclusion criterion aims to identify the studies that do
not propose any new ML model but just conduct comparisons between ML models and other models. These comparative studies
have no bias towards any estimation model; they report the comparison results in an objective way. Therefore, in these studies both
positive and negative comparison results are likely to be reported,
which helps alleviate publication bias to some extent. Nevertheless, since this review did not consider gray literature (i.e., theses,
technical reports, white papers, and work in progress) due to the
difficulties in obtaining them, it must be admitted that publication
bias exists unavoidably.
To reduce the threat of inaccurate data extraction, we have
elaborated the specialized cards for data extraction. In addition,
all disagreements between extractor and checker have been resolved by discussion among all researchers. However, since the
data extraction strategy was designed to extract only the optimal
one from the multiple estimation accuracy values produced by different model configurations, an accuracy bias could still occur.
Even so, we believe that the accuracy bias is limited, as the optimal
configuration of a model is usually preferable in practice.

3. Results and discussion
This section presents and discusses the findings of this review.
First, we present an overview of the selected studies. Second, we
report and discuss the review findings according to the research
questions, one by one in the separate subsections. During the discussion, we interpret the review results not only within the context of the research questions, but also in a broader context
closely related to the research questions. For justification, some related works are also provided to support the findings.

3.1. Overview of selected studies
We identified 84 studies (see Table A.8 in Appendix A) in the
field of ML based SDEE. These papers were published during the
time period 1991–2010. Among them, 59 (70%) papers were published in journals, 24 (29%) papers appeared in conference proceedings, and one paper came from book chapter. The
publication venues and distribution of the selected studies are
shown in Table 3. The publication venues mainly include Information and Software Technology (IST), IEEE Transactions on Software
Engineering (TSE), Journal of Systems and Software (JSS), and Empirical Software Engineering (EMSE). These four journals are prestigious in software engineering domain [46], and 44 (52%) studies
for this review are retrieved from them.
Regarding the types of the selected studies, all the studies belong to experiment research except for one case study research
(S9), and no survey research was found. Although most of the selected studies used at least one project data set from industry to
validate ML models, it does not follow that the validation results
sufficiently reflect the real situations in industry. In fact, the lack
of case study and survey from industry may imply that the application of ML techniques in SDEE practice is still immature.
Regarding the quality of the selected studies, since in the study
selection strategy we have required the quality score of study must
exceed 5 (notice that the perfect score of quality assessment is 10),
we believe that the selected studies are with high quality. In fact,
as shown in Table 4, about 70% (58 of 84) of the selected studies
are in high or very high quality level.

47

J. Wen et al. / Information and Software Technology 54 (2012) 41–59
Table 3
Publication venues and distribution of selected studies.
Publication venue

Type

# Of studies

Information and Software Technology (IST)
IEEE Transactions on Software Engineering (TSE)
Journal of Systems and Software (JSS)
Empirical Software Engineering (EMSE)
Expert Systems with Applications
International Conference on Predictive Models in Software Engineering (PROMISE)
International Software Metrics Symposium (METRICS)
International Symposium on Empirical Software Engineering and Measurement (ESEM)
International Conference on Tools with Artificial Intelligence (ICTAI)
International Conference on Software Engineering (ICSE)
Journal of Software Maintenance and Evolution: Research and Practice
Others

Journal
Journal
Journal
Journal
Journal
Conference
Conference
Conference
Conference
Conference
Journal

14
12
9
9
4
4
4
3
3
2
2
18

16
14
11
11
5
5
5
4
4
2
2
21

84

100

Total

Table 4
Quality levels of relevant studies.

14
CBR

# Of studies

Very high (8.5 6 score 6 10)
High (7 6 score 6 8)
Medium (5.5 6 score 6 6.5)
Low (3 6 score 6 5)
Very low (0 6 score 6 2.5)

12
46
26
70
25

Total

179

Percent

12

6.7
25.7
14.5
39.1
14.0

10

100

3.2. Types of ML techniques (RQ1)

BN
SVR

8

GA
GP
AR

6
4
2

From the selected studies, we identified eight types of ML techniques that had been applied to estimate software development effort. They are listed as follows.
Case-Based Reasoning (CBR)
Artificial Neural Networks (ANN)
Decision Trees (DT)
Bayesian Networks (BN)
Support Vector Regression (SVR)
Genetic Algorithms (GA)
Genetic Programming (GP)
Association Rules (AR)

Among the above listed ML techniques, CBR, ANN, and DT are
the three most frequently used ones; they together were adopted
by 80% of the selected studies, as illustrated in Fig. 3. The detailed
information about which ML techniques were used in each selected study can be found in Table C.10 in Appendix C. Fig. 3 presents only the amount of research attention that each type of ML
technique has received during the past two decades; as a complement to Fig. 3, Fig. 4 is plotted to further present the distribution of
research attention in each publication year. As shown in Fig. 4, on
one hand, an obvious publication peak appears around year 2008;

GP (3)
AR (1) 3%
1%

ANN
DT

Number of Studies

Quality level










Percent

GA (6) SVR (6)
5% BN (7)
5%
6%
DT (19)
17%

CBR (43)
37%

ANN (30)
26%
Fig. 3. Distribution of the studies over type of ML technique.

0
1991

1996

2001

2006

2010

Fig. 4. Distribution of the studies over publication year.

on the other hand, compared to other ML techniques, CBR and ANN
seem to have received dominant research attention in many years.
Note that some studies contain more than one ML technique.
The identified ML techniques were used to estimate software
development effort usually in two forms: in alone or in combination. The combination form may be obtained by combining two
or more ML techniques or by combining ML techniques with
non-ML techniques. The typical ML technique and non-ML technique that were often used to combine with other ML techniques
are GA and fuzzy logic, respectively. As for GA, according to the selected studies, it was found to be used only in combination form.
The studies reporting the use of GA in combination with other
ML techniques are, for example, S28 (with CBR), S74 (with ANN),
and S65 (with SVR). In these combined models, GA was used just
for feature weighting (S28, S74) or feature selection (S65). As for
fuzzy logic, it was widely used to deal with uncertainty and imprecision information. In particular, it was advocated to combine fuzzy logic with some ML techniques such as CBR (S5), ANN (S29), and
DT (S31) to improve model performance by preprocessing the inputs of the models.
What we found about the ML techniques used in SDEE domain
is highly consistent with the findings of several other relevant review works. For instance, Zhang and Tsai [6] identified five ML
techniques (CBR, ANN, DT, BN, GA) that are used for software effort
estimation, all of which have also been identified in this review.
Jørgensen and Shepperd [1] summarized the distribution of studies
according to estimation methods. Specifically, they identified up to
11 estimation methods, of which four are ML methods, i.e., CBR (or
Analogy), ANN, DT, and BN (ranked in descending order of the
number of studies). These four ML methods have also been identified in this review, even their ranking order in terms of the number
of studies is the same.

48

J. Wen et al. / Information and Software Technology 54 (2012) 41–59

3.3. Estimation accuracy of ML models (RQ2)
Since ML model is data-driven, both model construction and
model validation rely heavily on historical project data. Therefore,
when evaluating the estimation accuracy of an ML model, we
should take into account the historical project data set on which
the model is constructed and validated, as well as the employed
validation method.
Various historical project data sets were used to construct and
validate the ML models identified in this review. The most frequently used data sets together with their relevant information
are shown in Table 5. The relevant information includes the type
of the data set (within-company or cross-company), the number
and percentage of the selected studies that use the data set, the
number of projects in the data set, and the source of the data set.
With respect to validation methods, Holdout, Leave-One-Out
Cross-Validation (LOOCV), and n-fold Cross-Validation (n > 1) are
the three dominant ones used in the selected studies. Specifically,
the numbers (percentages) of the studies that used these three validation methods are 32 (38%) for Holdout, 31 (37%) for LOOCV, and
16 (19%) for n-fold Cross-Validation.
Besides data set and validation method, accuracy metric should
also be considered in evaluating the ML models. Effort estimation
accuracy can be measured with various metrics, and different metrics measure the accuracy from different aspects. Thus, to answer
RQ2 properly, it is crucial to adopt appropriate accuracy metrics
in evaluating estimation accuracy. It is found in the selected studies that MMRE (Mean Magnitude of Relative Error), Pred (25) (Percentage of predictions that are within 25% of the actual value), and
MdMRE (Median Magnitude of Relative Error) are the three most
popular accuracy metrics. Specifically, the numbers (percentages)
of the studies that used these three metrics are 75 (89%) for MMRE,
55 (65%) for Pred (25), and 31 (37%) for MdMRE. In view of the
dominance of MMRE, Pred (25), and MdMRE metrics, we adopted
them in this review to evaluate the estimation accuracy of ML
models.
The original values of MMRE, Pred (25), and MdMRE extracted
from the selected studies are summarized in Table C.10 in Appendix C. Notice that, since GA was found to be used only for feature
selection and feature weighting in combined ML models, no estimation accuracy data of GA are provided in Table C.10. To analyze
the estimation accuracy graphically, we used box plots to demonstrate the distributions of MMRE, Pred (25), and MdMRE values of
ML models in Fig. 5a–c, respectively. Since the observations of
MMRE and Pred (25) for AR are few, we did not plot them in the
corresponding figures to avoid the meaningless presentation of
only one or two data points in a box plot. For the same reason,
box plots of MdMRE for SVR, GP, and AR are also not presented
in Fig. 5c.
A lower MMRE and MdMRE, or a higher Pred (25) indicates a
more accurate estimate. On the performance of ML models measured in MMRE and Pred (25) (see Fig. 5a and b), ANN and SVR
are the most accurate ones (with median MMRE around 35% and
Table 5
Data sets used for ML model construction and validation (W = within-company,
C = cross-company).

a

Data set

Type

# Of studies

Percent

# Of projects

Source

Desharnais
COCOMO
ISBSG
Albrecht
Kemerer
NASA
Tukutuku

W
C
C
W
W
W
C

24
19
17
12
11
9
7

29
23
20
14
13
11
8

81
63
>1000a
24
15
18
>100a

[47]
[48]
[49]
[50]
[51]
[52]
[53]

The number of projects depends on the version of data set.

median Pred (25) around 70%), followed by CBR, DT, and GP (with
both median MMRE and median Pred (25) around 50%), whereas
BN has the worst accuracy (with median MMRE around 100%
and median Pred (25) around 30%). On the performance of ML
models measured in MdMRE (see Fig. 5c), both CBR and ANN have
their median MdMRE around 30%, whereas those of BN and DT are
around 45%. Apart from the above mentioned observations