DATA MINING WITH SOFTWARE INDUSTRY PROJE

IADIS International Conference Applied Computing 2009

DATA MINING WITH SOFTWARE INDUSTRY PROJECT
DATA: A CASE STUDY
Topi Haapio and Tim Menzies
Lane Department of Computer Science, West Virginia University
Morgantown, WV 26506-610, USA.

ABSTRACT
Increasingly, data mining is used to improve an organization’s software process quality, e.g. effort estimations. Data is
collected from projects, and data miners are used to discover beneficial knowledge. Mining-suitable software project
management data can, however, be difficult to collect, and results frequently in a small data set. This paper addresses the
challenges with such a small data set, and how we overcame these challenges. The paper reports, as a case study, a data
mining experiment that both failed, and succeeded. While the data did not support answers to the questions that prompted
the experiment, we could find answers to other relating important business questions. We offer two conclusions. Firstly,
it is important to control research expectations when setting up such a study since not all questions are supported by the
available data. Secondly, it may be required to tune the questions to the data, and not the other way around. We offer this
second conclusion cautiously since it runs counter to previous empirical software engineering recommendations.
Nevertheless, we believe it may be a useful approach when studying real world software engineering data that may be
limited inside, noisy or skewed by local factors.
KEYWORDS

Case study, software industry, business intelligence, data mining, small data set.

1. INTRODUCTION
In software industry, business intelligence (BI) information is produced for corporate management to
improve software process quality of cost or effort estimations, for instance. One of the popular processes for
producing BI information is data mining. The data mining tools are employed to model the data (Pyle 1999),
e.g. by regression or with trees. The models of data behavior are generated with data mining learners, using
appropriate data. The data can be gathered either manually or automated. Whereas the automated data
gathering processes can produce vast amount of data in some business areas (Pyle 1999), the manual data
gathering results usually in smaller and noisier data sets. Moreover, whereas some areas of software
engineering (SE), e.g. defect analysis, show success in data collection (Menzies et al 2007), others struggle.
In fact, it can be quite difficult both to access and to find usable project management data. For example, after
26 years of trying, less than 200 sample projects have been collected for the COCOMO database (Boehm et
al 2000). NASA project repository na60 with 60 NASA projects were collected during last two decades
(Chen et al 2005). The reason is quite simple: whereas there are many program modules for defect analysis
purposes there are only a few projects these modules are included to.
Another reason for differences in data set sizes and quality is that in the non-commercial organizations
the government funding can enable a more research-motivated and extensive data collection, whereas in
software industry the data collection is more driven by the customer needs and the process maturity models
which the companies are committed to. For example, for the last five years we have been collecting project

data from a large North European software company on software project activity and effort distribution,
resulting in 32 projects from years 1999-2007. The collected project data does not comprise a large
repository; in fact, it can be presented with a single Excel spreadsheet. This paper describes a case study of a
data mining experiment from that small data set. In our experiment, we attempted to develop statistical
models to predict software project activity effort. Our data mining experiment results were two-folded: where
we could not find evidence for what we were primarily tasked to find, we could find evidence for other

33

ISBN: 978-972-8924-97-3 © 2009 IADIS

factors significant for the business. While the detailed results of the data mining is a topic of another
publication, in this paper we focus in the other main contribution of our study: addressing the reality and the
challenges software companies face in collecting and utilizing project data, and how to overcome them. We
argue that such an analysis is an open and urgent question in many SE fields since suitable real software
project data requires substantial collection effort.
The rest of this paper is structured as follows. Section 2 describes the case study presenting our data
mining experiment on the effort predictors in software projects. Section 3 offers five conclusions in which
the research results with their implications are discussed.


2. CASE STUDY
2.1 Research Methodology, Site and Background
In this study we applied case study. Here, the case study methodology is understood and applied in a broader
context than that Yin’s (1994). We chose an exploratory approach rather than explanatory from pragmatic
reasons to achieve answers to question “what?” and “what can we do about it?” rather than to questions
“why?” or “how?”, i.e. we attempt to point out challenges relevant to the software industry and provide a
solution to overcome these challenges rather than understand the phenomena behind the problem.
The case study is based on the data collecting practices in Tieto Corporation during 1999-2008. Tieto is
one of the largest IT services company in the Northern Europe with 16.000 employees.
In 2003, the quality executives at Tieto were concerned of the influence of the software project activities
other than actual software construction and project management activities have on software project effort and
its estimation accuracy, i.e. these general software project activities had not had the attention they might have
required. Management hypothesized that focusing only on software construction and project management in
effort estimations, and neglecting or undervaluing the other project activities can result in inaccurate
estimates.
In (Haapio 2004), we noted that much of the effort estimation work focuses on the first two of the
following three parts of a typical software project work breakdown structure (WBS):
1. Software construction involves the effort needed for the actual software construction in the
project’s life-cycle frame, such as analysis, design, implementation, and testing. Without this
effort, the software cannot be constructed, verified and validated for the customer hand-over.

2. Project management involves activities that are conducted solely by the project manager, such as
project planning and monitoring, administrative tasks, and steering group meetings.
3. All the activities in the project’s life-cycle frame that do not belong to the other two main
categories can be called non-construction activities. In (Haapio 2006), the category of nonconstruction activities was further decomposed into seven individual, but generic, software
project activities: configuration management, customer-related activities, documentation,
orientation, project-related activities, quality management, and miscellaneous activities.
Accordingly we explored how non-construction activities affect effort estimates. Our research (Haapio
2004) showed that the amount of non-construction activity effort in software projects is not only significant
(median 20.4% of total project effort) but also varied remarkably between the projects. Could this, in fact, be
the reason for the high effort estimation inaccuracy back then (median and mean magnitude of relative error–
MdMRE, MMRE–were reported to be 0.34 and 0.36, respectively (Haapio 2004))? We experimented with
data mining to find out if this was indeed the case.

2.2 Data Mining Experiment
In the case study, we performed a data mining experiment in which data mining techniques were used in
assessing the impact of different software project activities and in finding predictors for the effort of different
project activities. The learners used in data mining come from the WEKA toolkit software (Holmes et al
1994). In this paper, the focus is on the challenges the software industry can encounter in their software
process improvement (SPI) work with data mining. Therefore, we give only a concise description of our data


34

IADIS International Conference Applied Computing 2009

mining experiment and its generalized results in the following. Details of the experiment and its results will
be published separately.
Data mining starts with data preparation. We applied the data preparation guidelines given by Pyle (1999,
2003) whenever possible. However, the Pyle’s guidelines to choose data, to our mind, are for a large data
source and for choosing sample records from the large source, not a small data set like ours. In our case, we
have collected every record (project data) that we had access into during 2003-2008, and which had sufficient
and relevant data. Although there are hundreds of projects on-going in the company in question, most of the
project data is inaccessible or not usable for data mining purposes.
The gathered data set consisted of 32 custom software development and enhancement projects which took
place in 1999-2006. These projects were delivered to five different Nordic customers who operate mainly in
the telecommunication business domain. The delivered software systems were based on different technical
solutions. The duration of the projects was between 1.9 and 52.8 months. The projects required effort
between 276.5 and 51,426.6 hours. The normal work iteration was included in the effort data. Effort caused
by change management, however, was excluded from the data because change management effort and costs
are not included in the original effort estimation and the realized costs. The effort estimation information was
gathered for this research from the tender proposal documents, the contract documents, and the final report

documents.
Pyle (1999, 2003) notes that the collected records should be distributed into three representative data sets:
training, testing, and evaluation. Due to our small data set this requirement was unreasonable. Instead, we
performed the data mining with one single data set, using a 2/3rd 1/3rd train/test cross-validation procedure,
repeated ten times.
For the 32 projects in question, we gathered available variables common and related to most software
projects. Our 11 predictors (also, independent variables or inputs) include project, organization, customer,
size (effort), and non-construction activity related variables. Our response variables (also, dependent or
output variables) were originally a set of continuous classes representing the effort proportions of the three
major software project categories (software construction, project management, non-construction activities)
and six further decomposed non-construction activities. We excluded the ‘miscellaneous activities’ response
class for two reasons: first, the frequency of these activities to appear in a project was rather small (28.1%)
and second, the ‘miscellaneous activities’, having no common denominator, is a ‘dump’ category.
Pyle (1999, 2003) gives also instructions on missing values and data translations. We manipulated our
data as little as possible: e.g. all missing values are left missing and were not replaced by a mean or median
value. The values for effort estimate in hours were recalculated as LOG, which is a usual action (Chen et al
2005), and further derived to two predictors.
According the guidelines by Pyle (1999, 2003), we visualized the data (the predictors in respect to each
response) using Weka’s visualizing features, including cognitive nets, also referred as cognitive maps (Pyle
1999). However, any eminent predictor for responses could not be found by applying these visualizing

methods.
After data preparation we, encouraged by the results in other studies (e.g. Hall and Holmes 2003; Chen et
al 2005), employed Feature Subset Selection (FSS) to analyze the predictors before classifying them. We
applied Weka’s Wrapper feature subset selector, based on Kohavi and John’s (1997) Wrapper algorithm, in
the process since experiments by other researchers strongly suggest that it is superior to many other variable
pruning methods.
Our FSS used ten-fold cross-validation for accuracy estimation. The selection search was used to produce
a list of attributes to be used for classification. Attribute selection was performed using the training data and
the corresponding learning schemes for each machine learner. As result, the Wrapper performed better on
discrete classes than on continuous classes.
After FSS, we applied a range of learners to find predictors for software project activity effort. First, we
applied three learners provided by Weka for linear analyses:
• Function-based Multilayer Perceptron (Quinlan 1992a; Ali and Smith 2006) and Linear
Regression.
• Tree-based M5-Prime, M5P by Quinlan (1992b).
The linear analysis results, in terms of correlation coefficient, were gained either with full data set or a
pruned one. We pruned our data set by removing all columns that gained under 50% of folds in 3-fold (since
our data set was small) cross-validation in FSS. This FSS was made with Wrapper using each learner. We
used column pruning to gain better results. Although by using FSS our results improved slightly, none of the


35

ISBN: 978-972-8924-97-3 © 2009 IADIS

response classes resulted with acceptable coefficients. Since continuous methods were not apparently helpful
in this domain, we moved on to discretizations of the continuous space and dividing all our output variables
into two discretized groups (‘under’ and ‘over’) according their median values.
For the discretized response classes, we applied a range of learners provided by Weka:
• Bayes-based Naïve Bayes (Hall and Holmes 2003; Ali and Smith 2006; Menzies et al 2007) and
AODE (Witten and Frank 2005).
• Function-based Multilayer Perceptron.
• Tree-based J48, a Java implementation of Quinlan’s (1992a) C4.5 algorithm.
• Rule-based JRip, a Java implementation of Cohen’s (1995) RIPPER rule learner, and Holte’s
(1993) OneR (Nevill-Manning et al 1995; Ali and Smith 2006).
Results were averages across a 2/3rd, 1/3rd cross-validation study, repeated ten times for each of the two
learners and for each of nice possible target variables (in the case of the nine repeats, one output variable was
selected and the other eight were removed from the data set). The prediction outcomes of the discretized
classes were presented with confusion matrices. An ideal result has no false positive of false negatives matrix
values. In practice, such an ideal outcome happens very rarely, due to data idiosyncrasies. Hence, we were
satisfied with an ‘acceptable result’ of most the results on the diagonal with only a few off-diagonal entries.

An ‘acceptable result’ only appears for one response class, i.e. when learners are predicting for the level of
one of the non-construction activities, namely quality management, apparent in a project whereas no
‘acceptable result’ appears for the non-construction activities category. The data mining experiment result,
interesting for both software business and quality management, in general was that the prediction is that if a
project is estimated to have a smaller effort (than median effort), the relative quality management effort will
realize over its median value, and vice versa, if a project is estimated to have a larger effort (than median
effort), the relative quality management effort will realize under its median value. We will return to the
details of the experiment and its results in another publication.

2.3 Analysis on Data Collection Practices at Research Site
During the decade the data sample is from and collected (1999-2008), the fast pace in company acquisitions
led to the shortage of good-quality project management data at the research site. The acquired companies had
their own project management data systems and practices how the data was structured. Also, most acquired
companies remained as independent subsidiaries with no or very limited visibility into company's project
management data outside the specific subsidiary.
The work breakdown structures created for the projects were more based on invoicing than on effort data
utilization purposes, and in many cases the structures were proposed by the customer to assist in their
budgeting and cost controlling needs. Recently, investments in data utilization have increased as the company
strives to the highest CMMI maturity levels.


3. CONCLUSION
Based on our experience, we offer the following five conclusions. Firstly, based on our case study research
site analysis, we recommend software companies strive to remove the barriers related to the project
management data. In particular, the use of one corporate-wide project management data system, conformity
of data structures within the system, and transparent data promote the success of data mining. In practice,
however, all barriers might be difficult to be removed.
Secondly, a general result for industry is that we have made the conclusions, despite a severe shortage in
the amount of available data. Our pre-experimental concern was that we lacked sufficient data to make useful
conclusions. For this study we collected project data for last five years from a large North European software
company. Even after an elaborate historical data collection over the period 1999-2007, we could only find
data in 32 projects. Nevertheless, even this small amount of data was sufficient to learn a useful effort
predictor, a beneficial BI knowledge for software business and quality management.
Thirdly, our data mining experiment suggests that standard continuous models can be outperformed by
qualitative discrete models. Specifically, in our case, we achieved a result with median discretization prior
processing, i.e. by dividing both predictor and response classes according to their median values. We found a

36

IADIS International Conference Applied Computing 2009


signal in the qualitative space that was invisible in the quantitative space. In fact, it can be easier to find a
dense target than a diffuse one. Median discretization batches up diffuse signals into a small number of
buckets, in our case, two. Hence, we recommend that if a failure in continuous modeling is observed, discrete
modeling can turn out to be successful.
Fourthly, the expectations for the research need to be carefully managed at the start of a data mining
experiment. If we cannot find the answers the stakeholder that commissioned the research wants to hear, but
we can find other important factors, we do not want disappointment over the former to blind our
commissioner to the value of the latter.
Fifthly, it can be useful to allow for a redirection, halfway through a study. Just because a data set does
not support answers to question X, does not mean it cannot offer useful information about question Y. Hence,
we advise tuning the question to the data and not the other way around. This advice is somewhat at odds with
standard empirical SE theory and literature, which advises tuning data collection according to the goals of the
study (van Solingen and Berghout 1999). For example, in Goal/Question/Metric (GQM) paradigm (Basili
and Rombach 1988), data collection is designed as follows (van Solingen and Berghout 1999):
1. Conceptual level (goal): A goal is defined for an object for a variety of reasons, with respect to
various models of quality, from various points of view and relative to a particular environment.
2. Operational level (question): A set of questions is used to define models of the object of study
and then focuses on that object to characterize the assessment or achievement of a specific goal.
3. Quantitative level (metric): A set of metrics, based on the models, is associated with every
question in order to answer it in a measurable way.
In an ideal case, we can follow the above three steps. However, the pace of change in both software
industry and SE organizations in general can make this impractical as it did in the NASA Software
Engineering Laboratory’s (SEL), for instance. With the shift from in-house production to outsourced,
external contractor, production, and without an owner of the software development process, each project
could adopt its own structure. The earlier SEL’s learning organization experimentation became difficult since
there was no longer a central model to build on (Basili et al 2002). The factors that lead to the demise of the
SEL are still active. SE practices are quite diverse, with no widely accepted or widely adopted definition of
‘best’ practices. The SEL failed because if could not provide value-added when faced with software projects
that did not fit their preconceived model of how a software project should be conducted. In the 21st century,
we should not expect to find large uniform data set where fields are well-defined and filled-in by the projects.
Rather, we need to find ways for data mining to be a value-added service, despite idiosyncrasies in the data.
Consequently, we recommend GQM when data collection can be designed before a project starts its work.
Otherwise, as done in this paper, we recommend using data mining as microscope to closely examine a data.
While it useful to start such a study with a goal in mind, an experimenter should be open to stumbling over
other hypothesis.
To summarize, the main conclusion of our study is that the question must be tuned to the data. In the
modern agile and outsourced SE world, many of the premises of prior SE research no longer hold. We cannot
assume a static sweet-structured domain where consistent data can be collected from multiple projects over
many years for research and other utilizing purposes. We need to recognize that data collection has it limits
in organizations driven by minimizing costs, and customers being the main stakeholder for data collection
reason. Thus, we conclude with a recommendation to analyze the initial stand-point for data mining, and then
either continue with GQM or, as in our case, use data mining for data examination.

ACKNOWLEDGEMENT
The authors would like thank to Prof. Anne Eerola, University of Kuopio, for commenting on the draft paper.

37

ISBN: 978-972-8924-97-3 © 2009 IADIS

REFERENCES
Ali, S. and Smith, K., 2006. On learning algorithm selection for classification. In Applied Soft Computing, Vol. 6, No. 2,
pp. 119-138.
Basili, V. and Rombach, H., 1988. The TAME Project: Towards Improvement-Oriented Software Environments. In IEEE
Transactions on Software Engineering, Vol. 14, No. 6, pp. 758-773.
Basili, V. et al, 2002. Lessons learned from 25 years of process improvement: the rise and fall of the NASA software
engineering laboratory. Proceedings of the 24th International Conference on Software Engineering (ICSE’02),
Orlando, USA, pp. 69-79.
Boehm, B. et al, 2000. Software Cost Estimation with COCOMO II. Prentice-Hall, Upper Saddle River, USA.
Chen, Z. et al, 2005. Finding the Right Data for Software Cost Modeling. In IEEE Software, Vol. 22, No. 6, pp. 38-46.
Cohen, W., 1995. Fast Effective Rule Induction. Proceedings of the 12th International Conference on Machine Learning,
Tahoe City, USA, pp. 115-123.
Haapio, T., 2004. The Effects of Non-Construction Activities on Effort Estimation. Proceedings of the 27th Information
Systems Research in Scandinavia (IRIS’27), Falkenberg, Sweden, [13].
Haapio, T., 2006. Generating a Work Breakdown Structure: A Case Study on the General Software Project Activities.
Proceedings of the 13th European Conference on European Systems & Software Process Improvement and
Innovation (EuroSPI’2006), Joensuu, Finland, pp. 11.1-11.
Hall, M. and Holmes, G., 2003. Benchmarking attribute selection techniques for discrete class data mining. In IEEE
Transactions on Knowledge and Data Engineering, Vol. 15, No. 6, pp. 1437-1447.
Holmes, G. et al, 1994. WEKA: A Machine Learning Workbench. Proceedings of the 1994 Second Australian and New
Zealand Conference on Intelligent Information Systems, Brisbane, Australia, pp. 357-361.
Holte, R., 1993. Very simple classification rules perform well on most commonly used dataset. In Machine Learning,
Vol. 11, pp. 63-91.
Kohavi, R. and John, G., 1997. Wrappers for feature subset selection. In Artificial Intelligence, Vol. 97, No. 1-2, pp. 273324.
Menzies, T. et al, 2007. Data Mining Static Code Attributes to Learn Defect Predictors. In IEEE Transactions on
Software Engineering, Vol. 33, No. 1, pp. 2-13.
Nevill-Manning, C. et al, 1995. The Development of Holte’s 1R Classifier. Proceedings of the Second New Zealand
International Two-Stream Conference on Artificial Neural Networks and Expert Systems, Dunedin, New Zealand, pp.
239-242.
Pyle, D., 1999. Data Preparation for Data Mining. Morgan Kaufmann Publishers, Inc., San Francisco, USA.
Pyle, D., 2003. Data Collection, Preparation, Quality, and Visualization. In The Handbook of Data Mining, pp. 365-391,
Lawrence Erlbaum Associates, Inc., Mahwah, USA.
Quinlan, R., 1992a. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Meteo, USA.
Quinlan, R., 1992b. Learning with Continuous Classes. Proceedings of the 5th Australian Joint Conference on Artificial
Intelligence, Hobart, Tasmania, pp. 343-348.
van Solingen, R. and Berghout, E., 1999. The Goal/Question/Metric Method. McGraw-Hill Education, London, UK.
Witten, I. and Frank, E., 2005. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann: Los
Altos, USA.
Yin, R., 1994. Case Study Research: Design and Methods. Sage Publications, Thousand Oaks, USA.

38