J00143

Int. J. Data Analysis Techniques and Strategies, Vol. X, No. Y, xxxx

Supervised learning approaches and feature
selection – a case study in diabetes
Yugowati Praharsi
Department of Industrial and System Engineering,
Chung Yuan Christian University,
Chung Li, 32023, Taiwan
and
Department of Information Technology,
Satya Wacana Christian University,
Salatiga, 50711, Indonesia
E-mail: [email protected]

Shaou-Gang Miaou
Department of Electronic Engineering,
Chung Yuan Christian University
Chung Li, 32023, Taiwan
E-mail: [email protected]

Hui-Ming Wee*

Department of Industrial and System Engineering,
Chung Yuan Christian University,
No. 200, Chung Pei Rd., Chungli, 32023, Taiwan
Fax: +886-3-2654499
E-mail: [email protected]
*Corresponding author
Abstract: Data description and classification are important tasks in supervised
learning. In this study, three supervised learning methods such as k-nearest
neighbour (k-NN), support vector data description (SVDD) and support vector
machine (SVM) are considered because they do not suffer from the problem of
introducing a new class. The data sample chosen is Pima Indians diabetes. The
results show that feature selection based on mean information gain and a
standard deviation threshold can be considered as a substitute for forward
selection. This indicates that data variation using information gain is an
important factor that must be considered in selecting feature subset. Finally,
among eight candidate features, glucose level is the most prominent feature for
diabetes detection in all classifiers and feature selection methods under
consideration. Relevancy measurement in information gain can sort out the
most important feature to the least significant one. It can be very useful in
medical applications such as defining feature prioritisation for symptom

recognition.
Keywords: supervised learning; k-nearest neighbour; k-NN; support vector
data description; SVDD; support vector machine; SVM; classification; feature
selection; diabetes.
Copyright © 200x Inderscience Enterprises Ltd.

1

2

Y. Praharsi et al.
Reference to this paper should be made as follows: Praharsi, Y., Miaou, S-G.
and Wee, H-M. (xxxx) ‘Supervised learning approaches and feature selection –
a case study in diabetes’, Int. J. Data Analysis Techniques and Strategies,
Vol. X, No. Y, pp.000–000.
Biographical notes: Yugowati Praharsi is a PhD student in Department of
Industrial and Systems Engineering at Chung Yuan Christian University in
Taiwan. She received her BSc in Mathematics from Satya Wacana Christian
University, Indonesia and MSc in Electronic Engineering from Chung Yuan
Christian University, Taiwan. Her research interests are in the field of

mathematics modelling, operations research, and supply chain management.
Shaou-Gang Miaou is a Professor in Department of Electronic Engineering at
Chung Yuan Christian University in Taiwan. He received his BS in Electronic
Engineering from Chung Yuan Christian University, Taiwan, MS and PhD in
Electrical Engineering from University of Florida, USA. His research interests
are in the field of image processing, biomedical signal processing and pattern
recognitions. His publications include four patents, ten books and over 120
journal and conference papers. He is a senior member of IEEE.
Hui-Ming Wee is a Professor in Department of Industrial and Systems
Engineering at Chung Yuan Christian University in Taiwan. He received
his BSc (hons.) in Electrical and Electronic Engineering from Strathclyde
University, UK, MEng in Industrial Engineering and Management from Asian
Institute of Technology (AIT) and PhD in Industrial Engineering from
Cleveland State University, Ohio, USA. His research interests are in the field of
production/inventory control, optimisation and supply chain management. His
publications include four books and over 200 refereed journal papers.

1

Introduction


The rapid development of technology leads to increasing data accumulation that yields
valuable collection of facts and information. With this, data storage has become a typical
method to preserve information and important facts. However, in order to obtain valuable
information from those data, an effective learning approach to explore the data must be
carried out.
The learning approach in data exploration can be categorised into supervised and
unsupervised. A supervised learning classifies the data based on the label of input data.
Whereas, classifying the data without the label of input data is called unsupervised
learning. Several methods have been proposed to solve problems on supervised learning
such as Naive Bayes, k-means, principal component analysis (PCA), k-nearest neighbour
(k-NN), support vector data description (SVDD), support vector machine (SVM), and
artificial neural network. These methods involve class labels and features of input data
(Duda et al., 2000; Ji et al., 2008; Smith, 2009).
Since no single classifier gives the best recognition result on all data and there are so
many classifiers, this study only compares three supervised learning methods, i.e., k-NN
(NN with k = 1, 3, 5, 7, 9, 11), SVDD and SVM. These three classifiers have their own
characteristics and are representatives of their own kind. Furthermore, they do not suffer
from the problem of introducing a new class because they either require no new


Supervised learning approaches and feature selection

3

retraining or need a small scale retraining. These three classifiers are also extended by
implementing feature selection methods for higher classification accuracy.
Features used in describing data are not equally important for certain problems. Its
two main properties that influence the performance of classifiers are redundancy and
relevancy. Minimising inter-correlation among features can avoid redundancy. While,
obtaining the correlation between feature and class can produce relevancy of each feature
to class. In this study, a selected feature subset is obtained using a forward selection and
correlation method. Forward selection and correlation method are based on a wrapper
approach and filter approach of feature selection theory, respectively. As introduced by
Kittler (1978) and mentioned by Hall (1998), the former begins with an empty set, then
features are added one by one until no features can produce higher accuracy. The latter
measures feature-feature intercorrelations and feature-class correlations. Correlation
between features and class label is measured by entropy and information gain while
feature-feature intercorrelation uses Pearson correlation.
The remaining parts of this paper are organised as follows. SVM, SVDD, and NN are
given in Sections 2, 3, and 4, respectively. Feature selection is described in Section 5 and

diabetes is presented in Section 6. Experimental designs are given in Section 7. Finally,
experimental results and some conclusions are summarised in Sections 8 and 9,
respectively.

2

Support vector machine

The basic idea of a SVM is to construct a hyperplane that the margin of separation,

ρ = wT w , can be maximised to classify the data into positive and negative classes.

SVM looks for optimal separating hyperplane (OSH) so that it can classify correctly. To
construct an optimal hyperplane is preceded by non-linear mapping in the high
dimensional feature space so the mapped data can be separated linearly. In addition, the
correlation between the mapped data and the input data is non-linear (Cortes and Vapnik,
1995; Haykin, 1999; Huang and Wang, 2006).
For an unseen/test data z, its class can be obtained by the decision function:

D( z ) = sign ( wT ϕ ( z ) + b )


and the decision rules as follows:



if D(z) > 0, ϕ(z) belongs to positive class



if D(z) = 0, ϕ(z) is misclassified.



if D(z) < 0, ϕ(z) belongs to negative class

where the variables can be explained as follows:
w

weight vector


b

bias of separating hyperplane

ϕ(z) non-linear mapping from input space to feature space.

(1)

4

3

Y. Praharsi et al.

Support vector data description

The basic idea of SVDD is to create a description/closed boundary that contains the
training data and then detects the new data does have the same nature with the training
data. The purpose of data description is to provide a compact closed boundary of training
dataset which is called a hyper-sphere. The sphere has a centre a and radius R > 0. The

main idea is to minimise the volume of the sphere by minimising R2.
For a testing data z, it is accepted if:

ϕ ( z) − a

2

= K ( z, z ) − 2
+

∑α K ( z, x )
N

i =1

i

i

∑∑α α K ( x , x ) ≤ R

N

(2)

N

i =1 j =1

i

j

i

2

j

where the variable αi is Lagrange multiplier and K(xi, yj) is an inner product kernel (Lee
et al., 2007; Tax and Duin, 1999, 2004).


4

Nearest neighbour

The nearest neighbour classifies data based on the class of their nearest neighbours. In
k-NN, a data will be classified by a majority vote of its nearest neighbours. k-NN is a
classification method that is modest and basic and can be used as the first step to learning
classification when there is no prior knowledge about data distribution. The k-NN
classifier is based on the Euclidian distance between testing data and training datasets
(Cunningham and Delany, 2007; Peterson, 2009).

5

Feature selection

Feature selection is the process of selecting some features of pre-existing set of features.
The selected features should carry good generalisation capabilities to design classifier
(Liu and Yu, 2005; Theodoridis and Koutroumbas, 1999). Feature selection is one of the
optimisation techniques that will evaluate these features which are relevant to the class so
as to improve accuracy and reduce the features dimensionality by removing the features
which has high mutual correlation (Chen and Cheng, 2009). There are some evaluation
criteria for feature selection methods such as InfoGain, gain ratio, and correlation-based
feature selector (Cfs).

5.1 Correlation-based feature selection
A Cfs is a simple filter algorithm to ranks feature subsets according to the magnitude of
correlation values and is one of the methods of heuristic approach. Cfs’s feature subset
evaluation is described by Ghiselli (1964) as mentioned in Hall (1998):

Supervised learning approaches and feature selection
Ms =

k rcf

k + k ( k − 1) r ff

5

(3)

where MS is the magnitude of a feature subset S containing k features, rcf is the mean of
class-feature correlation (intracorrelation), and r ff

is the mean of feature-feature

intercorrelation.

5.2 Entropy
Uncertainty in a system due to randomness is often measured as entropy. The concept of
entropy in information theory is introduced by Quinlan (1993) as mentioned in Hall
(1998). The entropy of feature Y in a system is described as:

H (Y ) = −

∑ p( y) log
y∈Y

H (Y X ) = −

2

( p( y ) )

(4)

∑ p( x)∑ p ( y x ) log ( p ( y x ) )

The entropy of Y after partitioning is given by
x∈ X

y∈Y

2

(5)

The gap of the entropy of Y prior and after partitioning is called information gain.
Information gain can be formulated as:
Information Gain = H (Y ) − H (Y X )

6

(6)

Diabetes

Blood sugar/glucose level that is too high can lead to diabetes. The body needs glucose to
produce energy and blood provides it. Glucose comes from foods we eat and is also
produced by the liver and muscle. Blood bring glucose throughout the body cells. On the
other hand, insulin helps the glucose to be absorbed into the body cells. Insulin is a
hormone produced by pancreas. If the body is unable to produce enough insulin, or
insulin is not working properly, the glucose cannot be absorbed into the body cells.
Consequently, the level of blood glucose increases. If this glucose levels exceed normal
limit, it will cause diabetes.
There are two types of diabetes namely type one and two. Type one usually affects
children, adolescents, or youth. This is characterised by the destruction of beta cell due to
autoimmune process so that pancreas cannot produce insulin more. Type two usually
affects most people at any age. This is characterised by insulin disorder, obesity factor,
and liver which do not use insulin properly (Alberti and Zimmet, 1998; What Diabetes
Is).
There is a significant probability to show that women with diabetes previously
diagnosed gestational diabetes mellitus (GDM) in the first half of pregnancy. GDM is a
common disorder during pregnancy. Women who are diagnosed with GDM will have a

6

Y. Praharsi et al.

high risk for type 2 diabetes in the future life (Ben-Haroush et al., 2004; Cheung and
Byth, 2003; Lauenborg et al., 2004).

7

Experiment design

The dataset used in the study is Pima Indians from the UCI Machine Learning Database.
The Pima Indians involves only females at least 21 years old. There are 768 instances,
where 268 instances are diabetes patients and 500 instances are not diabetes patients.
Each instance contains eight attributes/features and one class attribute. The information
relating to the datasets are summarised in Table 1.
Table 1

Dataset of Pima Indians

Num. of classes

Num. of features

Num. of data in each class

Num. of total data

8

268 (diabetes)

768

2

500 (normal)

The attributes/features are number of times pregnant, oral glucose tolerance test (OGTT),
diastolic blood pressure, triceps skin fold (TSF) thickness, 2-hour serum insulin, body
mass index (BMI), diabetes pedigree function/heredity, and age. For class attribute, class
value 1 is interpreted as diabetes patients and class value 0 is interpreted as non-diabetes
patients (Sigillito, 2008).

7.1 Performance evaluation measure
The performance of classifier is evaluated using confusion matrix given in Figure 1. The
following evaluation criteria will be used in this study:
Figure 1

Confusion matrix

Predicted class

Actual class
TPRate =
TN Rate =

P
TP
FP

N
FN
TN

TP
TP + FN

(7)

TN
TN + FP

(8)

Precision =
Recall =

P
N

TP
TP + FN

TN
TP + FN

(9)
(10)

Supervised learning approaches and feature selection
Accuracy =
F − score=

TP + TN
TP + TN + FP + FN

2 × Recall × Precision
Recall + Precision

Kappa Value =

Observed Agreement − Chance Agreement
1 − Chance Agreement

G − mean = TN rate ⋅ TPrate

7

(11)
(12)
(13)
(14)

All the evaluation criteria above are used because each of them has its own strength and
weakness. There is no single criterion that works best on all given data. Some definitions
about evaluation criteria are described as follows. TPrate or recall is the proportion of
positive data that were correctly identified. TNrate is the proportion of negative data that
were classified correctly. Precision is the proportion of the predicted positive data that
were correct. Accuracy is the proportion of the total number of predictions that were
correct. F-score is a measure to test the accuracy or the measurement of the balance
between precision and recall. Kappa value is an index that compares the agreement
against the opportunities that would be expected to occur by chance. Geometric mean
(g-mean) is a type of mean, which shows a tendency to converge or particular value of a
set of numbers. In Kubat and Matwin (1997), the geometric mean of the two quantities
(precision and recall) is used as an extra criterion. For example, g-means is used as an
extra criterion for imbalanced training data besides accuracy because g-means considers
the number of correct predictions for positive and negative data.
In this study, evaluation is done using 10-fold cross-validation. In 10-fold cross
validation, the training set is divided into 10 subsets of equal size. One subset drawn at
random and then tested using the classifier. This is done continuously until the tenth
subset. Therefore, any data from the entire training set is predicted once so the cross
validation accuracy is the percentage of correctly classified data. A grid-search on C
(penalty weight of error) and s (RBF kernel parameter) is set using cross-validation. This
work used software MATLAB 7.0 to run the programme.

7.2 Classifiers
Three classifiers are compared in this research: nearest neighbour (NN with k = 1, 3, 5, 7,
9, 11), SVDD and SVM. All the performance results of classifiers were obtained through
10-fold cross validation to minimise the impacts of data dependency and prevent the
over-fitting problem. The grid algorithm for all classifiers is given in Figure 2.
In addition to the flow charts shown in Figure 2, initialisation parameter C and s are
just for SVM and SVDD classifiers. For SVDD classifier, target data is diabetes patients
and testing data is diabetes and non-diabetes patients. For NN and SVM classifiers, target
data/training data are the same as testing data, i.e., diabetes and non-diabetes patients.

8

Y. Praharsi et al.

Figure 2

A flow chart showing the grid algorithm for SVM, NN, and SVDD classifiers

Testing Set

Training Set

Initializing C and s

Training SVM , NN, SVDD Classifiers
using 10-fold Cross Validation

Trained SVM , NN, SVDD Classifiers

Grid Search

N

Termination
Criteria
Y

Optimized C and s

7.3 Feature selection methods
In order to improve the performance of the classifier, feature selection methods are
applied. In this study, two feature selection methods are used: forward selection search
and correlation approach. The purpose of feature selection is to obtain subset feature that
has strong correlation with the class and uncorrelated to each others.
The feature correlation measures used here are Pearson correlation and entropy.
Pearson correlation is used to measure the feature-feature correlation because it indicates
linear relationship. Features are similar if they have strong relationship. Entropy is used
to measure feature-class correlation because it has good performance for data with
nominal class value. The flow chart for all classifiers with feature selection is given in
Figure 3.
Best subset is obtained using several methods:
1

Forward selection heuristic search
It uses greedy method to obtain the best subset. Beginning with the empty set and
then add features one by one until there are no features that can produce higher
accuracy. The possible number of subsets obtained in this method is calculated by
using:
number of subset =

∑C
n

i =1

n
i

= 2n − 1

9

Supervised learning approaches and feature selection
2

The best subset is obtained based on the best merit. The merit is calculated using (3).
In this formula, rff is derived from Pearson correlation and rfc is derived from
information gain.

3

Using a threshold to select subsets. Select subsets consisting of features that have
information gain above the threshold. The following two thresholds are considered:
a
b

threshold = mean information gain
threshold = mean information gain – 0.5 * standard deviation.

Two thresholds are tested in order to see which threshold is better. The basis
threshold is mean information gain. The first threshold just considered the basis and
the second threshold also took the variant of data (standard deviation) into
consideration.
Figure 3

A flow chart of the main programme for SVDD, SVM, and NN classifiers with feature
selection
Start

Positive and Negative
training data,
Parameter C and s

Select the best feature subset by forward,
merit, and threshold of mean information
gain with/without standard deviation

10 fold Cross Validation (CV)
SVDD for target data, SVDD
for testing data

10 fold CV
SVM for target data,
SVM for testing data

Accuracy, Error,
Precision, Recall, FScore, TPrate, TNrate,
FPrate, Kappa, g_means

end

10 fold CV
NN Classifier between
target data & testing data

10

Y. Praharsi et al.

8

Results and discussion

8.1 Performances of supervised learning approaches without feature selection
Performance evaluation measure without feature selection is summarised in Table 2.
According to Table 2, the values between TN rate (proportion of non-diabetes patients
that were classified correctly) and TP rate (proportion of diabetes patients that were
correctly identified) are not balanced. This is due to imbalanced training datasets (Liu
et al., 2006). In this study, the accuracy of SVM classifier (78.3%) outperforms the
accuracy of the other two classifiers and the work by Bacauskiene et al. (2009) (76.9%).
The optimal parameters for SVM and SVDD are C = 0.6 and s = 1 and C = 0.3 and
s = 750, respectively. After implementing k = 1, 3, 5, 7, 9, 11, the optimal k for k-NN is
k = 5. The performance of 1-NN or simply NN is included here for comparison due to its
widespread use and simplicity. The table shows that SVDD has the lowest accuracy as
compared to other classifiers. Consequently, SVDD cannot be applied for two classes in
Pima Indians diabetes database.
Table 2
Supervised
methods

Performance evaluation measure for supervised learning without feature selection
TP
rate

TN
rate

Accuracy

Precision

Recall

F-score

Kappa
value

SVM

0.523

0.918

0.783

0.785

0.523

0.619

0.477

0.689

SVDD

0.692

0.392

0.495

0.372

0.692

0.484

0.070

0.519

1-NN

0.512

0.8

0.701

0.570

0.512

0.535

0.318

0.634

5-NN

0.573

0.856

0.759

0.693

0.573

0.621

0.447

0.698

g-means

Table 3 provides computational time in training and testing stages of each classifier
without feature selection. The number of total data used in this study is 768. The time is
generated by using a personal computer (PC) with Intel CPU (Pentium 1.6 GHz) and 512
MB of RAM. This work used software MATLAB Version 7.0 Release 14 to run the
programme. As expected, all classifiers need more time in training phase than that in
testing phase. Moreover, SVM consumes much more time than other classifiers.
Table 3
Supervised
methods

Computational time for the classifier without feature selection
TP
rate

TN
rate

Accuracy

Precision

Recall

F-score

Kappa
value

SVM

0.523

0.918

0.783

0.785

0.523

0.619

0.477

0.689

SVDD

0.692

0.392

0.495

0.372

0.692

0.484

0.070

0.519

1-NN

0.512

0.8

0.701

0.570

0.512

0.535

0.318

0.634

5-NN

0.573

0.856

0.759

0.693

0.573

0.621

0.447

0.698

g-means

8.2 Performances of supervised learning approaches with feature selection
As can be seen from Table 4, the classification accuracy obtained from classifiers trained
on the selected feature sets in forward selection and mean InfoGain with standard
deviation threshold is higher than that of other feature selection methods. This indicates
that data variation is an important factor that must be considered in selecting feature
subset. Forward feature selection of SVM here also outperforms the genetic algorithm

11

Supervised learning approaches and feature selection

(GA) driven SVM proposed by Bacauskiene et al. (2009) (77.6%). However, the
improvement in classification accuracy obtained for the data presented in Table 4 is
rather marginal. Other performance evaluation measures are described in Table 5 to
Table 8. In addition, many evaluation criteria could increase confidence value of the
result.
Table 4

Accuracy performance for supervised learning with feature selection

Methods

1-NN

5-NN

SVDD

SVM

Without feature selection

70.1%

75.9%

49.5%

78.3%

Feature
selection

Forward selection

71.1%

76.2%

67.5%

78.3%

Correlation (merit)

62.5%

68.4%

32.1%

74.3%

Thresholding on mean information
gain

69.9%

74.1%

26.3%

75.9%

Thresholding on mean InfoGain –
0.5 * standardDev

71.2%

74.3%

57.1%

77%

Table 5

SVM performance with feature selection
TP
rate

TN
rate

Accuracy

Precision

Recall

F-score

Kappa
value

g-means

Forward

0.523

0.918

0.783

0.785

0.523

0.619

0.477

0.689

Merit

0.392

0.926

0.743

0.694

0.378

0.486

0.359

0.601

Mean info

0.458

0.916

0.759

0.70

0.441

0.535

0.409

0.643

Mean info
and stdev

0.512

0.904

0.77

0.70

0.493

0.574

0.447

0.677

Recall

F-score

Kappa
value

g-means
0.634

Table 6

SVDD performance with feature selection
TP
rate

TN
rate

Accuracy

Precision

Forward

0.55

0.74

0.675

0.524

0.55

0.534

0.285

Merit

0.588

0.182

0.321

0.281

0.588

0.377

–0.171

0.294

Mean info

0.573

0.102

0.263

0.249

0.573

0.347

–0.248

0.236

Mean info
and stdev

0.592

0.56

0.571

0.414

0.592

0.483

0.138

0.566

Recall

F-score

Kappa
value

g-means

Table 7

1-NN performance with feature selection
TP
rate

TN
rate

Accuracy

Precision

Forward

0.423

0.86

0.711

0.611

0.423

0.5

0.306

0.603

Merit

0.569

0.654

0.625

0.463

0.569

0.508

0.211

0.607

Mean info

0.577

0.762

0.699

0.561

0.577

0.566

0.336

0.661

Mean info
and stdev

0.561

0.79

0.712

0.579

0.562

0.566

0.352

0.661

12

Y. Praharsi et al.

Table 8

5-NN performance with feature selection
TP
rate

TN
rate

Accuracy

Precision

Recall

F-score

Forward

0.562

0.866

0.762

0.693

0.562

0.616

0.447

0.694

Merit

0.435

0.814

0.684

0.564

0.435

0.486

0.264

0.593

Mean info

0.519

0.856

0.741

0.662

0.519

0.575

0.394

0.663

Mean info
and stdev

0.573

0.832

0.743

0.651

0.573

0.602

0.416

0.686

Table 9

1-NN

Without feature selection

Table 10

g-means

Computational time (in second) with feature selection

Methods
Feature
selection

Kappa
value

5-NN

SVDD

SVM

52.14

57.35

47.82

961.22

Forward selection

14,167.12

17,632.11

11,463.77

377,638.1

Correlation (merit)

50.52

56.39

45.57

999.95

Thresholding on mean
information gain

50.75

58.10

47.41

943.19

Thresholding on mean InfoGain
– 0.5 * standardDev

51.36

57.63

46.61

952.14

The best feature subsets selected

Methods
Forward selection

NN

SVDD

SVM

(1-NN)

00000001

11111111

Age

Pregnant, glucose,
diastolic, TSF,
insulin, BMI,
pedigree, age

01000000

01000000

01000000

(1-NN and 5-NN)

Glucose

Glucose

01000100

01000100

01000100

(1-NN and 5-NN)

Glucose, BMI

Glucose, BMI

11000001
(5-NN)
11101111
(1-NN)
Pregnant, glucose, age
(5-NN)
Pregnant, glucose,
diastolic, insulin, BMI,
pedigree, age
Correlation (merit)

Glucose
Thresholding on mean
info gain

Glucose, BMI
Thresholding on mean
InfoGain – 0.5 *
standardDev

11000101

11000101

11000101

(1-NN and 5-NN)

Pregnant,
glucose, BMI,
age

Pregnant, glucose,
BMI, age

Pregnant, glucose, BMI,
age

Supervised learning approaches and feature selection

13

Table 9 provides total computational time involving both training and testing stages. The
computer specification is the same as that for generating Table 3, except that the results
for SVM and 5-NN forward selection are generated by using the PC with Intel CPU
(Pentium 2.81 GHz) and 1 GB of RAM. By columns observation, it can be seen that
SVDD is the fastest. It happens because SVDD is one class classifier. By rows
observation, mean information gain with standard deviation threshold is much more
efficient as compared to forward selection.
The best feature subset (Table 10) is chosen by the highest accuracy in each feature
selection method of classifiers. According to the results given in Table 5 to Table 8,
forward selection is the best feature selection method for all classifiers, except the 1-NN
case using mean information gain and standard deviation threshold. 1-NN uses four
relevant features, including pregnant, glucose, BMI and age while 5-NN uses seven
features, i.e., pregnant, glucose, diastolic, insulin, BMI, pedigree, and age. SVDD just
uses single feature (age) that is most relevant for describing its structure while SVM uses
all eight features.
Table 11

Ranking in relevance of each feature to class

Feature

Information gain

Glucose

0.1686

BMI

0.0822

Age

0.0488

Pregnant

0.0469

Heredity

0.0225

Serum Insulin

0.0197

Diastolic blood pressure

0.0160

Triceps skin fold

0.0094

According to Table 11, glucose has the highest information gain. It means that glucose
has the highest relevance to class. Note also that glucose feature is adopted in all columns
of Table 10, except SVDD with forward selection.

9

Conclusions and future work

In this study, we have discussed three supervised learning methods and its extension by
implementing feature selection in the diabetes case. Based on the highest accuracy,
forward feature selection is the best feature subset selection method for SVM and 5-NN
classifiers. At the same time, mean information gain and standard deviation threshold
feature selection which is the best for 1-NN classifier can be considered as a substitute
for forward selection because it is computationally efficient and the accuracy does not
decrease significantly for SVM and 5-NN as compared to forward selection. This
indicates that data variation using information gain is an important factor that must be
considered in selecting feature subset. In this study, glucose/OGTT is a prominent feature
in all classifiers and feature selection methods. Meanwhile, SVDD just uses single age
feature that is most relevant for describing its data structure. However, age feature does
not have the highest relevance to class based on the value of information gain. Therefore,

14

Y. Praharsi et al.

SVDD cannot be applied for this dataset. Relevancy measurement using information gain
can be used to sort from the most important feature to the least important feature. It can
be useful in medical applications such as defining feature prioritisation. In the future, this
paper can be extended by applying imbalanced SVM (ISVM) to solve the imbalanced
training datasets.

References
Alberti, K.G.M.M. and Zimmet, P.Z. (1998) ‘Definition, diagnosis and classification of diabetes
mellitus and its complications part 1: diagnosis and classification of diabetes mellitus
provisional report of WHO consultation’, Diabetic Medicine, Vol. 15, No. 7, pp.539–553.
Bacauskiene, M., Verikas, A., Gelzinis, A. and Valincius, D. (2009) ‘A feature selection technique
for generation of classification committees and its application to categorization of laryngeal
images’, Pattern Recognition, Vol. 42, No. 5, pp.645–654.
Ben-Haroush, A., Yogev, Y. and Hod, M. (2004) ‘Epidemology of gestational diabetes mellitus and
its association with type 2 diabetes’, Diabetic Medicine, Vol. 21, No. 2, pp.103–113.
Chen, Y-S. and Cheng, C-H. (2009) ‘Evaluating industry performance using extracted RGR rules
based on feature selection and rough sets classifiers’, Expert Systems with Applications,
Vol. 36, No. 5, pp.9448–9456.
Cheung, N.W. and Byth, K. (2003) ‘Population health significance of gestational diabetes’,
Diabetes Care, Vol. 26, No. 7, pp.2005–2009.
Cortes, C. and Vapnik, V. (1995) ‘Support vector networks’, Machine Learning, Vol. 20, No. 3,
pp.273–297.
Cunningham, P. and Delany, S.J. (2007) k-Nearest Neighbour Classifiers, Artificial Intelligence
Group, Department of Computer Science, Trinity College, Dublin.
Duda, R.O., Hart, P.E. and Stork, D.G. (2000) In Pattern Classification, pp.526–527, John Wiley &
Sons, Inc.
Ghiselli, E.E. (1964) Theory of Psychological Measurement, McGraw Hill.
Hall, M.A. (1998) Correlation-based Feature Selection for Machine Learning, The University of
Waikato, Hamilton, New Zealand.
Haykin, S. (1999) ‘Support vector machine’, in Neural Network: A Comprehensive Foundation,
pp.318–350, Prentice-Hall, New Jersey.
Huang, C.L. and Wang, C.J. (2006) ‘A GA-based feature selection and parameters optimization
for support vector machines’, in Expert Systems with Applications Journal, Vol. 31, No. 2,
pp.231–240.
Ji, R., Liu, D., Wu, M. and Liu, J. (2008) ‘The application of SVDD in gene expression data
clustering’, Paper presented at the Proc. of the 2nd Int. Conf. on Bioinformatics and
Biomedical Engineering.
Kittler (1978) ‘Feature set search algorithms’, in C.H. Chen (Ed.): Pattern Recognition and Signal
Processing, the Netherlands.
Kubat, M. and Matwin, S. (1997) ‘Addressing the curse of imbalanced training sets: one-sided
selection’, Paper presented at the Proc. of the 14th Int. Conf. on Machine Learning.
Lauenborg, J., Hansen, T., Jensen, D.M., Vestergaard, H., Molsted-Pedersen, L., Hornnes, P. et al.
(2004) ‘Increasing incidence of diabetes after gestational diabetes’, Diabetes Care, Vol. 27,
No. 5, pp.1194–1199.
Lee, K.Y., Kim, D.W., Lee, K.H. and Lee, D. (2007) ‘Density-induced support vector data
description’, IEEE Trans. on Neural Networks, Vol. 18, No. 1, pp.284–289.
Liu, H. and Yu, L. (2005) ‘Toward integrating feature selection algorithms for classification and
clustering’, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 4,
pp.491–502.

Supervised learning approaches and feature selection

15

Liu, Y-H., Chen, Y-T. and Lu, S-S. (2006) ‘Face detection using kernel PCA and imbalanced
SVM’, in Advances in Natural Computation, Vol. 4221, pp.351–360, Springer, Berlin
Heidelberg New York.
Peterson, L.E. (2009) k-Nearest Neighbor, available at http://www.scholarpedia.org/article/Knearest_neighbor.
Quinlan, J.R. (1993) C4.5: Programs for Machine Learning, Morgan Kauffman.
Sigillito, V. (2008) Pima Indians Diabetes Data Set, 12 December, 2009, available at
http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes.
Smith, L.I. (2009) Tutorial on Principal Component Analysis, 27 June, available at
http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf (accessed on
26 February 2002).
Tax, D.M.J. and Duin, R.P.W. (1999) ‘Support vector domain description’, Pattern Recognition
Letter, Vol. 20, Nos. 11–13, pp.1191–1199.
Tax, D.M.J. and Duin, R.P.W. (2004) ‘Support vector data description’, Machine Learning
Journal, Vol. 54, No. 1, pp.45–66.
Theodoridis, S. and Koutroumbas, K. (1999) Pattern Recognition, Academic Press, USA.
What Diabetes Is, available at http://diabetes.niddk.nih.gov/.

Dokumen yang terkait