R.D.L.N. Karisma, A. Kormitson, H. Kuswanto
SWUP
MA.27
Control risk factors as variable in this case are alcohol habit, smoking habit, physical activity, body mass index, dietary habit, weight, and height. Some algorithms such as classification
tree and regression tree are used by researcher for the sake of classification. CART Classification and Regression Tree methods, one of classification methods in data mining,
was used by Takahashi et al. 2006 for identified four groups for mortality in intracerebral haemorrhage patients ICH. Random forest was used Rajagopal et al. 2013 about Enhancer
Identification from Chromatin State giving conclusion that random forest is informative feature for classification, because it can predict enhancers accurately in a genome-wide
based on chromatin modi-fications. Problem and aim to be solved in this study case i.e. characteristic patient Ischemic and Hemorrhagic then classify Ischemic patient and
Hemorrhagic patient reviewed by modified risk factor. Sample data is sample of patient data Ischemic and Hemorrhagic 2009 to 2014, which is expected to represent patients Ischemic
and Hemorrhagic. Hospital data in Estonia for cardiovascular disease on 2001 have 3245 then increase in 2009 has 3327 per 100000 population Nichols et al., 2014.
2. Materials and methods
CART Classification and Regression Tree
CART, which is developed by Leo Breiman, Jerome H. Friedman, Richard A. Olshen and Charles J. Stone around 1980’s, is a method with nonparametric model approach developed
for the analysis of classification with continuous variables and categorical response Breiman, 1984. There are three elements in concept of tree establishment on CART method, as
follows. 1
Forming Tree Clasification The process of forming tree classification consists of three stages that are:
a Choosing Classifier Classifier
In this stage, the data that will be used is learningtraining ℒ heterogenic
sample data. That sample will be classified based on classification rule and goodness of split criteria. Heterogeneity each class from specified node in
classification tree measured with impurity measure it. Impurity measure will help finding optimal classification function. Generally, heterogenic functionality used is
Gini Index. Gini index has benefit such as having simple calculation process but relatively fast and easy also suitable to be applied in some cases
[6]
. The function of gini index is
¹ = ∑ |¹
§|¹
¥
, where
§|¹ denotes j Class proportion on nodes t and |¹ denotes Class
proportion on nodes t. Goodness of fit is classify evaluation by classify s on nodes t. Measurement
equality from how good classifier s on filtering data based on class is class heterogeneity measurement decrease, which is defined as.
∅ , ¹ = ∆ , ¹ = ¹ −
I
¹
I
− ¹ ,
where ¹ : Heterogeneities function on t
I
: Observe proportion left node : Observe proportion right node
¹
I
: Heterogeneities function on child left nodes ¹ : Heterogeneities function on child right nodes
Random forest of modified risk factor on ischemic and hemorrhagic
Case study: Medicum Clinic, Tallinn, Estonia
SWUP
MA.28
Classification which produces higher ∅
, ¹ is the best classifier because it can reduce heterogeneity significantly.
¹
I
and ¹ are partition from t node which
become two disjoint subsets where
I
and are each proportion of each nodes
chance. Because ¹
I
∪ ¹ = ¹ then the value ∆ , ¹ is changing representation
from heterogeneity in t nodes by s classifier. If nodes come from non-homogeny class, the same procedure will be repeated until tree classification become one
specified configuration, i.e., ∆
∗
, ¹ = Žg
∈
∆ , ¹ .
b Determination of Terminal Nodes
A t node defined terminal or not, will be re-selected again when t nodes there is no heterogeneity decrease significantly or minimum limit n as there is only one
observation in the child node. Generally, the minimum number of cases in last terminal, if its fulfilled then tree development will be stopped Breiman, 1984.
c Marking Label Class
Marking label class on terminal nodes done based on majority rules, as follows, §
¼
|¹ = max
¥
§|¹ = max
¥ ¥
¹ ¹
, where
¥
¹ : number of class observation j on t node ¹ : number observation on t node
Class label t terminal nodes is §
¼
which give estimated classification value on smallest t node with amount
¹ = 1 − max
¥
§|¹ . The formation of tree classification process will stop when there is only one observation in each of the
child nodes or minimum limits , all observations in the child nodes are identical, and there is limited depth amount of maximum tree level.
2 Pruning Tree Classification
Pruning is done to unimportant part of the tree obtaining the optimal tree. Minimum cost complexity is a pruning measurement which is done to get optimal tree
Breiman, 1984: ü
Ì
¹ = ü ¹ + mŠ Š, where
ü ¹ : error proportion in sub tree re-substitution estimate m : complexity parameter
Š Š : measurement the number of terminal node T tree 3
Determining Optimum Tree Classification Big size of tree classification will cause high value in cost complexity because the data
structures described, tend to be complex so it is necessary to choose optimal tree which has modest size but provide sufficient replacement estimated value. If
ü is chosen as best
estimator then bigger tree will tends to be chosen, because bigger tree will make value of ü
smaller. There are two estimator types to get optimal tree classification, i.e.: a
Test Sample Estimate Test sample estimate used for big data. This procedure starts by dividing cases into
two parts, which are ℒ and
ℒ . It should be noticed that ℒ used for making T tree from sample learning, while
ℒ is observation to predict ü
, hence the equation is given 6.
ü =
”
∑ r Ü
7
≠ §
7
°
,¥
°
∈ℒ
”
,
R.D.L.N. Karisma, A. Kormitson, H. Kuswanto
SWUP
MA.29
where, is observation measurement in
ℒ and r . valued 0 if statement in
parentheses wrong and a value of 1 if the statement in parentheses is true. Selected optimum tree classification
∗
with ü
∗
= min ü .
b Cross Validation V-Fold Estimate
Cross Validation V-Fold Estimate used when observation is not really big. Observations in
ℒ divided randomly become disjoint parts with size approximately equal to each class.
Tree derived from a sample of learningtraining to- × with × = 1,2, … , ×. In
example is qualification result, thus the sample estimator for
ü- . is,
ü- . =
∑ x-Ü
7
≠ §
7
.
°
,¥
°
∈ℒ
”
, with
≅ is observation amount in
ℒ . Then same procedure used by using all
observation on ℒ forming row of trees
. Hence, estimator fold cross validation v for
is ü
¶
= ∑
ü
¶
- .
ℒ
. Optimum tree classification is
∗
with ü
¶ ∗
= min ü
¶
.
Random forest methods
Ensemble is a classification method from individual decision combined with weight or weighted voting to classify new example Dietterich, 2000. The most important thing is
ensemble have higher accuracy than individual classification. One of the examples of ensemble method is Random Forest. Random Forest is a classification method developed by
Breiman. Random forest is a classifier formed from a set of tree structure, where each tree is a random independent vector which has identical distribution and each tree comes from
best unit Breiman, 2001. Bagging is repeating process re-sampling randomly which produce different data and followed by different tree Sartono Syafitri , 2010. It can be
simplified that random forest can be formed using CART methodology in maximum size without pruning. The advantages of random forest are
1
Can be used for big scale data with high accuracy. 2
Can give important variable estimation on classification. 3
This method can be used to stabilize error on non-balanced population data set. 4
Can generate unbiased internal estimation by generalizing error as the forest building progress
Random forest algorithm Random forest built using data training built using data training training set with
“ ”sized consist of predictor variable. It can be simplified, formation of random forest algorithm is described as follows.
1 Draw sample data from original data set with bootstrap resample with replacement
2 For each of bootstrap resample, growing an un-pruned classification tree, with the best
classifier based on random predictor variables. Number of random variables are computed by
log + 1 , where is number of predictor
[8]
or Ý where is
number of predictor variables Hestie, 2008. 3
Predict new sample data based on classification tree. 4
Repeating 1 until 3 stages as times random.
Random forest of modified risk factor on ischemic and hemorrhagic
Case study: Medicum Clinic, Tallinn, Estonia
SWUP
MA.30
5 Estimation combined based on
u tree i.e. using majority vote to classification case or average for regression.
The result will produce a single set tree with different size. Expected result is a single set tree which has small correlation between tree, because small correlation will produce
varians on small random forest Hestie, 2008. Bootstrap was covered by Efron 1985 for determined a sampling distribution
estimate in resample procedure with replacement original data. Algorithms bootstrap is as follows.
a Construction empirical distribution of
ôé
7
from a sample giving probability
7
each r
where = 1,2, … , .
b Draw random sample bootstrap with replacement from empirical distribution of
ôé
7
called as first bootstrap r
∗
c Choose and compute statistics
é from bootstrap sample r
∗
called as é
∗
d Repeat step a and c until B times then obtained
é
∗
, é
∗
, … , é
∗
e Build a probability distribution from
é
∗
with giving probability in each
é
∗
, é
∗
, … , é
∗
, that is an bootstrap estimator for sampling distribution
é and called ôé
∗
. f
Bootstrap estimation is approached by é
∗
= ∑ é
∗
.
Out of bag estimates
According to Tibshirani 1996, out of bag estimate is a good tool to estimate error especially to estimate variance which has changeable classification decision. Breiman
[13]
provided empirically that OOB estimate needed so that error has unbiased estimated. OOB data is a data is not contained in bootstrap example. For average observation, original data
group will be OOB data 37 of trees number, hence each original group data will be predicted as many as one-third from each trees number Breiman, 1996. Breiman 2001 showing
random forest result has limited error, when trees number increasing then error will be convergent. Hence, the higher tree number
u, then error will be convergent. That is why random forest does not over fit.
Classification accuracy
A method used to determine sample classification function should be performed in the future sample is Apparent Error Rate APER. APER is a measure performance that can be
calculated for any classification procedure and does not depend on form parent population. It is defined as fraction of observations in training set that are misclassification by
classification function Johnson Wichern, 1998. Generally, training extremely imbalance data then the overall classification accuracy is often not appropriate as measure of
performance. ROC Receiver Operating Characteristic curve is more considered and more informative than APER as measure in classification accuracy Agresti, 2007. In ROC curve, it
has sensitivity and specificity. Sensitivity is accuracy in positive class then specificity is accuracy in negative class. Then ROC is satisfying as measurement in accuracy for imbalance
data. In ROC curve, the spacious area under curve AUC becomes reversing the goodness of classification method, while more spacious area under the curve shows that the method
capable for measuring sensitivity and specificity as well. G-Mean Geometric Mean is a measurement of accuracy that used to measure sensitivity and specificity, then it value used
in Kubat et al. 1997 to assess the performance of their methods Vigorously classification
R.D.L.N. Karisma, A. Kormitson, H. Kuswanto
SWUP
MA.31
method should be capable measuring sensitivity and specificity.
xçü =
x + x + ôx +
+ ô ×
100
S ensitivity =
x x + ô
S pecificity
= + ôx
G − mean
=
Ý
, ¹ × ¹ × ,- - ¹
Table 1. Apparent error rate table.
Predicted membership
Positive 1
Negative 2
Actual membership
Positive 1
TN True Negative
FN False Negative
Negative 2
FP False Positive
TP True Positive
Note: TN = number of observation from class 1 that correctly classified as class 1
FP = number of observation from class 2 that incorrectly classified as class 1 FN = number of observation from class 1 that incorrectly classified as class 2
TP = number of observation from class 2 that correctly classified as class 2
Ischemic and hemorrhagic
More than 80 of deaths in low and middle income are caused by Cardiovascular. It is predicted that in 2030, 23.30 million people will die because of Cardiovascular. In Estonia
2010, noted that 786.5 deaths per-100.000 population European Commission, 2010. Stroke Cerebrovascular is one of dangerous type of cardiovascular. The number of deaths due to
stroke mostly happened in Baltic, Central Europe, and Eastern Europe. It is reported that more than 350 deaths per 100.00 population caused by this disease OECD, 2014. Stroke is
the second biggest disease in Estonia Estonian Statistics, 2015.
Based on the cause, stroke is divided in to two kinds which are Ischemic, and Hemorrhagic. In Europe, there are 75 Ischemic stroke victim and 15 Hemorrhagic stroke
victim European Cardiovascular Statistics, 2012. Ischemic is a kind of stroke disease caused by blood clogging to the brain, while hemorrhagic occurs when a weakened blood vessel
ruptures European Cardiovascular Statistics, 2012. The brain depends on the artery to bring fresh blood containing oxygen, nutrients to the brain and take carbon dioxide. If the artery is
clogged or broken, neurons can not make enough energy and will eventually stop working which causes brain cells to die. The following illustration is the ischemic and hemorrhagic
stroke.
The scope of Stroke divided into two factors: modifiable risk factors and the risk factors that can not be modified. Risk factor that can not be modified.
a Age
b Stroke can infect anyone at any age. By the aging time the risk of stroke will increase
two times greater. c
Sex d
Women have bigger chance having stroke in every year than a man, especially women in middle age. Women are less aware that they have a higher risk for stroke and have
Random forest of modified risk factor on ischemic and hemorrhagic
Case study: Medicum Clinic, Tallinn, Estonia
SWUP
MA.32
little knowledge about the risk of stroke. Stroke killed women twice as much breast cancer every year. Men at young age have a bigger chance on having stroke.
e Race and Ethnic
f Africa and United States has twice bigger risk than AsiaPacific Island and Kaukasid.
Because Africa ad United States has higher risk on high-blood pressure, diabetes and obesity. A person who has stroke disease in family history, has bigger chance having
stroke in early age.
Risk factor that can be modified, as follows a
Obesity Definition of obesity is accumulation of excess fat that have risk factor in health.
Measurement people who have obesity are Body Mass Index BMI. Weight a person in kilogram divided with square of height in meter WHO, 2012.
Table 2. Body mass Index for European. BMI kgm
2
Status
Underweight 18.5
Normal weight 18.5 BMI 24.99
Overweight 25.0 BMI 29.99
Obesity ≥30
b Dietary Habit
Healthy diet will decrease chronic disease and will develop immunity and get ideal weight. Choosing food and consuming fruit and vegetable will stabilize calories.
c Physical Activity
Physical activity or any activities that makes body moves once in a week will decrease risk of having stroke. Regular physical activity will increase immunity and can prevent
chronic disease.
d Tobacco use and smoking habit
A smoker will have higher risk on having stroke than non-smoker. Smoke can cause blood clotting, blood coagulating and increase buildup of plaque in arteries.
e Alcohol users
Alcohol users have higher risk than non-alcoholic person. Alcohol can increase blood pressure and increase triglyceride level cholesterol which hardened in arteries.
Data source
Data in this study used data Ischemic patients and Hemorrhagic patients in Medicum Clinic, Punane 61, Tallinn, Estonia. Sample used in this study 420 patients that have diagnosis
date in 2009 until April 2015.
Indentification variabel
Description variable used 1 response variable and 5 predictor variables. Response variable is Stroke patients with 2 categories, it is e.g. Ischemic patients and Hemorrhagic
patients. Predictor variables have 4 variables that have nominal category and 1 variable has ratio category. Table 3 shows data set in observation.
R.D.L.N. Karisma, A. Kormitson, H. Kuswanto
SWUP
MA.33
Table 3. Variable description. Number
Variable Category
Information
1 Y
Nominal Response variable
1: Ischemic
patients 2:Hemorrhagic patients
2 X
1
Nominal Patients who have drinking habit at
least 60 grams within a period of at least 30days WHO, 2014
Alcohol Habit
1: No alcohol 2: Yes 3
X
2
Nominal Weekly consumption vegetables
and fruits Diet control from Doctor WHO, 1999
Dietary Habit 1: No 2: Yes 4
X
3
Nominal Smoking habits in a day that goes
within 1 year Estonian Health Interview, 2006
Smoking habit
1: No 2: Yes 5
X
4
Nominal Physical activity at least 1 time
minimum 60 minutes of the week Physical
Activity 1: No 2: Yes
Methods
The main analysis started when preparing data has completed. The main analysis used random forest analysis, to know importance variables that influenced Ischemic patients and
Hemorrhagic patients based on modified risk factors. Response variable in this study has characterized categorical, hence in random forest analysis used classification.
1 Coding data based on category that decided
2 Dividing data in two parts were training data and testing data. It divided by combination
training data and testing data proportion are 75:25, 80:20, 85:15, 90:10, 95:5.
3 Analyze with random forest analysis.
a Draw bootstrap sample with replacement from original data
b In each of bootstrap samples, grow classification tree or regression tree without
prune the tree. c
Choose the best split among predictor variables, randomly sample of predictors with: where, or
is the number of predictors: log + 1 or Ý
d Growing the tree classification
e Repeat steps b to d until
u replication f
Majority vote did for prediction classification from u times
g Computing misclassification training data
h Computing misclassification testing data
Repeat b to h with other combinations number of tree .
3. Results and discussion