PROS Ria DLNK, Alexandr K, Heri K Random forest fulltext

(1)

Random forest of modified risk factor on ischemic and hemorrhagic

(Case study: Medicum Clinic, Tallinn, Estonia)

Ria Dhea Layla Nur Karisma*a, Alexandr Kormitsõnb, Heri Kuswantoc

a,cSepuluh Nopember Institute of Technology, Jl. Arief Rahman Hakim, Surabaya 60117, Indonesia bTallinn University of Technology, Ehitaja Tee 5, Tallinn 19086 , Estonia

Abstract

Estonia is one of European Union countries with capital city named Tallinn. It is one of Baltic area with population 1312300 and they have problem in health such as Stroke (Cerebrovascular) which is the second biggest cardiovascular disease cause of death. The aim of study is to classify modified factor Ischemic patient and Hemorrhagic Patient using ensemble method. It used Random Forest analysis which is a classifier formed from a set of tree structure, where each tree is a random independent vector which has identical distribution and each tree comes from best unit. Generally, the method has better accuracy than individual classification. The unit of observation is 420 patients consist of missing data and the independent variable is modified factor of Ischemic patient and Hemorrhagic patient in Medicum clinic, Tallinn, Estonia. The independent variable is alcohol habit, diet habit, smoking habit, physical activity, and body mass index. Proportion of training and testing data is 85%:15%, whereas it formed proportion of original data set. In this research, used bootstrap with replacement 2015 times one used and replication 300 along 3 combination of predictor variable, which is 1,7% in miss accuracy. The important modified risk factor is diet habit and alcohol habit. Variable that has influenced in Ischemic is smoking habit, diet habit, and physical activity meanwhile in Hemorrhagic is diet habit. Response variable has imbalance data then we are considered for appropriate accuracy that showed by sensitivity and specificity. Accuracy of prediction model 98.32% and validation of the model is 95.23%, then sensitivity and specificity are 98.6% and 97.2% respectively.

Keywords stroke, ischemic, hemorrhagic, modified risk factor, ensemble method, random forest

1.

Introduction

Estonia is one of European Union countries located in North Europe with capital city named Tallinn. It is one of Baltic area with Latvia and Lithuania. Based on Statistics data January 2015 population Estonia were 1312300, this is less than 3600 from last year (Estonian Statistics, 2015). Health problem become one of factors that cause population decrease in Estonia. Cardiovascular becomes one of major death cause in Estonia, there was died 786.5 every 100000 people on 2010 (European Commission, 2010). Stroke is the second biggest cardiovascular disease caused death in Estonia (Estonian Statistics, 2015). Based on the cause, stroke is divided into 2 types i.e. Ischemic and Hemorrhagic. This study will discuss about Ischemic classification and Hemorrhagic classification based on controlled risk factors.


(2)

SWUP

Control risk factors as variable in this case are alcohol habit, smoking habit, physical activity, body mass index, dietary habit, weight, and height. Some algorithms such as classification tree and regression tree are used by researcher for the sake of classification. CART (Classification and Regression Tree) methods, one of classification methods in data mining, was used by Takahashi et al. (2006) for identified four groups for mortality in intracerebral haemorrhage patients (ICH). Random forest was used Rajagopal et al. (2013) about Enhancer Identification from Chromatin State giving conclusion that random forest is informative feature for classification, because it can predict enhancers accurately in a genome-wide based on chromatin modi-fications. Problem and aim to be solved in this study case i.e. characteristic patient Ischemic and Hemorrhagic then classify Ischemic patient and Hemorrhagic patient reviewed by modified risk factor. Sample data is sample of patient data Ischemic and Hemorrhagic 2009 to 2014, which is expected to represent patients Ischemic and Hemorrhagic. Hospital data in Estonia for cardiovascular disease on 2001 have 3245 then increase in 2009 has 3327 per 100000 population (Nichols et al., 2014).

2.

Materials and methods

CART (Classification and Regression Tree)

CART, which is developed by Leo Breiman, Jerome H. Friedman, Richard A. Olshen and Charles J. Stone around 1980’s, is a method with nonparametric model approach developed for the analysis of classification with continuous variables and categorical response (Breiman, 1984). There are three elements in concept of tree establishment on CART method, as follows.

1) Forming Tree Clasification

The process of forming tree classification consists of three stages that are: a) Choosing Classifier (Classifier)

In this stage, the data that will be used is learning/training (ℒ) heterogenic sample data. That sample will be classified based on classification rule and goodness of split criteria. Heterogeneity each class from specified node in classification tree measured with impurity measure i(t). Impurity measure will help finding optimal classification function. Generally, heterogenic functionality used is Gini Index. Gini index has benefit such as having simple calculation process but relatively fast and easy also suitable to be applied in some cases [6]. The function of

gini index is

¹ = ∑ ¥ |¹ §|¹ ,

where §|¹ denotes j Class proportion on nodes t and |¹ denotes Class proportion on nodes t.

Goodness of fit is classify evaluation by classify s on nodes t. Measurement equality from how good classifier s on filtering data based on class is class heterogeneity measurement decrease, which is defined as.

∅ , ¹ =∆ , ¹ = ¹ − I ¹I − ¹ ,

where

¹ : Heterogeneities function on t I : Observe proportion left node

: Observe proportion right node

¹I : Heterogeneities function on child left nodes


(3)

Classification which produces higher ∅ , ¹ is the best classifier because it can reduce heterogeneity significantly. ¹I and ¹ are partition from t node which become two disjoint subsets where I and are each proportion of each nodes chance. Because ¹I∪¹ = ¹ then the value ∆ , ¹ is changing representation from heterogeneity in t nodes by s classifier. If nodes come from non-homogeny class, the same procedure will be repeated until tree classification become one specified configuration, i.e.,

∆ ∗, ¹ = Žg

∈ ∆ , ¹ .

b) Determination of Terminal Nodes

A t node defined terminal or not, will be re-selected again when t nodes there is no heterogeneity decrease significantly or minimum limit n as there is only one observation in the child node. Generally, the minimum number of cases in last terminal, if its fulfilled then tree development will be stopped (Breiman, 1984). c) Marking Label Class

Marking label class on terminal nodes done based on majority rules, as follows,

§¼|¹ = max¥ §|¹ = max¥ ¥ ¹¹ ,

where

¥ ¹ : number of class observation j on t node

¹ : number observation on t node

Class label t terminal nodes is §¼ which give estimated classification value on smallest t node with amount ¹ = 1 − max¥ §|¹ . The formation of tree classification process will stop when there is only one observation in each of the child nodes or minimum limits , all observations in the child nodes are identical, and there is limited depth amount of maximum tree level.

2) Pruning Tree Classification

Pruning is done to unimportant part of the tree obtaining the optimal tree. Minimum cost complexity is a pruning measurement which is done to get optimal tree (Breiman, 1984):

üÌ ¹ = ü ¹ + mŠ Š,

where

ü ¹ : error proportion in sub tree ( re-substitution estimate)

m : complexity parameter

Š Š : measurement the number of terminal node T tree 3) Determining Optimum Tree Classification

Big size of tree classification will cause high value in cost complexity because the data structures described, tend to be complex so it is necessary to choose optimal tree which has modest size but provide sufficient replacement estimated value. If ü is chosen as best estimator then bigger tree will tends to be chosen, because bigger tree will make value of

ü smaller. There are two estimator types to get optimal tree classification, i.e.: a) Test Sample Estimate

Test sample estimate used for big data. This procedure starts by dividing cases into two parts, which are ℒ andℒ . It should be noticed that ℒ used for making T tree from sample learning, while ℒ is observation to predict ü ! , hence the equation is given (6).

ü ! =


(4)

SWUP

where, is observation measurement in ℒ and r . valued 0 if statement in parentheses wrong and a value of 1 if the statement in parentheses is true. Selected optimum tree classification ∗ with ü ∗ = min ü ! .

b) Cross Validation V-Fold Estimate

Cross Validation V-Fold Estimate used when observation is not really big. Observations in ℒ divided randomly become disjoint parts with size approximately equal to each class.

! Tree derived from a sample of learning/training to-× with × = 1,2, … , ×. In

example ! is qualification result, thus the sample estimator for ü- ! !. is,

ü- ! ! . = "∑ °,¥°∈ℒ”x-Ü! 7 ≠§7.,

with !

$ is observation amount in ℒ!. Then same procedure used by using all

observation on ℒ forming row of trees !. Hence, estimator fold cross validation v

for ! ! is

ü¶!

! =$∑$!#ℒ"ü¶!- ! !..

Optimum tree classification is ∗ with ü¶! ∗ = min ü¶! ! .

Random forest methods

Ensemble is a classification method from individual decision combined with weight or weighted voting to classify new example (Dietterich, 2000). The most important thing is ensemble have higher accuracy than individual classification. One of the examples of ensemble method is Random Forest. Random Forest is a classification method developed by Breiman. Random forest is a classifier formed from a set of tree structure, where each tree is a random independent vector which has identical distribution and each tree comes from best unit (Breiman, 2001). Bagging is repeating process (re-sampling) randomly which produce different data and followed by different tree (Sartono & Syafitri , 2010). It can be simplified that random forest can be formed using CART methodology in maximum size without pruning. The advantages of random forest are

1) Can be used for big scale data with high accuracy. 2) Can give important variable estimation on classification.

3) This method can be used to stabilize error on non-balanced population data set. 4) Can generate unbiased internal estimation by generalizing error as the forest building

progress

Random forest algorithm

Random forest built using data training built using data training (training set) with “ sized consist of predictor variable. It can be simplified, formation of random forest algorithm is described as follows.

1) Draw sample data from original data set with bootstrap resample with replacement 2) For each of bootstrap resample, growing an un-pruned classification tree, with the best

classifier based on random predictor variables. Number of random variables are computed by log %+ 1 , where % is number of predictor [8] or Ý where is

number of predictor variables (Hestie, 2008).

3) Predict new sample data based on classification tree. 4) Repeating 1 until 3 stages as times random.


(5)

5) Estimation combined based on u tree i.e. using majority vote to classification case or average for regression.

The result will produce a single set tree with different size. Expected result is a single set tree which has small correlation between tree, because small correlation will produce varians on small random forest (Hestie, 2008).

Bootstrap was covered by Efron (1985) for determined a sampling distribution estimate in resample procedure with replacement original data. Algorithms bootstrap is as follows.

a) Construction empirical distribution of ôé7 from a sample giving probability 7 each r where = 1,2, … , .

b) Draw random sample bootstrap with replacement from empirical distribution of ôé7 called as first bootstrap r∗

c) Choose and compute statistics &é from bootstrap sample r∗ called as &é∗ d) Repeat step a) and c) until B times then obtained &é∗,&é∗, … ,&é'

e) Build a probability distribution from &é'∗ with giving probability

' in each &é∗,&é∗, … ,&é'∗,

that is an bootstrap estimator for sampling distribution &é and called ôé∗. f) Bootstrap estimation is approached by &é∗= ∑'# &é∗

'.

Out of bag estimates

According to Tibshirani (1996), out of bag estimate is a good tool to estimate error especially to estimate variance which has changeable classification decision. Breiman [13]

provided empirically that OOB estimate needed so that error has unbiased estimated. OOB data is a data is not contained in bootstrap example. For average observation, original data group will be OOB data 37% of trees number, hence each original group data will be predicted as many as one-third from each trees number (Breiman, 1996). Breiman (2001) showing random forest result has limited error, when trees number increasing then error will be convergent. Hence, the higher tree number (u), then error will be convergent. That is why random forest does not over fit.

Classification accuracy

A method used to determine sample classification function should be performed in the future sample is Apparent Error Rate (APER). APER is a measure performance that can be calculated for any classification procedure and does not depend on form parent population. It is defined as fraction of observations in training set that are misclassification by classification function (Johnson & Wichern, 1998). Generally, training extremely imbalance data then the overall classification accuracy is often not appropriate as measure of performance. ROC (Receiver Operating Characteristic) curve is more considered and more informative than APER as measure in classification accuracy (Agresti, 2007). In ROC curve, it has sensitivity and specificity. Sensitivity is accuracy in positive class then specificity is accuracy in negative class. Then ROC is satisfying as measurement in accuracy for imbalance data. In ROC curve, the spacious area under curve (AUC) becomes reversing the goodness of classification method, while more spacious area under the curve shows that the method capable for measuring sensitivity and specificity as well. G-Mean (Geometric Mean) is a measurement of accuracy that used to measure sensitivity and specificity, then it value used in Kubat et al. (1997) to assess the performance of their methods Vigorously classification


(6)

SWUP

method should be capable measuring sensitivity and specificity.

(xçü % =

x + ôx +

x +

+ ô ×

100%

Sensitivity = x x + ô

Specificity=

+ ôx

G− mean =Ý , ¹ × ¹ × ,- -¹ Table 1. Apparent error rate table.

Predicted membership Positive

(1)

Negative (2) Actual

membership

Positive (1)

TN (True Negative)

FN (False Negative) Negative

(2)

FP (False Positive)

TP (True Positive) Note:

TN = number of observation from class 1 that correctly classified as class 1 FP = number of observation from class 2 that incorrectly classified as class 1 FN = number of observation from class 1 that incorrectly classified as class 2 TP = number of observation from class 2 that correctly classified as class 2

Ischemic and hemorrhagic

More than 80% of deaths in low and middle income are caused by Cardiovascular. It is predicted that in 2030, 23.30 million people will die because of Cardiovascular. In Estonia 2010, noted that 786.5 deaths per-100.000 population (European Commission, 2010). Stroke (Cerebrovascular) is one of dangerous type of cardiovascular. The number of deaths due to stroke mostly happened in Baltic, Central Europe, and Eastern Europe. It is reported that more than 350 deaths per 100.00 population caused by this disease (OECD, 2014). Stroke is the second biggest disease in Estonia (Estonian Statistics, 2015).

Based on the cause, stroke is divided in to two kinds which are Ischemic, and Hemorrhagic. In Europe, there are 75% Ischemic stroke victim and 15% Hemorrhagic stroke victim (European Cardiovascular Statistics, 2012). Ischemic is a kind of stroke disease caused by blood clogging to the brain, while hemorrhagic occurs when a weakened blood vessel ruptures (European Cardiovascular Statistics, 2012). The brain depends on the artery to bring fresh blood containing oxygen, nutrients to the brain and take carbon dioxide. If the artery is clogged or broken, neurons can not make enough energy and will eventually stop working which causes brain cells to die. The following illustration is the ischemic and hemorrhagic stroke.

The scope of Stroke divided into two factors: modifiable risk factors and the risk factors that can not be modified. Risk factor that can not be modified.

a) Age

b) Stroke can infect anyone at any age. By the aging time the risk of stroke will increase two times greater.

c) Sex

d) Women have bigger chance having stroke in every year than a man, especially women in middle age. Women are less aware that they have a higher risk for stroke and have


(7)

little knowledge about the risk of stroke. Stroke killed women twice as much breast cancer every year. Men at young age have a bigger chance on having stroke.

e) Race and Ethnic

f) Africa and United States has twice bigger risk than Asia/Pacific Island and Kaukasid. Because Africa ad United States has higher risk on high-blood pressure, diabetes and obesity. A person who has stroke disease in family history, has bigger chance having stroke in early age.

Risk factor that can be modified, as follows a) Obesity

Definition of obesity is accumulation of excess fat that have risk factor in health. Measurement people who have obesity are Body Mass Index (BMI). Weight a person (in kilogram) divided with square of height (in meter) (WHO, 2012).

Table 2. Body mass Index for European.

BMI (kg/m2) Status

Underweight < 18.5

Normal weight 18.5 < BMI <24.99 Overweight 25.0 < BMI < 29.99

Obesity ≥30

b) Dietary Habit

Healthy diet will decrease chronic disease and will develop immunity and get ideal weight. Choosing food and consuming fruit and vegetable will stabilize calories.

c) Physical Activity

Physical activity or any activities that makes body moves once in a week will decrease risk of having stroke. Regular physical activity will increase immunity and can prevent chronic disease.

d) Tobacco use and smoking habit

A smoker will have higher risk on having stroke than non-smoker. Smoke can cause blood clotting, blood coagulating and increase buildup of plaque in arteries.

e) Alcohol users

Alcohol users have higher risk than non-alcoholic person. Alcohol can increase blood pressure and increase triglyceride level (cholesterol) which hardened in arteries.

Data source

Data in this study used data Ischemic patients and Hemorrhagic patients in Medicum Clinic, Punane 61, Tallinn, Estonia. Sample used in this study 420 patients that have diagnosis date in 2009 until April 2015.

Indentification variabel

Description variable used 1 response variable and 5 predictor variables. Response variable is Stroke patients with 2 categories, it is e.g. Ischemic patients and Hemorrhagic patients. Predictor variables have 4 variables that have nominal category and 1 variable has ratio category. Table 3 shows data set in observation.


(8)

SWUP

Table 3. Variable description.

Number Variable Category Information

1 Y

Nominal

Response variable 1: Ischemic patients

2:Hemorrhagic patients 2

X1 Nominal Patients who have drinking habit at

least > 60 grams within a period of at least 30days (WHO, 2014) (Alcohol

Habit) 1: No alcohol 2: Yes 3

X2 Nominal Weekly consumption vegetables

and fruits (Diet control from Doctor) (WHO, 1999)

(Dietary Habit) 1: No 2: Yes

4

X3 Nominal Smoking habits in a day that goes

within 1 year (Estonian Health Interview, 2006)

(Smoking

habit) 1: No 2: Yes 5

X4 Nominal

Physical activity at least 1 time (minimum 60 minutes) of the week (Physical

Activity) 1: No 2: Yes

Methods

The main analysis started when preparing data has completed. The main analysis used random forest analysis, to know importance variables that influenced Ischemic patients and Hemorrhagic patients based on modified risk factors. Response variable in this study has characterized categorical, hence in random forest analysis used classification.

1) Coding data based on category that decided

2) Dividing data in two parts were training data and testing data. It divided by combination training data and testing data proportion are 75%:25%, 80%:20%, 85%:15%, 90%:10%, 95%:5%.

3) Analyze with random forest analysis.

a) Draw bootstrap sample with replacement from original data

b) In each of bootstrap samples, grow classification tree or regression tree without prune the tree.

c) Choose the best split among predictor variables, randomly sample of predictors with: (where, or % is the number of predictors):

log %+ 1 or Ý d) Growing the tree classification

e) Repeat steps b to d until ureplication

f) Majority vote did for prediction classification from u times g) Computing misclassification training data

h) Computing misclassification testing data

Repeat b) to h) with other combinations number of tree ( ).

3.

Results and discussion

Descriptive statistics

Table 4 presents data based on alcohol habit between Ischemic and Hemorrhagic patient. Ischemic patient who does not have consumption alcohol is 254 patients, otherwise Ischemic patient who has consumption alcohol is 90 patients. Hemorrhagic patient who does


(9)

not have consumption alcohol is 17 patients, while 59 patients have consumption alcohol. Total Ischemic patient based on alcohol habit is 344 patients. Total Hemorrhagic patient based on alcohol habit is 76 patients.

Table 4. Ischemic and hemorrhagic patient based on alcohol habit.

Diagnosis Alcohol Habit Total

No Yes

Ischemic 254 90 344

Hemorrhagic 17 59 76

Total 271 149 420

The second modified risk factor is dietary habit. Dietary habit means that patient who have Ischemic and Hemorrhagic had diet habit suggestion by doctor because their diseases. Variable dietary suggestion by doctor shows in Table 5.

Table 5. Ischemic and hemorrhagic patient based on dietary habit.

Diagnosis Diet Habit Total

No Yes

Ischemic 211 133 344

Hemorrhagic 71 5 76

Total 282 138 420

Ischemic patient who does not have dietary habit is 211 patients and Ischemic patient who has dietary habit is 133 patients. Total patient using imputation in Ischemic based on diet habit is 344 patients. While, Hemorrhagic patient does not have dietary habit 71 patients and have dietary habit 5 patients. Total patient using imputation in Hemorrhagic based on diet habit is 76 patients.

Smoking habit between Ischemic and Hemorrhagic patient using imputation presents in Table 6.

Table 6. Ischemic and hemorrhagic patient based on smoking habit. Diagnosis Smoking Habit Total

No Yes

Ischemic 198 146 344

Hemorrhagic 44 32 76

Total 242 178 420

Ischemic patient who has smoke habit is 146 patients. Ischemic patient who does not have habit in smoking habit is higher than patient who has habit in smoking habit. Total Hemorrhagic patient based on smoking habit is 76 patients Then, Hemorrhagic patient who have smoke habit is 32 patients. Total patient who does not have smoke habit in Ischemic and Hemorrhagic higher than patient who have smoking habit.

Regular physical activity presents in Table 7. Ischemic patient who have regular activity is 266 patient. Then, Ischemic patient who do not have regular physical activity is 78 patients. Then, Ischemic patient who does not have regular physical activity is 78 patients. Total Ischemic patient using multiple imputation is 344 patients.


(10)

SWUP

Hemorrhagic patient who has regular physical activity is 53 patients. Then, patient who does not have regular physical activity is 23 patients. Total Hemorrhagic patient using multiple imputation is 76 patients. Total Ischemic and Hemorrhagic patient who has regular physical activity are higher than patient who do not have regular physical activity.

Table 7. Ischemic and hemorrhagic patient based on physical activity. Diagnosis Physical Activity Total

No Yes

Ischemic 78 266 344

Hemorrhagic 23 53 76

Total 101 319 420

Body mass index presents Table 8. Minimum body mass index patient in Medicum Clinic is 22.9 kg/m2 which is some of them have normal weight after used multiple imputation. But, average of them has obesity or overweight, it is 52.53 kg/m2. While, their

maximum body mass index is 86.21 kg/m2, it means they have developed by overweight.

Table 8. Ischemic and hemorrhagic patient based on BMI.

Min. Mean Max.

22.9 52.53 86.21

CART

Classification method use random forest analysis in the study. Random forest is one of ensemble methods that has advantage increasing accuracy classification in single classifier that is not stable with combination classifiers in identical methods (CART) through voting process. Hence, result classification of random forest is from CART method that has repeated in process. CART analysis is one of machine learning methods, therefore the data must be divided in two types, training data and testing data. Training data used for modeling and testing data used for validation model. This study has some of combination for training and testing data. Combination training and testing are 95%:5%, 90%:10%, and 85%:15%. Training and testing have the best combination if it have characteristic big value even training and testing but both of them should be have balanced value.

The best combination data training and data testing are used combination 85% for data training and 15% for data testing. It has the best value and has balanced combination than the others. Therefore, training data that used in random forest analysis is combination 85% and 15%, where is 85% training data and 15% testing data.

Table 9. Combination data training and data testing. Combination (%) Misclassification (%) Training Testing Training Testing

95 5 94.7 100

90 10 94.7 97.6

85 15 96.10 98.4

Table 10 shows that all predictor variables developed classification tree. Based on the score predictor variable, variable diet habit is important variable and as main splitter in classification, because it has the highest score 100%. Then, some variables that have contribution are alcohol habit 48.58 % and soon until body mass index 22.80 % respectively.


(11)

Table 10. Score predictor variable in maximum classification tree. Variable Score Variable (%)

Diet habit 100

Alcohol habit 48.58

Smoking habit 38.98

Body Mass Index 32.12

Physical activity 22.80

Figure 4 presents maximum tree classification that grew from data in observation based on important variable that have chosen. Maximum tree classification has high depth and relative large.

Figure 1. Maximum classification tree.

Terminal node is endpoint of process in election of classifier, and it can be not elected in the next process. Therefore, terminal node is a node that has homogenous observation and finally included in specific class. In this study has 7 terminal nodes (seeFigure 2), there is 4 terminal nodes as specific in Ischemic patient and 3 terminal nodes as specific in Hemorrhagic patient.

Figure 2. Optimum classification tree ischemic patient and hemorrhagic patient.

Random forest analysis

The steps random forest analysis in chapter 3, value of Ž is determined by

«.² %+ 1 or Ý . %and are total of predictor variables in this an observation. Total of

Terminal Node 1 N = 159

Terminal Node 2 N = 38

Terminal Node 3 N = 10

Terminal Node 4 N = 13 Node 5 BMI

N = 23

Terminal Node 5 N = 7 Node 4 PHSY_ACT

N = 30 Node 3

DIET_HBT N = 68 Node 2 SMK_HBT N = 227

Terminal Node 6 N = 80

Terminal Node 7 N = 50 Node 6 DIET_HBT N = 130 Node 1

ALCH_HBT N = 357


(12)

SWUP

predictor variable is 5 variables, then Ž that used 2 and 3. The next step is determined “u” value, which is replication. Generally, u = 50 has given satisfied value in classification (Agresti, 2007), but Sutton [25] was given suggestion if u ≥ 100 giving more satisfied than

used u = 50, because it has given constant in misclassification. Therefore, “u” in this study is 50, 100, and 200. The result expectation is “Ž” must have small correlation because it is connected with strength (“s”) accuracy in a tree (Tibshirani, 1996). It means that if “s” has large value then the predicted accuracy will be higher. The small correlation will be given small variance in random forest, but Hesti et al. (2008) suggested to try Ý . Hence, “Ž” has important role to determine the best random forest.

Table 11 is accuracy misclassification level by “u” and “Ž”. It has small value when used Ž = 3 and u = 50. Misclassification level has small value when used Ž = 3, that mean

Ž = 3 has given optimum value. Hence, growing tree classification used by Ž = 3 and u =

200.

Table 11. Accuracy misclassification level by “u” and “Ž”.

• / misclassification (%)

2 50 2.5

2 100 2.5

2 200 2.2

3 50 1.9

3 100 2.2

3* 200 1.7

3 300 1.7

*used in this case

Out of bag estimate (OOB) from the sample has value 3.36%, that it mean miscalculation proportion random forest prediction result from original group 2.8%. In others word, it used to estimate generalized error for arbitrary classifiers. The OOB error to be good for optimizing ” Ž”. The value of “Ž is constant during the forest growing when used • =

3.

Hence, the OOB error is suitable estimator detected error estimates. The study can be used 200 trees and to be an appropriate. It affected in correlation between the tree and the strength of tree then have constant value when using • = 3. Based on Table 11, the random forest was run on • = 3 and /= 200 giving the smallest misclassification and giving the OOB error constantly.

Figure 3. Out of bag estimate (OOB) in different level of “•”.

Figure 3 shows important variable of study in Medicum clinic based on Gini Index. Gini index has characterized separate the class that has the biggest observation or the important


(13)

in the node. Class is category in response variable, whereas response variable is Ischemic patient and Hemorrhagic patient. The important variables are diet habit and alcohol habit. Diet habit has mean gini index 42.6%, mean while alcohol habit has gini index 31.6%. In the random forest analysis, parent node is alcohol habit, whereas it variable has the biggest probability as classifier.

Figure 4. Importance variable ischemic and hemorrhagic patient in medicum clinic.

The other variables, smoking habit, physical activity, and body mass index, have small gini index. The other variables are having small gini index as follows, smoking habit (9.47%), physical activity (4.97%). Figure 7 is plot gini index based on high gini index value. Figure 8 is result classification tree from random forest. Which means classification Ischemic patient is smoking habit, diet habit, and physical activity. Then, classification Hemorrhagic patient is diet habit. The main of splinter is alcohol habit.

Figure 5. Classification tree use random forest patient ischemic and hemorrhagic.

The other variables, smoking habit, physical activity, and body mass index, have small gini index. Body mass index (BMI) has gini index 11.59%, then smoking habit has gini index 9.47% and physical activity has gini index 4.97%. These variables are not really important in this study, but does not mean not important. Figure 7 is plot gini index based on high gini index value.

The main splitter of random forest from data patient Ischemic and Hemorrhagic is alcohol habit. Classification patient Ischemic is smoking habit, diet habit and physical activity. Then, classification patient Hemorrhagic is diet habit. Training and testing that used in this

phsyact smkhbt bmi alchhbt diethbt

0 10 20 30 40

model


(14)

SWUP

research are 85%:15%, then Table 12 presents accuracy in training data. Generally, accuracy prediction formed from 1-APER, whether 1-APER has higher value that means prediction accuracy more satisfying. However, in this research data Ischemic and Hemorrhagic patient have imbalance data then we are considered for appropriate accuracy that showed by sensitivity and specificity. Whether sensitivity has higher value it means that prediction accuracy in Ischemic class is satisfied, while specificity has higher value it means that prediction accuracy in Hemorrhagic class is satisfied. These values summarized in G-mean and spacious area under ROC curve (AUC), weather it has higher value it means that they are satisfied.

Table 12. Combination training data (85%:15%).

Actual class Prediction class Total APER (%) 1-APER (%) Ischemic Hemorrhagic

Training data

Ischemic 292 4 296

1.8 98.32

Hemorrhagic 2 59 61

Total 292 65 357

Sensitivity =/0/"01 = 22=0.986

Specificity =/ "/10=ò2ò"2 = 0.972

G− mean = Ý , ¹ × ¹ × ,- -¹ =√0.986 × 0.972 = 0.965

Table 12 presents classification accuracy tree in training data, it has 98.32% that means tree classification that formed in observation as appropriate 98.32%. It value is showing that accuracy in random forest is satisfying along misclassification accuracy 1.8%. Then, sensitivity in random forest has 0.986 it means that 98.6% data Ischemic patient appropriate in classified as Ischemic. Specificity is 0.972, it value is showing that 97.2% data Hemorrhagic patient appropriate in classified as Hemorrhagic. G-mean can be measured for balance prediction in each class, because classification method tend appropriated class that has more sample data but it is dissatisfactory for a bit sample data. In this research, G-mean value has 96.5% hence, random forest method is satisfying to predict sample data.

We can see the ability classification method to predict imbalance data using spacious area under ROC (AUC) whether it has spacious area that mean classification method has satisfied to predict imbalance data. Figure 6 presents ROC curve whereas it has almost perfect rate 93.5% in true positive rate and 6.5% false positive rate. Perfect occurs at the upper left-hand corner of the graph hence it is closer in the graph comes to it corner, then it gives that better at classification. The diagonal line represents random guess, hence the distance of the graph over the diagonal line has characterized how much better random guess that we did.

Table 13 presents validation to determine the tree classification is proper or not. Validation has done with filling in 63 testing data on the model classification tree from combination that we did. Accuracy classification tree in testing data is 95.23% that means it formed in observation as appropriate 95.23%. It value represents that accuracy in random forest is satisfying along misclassification accuracy 4.77%.


(15)

Figure 6. ROC curve.

Table 13. Combination testing data (85%:15%).

Actual class Prediction class Total APER (%) 1-APER (%) Ischemic Hemorrhagic

Testing data

Ischemic 50 3 53

4.77 95.23

Hemorrhagic 0 10 10

Total 50 13 63

4.

Conclusion and remarks

Classification in Ischemic patient is smoking habit, diet habit, and physical activity meanwhile, classification in Hemorrhagic patient is diet habit. Accuracy tree is 98.32%, it showed tree classification that formed in observation as appropriate 98.32% with sensitivity 0.986 then 98.6% data Ischemic patient appropriate in classified as Ischemic. Specificity is 0.972, it value showed that 97.2% data Hemorrhagic patient appropriate in classified as Hemorrhagic.

Acknowledgement

We thank to Medicum Clinic, Tallinn, Estonia for the kindness information. Kerti Sönmez as International Office and Siyi Ma as Erasmus+ coordinator Tallinn University of Technology, Tallinn, Estoni.

References

Agresti, A. (2007). An introduction to categorical data analysis. John Wiley and Sons: New York. Breiman, L. (1996). Out of bag estimate [Online]. Available:

https://www.stat.berkeley.edu/~breiman/OOBestimation.pdf Breiman, L. (2001). Random forest. Machine Learning, 45, 5-32

Breiman, L., Friedman, J.H., Olshen R.A., & Stone, C.J. (1984). Classification and regression tree. New York, NY: Chapman and Hall

Dietterich, T.G. (2000) Ensemble methods in machine learning. USA: Oregon State University

Efron, B., & Tibsihirani, R. (1985). The bootstrap method for assessing statistical accuracy. California: Stanford University


(16)

SWUP

Estonian Statistics (2015). Population decline is slowing down [Online]. Available: www.stat.ee/population

European Cardiovascular Statistics. (2012). European cardiovascular disease [Online]. Available: http://www.escardio.org/about/documents/eu-cardiovascular-disease-statistics-2012.pdf European Commission (2010). Eurostat: statistics explained [Online]. Available:

http://ec.europa.eu/eurostat/statistics-explained/index.php

Hestie, T.J., Tibshirani, R.J., & Friedman, J.H. (2008). The elements of statistical learning: Data-mining, inference and prediction. Second Edition. New York: Springer-Verlag

Johnson, N., & Wichern, D., (1998). Applied multivariate statistical analysis. Prentice-Hall, Englewood Cliffs, N.J.

Kubat, M., Holte, R., & Matwin, S. (1997). Adaptive fraud detection. Data Mining and Knowledge Discovery, 1, 291–316.

Nichols, M., Townsend, N., Scarborough, P., & Rayner, M. (2014). Cardiovascular diseases in Europe 2014: Epidemiological update. European Heart Journal, DOI: 10.1093/eurheartj/ehu299

OECD. (2014). Health at a glance: Europe 2014. ISSN: 2305-6088.

Rajagopal, N., Xie, W., Li, Y., Wagner U., Wang, W., Stamatoyannopoulos, J., Ernst, J., Kellis, M., & Ren, B. (2013). A random forest based algorithm for enhancer identification from chromatin state. doi:10.1371/journal.pcbi.1002968

Sartono, B., and Syafitri, D.U., (2010). Ensemble tree: An alternative toward simple classification and regression tree. ISSN: 0853-8115

Takahashi, O., Cook, E.F., Nakamura, T., Saito, J., Ikawa, F., & Fukui, T. (2006). Risk stratification for in-hospital mortality in spontaneous intracerebral hemorrhage: A Classification and Regression Tree Analysis. Oxford University Press. doi: 10.1093/qjmed/hcl107

Tibshirani, R. (1996). Bias, variance, and prediction error for classification rules, Technical Report, Statistics Department, University of Toronto

Truelsen, T., Begg, S., & Mathers, C.D. (2000). The global burden of cerebrovascular disease. GBD 200 Working Paper, World Health Organization, Geneva. [Online]. Available: http://www.who.int/healthinfo /statistics/bod_cerebrovasculardiseasestroke.pdf

WHO (World Health Organization). (2012). Modified and unmodified risk factors [Online]. Available: http://www.who.int/


(1)

SWUP Table 10. Score predictor variable in maximum classification tree.

Variable Score Variable (%)

Diet habit 100 Alcohol habit 48.58 Smoking habit 38.98 Body Mass Index 32.12 Physical activity 22.80

Figure 4 presents maximum tree classification that grew from data in observation based on important variable that have chosen. Maximum tree classification has high depth and relative large.

Figure 1. Maximum classification tree.

Terminal node is endpoint of process in election of classifier, and it can be not elected in the next process. Therefore, terminal node is a node that has homogenous observation and finally included in specific class. In this study has 7 terminal nodes (seeFigure 2), there is 4 terminal nodes as specific in Ischemic patient and 3 terminal nodes as specific in Hemorrhagic patient.

Figure 2. Optimum classification tree ischemic patient and hemorrhagic patient.

Random forest analysis

The steps random forest analysis in chapter 3, value of Ž is determined by «.² %+ 1 or Ý . %and are total of predictor variables in this an observation. Total of

Terminal Node 1 N = 159

Terminal Node 2 N = 38

Terminal Node 3 N = 10

Terminal Node 4 N = 13 Node 5 BMI

N = 23

Terminal Node 5 N = 7 Node 4 PHSY_ACT

N = 30 Node 3

DIET_HBT N = 68 Node 2 SMK_HBT N = 227

Terminal Node 6 N = 80

Terminal Node 7 N = 50 Node 6 DIET_HBT N = 130 Node 1

ALCH_HBT N = 357


(2)

predictor variable is 5 variables, then Ž that used 2 and 3. The next step is determined “u” value, which is replication. Generally, u = 50 has given satisfied value in classification (Agresti, 2007), but Sutton [25] was given suggestion if u ≥ 100 giving more satisfied than

used u = 50, because it has given constant in misclassification. Therefore, “u” in this study is 50, 100, and 200. The result expectation is “Ž” must have small correlation because it is connected with strength (“s”) accuracy in a tree (Tibshirani, 1996). It means that if “s” has large value then the predicted accuracy will be higher. The small correlation will be given small variance in random forest, but Hesti et al. (2008) suggested to try Ý . Hence, “Ž” has important role to determine the best random forest.

Table 11 is accuracy misclassification level by “u” and “Ž”. It has small value when used Ž = 3 and u = 50. Misclassification level has small value when used Ž = 3, that mean Ž = 3 has given optimum value. Hence, growing tree classification used by Ž = 3 and u = 200.

Table 11. Accuracy misclassification level by “u” and “Ž”.

• / misclassification (%)

2 50 2.5

2 100 2.5

2 200 2.2

3 50 1.9

3 100 2.2

3* 200 1.7

3 300 1.7

*used in this case

Out of bag estimate (OOB) from the sample has value 3.36%, that it mean miscalculation proportion random forest prediction result from original group 2.8%. In others word, it used to estimate generalized error for arbitrary classifiers. The OOB error to be good for optimizing ” Ž”. The value of “Ž is constant during the forest growing when used • = 3.

Hence, the OOB error is suitable estimator detected error estimates. The study can be used 200 trees and to be an appropriate. It affected in correlation between the tree and the strength of tree then have constant value when using • = 3. Based on Table 11, the random forest was run on • = 3 and /= 200 giving the smallest misclassification and giving the OOB error constantly.

Figure 3. Out of bag estimate (OOB) in different level of “•.

Figure 3 shows important variable of study in Medicum clinic based on Gini Index. Gini index has characterized separate the class that has the biggest observation or the important


(3)

SWUP in the node. Class is category in response variable, whereas response variable is Ischemic patient and Hemorrhagic patient. The important variables are diet habit and alcohol habit. Diet habit has mean gini index 42.6%, mean while alcohol habit has gini index 31.6%. In the random forest analysis, parent node is alcohol habit, whereas it variable has the biggest probability as classifier.

Figure 4. Importance variable ischemic and hemorrhagic patient in medicum clinic.

The other variables, smoking habit, physical activity, and body mass index, have small gini index. The other variables are having small gini index as follows, smoking habit (9.47%), physical activity (4.97%). Figure 7 is plot gini index based on high gini index value. Figure 8 is result classification tree from random forest. Which means classification Ischemic patient is smoking habit, diet habit, and physical activity. Then, classification Hemorrhagic patient is diet habit. The main of splinter is alcohol habit.

Figure 5. Classification tree use random forest patient ischemic and hemorrhagic.

The other variables, smoking habit, physical activity, and body mass index, have small gini index. Body mass index (BMI) has gini index 11.59%, then smoking habit has gini index 9.47% and physical activity has gini index 4.97%. These variables are not really important in this study, but does not mean not important. Figure 7 is plot gini index based on high gini index value.

The main splitter of random forest from data patient Ischemic and Hemorrhagic is alcohol habit. Classification patient Ischemic is smoking habit, diet habit and physical activity. Then, classification patient Hemorrhagic is diet habit. Training and testing that used in this

phsyact smkhbt bmi alchhbt diethbt

0 10 20 30 40

model


(4)

research are 85%:15%, then Table 12 presents accuracy in training data. Generally, accuracy prediction formed from 1-APER, whether 1-APER has higher value that means prediction accuracy more satisfying. However, in this research data Ischemic and Hemorrhagic patient have imbalance data then we are considered for appropriate accuracy that showed by sensitivity and specificity. Whether sensitivity has higher value it means that prediction accuracy in Ischemic class is satisfied, while specificity has higher value it means that prediction accuracy in Hemorrhagic class is satisfied. These values summarized in G-mean and spacious area under ROC curve (AUC), weather it has higher value it means that they are satisfied.

Table 12. Combination training data (85%:15%).

Actual class Prediction class Total APER (%) 1-APER (%)

Ischemic Hemorrhagic

Training data

Ischemic 292 4 296

1.8 98.32

Hemorrhagic 2 59 61

Total 292 65 357

Sensitivity =/0/"01 = 22=0.986

Specificity =/ "/10=ò2ò"2 = 0.972

G− mean = Ý , ¹ × ¹ × ,- -¹ =√0.986 × 0.972 = 0.965

Table 12 presents classification accuracy tree in training data, it has 98.32% that means tree classification that formed in observation as appropriate 98.32%. It value is showing that accuracy in random forest is satisfying along misclassification accuracy 1.8%. Then, sensitivity in random forest has 0.986 it means that 98.6% data Ischemic patient appropriate in classified as Ischemic. Specificity is 0.972, it value is showing that 97.2% data Hemorrhagic patient appropriate in classified as Hemorrhagic. G-mean can be measured for balance prediction in each class, because classification method tend appropriated class that has more sample data but it is dissatisfactory for a bit sample data. In this research, G-mean value has 96.5% hence, random forest method is satisfying to predict sample data.

We can see the ability classification method to predict imbalance data using spacious area under ROC (AUC) whether it has spacious area that mean classification method has satisfied to predict imbalance data. Figure 6 presents ROC curve whereas it has almost perfect rate 93.5% in true positive rate and 6.5% false positive rate. Perfect occurs at the upper left-hand corner of the graph hence it is closer in the graph comes to it corner, then it gives that better at classification. The diagonal line represents random guess, hence the distance of the graph over the diagonal line has characterized how much better random guess that we did.

Table 13 presents validation to determine the tree classification is proper or not. Validation has done with filling in 63 testing data on the model classification tree from combination that we did. Accuracy classification tree in testing data is 95.23% that means it formed in observation as appropriate 95.23%. It value represents that accuracy in random forest is satisfying along misclassification accuracy 4.77%.


(5)

SWUP Figure 6. ROC curve.

Table 13. Combination testing data (85%:15%).

Actual class Prediction class Total APER (%) 1-APER (%)

Ischemic Hemorrhagic

Testing data

Ischemic 50 3 53

4.77 95.23

Hemorrhagic 0 10 10

Total 50 13 63

4.

Conclusion and remarks

Classification in Ischemic patient is smoking habit, diet habit, and physical activity meanwhile, classification in Hemorrhagic patient is diet habit. Accuracy tree is 98.32%, it showed tree classification that formed in observation as appropriate 98.32% with sensitivity 0.986 then 98.6% data Ischemic patient appropriate in classified as Ischemic. Specificity is 0.972, it value showed that 97.2% data Hemorrhagic patient appropriate in classified as Hemorrhagic.

Acknowledgement

We thank to Medicum Clinic, Tallinn, Estonia for the kindness information. Kerti Sönmez as International Office and Siyi Ma as Erasmus+ coordinator Tallinn University of Technology, Tallinn, Estoni.

References

Agresti, A. (2007). An introduction to categorical data analysis. John Wiley and Sons: New York. Breiman, L. (1996). Out of bag estimate [Online]. Available:

https://www.stat.berkeley.edu/~breiman/OOBestimation.pdf Breiman, L. (2001). Random forest. Machine Learning, 45, 5-32

Breiman, L., Friedman, J.H., Olshen R.A., & Stone, C.J. (1984). Classification and regression tree. New York, NY: Chapman and Hall

Dietterich, T.G. (2000) Ensemble methods in machine learning. USA: Oregon State University

Efron, B., & Tibsihirani, R. (1985). The bootstrap method for assessing statistical accuracy. California: Stanford University


(6)

Estonian Statistics (2015). Population decline is slowing down [Online]. Available: www.stat.ee/population

European Cardiovascular Statistics. (2012). European cardiovascular disease [Online]. Available: http://www.escardio.org/about/documents/eu-cardiovascular-disease-statistics-2012.pdf European Commission (2010). Eurostat: statistics explained [Online]. Available:

http://ec.europa.eu/eurostat/statistics-explained/index.php

Hestie, T.J., Tibshirani, R.J., & Friedman, J.H. (2008). The elements of statistical learning: Data-mining, inference and prediction. Second Edition. New York: Springer-Verlag

Johnson, N., & Wichern, D., (1998). Applied multivariate statistical analysis. Prentice-Hall, Englewood Cliffs, N.J.

Kubat, M., Holte, R., & Matwin, S. (1997). Adaptive fraud detection. Data Mining and Knowledge Discovery, 1, 291–316.

Nichols, M., Townsend, N., Scarborough, P., & Rayner, M. (2014). Cardiovascular diseases in Europe 2014: Epidemiological update. European Heart Journal, DOI: 10.1093/eurheartj/ehu299

OECD. (2014). Health at a glance: Europe 2014. ISSN: 2305-6088.

Rajagopal, N., Xie, W., Li, Y., Wagner U., Wang, W., Stamatoyannopoulos, J., Ernst, J., Kellis, M., & Ren, B. (2013). A random forest based algorithm for enhancer identification from chromatin state. doi:10.1371/journal.pcbi.1002968

Sartono, B., and Syafitri, D.U., (2010). Ensemble tree: An alternative toward simple classification and regression tree. ISSN: 0853-8115

Takahashi, O., Cook, E.F., Nakamura, T., Saito, J., Ikawa, F., & Fukui, T. (2006). Risk stratification for in-hospital mortality in spontaneous intracerebral hemorrhage: A Classification and Regression Tree Analysis. Oxford University Press. doi: 10.1093/qjmed/hcl107

Tibshirani, R. (1996). Bias, variance, and prediction error for classification rules, Technical Report, Statistics Department, University of Toronto

Truelsen, T., Begg, S., & Mathers, C.D. (2000). The global burden of cerebrovascular disease. GBD 200 Working Paper, World Health Organization, Geneva. [Online]. Available: http://www.who.int/healthinfo /statistics/bod_cerebrovasculardiseasestroke.pdf

WHO (World Health Organization). (2012). Modified and unmodified risk factors [Online]. Available: http://www.who.int/