Materials and methods PROS Ria DLNK, Alexandr K, Heri K Random forest fulltext

R.D.L.N. Karisma, A. Kormitson, H. Kuswanto SWUP MA.27 Control risk factors as variable in this case are alcohol habit, smoking habit, physical activity, body mass index, dietary habit, weight, and height. Some algorithms such as classification tree and regression tree are used by researcher for the sake of classification. CART Classification and Regression Tree methods, one of classification methods in data mining, was used by Takahashi et al. 2006 for identified four groups for mortality in intracerebral haemorrhage patients ICH. Random forest was used Rajagopal et al. 2013 about Enhancer Identification from Chromatin State giving conclusion that random forest is informative feature for classification, because it can predict enhancers accurately in a genome-wide based on chromatin modi-fications. Problem and aim to be solved in this study case i.e. characteristic patient Ischemic and Hemorrhagic then classify Ischemic patient and Hemorrhagic patient reviewed by modified risk factor. Sample data is sample of patient data Ischemic and Hemorrhagic 2009 to 2014, which is expected to represent patients Ischemic and Hemorrhagic. Hospital data in Estonia for cardiovascular disease on 2001 have 3245 then increase in 2009 has 3327 per 100000 population Nichols et al., 2014.

2. Materials and methods

CART Classification and Regression Tree CART, which is developed by Leo Breiman, Jerome H. Friedman, Richard A. Olshen and Charles J. Stone around 1980’s, is a method with nonparametric model approach developed for the analysis of classification with continuous variables and categorical response Breiman, 1984. There are three elements in concept of tree establishment on CART method, as follows. 1 Forming Tree Clasification The process of forming tree classification consists of three stages that are: a Choosing Classifier Classifier In this stage, the data that will be used is learningtraining ℒ heterogenic sample data. That sample will be classified based on classification rule and goodness of split criteria. Heterogeneity each class from specified node in classification tree measured with impurity measure it. Impurity measure will help finding optimal classification function. Generally, heterogenic functionality used is Gini Index. Gini index has benefit such as having simple calculation process but relatively fast and easy also suitable to be applied in some cases [6] . The function of gini index is ¹ = ∑ |¹ §|¹ ¥ , where §|¹ denotes j Class proportion on nodes t and |¹ denotes Class proportion on nodes t. Goodness of fit is classify evaluation by classify s on nodes t. Measurement equality from how good classifier s on filtering data based on class is class heterogeneity measurement decrease, which is defined as. ∅ , ¹ = ∆ , ¹ = ¹ − I ¹ I − ¹ , where ¹ : Heterogeneities function on t I : Observe proportion left node : Observe proportion right node ¹ I : Heterogeneities function on child left nodes ¹ : Heterogeneities function on child right nodes Random forest of modified risk factor on ischemic and hemorrhagic Case study: Medicum Clinic, Tallinn, Estonia SWUP MA.28 Classification which produces higher ∅ , ¹ is the best classifier because it can reduce heterogeneity significantly. ¹ I and ¹ are partition from t node which become two disjoint subsets where I and are each proportion of each nodes chance. Because ¹ I ∪ ¹ = ¹ then the value ∆ , ¹ is changing representation from heterogeneity in t nodes by s classifier. If nodes come from non-homogeny class, the same procedure will be repeated until tree classification become one specified configuration, i.e., ∆ ∗ , ¹ = Žg ∈ ∆ , ¹ . b Determination of Terminal Nodes A t node defined terminal or not, will be re-selected again when t nodes there is no heterogeneity decrease significantly or minimum limit n as there is only one observation in the child node. Generally, the minimum number of cases in last terminal, if its fulfilled then tree development will be stopped Breiman, 1984. c Marking Label Class Marking label class on terminal nodes done based on majority rules, as follows, § ¼ |¹ = max ¥ §|¹ = max ¥ ¥ ¹ ¹ , where ¥ ¹ : number of class observation j on t node ¹ : number observation on t node Class label t terminal nodes is § ¼ which give estimated classification value on smallest t node with amount ¹ = 1 − max ¥ §|¹ . The formation of tree classification process will stop when there is only one observation in each of the child nodes or minimum limits , all observations in the child nodes are identical, and there is limited depth amount of maximum tree level. 2 Pruning Tree Classification Pruning is done to unimportant part of the tree obtaining the optimal tree. Minimum cost complexity is a pruning measurement which is done to get optimal tree Breiman, 1984: ü Ì ¹ = ü ¹ + mŠ Š, where ü ¹ : error proportion in sub tree re-substitution estimate m : complexity parameter Š Š : measurement the number of terminal node T tree 3 Determining Optimum Tree Classification Big size of tree classification will cause high value in cost complexity because the data structures described, tend to be complex so it is necessary to choose optimal tree which has modest size but provide sufficient replacement estimated value. If ü is chosen as best estimator then bigger tree will tends to be chosen, because bigger tree will make value of ü smaller. There are two estimator types to get optimal tree classification, i.e.: a Test Sample Estimate Test sample estimate used for big data. This procedure starts by dividing cases into two parts, which are ℒ and ℒ . It should be noticed that ℒ used for making T tree from sample learning, while ℒ is observation to predict ü , hence the equation is given 6. ü = ” ∑ r Ü 7 ≠ § 7 ° ,¥ ° ∈ℒ ” , R.D.L.N. Karisma, A. Kormitson, H. Kuswanto SWUP MA.29 where, is observation measurement in ℒ and r . valued 0 if statement in parentheses wrong and a value of 1 if the statement in parentheses is true. Selected optimum tree classification ∗ with ü ∗ = min ü . b Cross Validation V-Fold Estimate Cross Validation V-Fold Estimate used when observation is not really big. Observations in ℒ divided randomly become disjoint parts with size approximately equal to each class. Tree derived from a sample of learningtraining to- × with × = 1,2, … , ×. In example is qualification result, thus the sample estimator for ü- . is, ü- . = ∑ x-Ü 7 ≠ § 7 . ° ,¥ ° ∈ℒ ” , with ≅ is observation amount in ℒ . Then same procedure used by using all observation on ℒ forming row of trees . Hence, estimator fold cross validation v for is ü ¶ = ∑ ü ¶ - . ℒ . Optimum tree classification is ∗ with ü ¶ ∗ = min ü ¶ . Random forest methods Ensemble is a classification method from individual decision combined with weight or weighted voting to classify new example Dietterich, 2000. The most important thing is ensemble have higher accuracy than individual classification. One of the examples of ensemble method is Random Forest. Random Forest is a classification method developed by Breiman. Random forest is a classifier formed from a set of tree structure, where each tree is a random independent vector which has identical distribution and each tree comes from best unit Breiman, 2001. Bagging is repeating process re-sampling randomly which produce different data and followed by different tree Sartono Syafitri , 2010. It can be simplified that random forest can be formed using CART methodology in maximum size without pruning. The advantages of random forest are 1 Can be used for big scale data with high accuracy. 2 Can give important variable estimation on classification. 3 This method can be used to stabilize error on non-balanced population data set. 4 Can generate unbiased internal estimation by generalizing error as the forest building progress Random forest algorithm Random forest built using data training built using data training training set with “ ”sized consist of predictor variable. It can be simplified, formation of random forest algorithm is described as follows. 1 Draw sample data from original data set with bootstrap resample with replacement 2 For each of bootstrap resample, growing an un-pruned classification tree, with the best classifier based on random predictor variables. Number of random variables are computed by log + 1 , where is number of predictor [8] or Ý where is number of predictor variables Hestie, 2008. 3 Predict new sample data based on classification tree. 4 Repeating 1 until 3 stages as times random. Random forest of modified risk factor on ischemic and hemorrhagic Case study: Medicum Clinic, Tallinn, Estonia SWUP MA.30 5 Estimation combined based on u tree i.e. using majority vote to classification case or average for regression. The result will produce a single set tree with different size. Expected result is a single set tree which has small correlation between tree, because small correlation will produce varians on small random forest Hestie, 2008. Bootstrap was covered by Efron 1985 for determined a sampling distribution estimate in resample procedure with replacement original data. Algorithms bootstrap is as follows. a Construction empirical distribution of ôé 7 from a sample giving probability 7 each r where = 1,2, … , . b Draw random sample bootstrap with replacement from empirical distribution of ôé 7 called as first bootstrap r ∗ c Choose and compute statistics é from bootstrap sample r ∗ called as é ∗ d Repeat step a and c until B times then obtained é ∗ , é ∗ , … , é ∗ e Build a probability distribution from é ∗ with giving probability in each é ∗ , é ∗ , … , é ∗ , that is an bootstrap estimator for sampling distribution é and called ôé ∗ . f Bootstrap estimation is approached by é ∗ = ∑ é ∗ . Out of bag estimates According to Tibshirani 1996, out of bag estimate is a good tool to estimate error especially to estimate variance which has changeable classification decision. Breiman [13] provided empirically that OOB estimate needed so that error has unbiased estimated. OOB data is a data is not contained in bootstrap example. For average observation, original data group will be OOB data 37 of trees number, hence each original group data will be predicted as many as one-third from each trees number Breiman, 1996. Breiman 2001 showing random forest result has limited error, when trees number increasing then error will be convergent. Hence, the higher tree number u, then error will be convergent. That is why random forest does not over fit. Classification accuracy A method used to determine sample classification function should be performed in the future sample is Apparent Error Rate APER. APER is a measure performance that can be calculated for any classification procedure and does not depend on form parent population. It is defined as fraction of observations in training set that are misclassification by classification function Johnson Wichern, 1998. Generally, training extremely imbalance data then the overall classification accuracy is often not appropriate as measure of performance. ROC Receiver Operating Characteristic curve is more considered and more informative than APER as measure in classification accuracy Agresti, 2007. In ROC curve, it has sensitivity and specificity. Sensitivity is accuracy in positive class then specificity is accuracy in negative class. Then ROC is satisfying as measurement in accuracy for imbalance data. In ROC curve, the spacious area under curve AUC becomes reversing the goodness of classification method, while more spacious area under the curve shows that the method capable for measuring sensitivity and specificity as well. G-Mean Geometric Mean is a measurement of accuracy that used to measure sensitivity and specificity, then it value used in Kubat et al. 1997 to assess the performance of their methods Vigorously classification R.D.L.N. Karisma, A. Kormitson, H. Kuswanto SWUP MA.31 method should be capable measuring sensitivity and specificity. xçü = x + x + ôx + + ô × 100 S ensitivity = x x + ô S pecificity = + ôx G − mean = Ý , ¹ × ¹ × ,- - ¹ Table 1. Apparent error rate table. Predicted membership Positive 1 Negative 2 Actual membership Positive 1 TN True Negative FN False Negative Negative 2 FP False Positive TP True Positive Note: TN = number of observation from class 1 that correctly classified as class 1 FP = number of observation from class 2 that incorrectly classified as class 1 FN = number of observation from class 1 that incorrectly classified as class 2 TP = number of observation from class 2 that correctly classified as class 2 Ischemic and hemorrhagic More than 80 of deaths in low and middle income are caused by Cardiovascular. It is predicted that in 2030, 23.30 million people will die because of Cardiovascular. In Estonia 2010, noted that 786.5 deaths per-100.000 population European Commission, 2010. Stroke Cerebrovascular is one of dangerous type of cardiovascular. The number of deaths due to stroke mostly happened in Baltic, Central Europe, and Eastern Europe. It is reported that more than 350 deaths per 100.00 population caused by this disease OECD, 2014. Stroke is the second biggest disease in Estonia Estonian Statistics, 2015. Based on the cause, stroke is divided in to two kinds which are Ischemic, and Hemorrhagic. In Europe, there are 75 Ischemic stroke victim and 15 Hemorrhagic stroke victim European Cardiovascular Statistics, 2012. Ischemic is a kind of stroke disease caused by blood clogging to the brain, while hemorrhagic occurs when a weakened blood vessel ruptures European Cardiovascular Statistics, 2012. The brain depends on the artery to bring fresh blood containing oxygen, nutrients to the brain and take carbon dioxide. If the artery is clogged or broken, neurons can not make enough energy and will eventually stop working which causes brain cells to die. The following illustration is the ischemic and hemorrhagic stroke. The scope of Stroke divided into two factors: modifiable risk factors and the risk factors that can not be modified. Risk factor that can not be modified. a Age b Stroke can infect anyone at any age. By the aging time the risk of stroke will increase two times greater. c Sex d Women have bigger chance having stroke in every year than a man, especially women in middle age. Women are less aware that they have a higher risk for stroke and have Random forest of modified risk factor on ischemic and hemorrhagic Case study: Medicum Clinic, Tallinn, Estonia SWUP MA.32 little knowledge about the risk of stroke. Stroke killed women twice as much breast cancer every year. Men at young age have a bigger chance on having stroke. e Race and Ethnic f Africa and United States has twice bigger risk than AsiaPacific Island and Kaukasid. Because Africa ad United States has higher risk on high-blood pressure, diabetes and obesity. A person who has stroke disease in family history, has bigger chance having stroke in early age. Risk factor that can be modified, as follows a Obesity Definition of obesity is accumulation of excess fat that have risk factor in health. Measurement people who have obesity are Body Mass Index BMI. Weight a person in kilogram divided with square of height in meter WHO, 2012. Table 2. Body mass Index for European. BMI kgm 2 Status Underweight 18.5 Normal weight 18.5 BMI 24.99 Overweight 25.0 BMI 29.99 Obesity ≥30 b Dietary Habit Healthy diet will decrease chronic disease and will develop immunity and get ideal weight. Choosing food and consuming fruit and vegetable will stabilize calories. c Physical Activity Physical activity or any activities that makes body moves once in a week will decrease risk of having stroke. Regular physical activity will increase immunity and can prevent chronic disease. d Tobacco use and smoking habit A smoker will have higher risk on having stroke than non-smoker. Smoke can cause blood clotting, blood coagulating and increase buildup of plaque in arteries. e Alcohol users Alcohol users have higher risk than non-alcoholic person. Alcohol can increase blood pressure and increase triglyceride level cholesterol which hardened in arteries. Data source Data in this study used data Ischemic patients and Hemorrhagic patients in Medicum Clinic, Punane 61, Tallinn, Estonia. Sample used in this study 420 patients that have diagnosis date in 2009 until April 2015. Indentification variabel Description variable used 1 response variable and 5 predictor variables. Response variable is Stroke patients with 2 categories, it is e.g. Ischemic patients and Hemorrhagic patients. Predictor variables have 4 variables that have nominal category and 1 variable has ratio category. Table 3 shows data set in observation. R.D.L.N. Karisma, A. Kormitson, H. Kuswanto SWUP MA.33 Table 3. Variable description. Number Variable Category Information 1 Y Nominal Response variable 1: Ischemic patients 2:Hemorrhagic patients 2 X 1 Nominal Patients who have drinking habit at least 60 grams within a period of at least 30days WHO, 2014 Alcohol Habit 1: No alcohol 2: Yes 3 X 2 Nominal Weekly consumption vegetables and fruits Diet control from Doctor WHO, 1999 Dietary Habit 1: No 2: Yes 4 X 3 Nominal Smoking habits in a day that goes within 1 year Estonian Health Interview, 2006 Smoking habit 1: No 2: Yes 5 X 4 Nominal Physical activity at least 1 time minimum 60 minutes of the week Physical Activity 1: No 2: Yes Methods The main analysis started when preparing data has completed. The main analysis used random forest analysis, to know importance variables that influenced Ischemic patients and Hemorrhagic patients based on modified risk factors. Response variable in this study has characterized categorical, hence in random forest analysis used classification. 1 Coding data based on category that decided 2 Dividing data in two parts were training data and testing data. It divided by combination training data and testing data proportion are 75:25, 80:20, 85:15, 90:10, 95:5. 3 Analyze with random forest analysis. a Draw bootstrap sample with replacement from original data b In each of bootstrap samples, grow classification tree or regression tree without prune the tree. c Choose the best split among predictor variables, randomly sample of predictors with: where, or is the number of predictors: log + 1 or Ý d Growing the tree classification e Repeat steps b to d until u replication f Majority vote did for prediction classification from u times g Computing misclassification training data h Computing misclassification testing data Repeat b to h with other combinations number of tree .

3. Results and discussion