Model selection: cross-validation ALCS 2013 14 Main Report English 20151222

283

VI.3 Model selection: cross-validation

We selected carefully from the long list of possible variables because creating a model including all available variables can lead to ‘overfitting’ of the model, where data ‘noise’ impairs the ability to understand causal relationships between dependent and independent variables. 72 The variables we selected across surveys include demographic information household size, dependency ratio, proportion of females in the household, household head’s characteristics age, education, employment, household assets including land, livestock and dwelling and household access to basic services water, sanitation and electricity. We also included some subjective measures of well-being and UN proxy measures for district-level conflict and insecurity. We used a 10-fold validation approach to check for over-fitting bias; that is, we randomly divided household survey data into 10 folds parts, using nine folds as ‘training data’ and using the remaining fold as ‘testing data’. The consumption model is estimated on the nine folds ‘training data’ using a stepwise Ordinary Least Square [stepwise] regression OLS, an iterative process that selects variables based on their correlation with household consumption and their predictive power. We repeated this analysis 10 times, 73 each time using a different nine folds as ‘training data’ and the remaining fold for ‘testing data’, and each time testing the model’s performance against the actual – surveyed – poverty rates and on mean squared errors MSEs. 74 Figure VI.1 illustrates the validation approach. 72 While a model may perform well within the sample data used to create the model, ‘over-fitting’ may cause the model to perform poorly on new data. 73 SWIFT method creates 10 folds. However, a test may indicate any number of folds. 74 We assume that the error and regression coefficients will follow normal distributions in projecting simulating household expenditure. The simulation process is repeated for all households, typically twenty times using STATA’s ‘mi impute regress’ command. A poverty headcount rate is calculated by comparing the simulated household expenditure or income with a poverty line for each of the twenty simulation rounds. The average poverty rate of the simulations is used as a poverty estimate. 284 Figure VI.1: Illustration of cross-validation Step 1:Randomly split data into three folds C refers to consumption; X refers to non-consumption data Step 2: Select two folds as training data, develop a model there, and test model performance in the testing data Step 3: Repeat the above procedure three times by changing the testing data Source: Adapted from Yoshida et. al. 2015 This cross-validation exercise determines the optimal p-value for subsequent stepwise regressions – that is, where the p-value used in the equation minimises the difference between actual and model-projected poverty rates. To do this, we repeat the exercise for a range of p-values between 0.1 percent and 10 percent, also examine mean squared error to check for over-fitting. We illustrate this second cross validation exercise in Figure VI.2 . Figure VI.2: Results of the cross-validation exercise Source: Author’s calculation based on NRVA 2011-12 data. Calculation excludes data from Helmand and Khost provinces. Randomly Split by three Household Survey data Household Survey Data Training Data Testing Data modeling Compare Training Data Testing Data .1 3 5 6 .1 3 5 8 .1 3 6 .1 3 6 2 .1 3 6 4 .02 .04 .06 .08 .1 p-value .0 1 5 .0 1 1 .0 1 1 5 .0 1 2 .0 1 2 5 .02 .04 .06 .08 .1 p-value 285 As seen in Figure VI.2, the absolute value of the difference between actual and projected poverty rates fluctuates, it is clear that the difference increases above a p-value of 6 percent. Therefore, we chose 6 percent as the optimal p-value for the subsequent stepwise OLS regression on the full sample to estimate a national model, as outlined in Table VI.5 .

VI.4 Simulation and estimation of poverty rates