283
VI.3 Model selection: cross-validation
We selected carefully from the long list of possible variables because creating a model including all available variables can lead to
‘overfitting’ of the model, where data ‘noise’ impairs the ability to understand causal relationships between dependent and independent variables.
72
The variables we selected across surveys include demographic information household size, dependency ratio, proportion
of females in the household, household head’s characteristics age, education, employment, household
assets including land, livestock and dwelling and household access to basic services water, sanitation and electricity. We also included some subjective measures of well-being and UN proxy measures for
district-level conflict and insecurity. We used a 10-fold validation approach to check for over-fitting bias; that is, we randomly divided
household survey data into 10 folds parts, using nine folds as
‘training data’ and using the remaining fold as
‘testing data’. The consumption model is estimated on the nine folds ‘training data’ using a stepwise Ordinary Least Square [stepwise] regression OLS, an iterative process that selects variables
based on their correlation with household consumption and their predictive power. We repeated this analysis 10 times,
73
each time using a different nine folds as ‘training data’ and the remaining fold for
‘testing data’, and each time testing the model’s performance against the actual – surveyed – poverty rates and on mean squared errors MSEs.
74
Figure VI.1
illustrates the validation approach.
72
While a model may perform well within the sample data used to create the model, ‘over-fitting’ may cause the model to
perform poorly on new data.
73
SWIFT method creates 10 folds. However, a test may indicate any number of folds.
74
We assume that the error and regression coefficients will follow normal distributions in projecting simulating household expenditure. The simulation process is repeated for all households, typically twenty times
using STATA’s ‘mi impute regress’ command. A poverty headcount rate is calculated by comparing the simulated household expenditure or income
with a poverty line for each of the twenty simulation rounds. The average poverty rate of the simulations is used as a poverty estimate.
284
Figure VI.1: Illustration of cross-validation
Step 1:Randomly split data into three folds C refers to consumption; X refers to non-consumption data
Step 2: Select two folds as training data, develop a model there, and test model performance in the testing data
Step 3: Repeat the above procedure three times by changing the testing data
Source: Adapted from Yoshida et. al. 2015
This cross-validation exercise determines the optimal p-value for subsequent stepwise regressions – that
is, where the p-value used in the equation minimises the difference between actual and model-projected poverty rates. To do this, we repeat the exercise for a range of p-values between 0.1 percent and 10
percent, also examine mean squared error to check for over-fitting. We illustrate this second cross validation exercise in
Figure VI.2
.
Figure VI.2: Results of the cross-validation exercise
Source: Author’s calculation based on NRVA 2011-12 data. Calculation excludes data from Helmand and Khost
provinces.
Randomly Split by three
Household Survey data
Household Survey Data Training Data
Testing Data
modeling Compare
Training Data Testing Data
.1 3
5 6
.1 3
5 8
.1 3
6 .1
3 6
2 .1
3 6
4
.02 .04
.06 .08
.1 p-value
.0 1
5 .0
1 1
.0 1
1 5
.0 1
2 .0
1 2
5
.02 .04
.06 .08
.1 p-value
285
As seen in Figure VI.2, the absolute value of the difference between actual and projected poverty rates fluctuates, it is clear that the difference increases above a p-value of 6 percent. Therefore, we chose 6
percent as the optimal p-value for the subsequent stepwise OLS regression on the full sample to estimate a national model, as outlined in
Table VI.5
.
VI.4 Simulation and estimation of poverty rates