01b SOC681 Data Preparation and Screening
DATA PREPARATION AND
SCREENING
James G. Anderson, Ph.D.
Importance
•
•
•
•
Model specification
Failure of model fitting
Problems with parameter estimates
Problems with tests of significance
Categories of Problems
• Case-related Issues
– Missing Observations
• Outliers
– Distributional/Relational Issues
• Normality
• Linearity
• Homoscedasticity
Missing Data
• Missing Completely at Random (MCAR) – The
missing data is entirely unrelated statistically to the
values that would have been observed.
• Missing at Random (MAR) – Data values and
missing values are conditional on a set of
predictors or stratifying variables.
• Nonignorable Missing Data (NMD) – The missing
data conveys probabilistic information about the
vlaues that would have been observed above the
information provided in the observed data.
Methods for Dealing with Missing
Data
•
•
•
•
•
•
Listwise deletion
Pairwise deletion
Mean replacement
Regression replacement
Pattern matching
Maximum likelihood
Listwise Deletion (LD)
• Eliminates observations where there is any
data value missing.
• Limitations:
– Discards other information that the respondent
provided
– Reduces sample size significantly
Pairwise Deletion (PD)
• Excludes an observation from a calculation only
when it is missing a value needed for that
particular calculation.
• Limitations:
– Each mean, variance, covariance, etc. that is calculated
is based on a different sample size.
– Pairwise deletion may lead to out of bound values
resulting in nonpositve definite/singular covariance
matrices, negative variances, etc.
– Pairwise deletion is not recommended for SEM
Data Imputation (MI)
• Replaces the missing value with an estimate
of the value based on the complete data.
(e.g., the mean of the value for those
persons who reported the data)
Data Imputation (AMOS)
• Regression Imputation. The model is initially
fitted with ML. After setting model parameters to
their ML estimaters, linear regression is used to
predict unobserved values for each case as a linear
combination of the observed values for the same
case.
• Stochastic Regression Imputation. Imputes
values for each case by drawing at random from
the conditional distribution of the missing values
given the observed values with the unknown
model parameters fixed at their ML estimates.
Data Imputation (AMOS)
• Bayesian Imputation. Is like stochastic
regression imputation except that it takes
into account the fact that the parameter
values are only estimated and not known.
Performance of the Various Methods
to Deal with Missing Data
• When the missing data are MCAR ( missing is
entirely unrelated statisticvally to the values that
would have been observed):
– PD, LD and FIML all yield consistent solutions
– PD and LD are not as efficient as FIML
– MI is consistent with the first moments but yields
biased variance and covariance estimates.
– MI is not recommended for structural equation
modeling which is based on variance and covariance
information.
Performance of the Various Methods
to Deal with Missing Data
• When the missing data are MAR
(missingness and data values are
statistically unrelated conditional on a set of
predictor or stratifying variables):
– MPD, LD, and M I can produce severely biased
results independent of the sample size.
– FIML yields parameter estimates that are
consistent and efficient.
Performance of the Various Methods
to Deal with Missing Data
• When the missing data are nonignorable
(missingness conveys probabilistic information
about the values that would have been observed):
– All standards multivariate approaches can yield biased
results.
– There is some evidence, however, that FIML estimates
tend to be less biased than other methods.
– FIML is recommended for handling missing data.
NORMALITY
• Many SEM estimation procedures assume
multivariate normal distributions
• Lack of univariate normality occurs when the
skew index is > 3.0 and kurtosis index > 10.
• Multivariate normality can be detected by
indices of multivariate skew or kurtosis
• Non-normal distributions can sometimes be
corrected by transforming variables
OUTLIERS
• Univariate outliers more than three SDs away from
the mean
• Detection by inspecting frequency distributions and
univariate measures of skewness and kurtosis
• Multivariate outliers may have extreme scores on two
or more variables or their figurations of scores may be
unusual
• Detection by inspecting indices of multivariate
skewness and kurtosis. Mahlanobis Distance squared
is distributed as chi square with df equal to the
number of variables.
• Can be remedied by correcting errors or by dropping
these cases of transforming the variables
MULTICOLLINEARITY
• Occurs when intercorrelations among some variables are so high that
certain mathematical operations are impossible or results are unstable
because denominators are close to 0.
• Bivariate correlations >0.85; Multiple correlations>0.90
• May cause a nonpositive definite/singular covariance matrix
• May be due to inclusion of individual and composite variables
Detection; Tolerance = 1-R2 , 0.10;
Variance Inflation Factor (VIF) = 1/(1-R2) >10
• Can be corrected by eliminating or combining redundant variables
RELATIVE VARIANCES
• Covariance matrices where the ratio of the largest to
the smallest variance is greater than 10 are Ill Scaled
• Most SEM estimation methods are iterative
• Estimates may not converge to stable values when
variances of observed variables are very different in
magnitude
• To prevent this problem, variables with extremely low
or high variances can be rescaled by multiplying or
dividing observed scores by a constant. This changes
a variables mean and variance but not its correlations
with other variables.
LINEARITY
• SEMs assume linearity in the relations
among the variables
• Estimation of curvilinear and interactive
effects is possible.
VIOLATIONS OF
ASSUMPTIONS
• The best known distribution with no
kurtosis is the multinormal.
• Leptokurtic (more peaked) distributions
result in too many rejections of Ho based on
the Chi square statistic.
• Platykurtic distributions will lead to too low
estimates of Chi Square.
VARIABLE SCALES
• SEM in general assumes observed variables are
measured on a linear continuous scale
• Dichotomous and ordinal variables cause problems
because correlations /covariances tend to be truncated.
These scores are not normally distributed and responses
to individual items may not be very reliable.
• Some SEM programs like LISCOMP can analyze
dichotomous and ordinal variables
• PRELIS can be used to prepare a corrected covariance
matrix for non-continuous variables.
VIOLATIONS OF
ASSUMPTIONS
• High degrees of skewness lead to
excessively large Chi square estimates.
• In small samples (N
SCREENING
James G. Anderson, Ph.D.
Importance
•
•
•
•
Model specification
Failure of model fitting
Problems with parameter estimates
Problems with tests of significance
Categories of Problems
• Case-related Issues
– Missing Observations
• Outliers
– Distributional/Relational Issues
• Normality
• Linearity
• Homoscedasticity
Missing Data
• Missing Completely at Random (MCAR) – The
missing data is entirely unrelated statistically to the
values that would have been observed.
• Missing at Random (MAR) – Data values and
missing values are conditional on a set of
predictors or stratifying variables.
• Nonignorable Missing Data (NMD) – The missing
data conveys probabilistic information about the
vlaues that would have been observed above the
information provided in the observed data.
Methods for Dealing with Missing
Data
•
•
•
•
•
•
Listwise deletion
Pairwise deletion
Mean replacement
Regression replacement
Pattern matching
Maximum likelihood
Listwise Deletion (LD)
• Eliminates observations where there is any
data value missing.
• Limitations:
– Discards other information that the respondent
provided
– Reduces sample size significantly
Pairwise Deletion (PD)
• Excludes an observation from a calculation only
when it is missing a value needed for that
particular calculation.
• Limitations:
– Each mean, variance, covariance, etc. that is calculated
is based on a different sample size.
– Pairwise deletion may lead to out of bound values
resulting in nonpositve definite/singular covariance
matrices, negative variances, etc.
– Pairwise deletion is not recommended for SEM
Data Imputation (MI)
• Replaces the missing value with an estimate
of the value based on the complete data.
(e.g., the mean of the value for those
persons who reported the data)
Data Imputation (AMOS)
• Regression Imputation. The model is initially
fitted with ML. After setting model parameters to
their ML estimaters, linear regression is used to
predict unobserved values for each case as a linear
combination of the observed values for the same
case.
• Stochastic Regression Imputation. Imputes
values for each case by drawing at random from
the conditional distribution of the missing values
given the observed values with the unknown
model parameters fixed at their ML estimates.
Data Imputation (AMOS)
• Bayesian Imputation. Is like stochastic
regression imputation except that it takes
into account the fact that the parameter
values are only estimated and not known.
Performance of the Various Methods
to Deal with Missing Data
• When the missing data are MCAR ( missing is
entirely unrelated statisticvally to the values that
would have been observed):
– PD, LD and FIML all yield consistent solutions
– PD and LD are not as efficient as FIML
– MI is consistent with the first moments but yields
biased variance and covariance estimates.
– MI is not recommended for structural equation
modeling which is based on variance and covariance
information.
Performance of the Various Methods
to Deal with Missing Data
• When the missing data are MAR
(missingness and data values are
statistically unrelated conditional on a set of
predictor or stratifying variables):
– MPD, LD, and M I can produce severely biased
results independent of the sample size.
– FIML yields parameter estimates that are
consistent and efficient.
Performance of the Various Methods
to Deal with Missing Data
• When the missing data are nonignorable
(missingness conveys probabilistic information
about the values that would have been observed):
– All standards multivariate approaches can yield biased
results.
– There is some evidence, however, that FIML estimates
tend to be less biased than other methods.
– FIML is recommended for handling missing data.
NORMALITY
• Many SEM estimation procedures assume
multivariate normal distributions
• Lack of univariate normality occurs when the
skew index is > 3.0 and kurtosis index > 10.
• Multivariate normality can be detected by
indices of multivariate skew or kurtosis
• Non-normal distributions can sometimes be
corrected by transforming variables
OUTLIERS
• Univariate outliers more than three SDs away from
the mean
• Detection by inspecting frequency distributions and
univariate measures of skewness and kurtosis
• Multivariate outliers may have extreme scores on two
or more variables or their figurations of scores may be
unusual
• Detection by inspecting indices of multivariate
skewness and kurtosis. Mahlanobis Distance squared
is distributed as chi square with df equal to the
number of variables.
• Can be remedied by correcting errors or by dropping
these cases of transforming the variables
MULTICOLLINEARITY
• Occurs when intercorrelations among some variables are so high that
certain mathematical operations are impossible or results are unstable
because denominators are close to 0.
• Bivariate correlations >0.85; Multiple correlations>0.90
• May cause a nonpositive definite/singular covariance matrix
• May be due to inclusion of individual and composite variables
Detection; Tolerance = 1-R2 , 0.10;
Variance Inflation Factor (VIF) = 1/(1-R2) >10
• Can be corrected by eliminating or combining redundant variables
RELATIVE VARIANCES
• Covariance matrices where the ratio of the largest to
the smallest variance is greater than 10 are Ill Scaled
• Most SEM estimation methods are iterative
• Estimates may not converge to stable values when
variances of observed variables are very different in
magnitude
• To prevent this problem, variables with extremely low
or high variances can be rescaled by multiplying or
dividing observed scores by a constant. This changes
a variables mean and variance but not its correlations
with other variables.
LINEARITY
• SEMs assume linearity in the relations
among the variables
• Estimation of curvilinear and interactive
effects is possible.
VIOLATIONS OF
ASSUMPTIONS
• The best known distribution with no
kurtosis is the multinormal.
• Leptokurtic (more peaked) distributions
result in too many rejections of Ho based on
the Chi square statistic.
• Platykurtic distributions will lead to too low
estimates of Chi Square.
VARIABLE SCALES
• SEM in general assumes observed variables are
measured on a linear continuous scale
• Dichotomous and ordinal variables cause problems
because correlations /covariances tend to be truncated.
These scores are not normally distributed and responses
to individual items may not be very reliable.
• Some SEM programs like LISCOMP can analyze
dichotomous and ordinal variables
• PRELIS can be used to prepare a corrected covariance
matrix for non-continuous variables.
VIOLATIONS OF
ASSUMPTIONS
• High degrees of skewness lead to
excessively large Chi square estimates.
• In small samples (N