An Introduction to Stata
AN INTRODUCTION TO
STATA
Ekki Syamsulhakim
ekki.syamsulhakim@fe.unpad.ac.id
Yangki Imade Swara
Yangki.swara@fe.unpad.ac.id
Session 2
• Importing Data from Excel
• Creating graphs / scatterplot
• Dropping variable(s)
• Keeping variable(s)
• Creating new variable(s)
• Combining dataset
Session 3
• Merging Dataset
• Some important notes on doing applied economic
research
• Introduction to IFLS Data
• The Books
• The Codebooks
• The Data
• Cleaning IFLS Data
Merge
Merge
Merge
Merge
Merge
Merge
Merge
Exercise: merging
Make a do-file!
• clear
• cd D:\stata-training\data
• use nlsw88_indchar
• sort idcode
• des
• save nlsw88_indchar, replace
Exercise: merging
Continue the do file
• clear
• use nlsw88_empl
• sort idcode
• des
• save nlsw88_empl, replace
• use nlsw88_indchar, replace
• merge 1:1 idcode using nlsw88_empl
• drop _merge
• save nlsw88_merged, replace
Checking for duplicates
Continue the do file
• Duplicates report idcode
if command
• if at the end of a command means the command is to
use only the data specified. If is allowed with most
Stata commands.
• tab occupation never_married if age>40,
missing
IFLS
INDONESIAN FAMILY LIFE SURVEY
IFLS
Some practical guide
Getting familiar with the data
• Rule #3: Know the context! (Kennedy, 2000, page 5)
• “It is crucial that one become intimately familiar with the
phenomenon being investigated — its history, institutions,
operating constraints, measurement peculiarities, cultural customs,
and so on, going beyond a thorough literature review.”
Getting familiar with the data
• Rule #3: Know the context! (Kennedy, 2000, page 5)
• “Exactly how were the data gathered? Did government agencies
impute the data using unknown formulas? How were the
interviewees selected? What instructions were given to the
participants? What accounting conventions were followed? How
were the variables defined? What is the precise wording of the
questionnaire? How closely do measured variables match their
theoretical counterparts?”
Getting familiar with the data
• Rule #4: Inspect the Data! (Kennedy, 2000, page 5-6)
• “Inspecting the data involves summary statistics, graphs, and data
cleaning, to both check and ‘get a feel for’ the data. Summary
statistics can be very simple, such as calculating means, standard
errors, maximums, minimums, and correlation matrices,…”
• “The advantage of graphing is that graphics broadcast whereas
statistics narrowcast, or, as Tukey (1977, p.vi) notes: ‘The greatest
value of a picture is when it forces us to notice what we never
expected to see’.”
Getting familiar with the data
• Rule #4: Inspect the Data! (Kennedy, 2000, page 5-6)
• “Data cleaning looks for inconsistencies in the data — are any
observations impossible, unrealistic, or suspicious? According to
Rao (1997, p. 152), ‘Every number is guilty unless proved
innocent’… “
• “…Do you know how missing data were coded? Are dummies all
coded zero or one? Are any observations born in two different
months? Do all observations obey logical constraints they must
satisfy?”
Getting familiar with the data
#3. Thou shalt know the context.
• Corollary: Thou shalt not perform ignorant statistical analyses.
#4. Thou shalt inspect the data.
• Corollary: Thou shalt place data cleanliness ahead of econometric
godliness.
Getting familiar with the data
• Read the user’s guide(s) of IFLS 3 Data
• Get to know about
• The story / history of IFLS (including IFLS 1, 2…)
• The sampling mechanism
• The questionnaire(s)
• The type of software compatible to analyse the data
• Also, read the codebook
Know the context
• Even though the IFLS records “income” – both labor
(Book IIIA, module TK1) and non-labor (Book II, modul
HI), we must admit that there is a possibility that the
person under/overstate his/her income
Know the context
• May also be useful to consider to use “expenditure”
instead
• Often used as a better proxy of “permanent income” (Cornwell,
2009, among others)
• However, we cannot assume that borrowing is not possible
Our Case
• We want to know the relationship between Subjective
Well Being and Income
• Even though what we are doing is probably a ‘stylised
fact’, we need to have an economic theory (or mixed with
psychology theory?) to explain the relationship
• Also to find other variables affecting SWB
Model and Variables
•
We will use Linear Probability Model and Probit model to
estimate the equation
Variables
Variable (D Question in
or C)
Questionnaire
Instrument for
the variable
Book
Happiness
(D)
Taken all things together
how would you say
things are these days
- Birth date
- Age
SW12
3A
- AR09
- BTH_DAY
• K
• Ptrack
Gender (D)
Sex
• AR07
• SEX
• K
• Ptrack
Level of
Education
(D)
Highest level of
schooling ever
completed by HHM
AR16
K
Marital
Status (C)
Marital Status
AR13
K
Age (C)
Variables
Variable (D Question in
or C)
Questionnaire
Instrument for
the variable
Book
HH Income
(C)
Proxied by Total HH
expenditure per adults–
lots of question!
Downloaded from
RAND website
(made by Firman
Witoelar)
- 3A
- 3B
Optimism
(D)
Knowing about how
prices change
in recent year, do you
think you can
keep the standard of
living you have
today in the next 5
years?
SW03A
3A
Working
(D)
What was …’s primary
activity past week?
AR15c
K
Variables
Variable
Question in
Questionnaire
Code
Book
Ethnicity
Ethnicity
AR15D
Has
Relationship with HH
children (D) Head
AR02
K
- K
- Ptrack
Urban (D)
Urban / rural residence
SC_0597
Htrack
Location
(D)
Provincial codes
SC_01xx
Htrack
And other relevant instrument in constructing the variable later
IFLS: Book and DTA.file
Codebook
Downloading IFLS Data
• Create a folder for original IFLS Data
• For this training, create D:\stata-training\IFLS4m
• Download from https://
sites.google.com/a/fe.unpad.ac.id/ekki/stata-training
• bk_ar1.dta
save in d:\stata-training\IFLS4m
• All IFLS4-*.zip files
save in d:\stata-training\IFLS4m
From instrument to variable
• Create a do file, save it as “IFLS-training-step1.do”
* STEP 1 *
Clear
* CREATING A MACRO TO DEFINE FOLDERS
* ---------------------------------global dir00 "D:\stata-training\log\"
global dir01 "D:\stata-training\data\"
global dir02 "D:\stata-training\output\"
* Directories being used to get original data
* ----------------------global dir1 "D:\stata-training\IFLS4m"
* STEP 2 *
* THIS DO-FILE CONTAINS STEPS
* IN CLEANING DATA FOR HAPPINESS MODEL USING IFLS 4
* By Ekki Syamsulhakim, CEDS UNPAD
clear
set mem 200m
* Loading Original Data
use $dir1\bk_ar1, clear
des
sort pidlink
* Keeping important variables
keep
ar01a ar02 ar02b ar07 ar07x ar09 ar10 ar11 ar13 ///
ar15c ar15d ar16 ar17 ar18h hhid07 pid07 pidlink
* Saving to modified data folder
bk_ar1!, replace
save $dir01\
Next step
• Renaming variables
• Which one better: SEX or MALE or FEMALE?
• Which one better: AR15c or Activity_Past_Week or ActvtPastWk ?
• Should we change PIDLINK, PID07 or HHID07?
• Generating variables to be used in regression
• Merging with other DTA files
Renaming variables
* renaming variables
rename
rename
rename
rename
rename
rename
rename
rename
rename
rename
rename
rename
rename
rename
ar01a
ar02
ar07x
ar07
ar09
ar02b
ar10
ar11
ar13
ar15d
ar15c
ar16
ar17
ar18h
hhm_lives_inhh
rel_to_hhhead
male!
male
age
rel_to_hh
id_num_father
id_num_mother
marital_status
ethnic
activt_pstwk
educ_lvl
educ_grade
alive
Generating Married Dummy Variables
• First we want to create a dummy variable “married”
• 1 if married, 0 otherwise
• In IFLS, marital status is coded as:
1.
Not married
2.
Married
3.
Separated
4.
Divorced
5.
Widow/er
6.
Don’t know and missing
Generating Married Dummy Variables
* dummy variable married
gen married=1 if marital_status==2
replace married=1 if marital_status!=2
replace married=1 if marital_status==8 |
marital_status==9
STATA
Ekki Syamsulhakim
ekki.syamsulhakim@fe.unpad.ac.id
Yangki Imade Swara
Yangki.swara@fe.unpad.ac.id
Session 2
• Importing Data from Excel
• Creating graphs / scatterplot
• Dropping variable(s)
• Keeping variable(s)
• Creating new variable(s)
• Combining dataset
Session 3
• Merging Dataset
• Some important notes on doing applied economic
research
• Introduction to IFLS Data
• The Books
• The Codebooks
• The Data
• Cleaning IFLS Data
Merge
Merge
Merge
Merge
Merge
Merge
Merge
Exercise: merging
Make a do-file!
• clear
• cd D:\stata-training\data
• use nlsw88_indchar
• sort idcode
• des
• save nlsw88_indchar, replace
Exercise: merging
Continue the do file
• clear
• use nlsw88_empl
• sort idcode
• des
• save nlsw88_empl, replace
• use nlsw88_indchar, replace
• merge 1:1 idcode using nlsw88_empl
• drop _merge
• save nlsw88_merged, replace
Checking for duplicates
Continue the do file
• Duplicates report idcode
if command
• if at the end of a command means the command is to
use only the data specified. If is allowed with most
Stata commands.
• tab occupation never_married if age>40,
missing
IFLS
INDONESIAN FAMILY LIFE SURVEY
IFLS
Some practical guide
Getting familiar with the data
• Rule #3: Know the context! (Kennedy, 2000, page 5)
• “It is crucial that one become intimately familiar with the
phenomenon being investigated — its history, institutions,
operating constraints, measurement peculiarities, cultural customs,
and so on, going beyond a thorough literature review.”
Getting familiar with the data
• Rule #3: Know the context! (Kennedy, 2000, page 5)
• “Exactly how were the data gathered? Did government agencies
impute the data using unknown formulas? How were the
interviewees selected? What instructions were given to the
participants? What accounting conventions were followed? How
were the variables defined? What is the precise wording of the
questionnaire? How closely do measured variables match their
theoretical counterparts?”
Getting familiar with the data
• Rule #4: Inspect the Data! (Kennedy, 2000, page 5-6)
• “Inspecting the data involves summary statistics, graphs, and data
cleaning, to both check and ‘get a feel for’ the data. Summary
statistics can be very simple, such as calculating means, standard
errors, maximums, minimums, and correlation matrices,…”
• “The advantage of graphing is that graphics broadcast whereas
statistics narrowcast, or, as Tukey (1977, p.vi) notes: ‘The greatest
value of a picture is when it forces us to notice what we never
expected to see’.”
Getting familiar with the data
• Rule #4: Inspect the Data! (Kennedy, 2000, page 5-6)
• “Data cleaning looks for inconsistencies in the data — are any
observations impossible, unrealistic, or suspicious? According to
Rao (1997, p. 152), ‘Every number is guilty unless proved
innocent’… “
• “…Do you know how missing data were coded? Are dummies all
coded zero or one? Are any observations born in two different
months? Do all observations obey logical constraints they must
satisfy?”
Getting familiar with the data
#3. Thou shalt know the context.
• Corollary: Thou shalt not perform ignorant statistical analyses.
#4. Thou shalt inspect the data.
• Corollary: Thou shalt place data cleanliness ahead of econometric
godliness.
Getting familiar with the data
• Read the user’s guide(s) of IFLS 3 Data
• Get to know about
• The story / history of IFLS (including IFLS 1, 2…)
• The sampling mechanism
• The questionnaire(s)
• The type of software compatible to analyse the data
• Also, read the codebook
Know the context
• Even though the IFLS records “income” – both labor
(Book IIIA, module TK1) and non-labor (Book II, modul
HI), we must admit that there is a possibility that the
person under/overstate his/her income
Know the context
• May also be useful to consider to use “expenditure”
instead
• Often used as a better proxy of “permanent income” (Cornwell,
2009, among others)
• However, we cannot assume that borrowing is not possible
Our Case
• We want to know the relationship between Subjective
Well Being and Income
• Even though what we are doing is probably a ‘stylised
fact’, we need to have an economic theory (or mixed with
psychology theory?) to explain the relationship
• Also to find other variables affecting SWB
Model and Variables
•
We will use Linear Probability Model and Probit model to
estimate the equation
Variables
Variable (D Question in
or C)
Questionnaire
Instrument for
the variable
Book
Happiness
(D)
Taken all things together
how would you say
things are these days
- Birth date
- Age
SW12
3A
- AR09
- BTH_DAY
• K
• Ptrack
Gender (D)
Sex
• AR07
• SEX
• K
• Ptrack
Level of
Education
(D)
Highest level of
schooling ever
completed by HHM
AR16
K
Marital
Status (C)
Marital Status
AR13
K
Age (C)
Variables
Variable (D Question in
or C)
Questionnaire
Instrument for
the variable
Book
HH Income
(C)
Proxied by Total HH
expenditure per adults–
lots of question!
Downloaded from
RAND website
(made by Firman
Witoelar)
- 3A
- 3B
Optimism
(D)
Knowing about how
prices change
in recent year, do you
think you can
keep the standard of
living you have
today in the next 5
years?
SW03A
3A
Working
(D)
What was …’s primary
activity past week?
AR15c
K
Variables
Variable
Question in
Questionnaire
Code
Book
Ethnicity
Ethnicity
AR15D
Has
Relationship with HH
children (D) Head
AR02
K
- K
- Ptrack
Urban (D)
Urban / rural residence
SC_0597
Htrack
Location
(D)
Provincial codes
SC_01xx
Htrack
And other relevant instrument in constructing the variable later
IFLS: Book and DTA.file
Codebook
Downloading IFLS Data
• Create a folder for original IFLS Data
• For this training, create D:\stata-training\IFLS4m
• Download from https://
sites.google.com/a/fe.unpad.ac.id/ekki/stata-training
• bk_ar1.dta
save in d:\stata-training\IFLS4m
• All IFLS4-*.zip files
save in d:\stata-training\IFLS4m
From instrument to variable
• Create a do file, save it as “IFLS-training-step1.do”
* STEP 1 *
Clear
* CREATING A MACRO TO DEFINE FOLDERS
* ---------------------------------global dir00 "D:\stata-training\log\"
global dir01 "D:\stata-training\data\"
global dir02 "D:\stata-training\output\"
* Directories being used to get original data
* ----------------------global dir1 "D:\stata-training\IFLS4m"
* STEP 2 *
* THIS DO-FILE CONTAINS STEPS
* IN CLEANING DATA FOR HAPPINESS MODEL USING IFLS 4
* By Ekki Syamsulhakim, CEDS UNPAD
clear
set mem 200m
* Loading Original Data
use $dir1\bk_ar1, clear
des
sort pidlink
* Keeping important variables
keep
ar01a ar02 ar02b ar07 ar07x ar09 ar10 ar11 ar13 ///
ar15c ar15d ar16 ar17 ar18h hhid07 pid07 pidlink
* Saving to modified data folder
bk_ar1!, replace
save $dir01\
Next step
• Renaming variables
• Which one better: SEX or MALE or FEMALE?
• Which one better: AR15c or Activity_Past_Week or ActvtPastWk ?
• Should we change PIDLINK, PID07 or HHID07?
• Generating variables to be used in regression
• Merging with other DTA files
Renaming variables
* renaming variables
rename
rename
rename
rename
rename
rename
rename
rename
rename
rename
rename
rename
rename
rename
ar01a
ar02
ar07x
ar07
ar09
ar02b
ar10
ar11
ar13
ar15d
ar15c
ar16
ar17
ar18h
hhm_lives_inhh
rel_to_hhhead
male!
male
age
rel_to_hh
id_num_father
id_num_mother
marital_status
ethnic
activt_pstwk
educ_lvl
educ_grade
alive
Generating Married Dummy Variables
• First we want to create a dummy variable “married”
• 1 if married, 0 otherwise
• In IFLS, marital status is coded as:
1.
Not married
2.
Married
3.
Separated
4.
Divorced
5.
Widow/er
6.
Don’t know and missing
Generating Married Dummy Variables
* dummy variable married
gen married=1 if marital_status==2
replace married=1 if marital_status!=2
replace married=1 if marital_status==8 |
marital_status==9