An Introduction to Stata

AN INTRODUCTION TO
STATA
Ekki Syamsulhakim
ekki.syamsulhakim@fe.unpad.ac.id
Yangki Imade Swara
Yangki.swara@fe.unpad.ac.id

Session 2
• Importing Data from Excel
• Creating graphs / scatterplot
• Dropping variable(s)
• Keeping variable(s)
• Creating new variable(s)
• Combining dataset

Session 3
• Merging Dataset
• Some important notes on doing applied economic

research
• Introduction to IFLS Data

• The Books
• The Codebooks
• The Data

• Cleaning IFLS Data

Merge

Merge

Merge

Merge

Merge

Merge

Merge


Exercise: merging
Make a do-file!
• clear
• cd D:\stata-training\data
• use nlsw88_indchar
• sort idcode
• des
• save nlsw88_indchar, replace

Exercise: merging
Continue the do file
• clear
• use nlsw88_empl
• sort idcode
• des
• save nlsw88_empl, replace
• use nlsw88_indchar, replace
• merge 1:1 idcode using nlsw88_empl
• drop _merge
• save nlsw88_merged, replace


Checking for duplicates
Continue the do file
• Duplicates report idcode

if command
• if at the end of a command means the command is to

use only the data specified. If is allowed with most
Stata commands.

• tab occupation never_married if age>40,

missing

IFLS
INDONESIAN FAMILY LIFE SURVEY

IFLS


Some practical guide

Getting familiar with the data
• Rule #3: Know the context! (Kennedy, 2000, page 5)
• “It is crucial that one become intimately familiar with the

phenomenon being investigated — its history, institutions,
operating constraints, measurement peculiarities, cultural customs,
and so on, going beyond a thorough literature review.”

Getting familiar with the data
• Rule #3: Know the context! (Kennedy, 2000, page 5)
• “Exactly how were the data gathered? Did government agencies

impute the data using unknown formulas? How were the
interviewees selected? What instructions were given to the
participants? What accounting conventions were followed? How
were the variables defined? What is the precise wording of the
questionnaire? How closely do measured variables match their
theoretical counterparts?”


Getting familiar with the data
• Rule #4: Inspect the Data! (Kennedy, 2000, page 5-6)
• “Inspecting the data involves summary statistics, graphs, and data

cleaning, to both check and ‘get a feel for’ the data. Summary
statistics can be very simple, such as calculating means, standard
errors, maximums, minimums, and correlation matrices,…”
• “The advantage of graphing is that graphics broadcast whereas

statistics narrowcast, or, as Tukey (1977, p.vi) notes: ‘The greatest
value of a picture is when it forces us to notice what we never
expected to see’.”

Getting familiar with the data
• Rule #4: Inspect the Data! (Kennedy, 2000, page 5-6)
• “Data cleaning looks for inconsistencies in the data — are any

observations impossible, unrealistic, or suspicious? According to
Rao (1997, p. 152), ‘Every number is guilty unless proved

innocent’… “
• “…Do you know how missing data were coded? Are dummies all

coded zero or one? Are any observations born in two different
months? Do all observations obey logical constraints they must
satisfy?”

Getting familiar with the data
#3. Thou shalt know the context.
• Corollary: Thou shalt not perform ignorant statistical analyses.

#4. Thou shalt inspect the data.
• Corollary: Thou shalt place data cleanliness ahead of econometric

godliness.

Getting familiar with the data
• Read the user’s guide(s) of IFLS 3 Data
• Get to know about
• The story / history of IFLS (including IFLS 1, 2…)

• The sampling mechanism
• The questionnaire(s)
• The type of software compatible to analyse the data
• Also, read the codebook

Know the context
• Even though the IFLS records “income” – both labor

(Book IIIA, module TK1) and non-labor (Book II, modul
HI), we must admit that there is a possibility that the
person under/overstate his/her income

Know the context
• May also be useful to consider to use “expenditure”

instead
• Often used as a better proxy of “permanent income” (Cornwell,

2009, among others)
• However, we cannot assume that borrowing is not possible


Our Case
• We want to know the relationship between Subjective

Well Being and Income
• Even though what we are doing is probably a ‘stylised

fact’, we need to have an economic theory (or mixed with
psychology theory?) to explain the relationship
• Also to find other variables affecting SWB

Model and Variables


We will use Linear Probability Model and Probit model to
estimate the equation

Variables
Variable (D Question in
or C)

Questionnaire

Instrument for
the variable

Book

Happiness
(D)

Taken all things together
how would you say
things are these days
- Birth date
- Age

SW12

3A


- AR09
- BTH_DAY

• K
• Ptrack

Gender (D)

Sex

• AR07
• SEX

• K
• Ptrack

Level of
Education
(D)


Highest level of
schooling ever
completed by HHM

AR16

K

Marital
Status (C)

Marital Status

AR13

K

Age (C)

Variables
Variable (D Question in
or C)
Questionnaire

Instrument for
the variable

Book

HH Income
(C)

Proxied by Total HH
expenditure per adults–
lots of question!

Downloaded from
RAND website
(made by Firman
Witoelar)

- 3A
- 3B

Optimism
(D)

Knowing about how
prices change
in recent year, do you
think you can
keep the standard of
living you have
today in the next 5
years?

SW03A

3A

Working
(D)

What was …’s primary
activity past week?

AR15c

K

Variables
Variable

Question in
Questionnaire

Code

Book

Ethnicity

Ethnicity

AR15D

Has
Relationship with HH
children (D) Head

AR02

K
- K
- Ptrack

Urban (D)

Urban / rural residence

SC_0597

Htrack

Location
(D)

Provincial codes

SC_01xx

Htrack

And other relevant instrument in constructing the variable later

IFLS: Book and DTA.file

Codebook

Downloading IFLS Data
• Create a folder for original IFLS Data
• For this training, create D:\stata-training\IFLS4m
• Download from https://

sites.google.com/a/fe.unpad.ac.id/ekki/stata-training
• bk_ar1.dta

save in d:\stata-training\IFLS4m
• All IFLS4-*.zip files
save in d:\stata-training\IFLS4m

From instrument to variable
• Create a do file, save it as “IFLS-training-step1.do”

* STEP 1 *
Clear
* CREATING A MACRO TO DEFINE FOLDERS
* ---------------------------------global dir00 "D:\stata-training\log\"
global dir01 "D:\stata-training\data\"
global dir02 "D:\stata-training\output\"
* Directories being used to get original data
* ----------------------global dir1 "D:\stata-training\IFLS4m"

* STEP 2 *
* THIS DO-FILE CONTAINS STEPS
* IN CLEANING DATA FOR HAPPINESS MODEL USING IFLS 4
* By Ekki Syamsulhakim, CEDS UNPAD
clear
set mem 200m
* Loading Original Data
use $dir1\bk_ar1, clear
des
sort pidlink
* Keeping important variables
keep
ar01a ar02 ar02b ar07 ar07x ar09 ar10 ar11 ar13 ///
ar15c ar15d ar16 ar17 ar18h hhid07 pid07 pidlink
* Saving to modified data folder
bk_ar1!, replace

save $dir01\

Next step
• Renaming variables
• Which one better: SEX or MALE or FEMALE?
• Which one better: AR15c or Activity_Past_Week or ActvtPastWk ?
• Should we change PIDLINK, PID07 or HHID07?
• Generating variables to be used in regression
• Merging with other DTA files

Renaming variables
* renaming variables
rename
rename
rename
rename
rename
rename
rename
rename
rename
rename
rename
rename
rename
rename

ar01a
ar02
ar07x
ar07
ar09
ar02b
ar10
ar11
ar13
ar15d
ar15c
ar16
ar17
ar18h

hhm_lives_inhh
rel_to_hhhead
male!
male
age
rel_to_hh
id_num_father
id_num_mother
marital_status
ethnic
activt_pstwk
educ_lvl
educ_grade
alive

Generating Married Dummy Variables
• First  we want to create a dummy variable “married”
• 1 if married, 0 otherwise
• In IFLS, marital status is coded as:
1.

Not married

2.

Married

3.

Separated

4.

Divorced

5.

Widow/er

6.

Don’t know and missing

Generating Married Dummy Variables
* dummy variable married
gen married=1 if marital_status==2
replace married=1 if marital_status!=2
replace married=1 if marital_status==8 |
marital_status==9