Manajemen | Fakultas Ekonomi Universitas Maritim Raja Ali Haji jbes%2E2009%2E0006

Journal of Business & Economic Statistics

ISSN: 0735-0015 (Print) 1537-2707 (Online) Journal homepage: http://www.tandfonline.com/loi/ubes20

Sequential Causal Models for the Evaluation of
Labor Market Programs
Michael Lechner
To cite this article: Michael Lechner (2009) Sequential Causal Models for the Evaluation of
Labor Market Programs, Journal of Business & Economic Statistics, 27:1, 71-83, DOI: 10.1198/
jbes.2009.0006
To link to this article: http://dx.doi.org/10.1198/jbes.2009.0006

Published online: 01 Jan 2012.

Submit your article to this journal

Article views: 148

View related articles

Citing articles: 4 View citing articles


Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=ubes20
Download by: [Universitas Maritim Raja Ali Haji]

Date: 12 January 2016, At: 01:07

Sequential Causal Models for the Evaluation of
Labor Market Programs
Michael LECHNER
Professor of Econometrics, University of St. Gallen, Swiss Institute for Empirical Economic Research, Varnbuelstr.
14, CH-9000 St. Gallen, Switzerland (Michael.Lechner@unisg.ch)
This article reviews inverse selection probability weighting to estimate dynamic causal effects. A distribution theory based on sequential generalized method of moments estimation is proposed and the
method is applied to a reevaluation of some parts of the Swiss active labor market policy to obtain new
results and discuss several issues about the implementation of the estimation procedure.
KEY WORDS: Causal effects; Dynamic treatment effects; Nonparametric identification; Panel data;
Program evaluation; Sequential randomization.

Downloaded by [Universitas Maritim Raja Ali Haji] at 01:07 12 January 2016


1.

selection problem. His approach is based on the key assumption that it is possible to control for the variables jointly
influencing outcomes and selection at each particular selection
step of the sequential selection process. Robins’ (1986)
approach was applied subsequently in epidemiology and biostatistics. Gill and Robins (2001) extend this approach to
continuous treatments. Estimation strategies proposed for this
model by Robins and various coauthors usually make parametric or semiparametric modeling assumptions about the
relation of the potential outcomes to the confounding variables,
like in g-estimation proposed by Robins (1986). G-estimation
is based on recursively modeling the dependence of the outcome variables on the histories of confounders and treatments.
To do so more parsimoniously, so-called ‘‘structural nested
models’’ have been proposed by Robins (1993, 1994). However, estimation (typically by some form of maximum likelihood) of such models, which are necessarily complex, is very
tedious. An alternative to this class of models are the so-called
marginal structural models (MSM) again proposed by Robins
(1998a, b). MSM are more tractable and probably more robust
because they model the dependence of the potential outcomes
only on those confounders that are time constant. Because
controlling for those variables does not remove selection bias,
weighting by the inverse of the treatment probabilities is used

in addition. Robins (1999) discusses the advantages and disadvantages of MSM and structural nested models. Various
types of MSM are discussed in Robins, Greenland, and Hu
(1999); Robins, Hernan, and Brumback (2000); Hernan,
Brumback, and Robins (2001); as well as Joffe, Ten Have,
Feldman, and Kimmel (2004). Miller et al. (2001) relate MSM
to the literature on missing data problems. Estimation of MSM
typically appears to be done by the (weighted) method of
generalized estimating equations (Liang and Zeger 1986),
which has conceptional similarities to the generalized method
of moments (GMM) (Hansen 1982) that is more common in
econometrics. So far, there seem to be no applications of
dynamic potential outcome models to economic problems.
Lechner and Miquel (2001) adapt Robins’ (1986) framework
to comparisons of more general sequences and different

INTRODUCTION

There is a substantial body of literature about the estimation
of average ‘‘causal effects’’ of policy interventions using large
micro data. Examples are numerous studies concerned with

employment effects for unemployed who participate in publicly financed training programs. In this microeconometric
framework, a causal effect is defined as a comparison of what
would have happened in two (or more) different potential states
of the world. This binary comparison is based on those
observations that are actually observed in the two different
states. If the allocation to a specific state is random, then the
difference of the means of the outcome variables of the
respective participants in the different states is a valid estimator
for the average causal effect. If participation is not random,
then selection problems arise and weighted means are required.
In its static version, the selection problem received considerable attention in the microeconometric and statistics literature, resulting in numerous suggestions about how to derive
the required weights. Angrist and Krueger (1999); Heckman,
LaLonde, and Smith (1999); and Imbens (2004) provide surveys of various parts of this vast literature.
However, the static model may not be able to address all
relevant selection issues that may occur. Consider the evaluation of training programs: Suppose that interest is not in the
effect of one particular training course, but in a sequence of
courses. However, if the first course in such a sequence is
effective, some unemployed individuals take up a job before
the second course starts. If interest is not in the effect of the first
course, but in the effect of the sequence of two courses, such

behavior creates a selection problem that cannot be addressed
in the static model. For example, solely controlling for pretraining variables does not work for obvious reasons. One
important selection variable—namely, the outcome of the first
participation—is missing. However, controlling for variables
realized after the first training course that influence selection
into the second course bears the potential problem that they
may be influenced by the first part of the training sequence.
Thus, they are ‘‘endogenous’’ and are ruled out as control
variables in standard econometric approaches. Finally, it would
be hard to find an instrument that is valid in that particular case.
Robins (1986) suggests an explicitly dynamic causal
framework that allows defining causal effects of dynamic
interventions and systematically addressing this type of

 2009 American Statistical Association
Journal of Business & Economic Statistics
January 2009, Vol. 27, No. 1
DOI 10.1198/jbes.2009.0006
71


Downloaded by [Universitas Maritim Raja Ali Haji] at 01:07 12 January 2016

72

parameters and selection processes, and reformulate the identifying assumptions. The assumptions used by Robins et al. and
Lechner and Miquel (2001) bear considerable similarity to the
selection-on-observables or conditional independence assumption (CIA) that is prominent in the static evaluation literature.
When the CIA holds, inverse selection probability weighting
(IPW) estimators as well as matching estimators are typically
used (see Imbens 2004). Both classes of estimators have the
advantage that the relation of the outcome variables to the
confounders does not need to be specified. They require only
the specification of the relation of the confounders to the
selection process. In typical labor market evaluation studies,
this feature is considered an important advantage, because
applied researchers may have better knowledge about the
selection process than about the outcome process. Although
IPW estimators are, in principle, easy to implement and efficient under suitable conditions (e.g., Hirano, Imbens, and
Ridder 2003), they are known to have problematic finite
sample properties (e.g., Frölich 2004) when very small or very

large probabilities occur, which could lead to too much
importance (weight) for a single (or a few) observations.
However, although matching estimators may be more robust
(but less efficient) in this sense, for many important versions of
these estimators the asymptotic distribution has not yet been
established (see Abadie and Imbens 2006).
This article proposes estimators for the models of Robins
(1986) in the formulation suggested by Lechner and Miquel
(2001) that retain the flexible and convenient properties of the
aforementioned estimators, although some increase in complexity is unavoidable. Although sequential propensity score
matching estimators could be applied in this context as well (and
are presented in an earlier discussion paper version of this article
that is available on the Internet), this article focuses on IPW
estimation of the dynamic causal models. As mentioned earlier,
it is an advantage that both methods do not require modeling the
dependence of the outcomes and the confounders. The required
specification of the relation between selection and confounders
is done sequentially (i.e., each selection step over time is considered explicitly), which allows a detailed and flexible specification of this relation. Although these advantages are shared
by both—matching and IPW estimators—it is easy to derive its
asymptotic distribution only for IPW. For the case of a parametric modeling of the relation of the selection process to

confounders, in each period, which is the common case in static
evaluation studies, this article provides the asymptotic distribution of the IPW estimators based on the theory of sequential
GMM estimation as proposed, for example, by Newey (1984).
As a further addition to the literature in epidemiology, this
article considers not only the estimation of the effects for some
general population (the ‘‘average treatment effect,’’ as it would
be called in the static evaluation literature), but also some effect
of heterogeneity by treatment status (average treatment effects
on the treated). Moreover, it discusses issues of common support
that arise in this context with IPW estimation in some detail. An
empirical analysis of different components of the Swiss active
labor market policy illustrates its application and addresses a
couple of practical issues.
The article proceeds as follows: Section 2 outlines the
dynamic causal framework using the notation suggested by

Journal of Business & Economic Statistics, January 2009

Lechner and Miquel (2001), and details the IPW estimator. In
Section 3, the practical application of the estimator is illustrated

using Swiss administrative data. Section 4 concludes. Appendix
A addresses issues that arise with multiple treatments as used
in the empirical part. Appendix B shows the derivation of the
sequential IPW estimator and its asymptotic distribution.
2. THE DYNAMIC CAUSAL MODEL: DEFINITION OF
THE MODEL AND IDENTIFICATION
2.1 Notation and Causal Effects
This section briefly repeats the definition of the dynamic
selection model as well as the identification results as presented
by Lechner and Miquel (2001) using the terminology of the
econometric evaluation literature. Because a three-periods–
two-treatments model is sufficient to discuss most relevant
issues that distinguish the dynamic from the static model and
comes with a much lighter notational burden than the general
model, this section shows the results only for this basic version.
Suppose there is an initial period 0, in which everybody is in
the same treatment state (e.g., everybody in the population of
interest is unemployed), followed by two periods in which
different treatment states could be realized (e.g., unemployed
may participate in some training courses). Periods are indexed

by t or t (t, t 2 {0,1,2}). For this population, the complete
treatment is described by a vector of random variables, S ¼ (S1,
S2). A particular realization of St is denoted by st 2 {0,1}.
Denote the history of variables up to period t by a bar below a
variable, such as s2 ¼ ðs1 ; s2 Þ: To differentiate between different sequences, sometimes a letter (e.g., j) is used to index a
sequence, like sjt : As a further notational convention, uppercase
letters usually denote random variables, whereas lowercase
letters denote specific values of the random variable. In the first
period, a unit is observed in one of two treatments (participating in training, s1 ¼ 1; or not participating in training, s1 ¼
0). In period 2, the unit participates in one of four treatment
sequences {(0,0), (1,0), (0,1), (1,1)}, depending on what happened in period 1. Therefore, every unit can participate in one
sequence defined by s1 and another sequence defined by the
values s1 and s2. To sum up, in the two (plus one)-periods–twotreatments example, we consider six different overlapping
potential outcomes corresponding to two mutually exclusive
states defined by treatment status in period 1, plus four mutually exclusive states defined by treatment status in periods 1
and 2.
Variables used to measure the effects of the treatment (i.e.,
the potential outcomes), such as like earnings, are indexed by
s
treatments, and are denoted by Y t t : Potential outcomes are

measured at the end of (or just after) each period, whereas
treatment status is measured at the beginning of each period. For
each length of a sequence (one or two periods), one of the
potential outcomes is observable and is denoted by Yt. The
resulting two observation rules (called ‘‘consistency condition’’
by Robins 1986, or ‘‘stable unit value treatment assumption’’ by
Rubin 1980) are defined in the following equation:
10
Y t ¼ S1 Y 1t þ ð1  S1 ÞY 0t ¼ S1 S2 Y 11
t þ S1 ð1  S2 ÞY t

00
þ ð1  S1 ÞS2 Y 01
t þ ð1  S1 Þð1  S2 ÞY t ; t ¼ 0; 1; 2:

Lechner: Sequential Causal Models

73

Finally, variables that may influence treatment selection and
potential outcomes are denoted by X. The K-dimensional
vector Xt, which may contain functions of Yt, is observable at
the same time as Yt.
Next, the average causal effect (for period t) of a sequence of
treatments defined up to period 1 or period 2 (t) compared with
 k l
s ; s
an alternative sequence ut t t is defined as follows:
sk ; s~jt

ut t

sl

k

ðsjt Þ :¼ EðY stt j St ¼ s~jt Þ  EðY t t j S~t ¼ s~jt Þ;

sk ; sl

t

0 # ~t # 2; 1 # t # 2; ~t # t; k 6¼ l; k; l 2 ð1; . . . ; 2 Þ;
~t

j 2 ð1; . . . ; 2 Þ:
Analogous to the static treatment literature, Lechner and
sk ; slt

Downloaded by [Universitas Maritim Raja Ali Haji] at 01:07 12 January 2016

Miquel (2001) call ut t

the dynamic average treatment effect

sk ; sl
(DATE). Accordingly, utt t ðskt Þ;

sk ; sl

as well as ut t t ðslt Þ are termed
DATE on the treated (DATET) and DATE on the nontreated,
sk ; sl

respectively. There are cases in between, like ut 2 2 ðsl1 Þ, for
which the conditioning set is defined by a sequence shorter
than the ones defining the causal contrast, denoted by
DATE(T). Note that the effects are symmetric for the same
sk ; slt

population ðut t

sl ; sk

ðs~jt Þ ¼ ut t t ðs~jt ÞÞ; which, however, does not
sk ; slt

exclude effect heterogeneity ðut t

changes in treatment status as well as outcomes and whether
they are observable. If they are observable, and if there is
common support (defined in part (c) of W-DCIA, then we have
identification, even if some or all conditioning variables in
period 2 are influenced by the outcome of the treatment in
period 1. To be more precise, the previously mentioned literature shows that, for example, quantities like EðY 11
2 Þ;
11
11
EðY 11
2 j S1 ¼ 0Þ; EðY 2 j S1 ¼ 1Þ; or E½Y 2 j S2 ¼ ð1; 0Þ are
11
identified, but that E½Y 11
2 j S2 ¼ ð0; 0Þ or E½Y 2 j S2 ¼ ð0; 1Þ

sl ; sk

ðskt Þ ¼
6 ut t t ðslt Þ; sj2 6¼ sl2 Þ:

2.2 Identification
Assume that a large random sample, fs1i ; s2i ; x0i ; x1i ; x2i ;
y1i ; y2i gi¼1:N ; is available. It is drawn from the large population
of interest (with the same value of s0) and is characterized by
the random variables ðS1 ; S2 ; X 0 ; X 1 ; X 2 ; Y 1 ; Y 2 Þ: Furthermore,
all conditional expectations of all random variables that are
of interest shall exist. Under these conditions, Robins (1986)
and Lechner and Miquel (2001) show that, if we can observe the
variables that jointly influence selection and outcomes at each
stage, then particular average treatment effects are identified.
This assumption is called the ‘‘weak conditional independence
assumption’’ (W-DCIA) by Lechner and Miquel (2001).
Weak Dynamic Conditional Independence Assumption

10
01
11
aÞ Y 00
2 ; Y 2 ; Y 2 ; Y 2 ‘ S1 j X 0 ¼ x 0 :
11
01
10
S2 j X 1 ¼ x1 ; S1 ¼ s1 :
bÞ Y 00
2 ; Y2 ; Y2 ; Y2
cÞ 1 > PðS1 ¼ 1 j X 0 ¼ x0 Þ > 0;
1 > PðS2 ¼ 1 j X 1 ¼ x1 ; S1 ¼ s1 Þ > 0;
8x1 2 x 1 ; 8s1 : s1 2 f0; 1g:

A BjC ¼ c means that each element of the vector of random variables B is independent of the random variable A
conditional on the random variable C taking a value of c in the
sense of Dawid (1979). Part (a) of W-DCIA states that, conditional on X0, potential outcomes are independent of treatment
choice in period 1 (S1). This is the standard version of the static
CIA. Part (b) states that conditional on the treatment in period
1, on observable outcomes (which may be part of X1) and on
confounding variables of period 0 and 1 ðX 1 Þ; potential outcomes are independent of participation in period 2 (S2).
To determine whether such an assumption is plausible in a
particular application, we have to ask which variables influence

sk ; sl

are not identified. Thus, u22 2 ; u22 2 ðsj1 Þ are identified, 8sk1 ; sk2 ;
sk ; sl

sl1 ; sl2 ; sj1 ; sj2 2 f0; 1g; but u22 2 ðsj2 Þ are not identified if sl1 6¼
sk1 : This result states that pairwise comparisons of all sequences
are identified, but only for groups of individuals defined by
their treatment status in periods 0 or 1. The reason is that although the first period treatment choice is random conditional
on exogenous variables, which is the result of the initial condition
S0 ¼ 0 for all observations, in the second period, randomization
into these treatments is conditional on variables already
influenced by the first period treatment. W-DCIA has an appeal
for applied work as a natural extension of the static framework.
Lechner and Miquel (2001) show that to identify all treatment parameters, W-DCIA must be strengthened by imposing
that the confounding variables used to control selection into the
treatment of the second period are not influenced by the selection into the first treatment. Thus, the dynamic selection problem
collapses into a static selection problem with multiple treatments as discussed by Imbens (2000) and Lechner (2001a).
Appendix A contains some brief considerations for the
dynamic case with multiple treatments. The extension to more
than two periods is direct.
2.3

Estimation

All relevant estimators of the causal effects of the treatment
sequences are based on differences of weighted means of the
observable outcomes of those actually receiving the two
treatments of interest:
k

l

^ust 2 ;s2 ðsj Þ ¼
t
P
3

1
sk2 ; sjt

^i
i2 sk2 w

X

sl ; sj

X
i2sk2

sk ;sj

^i 2 t yt;i  P
w
sl ;sj

sk ;sjt

^i 2 t ; w
^i 2
^i 2 t yt; i ; w
w

1
sl ;sj
^i 2 t
i2 sl2 w

$ 0;

i2 sl2

sk ; sj
^i 2 t
w

sl ;sj
^i 2 t
w

t 2 f0; 1g j 2 fk; lg:

and
are the appropriate weights (‘‘^’’ indicates that
they are estimated).
An obvious way to reweight the observations of a specific
treatment population toward the target population ðstj Þ is to use
the inverse of their respective selection probabilities of being
observed in that treatment as opposed to the target population,
following the ideas of Horvitz and Thomson (1952). Recently,
the idea of directly reweighting the observations based on the
choice probabilities has found renewed interest (e.g., Hahn
1998; Hernan et al. 2001; Hirano and Imbens 2001; Hirano
et al. 2003; Nevo 2003; Robins, Hernan, and Brumback 2000;
Robins and Rotnitzky 1995; Robins, Rotnitzky, and Zhao 1995;

74

Journal of Business & Economic Statistics, January 2009

Wooldridge, 2002, 2004). The reason is probably twofold.
First, as advocated, for example, by Robins et al. (2000), they
are easy to use, if a parametric model is used to estimate the
choice probabilities. Second, Hirano, Imbens, and Ridder
(2003) show for the static model, that for a suitably chosen
nonparametric estimator of the choice probabilities, the inverse
probability estimator is efficient.
For sjt ¼ s0 ; Robins et al. showed in various papers that the
following weights lead to consistent estimators (empirical
equivalent to the g-formula):
1
sk
^i 2 ¼ sk jsk
8i 2 S2 ¼ sk2 ; 8k:
w
sk1
2 1
p^2 ðx1; i Þ^
p1 ðx0; i Þ
sk jsk

p^22 1 ðx1;i Þ denotes a consistent estimate of the transition

Downloaded by [Universitas Maritim Raja Ali Haji] at 01:07 12 January 2016

sk

probability PðS2 ¼ sk2 j S1 ¼ sk1 ; X 1 ¼ x1;i Þ and p^11 ðx0;i Þ is a
consistent estimate of PðS1 ¼ sk1 j X 0 ¼ x0;i Þ: A similar result
can be derived for the DATE(T)—namely,
sk ;sj
^i 2 1
w

sj

¼

p^11 ðx0;i Þ

sk
sk jsk
p11 ðx0;i Þ
p^22 1 ðx1;i Þ^

8i 2 S2 ¼ sk2 ; 8k; j:

Appendix B shows that such weights solve the selection
problem.
It is, so far, common practice in the static evaluation literature to estimate such selection probabilities with parametric
binary or multiple-response models (for each selection or
transition). Under the standard regularity conditions of GMM
estimation that are usually considered to be weak (see Hansen
1982
pffiffiffiffi or, for this special case Newey 1984), such an estimator is
N —consistent with an asymptotically normal distribution.
Appendix B shows its asymptotic covariance matrix.
3. AN EMPIRICAL APPLICATION
In this section, the IPW estimator is applied to an evaluation
of active labor market programs in Switzerland to show that it
can provide interesting new insights, and discusses issues of
implementation that may arise.
3.1

Setting the Scene

The study by Gerfin and Lechner (2002) serves as an example
for this exercise. They are interested in the effects of different
components of the Swiss active labor market policy on subsequent labor market outcomes using a rich administrative
database coming from the merged records of the Swiss unemployment insurance system and the public pension system. They
consider several types of programs and argue extensively that
the data are informative enough to make the CIA plausible.
Their study contains very good information about variables
explaining program participation as well as potential outcomes,
such as long employment histories; regional, sociodemographic, and occupational information; as well as subjective
information on the unemployed provided by the caseworkers
who are in charge of program assignment. For more details
about the programs evaluated and the data used, see Gerfin and
Lechner (2002). This application relies on the same population
as Gerfin and Lechner (2002), but uses the extended version of
the sample described in the follow-up study by Gerfin, Lechner,
and Steiger (2005).

Gerfin and Lechner (2002) estimate the effects of starting the
first program in an unemployment spell. They focus on the
beginning of the first program, because in a static model there
are two endogeneity problems. First, any timing of the outcome
measurement related to the end of the program may already
incorporate some program effects. Second, subsequent program participation may depend on the effect of the first program. The dynamic causal framework allows going further and
accounting for the endogeneity of program duration and subsequent participation.
Here, interest is in the effects of being two periods in different states for individuals entering unemployment in the last
quarter of 1997. The different states used are unemployment
(U), participating in training courses (C) or employment programs (E), and receiving temporary wage subsidies (T). A
period is defined as an interval of two months. Given the usual
duration of Swiss programs, a bimonthly period appears to be
appropriate to capture the essential dynamics. Any bimonthly
period during which an individual participates at least two weeks
in a program (C, E, or T) is considered program participation.
In the rare event of participation in two different programs
during the same period, the individual is defined to participate
in the longest of those programs. Treatment occurs between
January and April 1998. For example, considering sequences
like EE compared with CC allows assessing the effects of four
months of employment programs compared with four months
of courses. Obviously, this approach allows investigating the
effects of program combinations as well—for example, by
comparing CE with UU or EE. As another example, it is possible to check the effect of waiting for future participation by,
for example, comparing UC (or UCC) with CC. For the sake of
brevity, only effects of the types CC, EE, TT, and UU are
considered. Because W-DCIA is assumed to hold, we consider
only effects for subpopulations defined by the treatment in the
first period.
The outcome variables of interest are the probability of
finding unsubsidized employment as well as monthly earnings
between May 1998 and the end of the observation period in
December 1999. If we believe in the validity of the identification arguments of Gerfin and Lechner (2002) for the
beginning of the first program, then the part of W-DCIA
relating to the first period is no problem. When the intermediate outcomes that determine the selection in March are
observable, then the part of W-DCIA relating to the second
selection step is plausible as well.
Table 1 gives descriptive statistics on some important control
variables related to W-DCIA (see Table 2 for a full list of
control variables used), treatments, and intermediate as well as
final outcomes for the different subsamples of interest. The
control variables (X0) considered in this table are the monthly
earnings in the previous job, the subjective valuation of the
caseworkers, as well as the number of sanction days without
benefit (imposed by the caseworkers on unemployed individuals violating the rules). The subjective valuation gives an
assessment of the employability of the unemployed. This
assessment is a time-varying variable that the caseworker may
change at the monthly interview of the unemployed (it is
observed for every month). Comparing descriptive statistics
across subsamples defined by treatment status reveals that

Lechner: Sequential Causal Models

75

Table 1. Descriptive statistics of control variables and intermediate outcomes
Means/shares in % in subsamples
Subpopulations (treatment status January/February, March/April 1998)

Downloaded by [Universitas Maritim Raja Ali Haji] at 01:07 12 January 2016

Variables

U

UU

C

CC

E

EE

Selected control variables (measured in December 1997 or before)
Insured earnings in CHF
3,943 3,904 4,008 3,932 3,592 3,517
Chances to find a job (employability): no information
9
8
5
5
8
7
Very easy
8
8
6
5
3
4
Easy
17
16
19
18
10
12
Medium
55
56
59
60
58
58
Difficult
9
10
11
10
17
15
Special case
1
1
1
1
4
4
Sanction days without benefits (current spell; until December 1997)
3.1
3.2
3.8
3.2
3.5
3.3
Selected intermediate outcome variables (measured in February 1998) after first period
Employed
13
4
2
1
1
0
Earnings in CHF
509
164 1,283
51
33
0
Chances to find a job (employability): no information
2
2
2
2
0
0
Very easy (or some employment)
20
12
8
5
4
4
Easy
16
16
17
16
13
14
Medium
52
57
60
62
59
59
Difficult
9
11
13
14
20
19
Special case
1
2
1
1
3
4
Sanction days without benefits January þ February 1998
1.2
1.3
0.9
0.8
0.5
0.4
Outcome variables
Gross monthly earnings in May 1998 (in CHF)
1,648 1,345
987
867
590
371
Gross monthly earnings in September 1998 (in CHF)
2,151 1,913 1,809 1,728 1,356 1,112
Gross monthly earnings in September 1999 (in CHF)
2,792 2,533 2,805 2,683 2,243 2,163
Gross monthly earnings in December 1999 (in CHF)
2,420 2,171 2,758 2,699 2,105 2,068
Employed in May 1998
28
22
17
14
12
8
Employed in September 1998
36
31
30
28
28
25
Employed in September 1999
49
44
49
46
40
39
Employed in December 1999
40
36
44
42
37
37
Sample size
7,985 5,122
573
316
118
99

T

TT

4,023 4,067
11
14
6
6
18
18
56
54
7
7
1
2
2.6
2.3
6
229
3
12
20
56
7
1
0.5
1,919
2,407
3,046
2,713
29
39
52
45
790

1
30
3
8
21
59
7
1
0.6
1,368
2,124
2,921
2,595
22
35
52
46
382

NOTE: The sample is based on the selection criteria of Gerfin and Lechner (2002), but is constrained to those entering unemployment in the fourth quarter of 1997. The sequences are
defined on a bimonthly basis. The states in each period are abbreviated as U (unemployed), C (training course), E (employment program), and T (temporary wage subsidy). X (XX) means
being unemployed at the end of 1997 and in state X (XX) for the first 2 (4) months in 1998 (X ¼ U, C, E, T).

participants in employment programs are having the worst a
priori chances in the labor market. Participants in temporary
wage subsidies appear to be those with the best a priori chances.
The lower panel in Table 1 shows statistics for the four
intermediate outcomes (X1)—namely, employment, earnings,
the subjective valuation of the caseworkers, and sanction days
(i.e., reduced benefits if the unemployed violates the rules).
Comparing the variables across treatments confirms the previous conclusion. However, it also shows that participants
leaving the sequence are different from participants who finish
the sequence.
3.2.

The Inverse Probability Weighting Estimator
in Practice

3.2.1. Estimation of the Conditional Participation Probabilities. Conditional probabilities are estimated at each step
in each subsample using binary probits, which are subjected to
specification tests. Table 2 presents the six binary probits that
are required to estimate all selection equations that are of
interest for the effects of the treatments defined earlier. The
specifications broadly follow the specifications of Gerfin and
Lechner (2002) and Gerfin et al. (2005) adjusted for the spe-

cifics of the dynamic framework, the different aggregation
levels, and sample sizes. Because the outcome variables after
period 1, X1, do not influence S1 conditional on X0, they are
omitted from the first period probits (their entries are denoted
as X). Furthermore, the sample sizes and the variation of the
dependent and independent variables in some cases require
omitting specific variables from the specification (their entries
are denoted as 0). The latter problem is particularly pronounced
for the transition from E to EE, with only 19 observations
leaving the sequence at this stage.
3.2.2. Common Support and the Sensitivity of the Inverse
Probability Weighting Estimator to Extreme Probabilities. It
may not be possible to obtain reliable estimates of the mean
effects of the target populations as a whole, when some groups
of the target population have characteristics that make them
very unlikely to appear in one (or both) of the sequences under
consideration. If it is impossible for such groups to appear in
one of the treatments, then the support assumption of W-DCIA
is violated, which means that the data are uninformative about
this part of the target population (other than by extrapolation
from other parts of the population). The common remedy is to
remove this subpopulation from the target sample and redefine
the causal effect appropriately. Even if their probability of

76

Journal of Business & Economic Statistics, January 2009

Table 2. Probit specifications and coefficient estimates

Downloaded by [Universitas Maritim Raja Ali Haji] at 01:07 12 January 2016

Variable
Age, y/10
Female
Mother tongue not German/French/Italian
Mother tongue German/French/Italian, not main
language in own canton
French mother tongue
Italian mother tongue
Nationality: foreign with yearly permit
Foreign languages: other Swiss language
Job position very low
Qualification level: Skilled (highest)
Unskilled (lowest)
Chances to find a job: No information
(reference category: medium) Very easy
Easy
Difficult
Special case
Chances to find a job: No information , February 1998
(reference category: medium) Very easy, February 1998
Easy, February 1998
Difficult, February 1998
Special case, February 1998
Looking for job as % of full-time job, February 1998 (0–1)
Unemployment status: Full-time unemployed
Part-time unemployed
No information on desired full- or part-time job
Desired ¼ previous occupation, three-digit level,
February 1998
Previous occupation: Metals
Painting, technical drawing
Entrepreneurs, senior officials, justice
Office and computer
Retail trade
Science
Duration of UE spell (days/10)
Unemployment benefits /1,000 In 1996
In 1997
Month of entry into social security system
Share of employment 1988–1997
Average duration of UE spells 1988–1997/10
Subsidized temporary job before December 1997
Size of town (previous employment) < 30.000
Region of placement office (PO) in rural area
Inflow rate to long-term unemployment (by region of PO)
Region (reference central) West
East
Zurich
Southwest
Northwest
Ticino
Additional regional effects by canton: Bern
French main language in canton
Monthly insured earnings (in last job) in CHF/1,000
Employed February 2002
Current unemployment spell is first spell
Sanction days without benefits (current spell)/10
Sanction days without benefits/10, January þ February 1998
Subsample
Number of observations in subsample
Dependent variable
Mean of dependent variable in subsample

U vs. C

U vs. E

U vs. T

UU if U

CC if C

EE if E

TT if T

0.04
0.16
0.20
0.09

0.00
0.26
0.31
0.09

0.02
0.22
0.14
0.16

0.06
0.06
0.04
0.04

0.00
0.05
0.23
0.14

0.44
0.02
0
0.16

0.00
0.17
0.11
0.19

0.26
0.16
0.03
0.05
0.03
0.17
0.01
0.25
0.11
0.02
0.05
0.07
X
X
X
X
X
X
0.20
0.15
0.03
X

0.18
0.33
0.26
0.26
0
0.13
0.11
0.01
0.36
0.28
0.11
0.14
X
X
X
X
X
X
0.14
0.11
0.00
X

0.12
0.12
0.08
0.13
0.03
0.03
0.07
0.03
0.06
0.01
0.13
0.07
X
X
X
X
X
X
0.15
0.04
0.11
X

0.07
0.09
11
0.03
0.11
0.04
0.04
0.00
0.23
0.06
0.00
0.02
0.06
0.19
0.00
0.01
0.05
1.95
0.52
0.13
1.54
0.09

0.41
0.59
0.02
0.30
0.16
0.00
0.19
0.06
0.52
0.22
0.05
0
0
0.61
0.31
0.02
0
0.52
1.00
1.43
0.67
0.11

0.65
0
0.40
1.14
0.58
1.11
0.62
0
0
0
0
0
0
0.32
0.71
0.32
0
0.52
0.71
0
0
0.44

0.25
0.03
0.03
0.06
0.02
0.22
0.00
0.37
0.08
0.10
0.15
0
0
0.05
0.09
0.19
0
1.12
0.16
0.38
1.14
0.14

0.26
0
0
0.24
0.10
0.45
0.78
0.01
0.01
0.46
0.35
0.80
0
0.03
0.00
0.53
0.01
0
0.18
0.55
0.42
0.56
0.24
0.11
0.02
X
0.02
0.01
X

0.15
0
0
0.14
0.44
0.16
0.43
0.00
0.01
0.02
0.59
0.73
0
0.05
0.11
0.41
0.08
0.67
0.69
0.68
0.43
0.82
0.02
0.15
0.09
X
0.01
0.02
X

0
0.12
0.51
0.27
0.24
0
0
0
0.01
0.18
0.49
1.21
0.38
0.03
0.07
0.56
0.20
0
0.12
0.24
0.21
0.37
0.14
0.39
0.02
X
0.12
0.05
X

0.06
0
0
0.04
0.08
0.02
0.12
0.01
0.00
0.09
0.34
0.01
0
0.14
0.07
0.19
0.20
0
0.24
0.08
0.14
0.20
0.37
0.47
0.04
1.16
0.07
0.03
0.02

0.76
0
0
0.01
0.38
0.07
0.13
0.01
0.00
0.72
0.57
3.95
0
0
0.01
0.37
0.57
0
0.24
0.68
0.74
1.01
0.02
0.18
0.01
0.58
0.31
0.03
0.20

0
0
0
0
0
0
0.05
0
0.00
1.07
0
0
0
0
0.20
1.98
0
0
0.29
0
0.38
0
0
0
0.26
0
0.80
0.09
0

0
0.33
0.14
0.06
0.17
0
0
0
0.01
0.36
0.75
0.31
0.19
0
0.15
0.25
0.06
0
0.41
0.13
0.03
0.19
0.21
0.13
0.05
1.62
0.07
0.05
0.17

U or C
8,512
U
0.94

U or E
8,100
U
0.98

U or T
8,772
U
0.91

U
7,982
UU
0.64

C
574
CC
0.55

E
118
EE
0.83

T
790
TT
0.48

NOTE: Binary probit model estimated on the respective subsample. All specifications include an intercept and are subjected to specification tests (omitted variables, nonnormality). If
not stated otherwise, all information in the variables relates to the last day in December 1997. Exclusion of variables: 0: Variables omitted from specification. X: Variable not temporarily
prior to dependent variable. Bold numbers in italics denote significance at the 1% level. Bold numbers denote significance at the 5% level. Italics denote significance at the 10% level. See
note in Table 1.

Downloaded by [Universitas Maritim Raja Ali Haji] at 01:07 12 January 2016

Lechner: Sequential Causal Models

occurrence is positive but small, IPW estimators may have bad
properties (e.g., Frölich 2004; Robins et al. 2000; and the
Monte Carlo study in the discussion paper version of this
article). Therefore, even in this case the target population needs
to be adjusted. Although, there is some discussion of support
issues in the matching literature (e.g., Lechner 2001b; Smith
and Todd 2005), the literature on IPW appears to be silent on
that topic. The typical IPW solution to the problem is to remove
(or rescale) observations with very large weights and check the
sensitivity of the results with respect to the trimming rules used.
Trimming observations, however, changes the target distribution (shrinking large weights changes the target population
as well, but in a way that is very hard to describe empirically).
This change of the target distribution can be monitored by
computing the same weights for the target population as for the
treated population (i.e., the weights they would have obtained
had they participated in such treatment). Then the subpopulation with values of those probabilities that lead to a
removal in one of the treatment samples is compared descriptively with the subpopulation with no such extreme values.
Because two potential outcomes are estimated for the same
target population, trimming has to be done in both treated
subsamples. However, when the trimming is done independently for both subsamples, the estimates of the two mean
potential outcomes may end up being valid for different target
populations. Therefore, the same ‘‘types’’ of observations
should be removed. Here, this idea is implemented in a simple
way: Denote the weights used to weight observations in
sequence 1 by w1i21 and those for observations in sequence 0
by w0i20 : Compute both weights for the target population
ðw0i2s j ; w1i2s j Þ and both treatment populations ðw1i20 ; w0i21 Þ as
1
well. Define the desired cutoff level for each treatment as w
0 : Reduce w
1 ðw
0 Þ to the largest value of w1i20 ðw0i21 Þ if it is
and w
 1 ðw
0 Þ: This procedure could be iterated,
below the original w
1 and
but this was not necessary in the application. Based on w
0 , the target population is adjusted by considering only
w
1 and w
0 :
observations with values of w0i2s j and w1i2s j less than w
Comparing descriptive statistics of confounders in the
retained versus the deleted target sample shows how this
reduction affects the interpretation of the parameters estimated.
Furthermore, predicting the means of the confounders for the
reduced sample in the same way as the mean potential outcomes
is a second way to characterize the adjusted target population.
In the application, we compare the results of four different
trimming rules for the various sequences and subpopulations.
The first rule does not trim any observation at all; the second to
fourth ones trim a given share of the observations with the
largest weights (applied in both treated subsamples separately).
The shares that are used are 0.1%, 1%, and 5%, respectively.
3.2.3 Results for the Different Estimators. Table 3 contains four blocks, with each block corresponding to an IPW estimator using one of the four cutoff rules. For each estimator, Table
3 contains a part related to common support issues (columns 3–6
and 12–15) and the weights (columns 7, 8, and 16, 17), as well as
a part showing the estimates of the potential outcomes (monthly
earnings in Swiss franks, CHF in December 1999; columns 5, 6
and 18, 19) and the effects of the sequences (columns 11, 20).
The part about the common support begins with the share of
observations that would be deleted by adjusting the reference

77

distribution (column 3, 12). The deleted share increases with
the decreasing cutoff points for the weights and it varies with
the different sequences investigated.
Concerning trimming, using the 5% trimming leads to a loss
of between 30% and 55% of the observations, whereas a .1%
trimming rule leads to considerably smaller losses between 3%
and 18%. Comparing different sequences, the largest losses
occur when comparing participants in employment programs
(EE) with recipients of temporary wage subsidies (TT). This is
expected, because they are the most dissimilar groups.
The next column (4, 13) gives the mean of the earnings
variable in the last job before treatment. This variable cannot be
influenced by the treatment and is thus exogenous. Comparing
its mean in the adjusted reference sample with its mean in the
original reference sample (same as those for the IPW without
trimming) shows the direction of the adjustment. With the
exception of some comparisons based on the 5% rule, adjusting
the reference distribution does not change the mean of this
lagged earnings measure very much. Columns 5, 14 and 6, 15
contain the predictions of this variable based on the two
treatment samples using the same weights as used for computing the mean potential outcomes. Because the effect for this
variable is zero, those estimates should be similar (if the
probability model is correctly specified) and both values should
be similar to its mean in the adjusted reference population
(column 4, 13). Taking into account the sampling variability,
these conditions appear to be fulfilled. Thus, the comparisons,
which are like balancing tests used with static matching, give
no indication of a misspecification of the selection model used.
Columns 7, 16 and 8, 17 contain the 10% concentration ratio
of the weights in the two subsamples of treated observations
(after adjusting for common support). This ratio shows the
share of the target population covered by those 10% of the
treatment population that have the largest weights. In the case
of random assignment, this ratio would be about 10. However,
if only a single observation is used to predict outcomes for all
members of the target population, this ratio is 100. The results
show systematically higher ratios when small treated groups
are used to predict outcomes for large (and diverse) target
populations, but overall do not indicate that the estimates are
dominated by very few observations.
Columns 9, 18 and 10, 19 show the predicted counterfactual
mean earnings of the sequences (1) for the respective target
populations (2). Comparing the exogenous lagged earnings of
both treatment groups with those of the target group using the
lower panel of Table 1 gives an idea of whether we expect IPW
to adjust mean observed posttreatment earnings upward or
downward. Indeed, these adjustments appear to move the
estimators in the expected direction in almost all cases.
Comparing the stability of the results across estimators (and
thus across different common supports) shows qualitatively
similar results, although the 5% trimming rule leads to a fairly
different magnitude of the effect for the comparison of UU with
CC for population U. In this case, the latter estimator finds
considerably lower mean potential earnings for participating in
CC compared with the other estimators. This goes along with
the fact that when using the weights for the subpopulation in
CC, the predicted mean lagged earnings are lower than the
sample mean for the 5% rule, but are higher for the other

Downloaded by [Universitas Maritim Raja Ali Haji] at 01:07 12 January 2016
78

Table 3. Estimation results for gross monthly earnings (in CHF) in December 1999
Common support

Common support
Predicted
value of
E(Y0 j s1)

% of Mean of Y0
target
Target
In adjusted target sample
Sequences population sample
deleted
s12  s02
s1
s1
s12
s02
(1)
(2)
(3)
(4)
(5)
(6)
UU-CC
UU-EE
UU-TT
TT-EE

UU-CC

UU-TT
TT-EE

0
0
0
0
0
0
0
0

U
C
U
E
U
T
T
E

22
9
24
11
17
14
36
25

s12
(7)

IPW without adjustment
3,943
3,912 4,048
16
4,008
3,971 4,054
32
3,943
3,911 3,912
17
3,592
3,557 3,598
42
3,943
3,911 3,886
17
4,023
3,969 4,053
29
4,023
4,053 4,165
19
3,590
3,624 3,598
45
IPW, 1% of largest weights removed
3,925
3,926 4,065
14
3,985
3,989 3,987
31
3,816
3,801 3,711
15
3,572
3,542 3,596
39
3,940
3,947 3,813
14
4,037
3,998 4,034
25
3,827
3,785 3,669
16
3,666
3,652 3,717
41

% of Mean of Y0
target
In adjusted target sample
sample
1 0
s
;s
s1
s02 ut 2 2 ðs1 Þ deleted
s12
s02
(10)
(11)
(12)
(13)
(14)
(15)

s02
(8)

s12
(9)

42
21
46
18
393
20
48
18

2232
2,316
2,232
2,037
2,234
2,341
2,592
2,313

3,140
2,943
1,971
2,076
2,480
2,574
2,047
2,076

907
626
260
39
246
233
545
237

15
3
18
5
13
8
16
10

3,920
4,005
3,844
3,532
3,935
4,040
3,941
3,581

37
21
36
17
29
16
39
17

2,231
2,293
2,185
2,056
2,179
2,257
2,412
2,271

2,991
2,848
2,394
2,146
2,454
2,620
2,196
2,197

759
555
209
89
274
362
215
73

44
30
50
32
30
30
55
43

3,966
3,895
3,675
3,602
3,992
4,018
3,763
3,637

Concentration ratio of
weights (in %)
s12
(16)

s02
(17)

IPW, 0.1% of largest weights removed
3,911 4,116
15
41
3,972 4,034
30
20
3,822 3,666
16
42
3,497 3,558
42
18
3,925 3,841
15
32
3,996 4,037
26
18
3,978 3,827
18
40
3,604 3,613
44
17
IPW, 5% of largest weights removed
3,972 3,904
13
30
3,945 3,914
29
19
3,674 3,683
13
32
3,552 3,603
34
15
3,998 4,005
13
24
4,026 4,005
22
15
3,730 3,655
15
33
3,741 3,657
35
16

Effects on
current
 s2 earnings

E Y t j s1
s1 ;s0

s12
(18)

s02 ut 2 2 ðs1 Þ
(19)
(20)

2,227
2,271
2,207
2,014
2,194
2,261
2,500
2,249

3,098
2,907
2,166
2,118
2,536
2,567
2,361
2,161

871
636
41
103
341
306
139
88

2,307
2,281
2,077
2,111
2,198
2,284
2,479
2,561

2,692
2,731
1,973
2,201
2,458
2,590
2,503
2,263

385
449
104
90
260
305
23
298

NOTE: Sequences: Treatments for which the effects are estimated. Target population: Population for which the effects of the treatments are estimated. Common support: Part of the target sample for which the effects could be estimated from
observations in the two subsamples of treatment participants. In the adjusted target sample, observations for which the treatment sample could not be used to predict the effects are removed (the share of the removed part is shown in the column headed by
‘‘% of target sample deleted’’). Predicted value of means that the mean of Y0 is predicted using the same weights as used for the estimation of the effects. Concentration ratio of weights: Sum of the 10% largest weights divided by the sum of all weights 3
100. Y0 denotes insured monthly earnings in CHF (used by the unemployment insurance to calculate unemployment benefits). Because Y0 is computed from the UE insurance data, they may not be perfectly comparable with the outcome variable that is
calculated from the pension records. Earnings are coded as 0 if individuals receive a temporary wage subsidy, participate in a program, or are unemployed. For the effects, bold numbers in italics denote significance at the 1% level. Bold numbers denote
significance at the 5% level. Italics denote significance at the 10% level. Estimates for the same counterfactual mean outcome may differ across different comparisons because the common support is comparison specific. See note in Table 1.

Journal of Business & Economic Statistics, January 2009

UU-EE

U
C
U
E
U
T
T
E

Concentration ratio
of weights (in %)

Predicted
value of
E(Y0 j s1)

Effects
on current
 earnings

s
E Y t 2 j s1

Downloaded by [Universitas Maritim Raja Ali Haji] at 01:07 12 January 2016

Lechner: Sequential Causal Models

79

estimators with more extreme cutoff values. This difference is
reflected in the effect estimates as well. However, even though
these differences exist, standard errors for the mean counterfactual are rather large. Therefore, it is probably not possible to
reject the hypothesis that the estimators give the same results.
Finally, it should be pointed out that, for several comparisons, the number of observations ending up in the sequences is
too small to lead to significant results, although the estimated
effects are not too small (like for the comparison of UU with
EE or EE with TT).
3.2.4. Effects of the Program Sequences on Labor Market
Outcomes. In addition to the result for December 1999, Table
4 presents results for outcome variables 1, 6, 17, and 20 months
after the end of the sequence of interest to shed some light on
the development of the effects over time. Furthermore, because
it is the ‘‘official’’ objective for the active labor market policy
to increase reemployment chances, in addition to the earnings
variable that is shown in the upper panel of Table 4, an outcome
variable measuring whether an individual is employed is presented in the lower panel of Table 4.
To summarize, the results suggest that it is clearly better to
participate in a course or receive a wage subsidy for two
periods compared with remaining unemployed over that
period. However, for employment programs it does not matter
whether an unemployed person participates in such a program
or remains unemployed. Of course, because the latter is much
cheaper, employment programs appear to be ineffective.
Looking at the dynamics of the program effects, the reason
seems to be a large lock-in effect (individuals in the program
search less intensively than those outside the program),
because the typical duration of an employment program would
be 6 months, which is longer than for the other measures. The
comparison between wage subsidies and employment pro-

grams indicates the superiority of the former when both have at
least a duration of two periods, but this effect is not estimated
precisely enough for a firm conclusion.
3.2.5. Standard Errors. Table 5 shows the results of five
different estimators for the standard errors. The first two proce