Manajemen | Fakultas Ekonomi Universitas Maritim Raja Ali Haji 073500102288618766

Journal of Business & Economic Statistics

ISSN: 0735-0015 (Print) 1537-2707 (Online) Journal homepage: http://www.tandfonline.com/loi/ubes20

A Note on Rubin's Statistical Matching Using
File Concatenation With Adjusted Weights and
Multiple Imputations
Chris Moriarity & Fritz Scheuren
To cite this article: Chris Moriarity & Fritz Scheuren (2003) A Note on Rubin's Statistical
Matching Using File Concatenation With Adjusted Weights and Multiple Imputations, Journal of
Business & Economic Statistics, 21:1, 65-73, DOI: 10.1198/073500102288618766
To link to this article: http://dx.doi.org/10.1198/073500102288618766

Published online: 01 Jan 2012.

Submit your article to this journal

Article views: 74

View related articles


Citing articles: 1 View citing articles

Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=ubes20
Download by: [Universitas Maritim Raja Ali Haji]

Date: 13 January 2016, At: 00:38

A Note on Rubin’s Statistical Matching Using
File Concatenation With Adjusted Weights
and Multiple Imputations
Chris Moriarity
U.S. General Accounting Of’ce (chrismor@cpcug.org)

Fritz Scheuren

Downloaded by [Universitas Maritim Raja Ali Haji] at 00:38 13 January 2016

NORC, University of Chicago (scheuren@aol.com)
Statistical matching has been used for more than 30 years to combine information contained in two

sample survey Žles. Rubin (1986) outlined an imputation procedure for statistical matching that is
different from almost all other work on this topic. Here we evaluate and extend Rubin’s procedure.
KEY WORDS:

1.

Complex survey design; Multivariate normal; Predictive mean matching; Resampling;
Robustness; Variance-covariance structures.

INTRODUCTION

2.

WHAT IS STATISTICAL MATCHING?

Perhaps the best description to date of statistical matching has been given by Rodgers (1984); other good descriptions have been given by Cohen (1991) and Radner, Allen,
Gonzalez, Jabine, and Muller (1980). A brief summary of the
method is provided here.
Suppose that there are two sample Žles, A and B, taken
from two different surveys. Suppose further that Žle A contains potentially vector-valued variables 4X1 Y 5, whereas Žle B

contains potentially vector-valued variables 4X1 Z5. The objective of statistical matching is to combine these two Žles to
obtain at least one Žle containing 4X1 Y 1 Z5. In contrast to
record linkage, or exact matching (e.g., Fellegi and Sunter
1969; Scheuren and Winkler 1993, 1997), the two Žles to be
combined are not assumed to have records for the same entities. In statistical matching, the Žles are assumed to have little
or no overlap, and hence records for similar entities are combined, rather than records for the same entities.
All statistical matches described in the literature have used
the X variables in the two Žles as a bridge to create synthetic
records containing 4X1 Y 1 Z5. To illustrate, suppose that Žle A
consisted in part of records

Statistical matching began to be widely practiced with
the availability of public use Žles in the 1960s (Citro and
Hanushek 1991). Arguably, the desire to use this technique
was even an impetus for the release of several early public use
Žles, including those involving U.S. tax and census data (e.g.,
Okner 1972).
Statistical matching continues to be widely used by
economists for policy microsimulation modeling in government, its original home (as the references herein attest). It
also has begun to play a role in many business settings as

well, especially as a way—in an era of data warehousing and
data mining—to bridge across information silos in large organizations. These applications have not yet reached refereed
journals, however—partly, we believe, because they are being
treated as proprietary.
Despite their widespread and, in some areas, growing use,
statistical matching techniques seem to have received insufŽcient attention regarding their theoretical properties. Statistical
matching always has had an ad hoc avor (Scheuren 1989),
although parts of the subject have been examined with care
(e.g., Cohen 1991; Rodgers 1984; Sims 1972). In this article
we return to one of the important attempts to underpin practice with theory, important work of Don Rubin (1986), now
more than 15 years old.
We begin by describing Rubin’s contribution. Needed
reŽnements are then made that go hand-in-hand with advances
in computing in the time since Rubin’s article was written. To
frame the results presented, after this introduction (Sec. 1) we
include a section (Sec. 2) titled “What is Statistical Matching?” This is followed by a restatement of Rubin’s original
results, along with a detailed examination of the method’s
properties (Sec. 3). Then in Section 4 we present improvements of Rubin’s procedure and develop their properties. In
Section 5 we provide some simulation results that illustrate
the new approaches, and in Section 6 discuss applications and

generalizations of the approaches. Finally, in Section 7 we
give a summary and draw some conclusions.

X 1 1 Y1 1
X 2 1 Y2 1
and
X 3 1 Y3 1
whereas Žle B had records of the form
X1 1 Z1 1
X3 1 Z3 1

© 2003 American Statistical Association
Journal of Business & Economic Statistics
January 2003, Vol. 21, No. 1
DOI 10.1198/073500102288618766
65

66

Journal of Business & Economic Statistics, January 2003


3.

and
X4 1 Z4 0
The matching methodologies used almost always have made
the assumption that 4Y 1 Z5 are conditionally independent
given X, as pointed out initially by Sims (1972). From this
assumption, it would be immediate that one could create
X 1 1 Y1 1 Z 1
and

Downloaded by [Universitas Maritim Raja Ali Haji] at 00:38 13 January 2016

X 3 1 Y3 1 Z 3 0
Notice that matching on X1 and X3 in no way implies that
[4X1 1 Y1 5 and 4X1 1 Z1 5], or [4X3 1 Y3 5 and 4X 3 1 Z3 5], are taken
from the same entities.
What to do with the remaining records is less clear, and
techniques vary. Broadly, the various strategies used for statistical matching can be grouped into two general categories, “constrained” and “unconstrained.” Constrained statistical matching requires the use of all records in the two Žles

and basically preserves the marginal Y and Z distributions
(e.g., Barr, Stewart, and Turner 1982). In the foregoing (simplistic) example, for a constrained match one would have to
end up with a combined Žle that also had records
X2 1 Y2 1 Z??
and
X 4 1 Y?? 1 Z4 0
Unconstrained matching does not have this requirement, and
one might stop after creating X2 1 Y2 1 Z?? . How the statistical
matching procedure deŽned records to be similar would determine the values of the variables without speciŽc subscripts.
A number of practical issues, not part of our present scope,
need to be addressed during a statistical matching process.
Among these issues are alignment of universes (i.e., agreement of the weighted sums of the data Žles) and alignment
of units of analysis (i.e., individual records representing the
same units). Usually, too, the bridging X variables can have
different measurement or nonsampling properties in the two
Žles (See Cohen 1991; Ingram, O’Hare, Scheuren, and Turek
2000 for further details).
Statistical matching is by no means the only way to combine information from two Žles. Sims (1978), for instance,
described alternative methodologies to statistical matching that
could be used under conditional independence. Other authors

(e.g., Singh, Mantel, Kinack, and Rowe 1993; Paass 1986,
1989) have described methodologies for statistical matching
if auxiliary information about the 4Y 1 Z5 relationship is available. Although an important special case, this option is seldom
available (Ingram et al. 2000). (See also National Research
Council 1992, where the subject of combining information has
been taken up quite generally.)
Rodgers (1984) included a more detailed example of combining two Žles using both constrained and unconstrained
matching than the example that we provide here. We encourage the interested reader to consult that reference for an
illustration of how sample weights are used in the matching
process.

RUBIN’S PROCEDURE FOR
STATISTICAL MATCHING

In the framework described earlier, Rubin (1986) outlined a
methodology for what he termed the “concatenation” of two
sample Žles. Assuming a trivariate outcome Žle 4X1 Y 1 Z5,
where X could be vector-valued, Rubin suggested a methodology for multiple imputation (Rubin 1987, 1996) of Z in Žle A
and Y in Žle B, based on several speciŽed values of the partial correlation of Y and Z, given X. We Žrst describe Rubin’s
procedure, and then examine aspects of the procedure in more

detail.

3.1

Description of the Procedure

Broadly, Rubin proposes starting his procedure by using
regression. This in turn creates the predictions of the variables
that he wants to use for the statistical matching. Finally, he
advocates concatenating the resulting Žles.
Each of these steps is described briey in this section, followed by a discussion in Section 3.2 of the procedure’s theoretical properties. As will be seen, many of these theoretical
results are new and in places corrective. In fact, we would not
advocate using Rubin’s approach without major modiŽcations,
as we specify in Sections 4 and 6.
3.1.1. Regression Step. Rubin’s procedure begins by
postulating a value for the partial correlation of 4Y 1 Z5 given
X, and calculating the regressions of Y on X in Žle A and Z
on X in Žle B. The regression coefŽcients, the variances of
the residuals from the two regressions, and the assumed value
of the partial correlation of 4Y 1 Z5, given X, then are used to

construct the matrix
0

0


0
@ 4R Y on X 5
0
ƒ4R Z on X 5

RY on X
pvarY —X
pcovY 1 Z—X

RZ on X

1

C

pcovY 1 Z—XA 0

(1)

pvarZ—X

Here RY on X and RZ on X are the column vectors of the regression coefŽcients of Y on X and Z on X, respectively, and
ƒ4RY on X 50 and ƒ4RZ on X 50 are the respective negative transposes. RY onX and RZ onX are of dimension 4m C 15 by 1, where
the dimension of X is m. pcovY 1 Z—X is the partial covariance
of 4Y 1 Z5 given X; pvarY —X is the partial variance of Y given
X, and pvarZ—X is the partial variance of Z given X. pvarY —X
and pvarZ—X are estimated using the variances of the residuals
of the corresponding regressions, and pcovY 1 Z—X is estimated
using the assumed value of the partial correlation of 4Y 1 Z5,
given X, multiplied by the square root of (pvarY —X ¢ pvarZ—X ).
The “sweep” matrix operator (Goodnight 1979; Seber 1977)
is then applied to (1) to obtain estimates of RY on X1 Z and
RZ on X1 Y . Beginning with (1) and then “sweeping on Y ” gives
the matrix
0
B
@

#

#

#

#

ƒ4RZ on X1 Y 5012nƒ1

ƒ4RZ on X1 Y 5n

1
4RZ on X 1 Y 512nƒ1
C
4RZ on X 1 Y 5n A 1
pvarZ—X1 Y

Moriarity and Scheuren: A Note on Rubin’s Statistical Matching Using File Concatenation

and beginning with (1) and “sweeping on Z” gives the matrix
0
1
4RY on X1 Z 512nƒ1
#
#

C
0
ƒ4RY on X 1 Z 5nA
1
pvarY —X 1 Z
@ 4RY on X1 Z 512nƒ1

Downloaded by [Universitas Maritim Raja Ali Haji] at 00:38 13 January 2016

#

4RY on X1 Z 5n

#

where “#” denotes immaterial entries and 4RA on B1 C 512n
denotes the full set of regression coefŽcients of A on B and C.
RY on X1 Z is used to obtain what we call the “primary” estimates of Y in Žle B, and RZ on X1 Y is used to obtain primary
estimates of Z in Žle A. These primary predicted values then
are used to produce what we call the “secondary” predicted
values. RY on X1 Z is used, along with observed X and predicted
Z, to obtain secondary estimates of Y in Žle A, and RZ onX 1 Y
is used, along with observed X and predicted Y , to obtain secondary estimates of Z in Žle B.
At the completion of the estimation step, Žle A consists of
observed X, observed Y , primary predicted Z, and secondary
predicted Y . File B consists of observed X, observed Z, primary predicted Y , and secondary predicted Z.
Note that if the partial correlation of Y and Z, given X, is
assumed to be 0, then the foregoing procedure simpliŽes. In
this case, RY on X1 Z equals RY on X and RZ on X1 Y equals RZ on X .
We note in passing that the regression coefŽcients obtained
by use of the sweep operator as described by Rubin also can
be obtained by the regression method described by Kadane
(1978). (See Goodnight 1979 or Seber 1977, chap. 12 for
more details.) The methods are identical if applied to just
one dataset, and differences in regression coefŽcients obtained
from the two methods in the framework discussed here can be
expected to be small if the datasets are large.
To replicate the regression coefŽcients shown in Rubin’s
table 5, we need to use the sweep operator, because the
datasets are very small (eight records in Žle A and six records
in Žle B). However, even in this case, Kadane’s method produces regression coefŽcients of Y on X and Z in good agreement with Rubin’s method. (Note, however, that the agreement is not too good for the regression coefŽcients of Z on X
and Y .)
3.1.2. Matching Step. The matching step in Rubin’s
approach involves using unconstrained matches. The Žnal
value of Z assigned to the jth record in Žle A is obtained by
doing an unconstrained match between the (primary) predicted
value of Z for the jth record and the (secondary) predicted
values of Z in Žle B. (The matching criterion is a minimum
distance or nearest-neighbor approach.) The observed Z value
in the matched record in Žle B is then assigned to the jth
record in Žle A. Similarly, the Žnal value of Y assigned to
the ith record in Žle B is obtained by doing an unconstrained
match between the (primary) predicted value of Y for the ith
record and the (secondary) predicted values of Y in Žle A.
The matching steps are done separately for Y and Z.
Note that a careful reading of Rubin’s article is necessary
to discern his methodology for the matching step. The statistical matching literature contains references that incorrectly
state that Rubin’s procedure is to do an unconstrained match
between the (primary) predicted value of Z for the jth record
in Žle A and the observed values of Z in Žle B, and an unconstrained match between the (primary) predicted value of Y for
the ith record in Žle B and the observed values of Y in Žle A.

67

3.1.3. Concatenation Step. Rubin then suggests concatenating the resulting statistically matched Žles and assigning
the weight 4wƒA 1 C wƒB 1 5ƒ1 to each record, where wA is the
weight corresponding to the Žle A portion of the record and
wB is the weight corresponding to the Žle B portion of the
record.
Rubin’s method is one of only two procedures described
in the statistical matching literature for assessing the effect
of alternative assumptions of the inestimable value cov4Y 1 Z5.
(See Kadane 1978, supplemented by Moriarity 2001, for the
other procedure.)
Rubin suggests that his basic procedure be repeated for several assumed values of the partial correlation of 4Y 1 Z5 given
X, as an operation akin to multiple imputation (Rubin 1987,
1996). Note that Kadane (1978) had already emphasized the
necessity of repeating the matching procedure for a range of
corr4Y 1 Z5 values, thus anticipating Rubin’s multiple imputation concept as applied in this arena.
3.2

Further Aspects of Rubin’s Method

Here we discuss some aspects of Rubin’s method separately
for the regression step, matching step, and concatenation step.
3.2.1. Regression Step. In this section we discuss
Rubin’s method within the framework of 4X1 Y 1 Z5 having a nonsingular multivariate normal distribution with mean
4ŒX 1 ŒY 1 ŒZ 5 and covariance matrix
1
0
èXX èX Y èX Z
C
B
è D @ èY X èY Y èY Z A 0
èZX

èZY

èZZ

All elements of è can be estimated from Žle A or Žle B except
for èY Z and its transpose, èZY . SpeciŽcation of the partial correlation of 4Y 1 Z5 given X, as Rubin does, can be considered
equivalent to speciŽcation of èY Z in this framework.
It can be shown (Moriarity 2001) that for Žle A, after the
bj 5
“primary” prediction of Z, the joint distribution of 4Xj 1 Yj 1 Z
in Žle A is normal with mean 4ŒX 1 ŒY 1 ŒZ 5 and (singular)
covariance matrix
1
0
èXX èXY èXZ
C
B
SA1 D @ èY X èY Y èY Z A 0
èZX

èZY

èbZZb

We use the “A1 ” subscript to emphasize that this is the covariance matrix for Žle A, after the primary prediction of Z. èZbZb
can be shown to equal
¡
èZX

èZY

Á
¢ èXX
èY X

èXY
èY Y

!ƒ1 Á
!
èX Z
èY Z

0

Similarly, it can be shown that for Žle B, after the primary
bi 1 Zi 5 in Žle B
prediction of Y , the joint distribution of 4Xi 1 Y
is normal with mean 4ŒX 1 ŒY 1 ŒZ 5 and (singular) covariance
matrix
1
0
èXX èXY èXZ
SB1 D @ èY X èbY Yb èY Z A 0
èZX èZY èZZ

68

Journal of Business & Economic Statistics, January 2003

èYbbY can be shown to equal

Downloaded by [Universitas Maritim Raja Ali Haji] at 00:38 13 January 2016

¡
èY X

³
¢ èXX
èY Z
èZX

èX Z
èZZ

´ƒ1 ³
´
èX Y
0
èZY

Hence, although the variances of the variables created by prediction are smaller than the variances of the variables being
predicted (see Sec. 4 for more discussion of this assertion),
other relationships are preserved. In particular, the postulated
value of èY Z is preserved.
In Žle A, after the “secondary” prediction of Y , it can
be shown (Moriarity 2001) that the joint distribution of
ej 1 Z
bj 5 is normal with mean 4ŒX 1 ŒY 1 ŒZ 5 and singular
4Xj 1 Y
covariance matrix
1
0
èXX èX Y èX Z
C
B
SA2 D @ èY X èYeYe èYebZA 1
èZX

èZbYe

èZbbZ

where èbZZb is deŽned as before, èeY Ye can be shown to equal
¡
èY X

èY Z

Á
¢ èXX
èZX

èX Z
èZZ

!ƒ1 Á
èXX

èXZ

èZX
€

èbZZb

!

Á
èXX

èXZ

èZX

èZZ

!ƒ1 Á
!
èXY
èZY

and èYebZ can be shown to equal
¡
èY X

èY Z

Á
¢ èXX
èZX

èX Z
èZZ

!ƒ1 Á
èXX

èXY

!

èZX èZY
Á
!ƒ1 Á
!
èXX èX Y
èX Z
€
0
èY X èY Y
èY Z

In general, èYebZ is not equal to èY Z .
Similarly, in Žle B, after the secondary prediction of Z, it
bi 1 Z
ei 5 is norcan be shown that the joint distribution of 4Xi 1 Y
mal with mean 4ŒX 1 ŒY 1 ŒZ 5 and singular covariance matrix
0

èXX

B
SB2 D @ èY X
èZX

èXY
èbY Yb

èZeYb

èXZ

1

C
èYbZeA 0
èeZZe

Here èbY Yb is deŽned as before, èZeeZ can be shown to equal
¡
èZX

èZY

Á
¢ èXX
èY X

èX Y
èY Y

!ƒ1 Á
èXX

èX Y

!

èY X èbY bY
Á
!ƒ1 Á
!
èXX èX Y
èXZ
€
1
èY X èY Y
èY Z

and èYbZe can be shown to equal èeY Zb.

3.2.2. Matching Step. Rubin’s procedure uses unconstrained matching. This allows for the possibility that some
records in each Žle act as a “donor” more than once, whereas
other records are not used. Unlike constrained matching,
which forces the use of all records, unconstrained matching
can lead to distortions in means, variances, and other parameters (see, e.g., Rodgers 1984).
3.2.3. Concatenation Step. As mentioned earlier, Rubin
suggests concatenating the Žles and assigning the weight
4wƒA 1 C wƒB 1 5ƒ1 to each record. If a given record consists mostly
of data from Žle A, then wA is the weight corresponding to the
Žle A portion of the record, and wB is the weight corresponding to the Žle B portion of the record under the assumption
that it was sampled according to the protocol used to sample
Žle A. If the given record consists mostly of data from Žle B,
then wB is the weight corresponding to the Žle B portion of
the record and wA is the weight corresponding to the Žle A
portion of the record under the assumption that it was sampled
according to the protocol used to sample Žle B.
This suggested weight is intuitively reasonable and feasible
to compute in simple cases such as iid sampling. For iid sampling, it can be shown (as Rubin does) that estimates are unbiased. However, for more complex sampling designs, it may not
be feasible to compute the suggested weight. The needed sample design information may not be available, and/or it could be
difŽcult or impossible to compute selection probabilities for
records in one Žle under the assumption of sampling according to the protocol used to sample the other Žle.
A serious problem with concatenation after the use of
unconstrained matching is that estimates may not be unbiased,
given the use of unconstrained matching, for non-iid sampling.
Furthermore, concatenation can give the illusion of the creation of additional information. If Žle A and Žle B have 200
records each, it seems apparent that it is not possible to form
more than 200 matched Y -Z pairs; however, concatenation can
give the illusion of up to 400 Y -Z pairs. This problem of the
illusion of having more information is a criticism that can,
of course, in one way or another be leveled at all statistical
matching methods.
4.

LOSS OF SPECIFIED VALUE OF èY Z IN RUBIN’S
PROCEDURE: ALTERNATIVES CONSIDERED

In the ideal situation of statistically matching two data Žles,
each having many observations with variables that are multivariate normally distributed, the preferred outcome of a procedure such as Rubin’s would be that a speciŽed value of èY Z
is reproduced accurately in the Žnal product of the statistical
matching procedure. Other characteristics such as means, variances, and covariances that are observable in Žles A and B
should also be preserved.
For Rubin’s procedure, the results presented in Section 3.2.1
show that the speciŽed value of èY Z is preserved during the
primary estimation of Y and Z in the multivariate normal
framework. However, the speciŽed value is not guaranteed to
be preserved during the secondary estimation of Y and Z. In
fact, signiŽcant deviation is possible.
Our extensive simulation of Rubin’s suggested matching procedure (see Sec. 5 for details of the simulation

Moriarity and Scheuren: A Note on Rubin’s Statistical Matching Using File Concatenation

69

Table 1. Summary of Simulation Results for Rubin’s Method and Related Methods

Corr(Y1 Z) values
near conditional
independence value
(1,049 simulations)

Corr(Y1 Z)
values away
from conditional
independence value
(824 simulations)

(1) Rubin’s method
File A
File B
File A/B difference

.04
.10
.10

.14
.25
.13

Bad for variances;
OK for means

Bad

(2) Same as (1), except
no secondary prediction
File A
File B
File A/B difference

.03
.02
.02

.03
.02
.01

Bad for variances;
OK for means

Good

(3) Same as (2), except add
residuals to primary predictions
File A
File B
File A/B difference

.03
.05
.03

.03
.03
.02

Good

Good

(4) Same as (3), except use univariate
constrained matching on Y and on Z
File A
File B
File A/B difference

.03
.05
.03

.03
.03
.01

Good (as is expected from
using constrained matching)

Good

(5) Same as (4), except use multivariate
constrained matching on (Y1 Z)
File A
File B
File A/B difference

.01
.01
.00

.01
.01
.00

Good (as is expected from
using constrained matching)

Good

Downloaded by [Universitas Maritim Raja Ali Haji] at 00:38 13 January 2016

Matching
procedure

Performance
reproducing
variances,
means, etc.

Performance
reproducing
Corr(X 1 Z) in
’le A and
Corr(X1 Y) in ’le B

NOTE: The ’le A and ’le B rows show the average absolute differences between estimated Corr(Y1 Z) and speci’ed Corr(Y1 Z). The ’le A/B difference row shows the average absolute
difference between estimates of Corr(Y1 Z) in the two ’les.

methodology) showed that the speciŽed value of èY Z is not
always preserved in the Žnal Žles. Not surprisingly, there was
considerable correlation between the divergence of the estimated value of èY Z from the speciŽed value of èY Z and the
divergence of èbY Ze from èY Z . Furthermore, the estimates of
èY Z computed from the two Žles often were far apart—a troubling inconsistency. Hence, Rubin’s methodology needed to
be revised to address these anomalies.
One alternative possibility is to carry out a modiŽcation of
Rubin’s procedure in which the secondary estimation step is
omitted, to eliminate the distortions introduced by that step.
That is, an alternative procedure is to use actual values, rather
than secondary estimates, during the matching step. As shown
in Table 1, our simulations provided strong evidence that this
modiŽcation of Rubin’s procedure was far more successful on
average in retaining the speciŽed value of èY Z . This modiŽcation also attained much better consistency of the estimated
values of èY Z from the two Žles compared with Rubin’s procedure.
A second alternative possibility to consider is to impute
residuals to the primary predicted values before the matching
step. Note that èZZ ƒ èZbZb and èY Y ƒ èbY bY can be identiŽed
as variances of random variables with certain conditional distributions (e.g., Anderson 1984, p. 37). (This also shows that
the variances of the predicted variable values are smaller than
the variances of the variables themselves.) Hence the covariance matrices can be made equal by imputing independently
drawn normally distributed random residuals with mean 0 and

bj and Y
bi .
variance as speciŽed and adding these residuals to Z
As shown in Table 1, simulations suggest that this methodology is comparable in most ways to Rubin’s procedure without the secondary estimation step. However, this methodology
provided improved performance in reproducing variances.
A third alternative methodology is to proceed as in the
aforementioned alternative, except to replace unconstrained
matching on Y and on Z with univariate constrained matching
on Y and on Z (two separate matches). As shown in Table 1,
the results of this alternative are comparable to those of the
two previous alternatives. However, this procedure has the
additional beneŽt of guaranteeing the elimination of distortion
in variances that can occur when unconstrained matching is
used.
A fourth alternative to consider is to proceed as in the
foregoing alternative, but replace two univariate constrained
matches on Y and on Z with a single multivariate constrained
match on 4Y 1 Z5. This alternative, which consists of the primary estimation step, skipping the secondary estimation step,
and multivariate constrained matching on 4Y 1 Z5 after imputation of residuals, has been discussed at length by Moriarity
(2001) and Moriarity and Scheuren (2001). The multivariate
constrained matching step uses a Mahalanobis distance computed on 4Y 1 Z5 and was implemented in our simulations using
the RELAX-IV public domain software (Bertsekas 1991; Bertsekas and Tseng 1994). If the matching step links the jth
record in Žle A to the ith record of Žle B, then the Žnal value
of Z assigned to the jth record in Žle A comes from the ith

70

Journal of Business & Economic Statistics, January 2003

record of Žle B, and the Žnal value of Y assigned to the ith
record of File B comes from the jth record of File A.
This fourth alternative has the beneŽts of univariate constrained matching, and it also eliminates any inconsistency
between 4Y 1 Z5 estimates using Žle A and 4Y 1 Z5 estimates
using Žle B. As shown in Table 1, this alternative gave the
best performance of the methods considered.

Downloaded by [Universitas Maritim Raja Ali Haji] at 00:38 13 January 2016

5.

DESCRIPTION AND RESULTS OF THE
SIMULATION METHODOLOGY

To assess the performance of Rubin’s procedure, and variants of that procedure, we carried out simulations. In these
simulations we used univariate X, Y , and Z, with 4X1 Y 1 Z5
assumed to have a trivariate normal distribution. Without loss
of generality, 4X1 Y 1 Z5 were assumed to have zero means
and unit variances. In this simple framework, èY Z is equal to
Corr4Y 1 Z5, and so on.
Corr4X1 Y 5 was allowed to vary from 0 to .95 in increments of .05. For a given value of Corr4X1 Y 5, Corr4X1 Z5
was allowed to vary from Corr4X1 Y 5 to .95 in increments of
.05. For given values of Corr4X1 Y 5 and Corr4X1 Z5 values,
Corr4Y 1 Z5 was allowed to take a range of 4–10 different values within the bounds speciŽed by
Corr4X1 Y 5 ¢ Corr4X1 Z5
p
61 ƒ 4Corr4X1 Y 552 7 ¢ 61 ƒ 4Corr4X1 Z55 2 71
which can be estimated using Žle A and Žle B. These
bounds have been variously derived (Rodgers and DeVol 1982;
Moriarity 2001). All start with the requirement that the correlation matrix of 4X1 Y 1 Z5 must be positive deŽnite.
Note that the number of values of Corr4Y 1 Z5 depended on
the length of the interval of admissible values of Corr4Y 1 Z5.
For given values of Corr4X1 Y 5, Corr4X1 Z5, and Corr4Y 1 Z5,
we drew two independent samples of size 1,000 from the
speciŽed multivariate normal distribution. We felt that using a
sample size of 1,000 was a reasonable compromise to simulate
a dataset of realistic size with minimal sampling variability,
while avoiding excessive computational burden.
We carried out the regression steps as previously described
in Section 3.1.1. Note that although it is possible to pool the
X values from both Žles to estimate var4X5, this can and does
lead to occasional problems of nonpositive deŽnite covariance
matrices (Moriarity 2001). To avoid such problems, we used
X values only from Žle A to estimate var4X5 when predicting
Z from X and Y for Žle A, and used X values only from Žle B
to estimate var4X5 when predicting Y from X and Z for Žle B.
All of the simulation work, except for the fourth alternative
discussed in the previous section, was carried out on a Pentium II PC with a 400 MHz processor and 128 MB of RAM.
A set of about 2,000 simulations typically took several hours
of continuous computer processing to complete. The simulation work for the fourth alternative was done separately (as
discussed in Moriarity and Scheuren 2001), and required more
computer processing (days, as compared to hours) than the
other alternatives.

5.1

Results and Innovations

Table 1 summarizes the results of applying:
1. Rubin’s originally-proposed procedure and then the following modiŽcations, as described in Section 4:
2. Skipping the use of secondary predicted values
3. Skipping the use of secondary predicted values and
adding residuals to the primary predicted values before matching
4. Same as 3, except using univariate constrained matching on Y and on Z (two separate matches) instead of unconstrained matching on Y and on Z
5. Same as 4, except using multivariate constrained matching on 4Y 1 Z5.
Comparing the Žrst two columns within a given row reveals
the relative performance of a matching procedure for values
near conditional independence versus values far from conditional independence. For Rubin’s original method, performance was worse for values far from conditional independence. All of the modiŽed procedures had robust performance
relative to conditional independence.
Comparing the rows within one of the Žrst two columns
illustrates the relative ability of different procedures to reproduce the speciŽed value of Corr4Y 1 Z5 after the matching step.
It can be seen that Rubin’s method had the poorest performance of all methods considered, due to the distortion introduced by the secondary estimation step. The best-performing
procedure was the one that added residuals to primary predictions and then performed multivariate constrained matching
using 4Y 1 Z5.
As shown in Table 1, we performed a broader evaluation of
the various procedures. We examined each procedure’s ability to reproduce other estimates, such as E4Z5, var4Z5, and
Corr4X1 Z5 in Žle A and E4Y 5, var4Y 5, and Corr4X1 Y 5 in
Žle B. Overall, Rubin’s method again had the poorest comparative performance for reproducing variances and covariances.
We think that this is because of the secondary estimation step,
because methods that used unconstrained matching but not the
secondary estimation step generally performed better. Overall,
multivariate constrained matching had the best comparative
performance for reproducing variances and covariances.
5.2

Simulation Summary

To summarize, we found that it was very important to
eschew the secondary estimation step. Several procedures that
omitted the secondary estimation step gave comparable results,
with multivariate constrained matching performing the best.
6.

APPLICATIONS AND GENERALIZATIONS

In this article we have made the assumption that the variables to be statistically matched come from multivariate normal distributions. This does not really Žt most situations
in practice, where the variables come from complex survey
designs and do not have a standard theoretical distribution, let
alone a normal one.
In addition to problems that arise with Rubin’s method from
carrying out secondary estimation, another limitation of the

Moriarity and Scheuren: A Note on Rubin’s Statistical Matching Using File Concatenation

Downloaded by [Universitas Maritim Raja Ali Haji] at 00:38 13 January 2016

method is that it assumes bivariate 4Y 1 Z5. Generalizations to
higher dimensions are not immediate. It may be possible to
apply techniques akin to multivariate predictive mean matching, as discussed by Little (1988). An alternative procedure
that appears to generalize in a straightforward way to higher
dimensions is discussed by Moriarity and Scheuren (2001).
Although developing a complete paradigm is beyond our
scope here, we can make some suggestions:
1. Constrained matching is a good starting point. It is
expensive but now affordable with recent advances in computing. The use of unconstrained matching by Rubin does not
seem to be essential to his ideas; indeed, it may have been
advocated just for the sake of speciŽcity.
2. Applications that match Žles as large as 1,000 (the sample size that we simulated) would be unusual. Even in largescale projects, like matching the full Current Population Survey with the Survey of Income and Program Participation, the
matching generally would be done separately in modest-sized
demographic subsets deŽned by categorical variables such as
gender or race (e.g., as described in Ingram et al. 2000).
3. We believe that the general robustness of normal methods can be appealed to, even when the individual observations
are not normal. Although not necessarily optimal, the statistics
calculated from the resulting combined Žle will be approximately normal because of the central limit theorem.
4. Resampling of the original sample before using the techniques presented in this article, could help expose potential
lack of robustness to failures in the iid assumption. One such
technique was discussed by Hinkins, Oh, and Scheuren (1997).
This can be computationally expensive, depending on the sample designs of the two Žles being matched.
5. Often researchers who do statistical matching do not
bring survey designers into the matching process. This is
needed. The use of sample replication, even if only approximately, is one way that designers can help matchers. Without
help from samplers, it generally will not be possible to create
credible sample variance estimates for statistics created from
the matched Žles. This is formidable in any case but could be
especially so with a concatenated Žle.
6. Deep subject matter knowledge is needed to deal with
differences in the measurement error and other nonsampling
concerns (e.g., edit and imputation issues) that arise and that
may even be a dominant limitation to statistical matching.
7. In all applications, no matter what the matcher’s experience level, caution would recommend that with a new problem, simulations always be done and a small prototype involving real data be conducted before beginning on a large scale.
No decision on how (or even whether) to do a statistical match
should be made until these steps have been taken.
7.

SUMMARY AND CONCLUSIONS

Rubin’s 1986 article in the Journal of Business and Economic Statistics presented innovative ideas in the area of statistical matching that, although important, have not until now
been followed up and developed. This is unfortunate, because
the approach advocated by Rubin once modiŽed, has value to
practitioners.

7.1

71

Secondary Estimation

We have shown that Rubin’s procedure is sound during the
primary estimation step, but not necessarily during the secondary estimation step. In fact, we strongly advise against the
routine use of Rubin’s secondary estimation step.
Because of the loss of the speciŽed value of èY Z that can
occur during the secondary estimation step, Rubin’s procedure
is not feasible as originally described. However, innovations
such as avoiding the secondary estimation step, adding residuals to the primary estimates before matching, and using constrained matching, particularly multivariate constrained matching on 4Y 1 Z5, appear to make the procedure feasible.
The end result is a collection of datasets formed from various assumed values of èY Z , where analyses can be repeated
over the collection and the results can be averaged or summarized in some other meaningful way. (For a recent reference on
ways to average the resulting values, see Hoeting, Madigan,
Raftery, and Volinsky 1999). The methods described by Rubin
(1987, 1996) also might be used; however, we consider this
an area in which additional research is warranted.
7.2

Unconstrained Matching

Legitimate concerns also arise when unconstrained matching is used. The suggestion of using unconstrained matching was appealing when made originally by Rubin, because
it requires much less computational effort than multivariate
constrained matching. However, it can be shown that unconstrained matching leads to distortion of means and variances,
as discussed by Rodgers (1984).
A simple form of univariate constrained matching (Goel
and Ramalingam 1989) that matches ranked values is comparable in computational effort to unconstrained matching. This
method appears to work acceptably when used in tandem with
avoiding secondary estimation and adding residuals to the primary estimates. Ingram et al. (2000) evaluated this procedure,
but without residuals added. Multivariate constrained matching requires more computational effort, but advances in computer hardware and software (e.g., Bertsekas 1991; Bertsekas
and Tseng 1994) have made multivariate constrained matching feasible.
7.3

Concatenation

The notion of Žle concatenation is appealing. However, on
close examination, it seems to have limited applicability. It is
not clear that the suggested weights can always be computed
for non-iid sampling for complex survey designs. Moreover,
if the two sample Žles are simply brought together, then there
is danger of giving the illusion of creation of information by
repetition of observations in the concatenated Žle. This incorrect conclusion is less likely if no concatenation is done.
Nonetheless, in many past applications of statistical matching, the matching of the two Žles was done in only one
direction, thus wasting information. Rubin’s essential idea of
matching in both directions is important and should become
standard practice. Use of this approach would imply that two
estimates would be available for inference, a helpful way of
displaying both sampling and nonsampling error.

Downloaded by [Universitas Maritim Raja Ali Haji] at 00:38 13 January 2016

72

To calculate the set of wABi D 4wƒA 1 C wƒB 1 5ƒ1 , the weight
suggested by Rubin for the ith observation on the concatenated Žle, one needs to calculate the weight that Žle A sample
observations would have had on Žle B had they been selected
into the Žle B sample. Similarly, one needs to be able to calculate for a Žle B case what its probability is of being selected
into Žle A.
Now, in general, for a given sample design there is an
index set I of labels needed to assign a probability of selection. In order for the type of concatenation that Rubin advocated to be feasible, these Žles would need to be of the form
6XA 1 YA 1 WeightA 1 IA 1 IB 7 and 6XB 1 ZB 1 WeightB 1 IA 1 IB 7 before
the matching step.
Now, except in very special cases (e.g., simple random sampling (SRS) on both Žles or stratiŽed SRS with a subset of
the X variables being the stratiŽers), IA and IB would not
be known for all observations. This would almost certainly
be true when public use Žles are being statistically matched.
Indeed, for public use Žles that generally contain limited information because of conŽdentiality concerns, wABi cannot be
calculated unless IA and IB are effectively captured by the
X variables and the sample weights. (Rubin makes a related
point in sec. 3.3 of his article.)
Ignore for a moment that in most settings, fully weighting a concatenated Žle of the form that Rubin describes may
be impossible. Suppose instead that it were possible. Usually,
there are known quantities about the population in either the
Žle A or the Žle B samples that are conditioned on either during selection or after the fact. Although only sketching this
in his article, Rubin seems to be leading us to this point with
his comment on adjusted ratio weights. To illustrate, suppose
one matched a stratiŽed sample (Žle A) with a simple random
sample (Žle B). Suppose further that the stratifying variables
were among the common X variables. Then after constructing the concatenated weights one would be led, almost naturally to condition them on stratum totals, if known. If both
Žles were stratiŽed simple random samples, with the stratiŽers
among the common X variables, then jointly conditioning on
both sets of strata totals might be done, possibly using a raking estimator (Deming and Stephan 1940).
We would argue that Rubin’s adjusted ratio version of concatenated weighting, as illustrated here, moves practitioners
back toward constrained matching, our preferred approach.
Now what does Rubin say about constrained matching?
I regard the automatic matching of margins to the original Žles as a relatively minor beneŽt of the constrained approach in most circumstances, especially considering that the real payoff in matching census margins arises when
samples are not large and census data on the margins exist; and then this
constrained approach, that matches census margins, is not as appropriate as
methods designed to match population margins such as ratio and regression
adjustment, which can be applied after the matched Žle is created.

The phrase “in most circumstances” may be where the problem lies.
We agree with Rubin about the beneŽt (or lack thereof)
when the samples are very large. But applications usually are
to small subpopulations for which census data on the margins may exist or where the practitioner believes one Žle’s
estimates to be superior to the other, perhaps due to higher
response rates or better measurement properties.

Journal of Business & Economic Statistics, January 2003

Now an important problem with the usual application of
constrained matching—and one on which we wholly agree
with Rubin—is that we are treating the constrained totals as
Žxed and without error. Replication approaches can address
this but usually are not applied. (see Sec. 6).
7.4

Final Observations

We conclude with some general comments about statistical matching and the procedures discussed herein. We take a
pragmatic view about statistical matching; it has been used
for many years and will continue to be used in the future.
Most statistical matching procedures that have been implemented have assumed, implicitly or explicitly, that 4Y 1 Z5 are
conditionally independent given X. Clearly, this is a plausible assumption that provides a consistent set of relationships
between X1 Y , and Z, but it is not the only possible plausible
assumption. For example, as discussed in Section 5 for the
case of univariate X1 Y , and Z with unit variances, the “conditional independence value” Corr (X1 Y ) ¢ Corr (X1 Z) is the
midpoint of a range of plausible values for Corr4Y 1 Z5, and
in general, the range of plausible values for Corr(Y 1 Z) in this
case is wide (Rodgers and DeVol 1982). A similar situation
exists in higher dimensions. Thus any synthetic data Žle produced by statistical matching procedures that do not exhibit
the effect of alternative plausible values of the 4Y 1 Z5 relationship to conditional independence has serious limitations,
unless the conditional independence assumption happens to be
more or less correct.
The techniques described here, which are extensions of
work of Kadane (1978) and Rubin (1986), provide a means
for exhibiting the effect of alternative plausible assumptions
about the unknowable relationship between Y and Z, and we
advocate the creation of a reasonable number of datasets to
display the effect of a wide range of plausible values. Procedures for generating a wide range of plausible values for
the 4Y 1 Z5 relationship have been outlined by Moriarity and
Scheuren (2001).
Also, if sufŽcient resources are available, then we recommend, for a given plausible value for the 4Y 1 Z5 relationship,
imputation of several different sets of residuals to display the
variability introduced by that procedure.
Although the techniques described herein can help exhibit
variability due to various plausible assumptions about the
4Y 1 Z5 relationship, only auxiliary information can provide an
accurate idea of the 4Y 1 Z5 relationship. If auxiliary information about the 4Y 1 Z5 relationship is available, then multivariate matching after imputing residuals can produce a dataset
that accurately reproduces that relationship while preserving
other observed relationships in the multivariate normal case
(Moriarity and Scheuren 2001). Further research is needed to
determine whether our technique works well in other cases.
ACKNOWLEDGMENTS
This article is based in large part on the unpublished doctoral dissertation cited as Moriarity (2001). Results sketched
here are fully developed in that reference. The authors thank
Tapan Nayak, Reza Modarres, and Hubert Lilliefors of The

Moriarity and Scheuren: A Note on Rubin’s Statistical Matching Using File Concatenation

George Washington University Department of Statistics for
useful discussions. The authors also thank the referees for their
constructive suggestions, which improved the clarity of this
article. The views expressed are ours and do not necessarily
reect the views or positions of the U.S. General Accounting
OfŽce.
[Received February 2002. Revised April 2002.]

Downloaded by [Universitas Maritim Raja Ali Haji] at 00:38 13 January 2016

REFERENCES
Anderson, T. W. (1984), An Introduction to Multivariate Statistical Analysis
(2nd ed.), New York: Wiley.
Barr, R. S., Stewart, W. H., and Turner, J. S. (1982), “An Empirical Evaluation
of Statistical Matching Strategies,” unpublished technical report, Southern
Methodist University. Edwin L. Cox School of Business.
Bertsekas, D. P. (1991), Linear Network Optimization: Algorithms and Codes,
Cambridge, MA: MIT Press.
Bertsekas, D. P., and Tseng, P. (1994), “RELAX-IV: A Faster Version of the
RELAX Code for Solving Minimum Cost Flow Problems,” unpublished
technical report, available at http://web.mit.edu/dimitrib/www/home.html.
Citro, C. F., and Hanushek, E. A. (eds.) (1991), Improving Information for
Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I:
Review and Recommendations, Washington, DC: National Academy Press.
Cohen, M. L. (1991), “Statistical Matching and Microsimulation Models,” in Improving Information for Social Policy Decisions: The Uses of
Microsimulation Modeling, Vol. II: Technical Papers, eds. C. F. Citro and
E. A. Hanushek, Washington, DC: National Academy Press, pp. 62–88.
Deming, W. E., and Stephan, F. F. (1940), “On a Least Squares Adjustment
of a Sampled Frequency Table when the Expected Marginal Totals are
Known,” Annals of Mathematical Statistics, 11, 427–444.
Fellegi, I. P., and Sunter, A. B. (1969), “A Theory for Record Linkage,”
Journal of the American Statistical Association, 64, 1183–1210.
Goel, P. K., and Ramalingam, T. (1989), The Matching Methodology: Some
Statistical Properties, Lecture Notes in Statistics Vol. 52, New York:
Springer-Verlag.
Goodnight, J. H. (1979), “A Tutorial on the SWEEP Operator,” The American
Statistician, 33, 149–158.
Hoeting, J. A., Madigan, D., Raftery, A., and Volinsky, C. T. (1999),
“Bayesian Model Averaging: A Tutorial,” Statistical Science, 14, 382–417.
Hinkins, S., Oh, H. L., and Scheuren, F. (1997), “Inverse Sampling Design
Algorithms,” Survey Methodology, 23, 11–21.
Ingram, D. D., O’Hare, J., Scheuren, F., and Turek, J. (2000), “Statistical
Matching: A New Validation Case Study,” in Proceedings of the Survey
Research Methods Section, American Statistical Association, pp. 746–751.
Kadane, J. B. (1978), “Some Statistical Problems in Merging Data Files,” in
1978 Compendium of Tax Research, Washington, DC: U.S. Department of
the Treasury, pp. 159–171. (Reprinted in Journal of OfŽcial Statistics, 17,
423–433.)

73

Little, R. J. A. (1988), “Missing-Data Adjustments in Large Surveys,” Journal
of Business and Economic Statistics, 6, 287–301.
Moriarity, C. (2001), “Statistical Properties of Statistical Matching,” unpublished Ph.D. dissertation, The George Washington University, Dept. of
Statistics.
Moriarity, C., and Scheuren, F. (2001), “Statistical Matching: A Paradigm for
Assessing the Uncertainty in the Procedure,” Journal of OfŽcial Statistics,
17, 407–422.
National Research Council (1992), Combining Information: Statistical Issues
and Opportunities for Research, Washington, DC: National Academy Press.
Okner, B. A. (1972), “Constructing a New Data Base From Existing Microdata Sets: The 1966 Merge File,” Annals of Economic and Social Measurement, 1, 325–342.
Paass, G. (1986), “Statistical Match: Evaluation of Existing Procedures and
Improvements by Using Additional Information,” in Microanalytic Simulation Models to Support Social a