Steps of an Ideal Research Program in th

Steps of an Ideal Research Program in the Empirical Social Sciences:
From Context of Discovery to Context of Justification

Erich H. Witte1
University of Hamburg, Germany

Frank Zenker2
Lund University, Sweden

Abstract
The current crisis in the social sciences, especially in psychology, is not readily
comparable to the usual crisis discourse, because many researchers have now in
fact lost confidence in knowledge which is based on empirical results. This paper
argues that, in order to make headway, researchers must seek to integrate extant
knowledge in a research program, and they must adapt their inference strategy as
they go along. This entails moving from the context of discovery to the context of
justification. Justification (confirmation) is a comparative notion, however, and
therefore possible only when alternative hypotheses are specified. Without such
specification no justification can be achieved, and so only a vague and intuitive
confidence in our empirical discoveries would remain warranted.
Keywords: confirmation, knowledge accumulation, replicability crisis, research

program, significance test

1 Prof.

(em.) Dr. Erich H. Witte
Social and Economic Psychology
University of Hamburg
Von-Melle-Park 5
D-20146 Hamburg
witte_e_h@uni-hamburg.de
http://www.epb.uni-hamburg.de/de/personen/witte-0
2 Frank

Zenker
Department of Philosophy
Lund University
Kungshuset, Lundagård
22 222 Lund, Sweden
frank.zenker@fil.lu.se
http://www.fil.lu.se/person/FrankZenker

0

Steps of an Ideal Research Program in the Empirical Social Sciences:
From Context of Discovery to Context of Justification

Erich H. Witte3
University of Hamburg, Germany

Frank Zenker4
Lund University, Sweden

Abstract: The current crisis in the social sciences, especially in psychology, is not
readily comparable to the usual crisis discourse, because many researchers have now
in fact lost confidence in knowledge which is based on empirical results. This paper
argues that, in order to make headway, researchers must seek to integrate extant
knowledge in a research program, and they must adapt their inference strategy as
they go along. This entails moving from the context of discovery to the context of
justification. Justification (confirmation) is a comparative notion, however, and
therefore possible only when alternative hypotheses are specified. Without such
specification no justification can be achieved, and so only a vague and intuitive

confidence in our empirical discoveries would remain warranted.

Keywords: confirmation, knowledge accumulation, replicability crisis, research
program, significance test

3 Prof.

(em.) Dr. Erich H. Witte
Social and Economic Psychology
University of Hamburg
Von-Melle-Park 5
D-20146 Hamburg
witte_e_h@uni-hamburg.de
http://www.epb.uni-hamburg.de/de/personen/witte-0
4 Frank

Zenker
Department of Philosophy
Lund University
Kungshuset, Lundagård

22 222 Lund, Sweden
frank.zenker@fil.lu.se
http://www.fil.lu.se/person/FrankZenker
1

1. Introduction
Psychology and other empirical social sciences currently experience a state of crisis.
This is not an entirely new claim (Sturm & Mühlberger, 2012); scholars have
diagnosed a crisis of their discipline from 1889 (Willy) to the present day, most
recently in Perspectives on Psychological Science (Pashler & Wagenmakers, 2012;
Spellman, 2012). Similarly, the special crisis of significance testing is as old as the
significance test (Harlow, Mulaik & Steiger, 1997; Witte, 1980). Regardless, we
certainly are in a crisis when many of us have lost any confidence as to which
among a plethora of published empirical results can be trusted. What do we really
know from our results?
Below, we provide a diagnosis of the current crisis (Sect. 2). If we are right,
then the predominant use of statistical inference methods is implicated negatively.
We by and large seem to employ such methods in underpowered studies which
produce theoretically disconnected one-off discoveries. Such results, we argue, are
neither stable and replicable discoveries nor justifications of a theoretical hypothesis,

something that can only arise in the framework of a cumulative research program.
We lay out four steps of such a program (Sect. 3), and go on to argue that our
scientific knowledge can only improve by entering the context of justification (Sect.
4). Moreover, we point out critically that many of our best theories currently remain
theoretically underdeveloped, because they do not yet predict precise effects (Sect.
5). Unless the community makes significant headway in conducting coordinated
long term research endeavors, or so we prophesize, the crisis discourse shall linger
on.

2. The status quo
Many proposals have been voiced to overcome the shortcomings of the present
situation. The perhaps too optimistic conclusions from a recent survey (Fuchs, Jenny
& Fiedler, 2012) indicate that psychologists are open to change. However, such
change should apparently not be too intense either; for instance, the accepted rules
of good practice should reportedly not be turned into binding publication conditions;
reviewers should reportedly be more tolerant of imperfections in results submitted
for publication: an 84% agreement among respondents (ibid., 640). Among the
2

solutions proposed one also finds the intensification of communication and the

exchange of data, or the exchange of the design characteristics of experimental
studies prior to an experiment (Nosek & Bar-Anan, 2012).
The crisis seems to involve a dilemma between publishing significant results, on the
one hand, and increasing the trustworthiness of scientific knowledge, on the other
(Nosek, Spies & Motyl, 2012; Bakker, van Dijk & Wicherts, 2012). Generally, a
very long path leads from the discovery of a new effect to its theoretical explanation.
But the distinction—well-known in the philosophy of science—between a context of
discovery and a context of justification (Reichenbach, 1938; Schickore & Steinle,
2006) has been mostly ignored in our fundamental approach to theory testing.
Significance tests and Bayes-tests alike —whether with a Cauchy distributed prior or
a normally distributed prior (Dienes, 2012; Rouder et al., 2009; Wagenmakers et al.,
2012)—do normally not specify alternative explanatory hypotheses. Usually, only
the null hypothesis is well-specified, so that a sufficiently large deviation from this
parameter can be accepted as a non-random result, which is then taken to translate
into a discovery of theoretical interest. But a statistically supported discovery of this
kind does emphatically not amount to a justification (confirmation) of a parameter
that is derived from a theory.
What are the crucial characteristics that differentiate the two contexts (Table 1)?

3


Characteristics
Uncertainty model

Context of discovery
Probability model with

Context of justification
Likelihood model with

P(x|θ) - the probability of

L(θ|x) - the likelihood of a

some data x given the

hypothetical parameter θ

Specification of


hypothesis θ.
Specification only of a

given some data x .
Specification of two

parameters

random parameter: θr =0.

parameters:

The theoretical alternative

θr = 0 and θth = x.

parameter remains
Theoretical construction

unspecified: θ ≠ 0.

Simple random model

Construction of a random

without adaptation to the

model under the specific

empirical condition.

conditions with a precise
parameter, construction of
a theoretical model, and
deduction of the specified
parameter.
An integrated research

Integration of empirical

Only single discoveries


results

without further integration program combines
of results. One-shot

various empirical results

decisions depend only on

under a complex theory.

the significance of single
Knowledge of wrong

experiments.
Specification of false

Specification of false and


decisions

positives (α error);

true positives (α-error and

no specification of true

1-β error).

positives (1-β error) (testpower).

Table 1. Comparison of salient characteristics in the context of discovery and the
context of justification

For instance, Wagenmakers et al. (2012) pursued an agenda for purely confirmatory
research and attempted to replicate the results published by Bem (2011). They could
stop their inquiry at 200 sessions with 100 subjects (see their figure 2; ibid., 636),
4

because the results obtained up to this point did lend a degree of support to the null
hypothesis that was 6.2 times larger than the degree lent to the alternative
hypothesis, under the knowledge-based prior with a more acceptable distribution of
the effects observed (Bem, Utts & Johnson, 2011). Such support is considered
“substantial evidence” for the null-hypothesis (Jeffreys, 1961; Wagenmakers et al.,
2011).
The method of hypothesis testing that Wagenmakers et al. employed is based on the
sequential analysis developed by Wald (1947). As a tool within a research strategy,
this method oscillates between the context of discovery and the context of
justification. The method cannot reflect a purely confirmatory strategy, because
neither the alpha error (the chance of a false positive result) nor the beta error (the
chance of a false negative result) in sequentially testing the two specified hypotheses
(p=0.50 against p=0.531 given in Wagenmarkers et al.’s case, based on Bem’s
(2011) results) are known if the number of observations varies with the observed
results. Without knowledge of these errors, the replicability and the trust in the
observed empirical result as a theoretical foundation for a justification is missed.
The scientific community needs trustworthy data to test their theories, and such
trustworthiness is owed in large part to the knowledge of errors (Mayo, 1996).
The crucial question is whether a given test condition is strong enough to provide for
a clear justification of the null-hypothesis. After all, the same sequential testing
strategy would result in the acceptance of the alternative hypothesis if only one had
stopped testing after about 38 sessions, something to be read off from the curve in
figure 2 (see Wagenmakers et al., 2012, 636; Simmons et al., 2011 illustrate the
same problem). If we accept an effect size of g=3.1% as a theoretical specification
originating from empirical results obtained in the past (see Bem, 2011, 409,
experiment 1), then we must devise a test condition which can in principle provide
us with a clear answer. This condition has to be specified by the α- and β-error. The
Neyman-Pearson-theory defines the number of observations (subjects, sessions) that
are necessary to decide between two hypotheses with a difference of g=3.1% (50%
against 53.1% according to Bem (2011, experiment 1) with α=β=0.05). This test is
against the alternative hypothesis, i.e., the β-error has to be considered as seriously
as the α-error.

5

When employing this kind of empirical rigor as a decision criterion, the null
hypothesis to be accepted must be 19 times more probable than the alternative
hypothesis, given that the selected strength of the experimental condition (0.95/
0.05) counts as trustworthy enough for a scientifically acceptable decision. In
Jeffreys’s (1961) classification for the plausibility of a hypothesis, this is called
“strong evidence” for H0. So, how many observations are necessary to support the
null hypothesis in the context of justification? The answer is 2828 (see Cohen, 1977,
169 on calculating the number of observations). If we want to be sure—sure, that is,
under specified α- and β-errors—then we incur an extreme number of data points
(observations, subjects, sessions, etc.). The main reasons for this tall number are the
effect size between the two rivaling hypotheses, which is very small, and the
rigorous control of both errors.
If so, then it cannot suffice to publish the testing strategy before and a stopping rule
after inspection of the data obtained. In comparison, sequential testing is certainly
not as problematic as “double-dipping” (Kriegeskorte et al., 2009) in order to inflate
the α-error. Nevertheless, sequential testing does necessarily inflate the β-error
(under fixed effects), and so increases the chance that a true difference is not
detected in the empirical condition used. In the case at hand, the β-error is at least
β=0.58 with α=0.05 (one-sided), g=0.05 and n=200 (see Cohen, 1977, 155). Here,
the size of the beta error is incompatible with an acceptable empirical condition that
can decide between the two specified hypotheses.
Generally, in the context of justification and for the purpose of a (more or less)
definite decision between two specified hypotheses, the effect size is known and the
α- and β-error can be chosen as tolerable. But doing so effectively fixes the number
of necessary observations. Specifically, the smaller the effect size, the larger is the
number of necessary observations. Therefore, theories that can only predict small
effects must be confronted with a large number of observations, much larger than is
normally the case in our empirical studies. Conversely, given the typically small
effect sizes of our theories and the typically low number of observations reported in
our journals, few published effects can even hope to count as replicable, i.e., as real
discoveries.
What, then, is the way out? Currently, we by and large fail to specify the alternative
hypothesis and hope to find something significant among our results; that failing, we
6

move on to torture the data. Such statistically significant results are then declared to
be scientifically significant observations. This is our main strategy in psychology
compared with those employed in biology, physics, and sociology (Witte &
Strohmeier, 2013). Predictably, one finds a huge number of underpowered studies in
our journals (Bakker, van Dijk & Wicherts, 2012)—studies which, by their very
design, cannot even hope to produce real and replicable discoveries. This problem
frequently arises when a significant result is obtained, but the power of the observed
effect is nevertheless too low (