Home > Uncategorized > What Does It Mean for a Replication To Be Successful?

## What Does It Mean for a Replication To Be Successful?

What should be the metric for deciding whether a given replication has succeeded in replicating the original or not?  I thought about that to try to understand it and came up with these notes.

Study A (“Original”) finds an effect of 8 with a 95% confidence interval of [2, 14]. By itself, it rejects the null of no effect.

Study B (“Replicator”) finds an effect of 3, with a confidence interval of [-3, 9]. By itself, it fails to reject the null of no effect. It also fails to reject the null of an effect of 8.

Has the replication succeeded?

Dan Gilbert, Gary King, Stephen Pettigrew, and Tim Wilson (2016) say YES. Srivastava (2016) is unclear.
http://science.sciencemag.org/content/351/6277/1037.2
Evaluating a new critique of the Reproducibility Project

Gilbert et al. can’t be right. They are saying that Study B, which finds an effect, is *replicated* by Study A, which does not. Or, more precisely, they are saying that we *should accept* Study A’s conclusion that an effect exists, because Study B does not offer enough evidence to reject it. But that more precise rendering is not a good way to match what we mean by “replicate”. If we use the word that way, we must also say that Study A has replicated Study B, and we should accept Study B’s conclusion that no effect exists because Study A does not offer enough evidence to reject it. But we can’t accept both conclusions.

If Gilbert et al. want to be picky, they can say that to be even more precise, we *cannot reject* Study A’s conclusion that an effect exists, because Study B does not offer enough evidence to reject it. But they also have to say that we cannot reject Study B’s conclusion that no effect exists. And if we can’t reject any of the possible conclusions, we’re back to where we were before anybody did any studies, and we have to say that the discipline hasn’t discovered anything on this subject.

The Gilbert et al. definition is shifting the burden of proof. The burden of proof is crucial in law and in classical statistics, so that’s not the way to do things.

Rather, we should proceed one of two ways.

(1) The simplest way, methodologically, is to combine the data in Studies A and B and see if the conclusion is the same as in Study A.

That’s not what we mean in ordinary language by the word “replication”, but it is what we mean by “confirmation” or “support”.

(2) The closest to what what we mean in ordinary language is to repeat the study and see if it comes to the same conclusion.

In that case, Study B above fails to replicate Study A. The problem relative to (1) is that the replication always adds more data points, increasing the power of the test, so we could have ten replications that each fail to get to 5% but each gets to 4.9% so together they confirm Study A.

It would be easy to report both methods 1 and 2.

What Gilbert et al and Srivastava are looking at is more complicated, because they are looking at, say, 100 different Study A’s on 100 different topics. I’ll leave that for another day.

Categories: Uncategorized Tags: