[Updated Jan. 2020 to cover unified definition, and clarify]


Replication is when you repeat a previous study to see if you get the same results. Ideally this happens a lot in science. In practice, not as much. But what counts? 

According to Brian Nosek (“What is replication?”, 2019), people commonly say replication is repeating a study’s procedure and observing whether the previous finding recurs, but this definition fails because the changes to procedure define the replication.

Consider: when replicating an Israeli study in the US, he didn’t use the original materials – they were in Hebrew! Replications change participants, campus, country, language, etc. But if any of these negated something like the Stroop effect, that would be a surprising and important discovery. 

So, Nosek argues replication is really a conceptual notion.

(He says “no a priori reason” — reasons made up afterwards don’t count We also assume competent, good-faith replication.)

Claiming something is a replication “is a theoretical commitment.”

Practical Definition for DARPA SCORE

Originally, we distinguished between direct replications and data replications, and privileged direct replications. Direct replications test the original claim by gathering new data, such as a new psychology experiment.

Data-analytic replications test the original claim using new found data, data appropriate for replication but not originally collected by the replication research team. For example, using the same economic indicator as the original-study researchers, but from a different time period.

But direct replications exclude most economics, sociology, and political science. So, as of February 2020 (Round 6), we will adopt DARPA’s new unified definition:

Unifying increases the number of evaluated claims in SCORE from 100 to 250. (However, R1-R5 markets will only pay prizes for direct replications, because that’s what we said then.)

Sample Size & Statistical Power

A decent replication has to have a sample large enough that failure to detect a result is almost certainly due to the claim being wrong, rather than not looking hard enough. If we redo the study but with only 3 participants, the result is almost certainly noise. 

We cannot just use the original sample size: most published studies are actually too small. Therefore we need to ensure sufficient power.

You might want to read:

 Share your contact info with us to stay up-to-date with an every-so-often newsletter and an announcement of the launch.