When testing hypotheses, there are two types of errors one can make: finding a relationship where one does not exist (false positive), and finding no relationship where one does exist (false negative). Statistical power measures how hard you looked. It is just your chance to find something if it’s there. Or if you like double negatives, it’s your chance to avoid a false negative. Higher power means less chance the guilty go free. (Thanks to CK for the metaphor.)

Statistical Power depends on three factors:

  1. Statistical significance criteria (for example 0.05 or 0.01). Setting higher criteria values for statistical significance results in higher power, but also raises the chance of false positives. In our case this is fixed at 0.05, so we move on.
  2. Effect Size. Larger effects are easier to detect (have higher power). Again, in our case these are fixed.
  3. Sample Size. Larger samples sizes make it easier to see subtle or noisy effects (have higher power).

When interpreting the results of a hypothesis test, we must be aware that low statistical power can lead to invalid conclusions.  Therefore, experiments often aim for a minimum of 80% power (or 0.8) – 80% chance to find the effect if there, or 20% chance of a false negative. But aim doesn’t mean achieve. Real power is often much less than 80%. That affects the results of significance tests.


Given a statistically significant result (most of our claims!), low power leads to type M and type S errors (Gelman & Carlin, 2014). Type M errors refer to magnitude error, where the measured effect exaggerates with true effect size. Type S errors refers to a measured effect size being in the opposite direction to the true effect size.


Statistical power tests are ideally performed before an experiment is conducted, and are often used to calculate minimum sample sizes required to reach a statistical power threshold for a given effect size. When applying power calculations to replications, the original study can be used to inform the priori power of the replication. For instance, power calculations can be used based on original effect size (e.g. 95% power based on original effect size), or by increasing the sample size by a given amount. As the characteristics and nature of each study is different, we rely on a set of rules to ensure that the replications are appropriately powered. (See “What is a high-quality replication?”) 


  • Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6), 641–651. https://doi.org/10.1177/1745691614551642

You might also like:

Follow Us

Although final results will not be available until mid-2020, you may follow the project to receive updates and relevant posts.

Replication Markets will be launched in August 2019 pending IRB approval; you can sign up here to be notified.