Mon Apr 8

The replication crisis is good for science

Written by Eric Loken, Assistant Professor of Educational Psychology, University of Connecticut

Science is in the midst of a crisis: A surprising fraction of published studies fail to replicate when the procedures are repeated.

For example, take the study, published in 2007, that claimed that tricky math problems requiring careful thought are easier to solve when presented in a fuzzy font ^[1]. When researchers found in a small study that using a fuzzy font improved performance accuracy, it supported a claim that encountering perceptual challenges could induce people to reflect more carefully.

However, 16 attempts to replicate the result failed ^[2], definitively demonstrating that the original claim was erroneous. Plotted together on a graph, the studies formed a perfect bell curve centered around zero effect. As is frequently the case with failures to replicate, of the 17 total attempts, the original had both the smallest sample size and the most extreme result.

The Reproducibility Project, a collaboration of 270 psychologists, has attempted to replicate 100 psychology studies ^[3], while a 2018 report ^[4] examined studies published in the prestigious scholarly journals Nature and Science between 2010 and 2015. These efforts find that about two-thirds of studies do replicate to some degree, but that the strength of the findings is often weaker than originally claimed.

Is this bad for science? It’s certainly uncomfortable for many scientists whose work gets undercut, and the rate of failures may currently be unacceptably high. But, as a psychologist and a statistician, I believe confronting the replication crisis is good for science as a whole.

Practicing good science

First, these replication attempts are examples of good science operating as it should. They are focused applications of the scientific method, careful experimentation and observation in the pursuit of reproducible results.

Many people incorrectly assume that, due to the “p<.05” threshold for statistical significance, only 5% of discoveries will prove to be errors. However, 15 years ago, physician John Ioannidis pointed to some fallacies in that assumption, arguing that false discoveries made up the majority of the published literature ^[5]. Replication efforts are confirming that the false discovery rate is much higher than 5%.

Awareness about the replication crisis appears to be promoting better behavior among scientists. Twenty years ago, the cycle for publication was basically complete after a scientist convinced three reviewers and an editor that the work was sound. Yes, the published research would become part of the literature, and therefore open to review – but that was a slow-moving process.

Today, the stakes have been raised for researchers. They know that there’s the possibility that their study might be reviewed by thousands of opinionated commenters on the internet or by a high-profile group like the Reproducibility Project. Some journals now require scientists to make their data and computer code available, which makes it likelier that others will catch errors in their work. What’s more, some scientists can now “preregister” their hypotheses before starting their study – the equivalent of calling your shot before you take it.

Combined with open sharing of materials and data, preregistration improves the transparency and reproducibility of science, hopefully ensuring that a smaller fraction of future studies will fail to replicate.

While there are signs that scientists are indeed reforming their ways ^[6], there is still a long way to go. Out of the 1,500 accepted presentations at the annual meeting for the Society for Behavioral Medicine in March ^[7], only 1 in 4 of the authors reported using these open science techniques in the work they presented.

Improving statistical intuition

Finally, the replication crisis is helping improve scientists’ intuitions about statistical inference.

Researchers now better understand how weak designs with high uncertainty – in combination with choosing to publish only when results are statistically significant – produce exaggerated results. In fact, it is one of the reasons more than 800 scientists recently argued in favor of abandoning statistical significance testing ^[8].

We also better appreciate how isolated research findings fit into the broader pattern of results. In another study, Ionnadis and oncologist Jonathan Schoenfeld surveyed the epidemiology literature ^[9] for studies associating 40 common food ingredients with cancer. There were some broad consistent trends – unsurprisingly, bacon, salt and sugar are never found to be protective against cancer.

But plotting the effects from 264 studies produced a confusing pattern. The magnitudes of the reported effects were highly variable. In other words, one study might say that a given ingredient was very bad for you, while another might conclude that the harms were small. In many cases, the studies even disagreed on whether a given ingredient was harmful or beneficial.

Each of the studies had at some point been reported in isolation in a newspaper or a website as the latest finding in health and nutrition. But taken as a whole, the evidence from all the studies was not nearly as definitive as each single study may have appeared.

Schoenfeld and Ioannidis also graphed the 264 published effect sizes. Unlike the fuzzy font replications, their graph of published effects looked like the tails of a bell curve. It was centered at zero with all the nonsignificant findings carved out. The unmistakable impression from seeing all the published nutrition results presented at once is that many of them might be like the fuzzy font result – impressive in isolation, but anomalous under replication.

The breathtaking possibility that a large fraction of published research findings might just be serendipitous is exactly why people speak of the replication crisis. But it’s not really a scientific crisis, because the awareness is bringing improvements in research practice, new understandings about statistical inference and an appreciation that isolated findings must be interpreted as part of a larger pattern.

Rather than undermining science, I feel that this is reaffirming the best practices of the scientific method.

References

^{^} are easier to solve when presented in a fuzzy font (doi.org)
^{^} 16 attempts to replicate the result failed (doi.org)
^{^} attempted to replicate 100 psychology studies (osf.io)
^{^} a 2018 report (doi.org)
^{^} made up the majority of the published literature (doi.org)
^{^} that scientists are indeed reforming their ways (fivethirtyeight.com)
^{^} at the annual meeting for the Society for Behavioral Medicine in March (plan.core-apps.com)
^{^} statistical significance testing (www.nature.com)
^{^} surveyed the epidemiology literature (doi.org)

Authors: Eric Loken, Assistant Professor of Educational Psychology, University of Connecticut

The replication crisis is good for science

Practicing good science

Improving statistical intuition

References

The ethical dilemmas behind plans for involuntary treatment to target homelessness, mental illness and addiction

Philly hospitals test new strategy for ‘tranq dope’ withdrawal – and it keeps patients from walking out before their treatment is done

Telehealth makes timely abortions possible for many, research shows

Vaccinating children: Is COVID-19 herd immunity possible without them?

Nonprofits that provide shelter for homeless people, disaster recovery help, and food for low-income Americans rely heavily on federal funding – they would be reeling if Trump froze that mon…

How weapons get to Ukraine and what's needed to protect vulnerable supply chains

Telecommuting could curb the coronavirus epidemic

What's the right way for scientists to edit human genes? 5 essential reads

On his 250th birthday, Joseph Fourier's math still makes a difference

New Hampshire voting doesn't look like other states − here's why that matters for the Republican primary