These People Are Trying To Fix A Huge Problem In Science

    The results of too many scientific studies aren't standing up to scrutiny. Here's how a group of scientists think they can (partly) change that.

    Parts of science are in crisis. Big, important findings of previous studies – things we thought we knew – have failed to stand up to scrutiny: When other scientists re-created the experiments, the same results didn't appear.

    The problem is widespread. One major study, led by the psychologist Brian Nosek of the University of Virginia, tried to replicate the findings of 100 psychological experiments; less than half of them stood up to scrutiny. No field has been completely unaffected, from physics to economics, although the crisis is most keenly felt in medical science and psychology.

    Some of the most well-known psychological studies ever have failed to replicate. For instance, a famous 1988 study found that facial expressions affect our mood – people who held a pencil between their teeth, forcing them to smile, apparently found cartoons funnier. It had a huge impact and was cited more than 1,500 times by other research – but last year a major replication attempt failed to find any effect. In another instance, researchers found in 2010 that "power poses" – standing in ways associated with dominance – made people more assertive. Their study had a huge impact and a TED talk by one of its authors has been viewed more than 42 million times – but again, it's failed to replicate, and another of the authors has said she doesn't believe the "power pose" effect is real. These are two of the most high-profile examples, but there are hundreds more.

    In an upcoming paper in the journal Nature Human Behaviour, available in preprint form on the site PsyArXiv, 72 scientists – including Nosek – have suggested a partial solution. Marcus Munafo, a professor of experimental psychology at the University of Bristol and one of the authors of the paper, told BuzzFeed News: "We're past the point where we can just highlight the problem and say how terrible it is. We need to think about ways in which we can improve the quality of what we do." Their suggestion is to make it much harder to declare that you've found a "statistically significant" result in the first place.

    The proposal has sparked significant controversy, but on balance, the scientists BuzzFeed News spoke to were in favour. "I'm still absorbing it," Sanjay Srivastava, a professor of psychology at the University of Oregon, who was not involved in the study, told BuzzFeed News. "But I think it has a lot going for it."

    "It raises a very important issue," David Spiegelhalter, Winton professor of the public understanding of risk at the University of Cambridge, told BuzzFeed News. "It's crude, but I've got some sympathy for it."

    To explain what it means, we're going to have to discuss how science works. At its most basic level, science is simple: A scientist puts forward a hypothesis, then tests it by collecting data. So they could put forward the hypothesis "this die is loaded so it always rolls a six" and test it by rolling the die.

    If she rolls a six, that doesn't prove the die is loaded. It could have been a fluke. If she rolls it again and it's a six again, that doesn't prove it either. In fact, she could roll it forever, and she'd never actually prove the die is loaded.

    What you can say is how likely it is that you'd see the results you're getting if the die isn't loaded. On a fair die, there's a 1 in 6 chance you'd roll a six. So if your hypothesis is wrong – in technical language, if the "null hypothesis" is true – there's a 1/6 probability you'd see a 6 anyway. There's a 1/36 chance that you'd see two sixes in a row on a fair die; a 1/216 chance that you'd see three in a row.

    This probability is known as the "p-value", and it's usually written as a decimal between 0 and 1: For instance, something that you'd see 1 time in 10 if the null hypothesis is true is written as p=0.1, i.e. 1 divided by 10.

    By convention, in science a finding is considered "statistically significant" if you'd see that result less than 1 time in every 20 if the null hypothesis were true: if p=0.05 or less. In our die experiment, that would mean that a single roll of the die wouldn't be enough. Rolling one six on a fair die has a p-value of 0.1667 (1 divided by 6), much higher than 0.05. But if you rolled two sixes in a row, you would comfortably be able to declare it as statistically significant, at p=0.0278 (1 divided by 36).

    The Nature Human Behaviour paper proposes a simple but stark change. Instead of a p-value of 0.05 being considered the threshold for significance, it should be moved to 0.005, or 1 in 200. In die-rolling terms, that means that two sixes in a row would not be enough – but a third would take you to p=0.0046, just below the threshold. The scientists say that results between p=0.05 and p=0.005 should be considered "suggestive", but not enough to declare a discovery.

    The problem with the p=0.05 threshold, the authors say, is that it's too lax. People assume that a p=0.05 finding means there's a 95% chance that the finding is real, but in fact it's much less robust than that. Essentially, that's because true hypotheses are pretty rare.

    Go back to the dice. If you've ever played a game with dice, you'll have rolled a few double sixes – possibly even a few triple sixes. But you probably didn't assume the dice were loaded. That's because there aren't very many loaded dice around.

    Imagine that one die in every 100 is loaded, and that loaded dice roll a six almost every time. That one in 100 is your "prior probability rate". Say you take 1,000 dice and try to find the loaded ones by rolling each of them twice. There would be 10 loaded dice, and they'd each roll two sixes. But also, of the 990 fair dice, on average, 28 of them would roll two sixes just by chance. Your p-value of 0.0278 – what looks like a 1 in 36 chance of being wrong – translates in this example to barely a 1 in 4 chance of actually having found a loaded die. Out of 38 positive results, only 10 would be real. The rest would be "false positives".

    This isn't an abstract point – it matters in lots of areas, including screening for diseases. "If you screen for Alzheimer’s, you find you get lots of false positives, because it’s relatively rare in the population," David Colquhoun, a professor of pharmacology at University College London who has written extensively about p-values, told BuzzFeed News. If a test is 90% accurate, but only 1% of the people you're testing have Alzheimer's, then about 90% of your positives will be false.

    But it also matters in research. Not every hypothesis you test will be right – obviously, or you wouldn't be testing them. "It's hard to estimate, and obviously it differs by field," says Munafo, but the Nature Human Behaviour paper estimates that, in psychology, the prior probability rate is that just one hypothesis in every 11 you test will be correct. Given that, and no other problems, 1 in every 3 positive results at p=0.05 would be false. But by lowering the threshold to p=0.005, if the 1 in 11 estimate is right, the rate of false positives would drop to 1 in 20.

    These numbers assume that there are no other problems in science. But in fact there are lots – most notably the problem of "outcome switching": that is, changing your hypothesis after the data has come in. If you test your data for 20 different hypotheses, then simply by chance you'd expect to find a 1-in-20 fluke, and there's your p=0.05. "Anyone can make a p-value small using various tricks like that," says Spiegelhalter.

    That's why he and others – including the authors – think that while this is important, it's not the only or even the main solution to science's replication problem. "Dealing with the size of the p-value fixes some things," he says. "But it’s not dealing with the most important issues." What would really help, he says, is if researchers distinguished between simply exploring, looking for interesting stuff – when a p=0.05 result is definitely worth noting – and confirming a discovery, which should need a much more stringent test. In physics and some branches of genetics, this is done already: The Higgs boson discovery was announced with a p-value of 0.000006, about 100,000 times lower than "significance".

    Munafo says that lowering the threshold will make these "p-hacking" tricks harder. "it’s much harder to p-hack over a 0.005 threshold than over a .05 threshold," he said. "You could do it if the p-value was at 0.006, but that's pretty good evidence already."

    But it can, he says, only be part of the solution, alongside things such as preregistration of hypotheses to prevent outcome switching, and encouraging scientists to replicate each others' findings – something that is rarely done at the moment, because journals are more likely to publish novel, exciting new science. That makes science less reliable, he says, because it incentivises scientists to "take a punt on wacky ideas".

    Not everyone agrees with the move. Colquhoun says that first, it underestimates the problem – that, taking other statistical facts into account, the risk of false positives is much higher than the Nature Human Behaviour paper acknowledges. He has written a full response to the paper, in which he argues that in situations when Munafo and his colleagues think the false positive rate is 5%, it could be as high as 24%. And second, he says, it's replacing one arbitrary threshold with another. Instead, he says, scientists should state how likely their hypothesis would have to be – their prior probability rate – to get an acceptable false positive rate given the p-value they've got.

    Munafo agrees to some degree. "I've avoided using the word 'significance' for the last few years," he says. "There's an inherent problem with this bright-line threshold. It gives an impression of a dichotomy, when in reality it's a continuous spectrum." He thinks scientists need more statistical training and a more nuanced approach to statistics.

    But reducing the threshold, he says, is a pragmatic, realistic step. And it may have positive knock-on effects. "In genomics," he says, "once it became possible to test whole genomes, people realised we needed a strong threshold." That threshold was set at p=0.00000001. "And to get there, people realised they needed bigger sample sizes. So they started sharing data worldwide, and they've produced really robust findings. And data-sharing became routine. These positive effects cascaded down."

    CORRECTION

    Brian Nosek is a psychologist. An earlier version of this piece misstated his specialty.