Counterpoint: Pretty sure p-hacking is older than 50 years.
Yeah, the term only came up recently. But ultimately p-hacking is "just" exploiting freedom in the possible interpretation of statistical results. Also it wasn't really "invented", it's just what humans do if they have data in front of them and some expectation what the data should tell them. I'm pretty sure the first p-hacker was the first person to do any form of early statistics.
I guess the greatest invention would be a method to prevent p-hacking. (And yeah, I know about preregistration. It doesn't work the way it's done right now, because a) most people don't do it, b) preregistration is rarely precise enough to really prevent p-hacking and c) nobody cares if you switch your outcomes.)
What kind of solution is there to this? Only trust studies which have been successfully reproduced several times? Raise the bar for an acceptable p value so it becomes less practical to p-hack?
Preregistration of studies [0]. Submit design and models (e.g., logistic regression with these covariates and these exclusion criteria) before collecting data. This is relatively easy to implement and hard to get around, but does need to be paired with more acceptance of the importance of null results in advancing science.
This is the right answer. I worked in clinical trials for a while, and they're required to submit a Statistical Analysis Plan before the study is even allowed to proceed. After all the data are collected, you just run whatever was originally proposed and you get the answer you get.
This really should be the standard for any meaningful science.
The idea is that preregistration is done publicly, so a reviewer of a paper could easily see that you have registered multiple analyses and reject the research based on this.
Yes, that's entirely possible. My comment was made keeping in mind that the scientific norm is to have a single preregistration for a single experiment. If a researcher wants to test N hypothesis, using an appropriate p-value correction to account for this, usually the entire set of hypotheses and the correction itself would be contained in a single registration. And that would be perfectly fine.
Submiting a large number of individual registrations for a single set of data is just not the norm and so it would be suspicious to begin with. (Okay, I will admit that on the order of 2-3 registrations for very different categories of study may not be as strange; but certainly 20 would signal something very odd).
You can, but that's not necessarily problematic in itself (perhaps slightly depending on what your aim is). By definition, using a statistical significance threshold of 0.05 means that you will have 5% false positive rate. You always expect that _some_ results are erronous (either false positive or false negatives), and the point of using p-values is merely to control the rate at which this happens.
That said, you can do something called multiple comparisons corrections, where you take into account the fact that you're doing multiple tests in the same go. In this case you'll end up being more conservative, and it's not the false positive rate you'll be controlling - instead you'll control the family-wise error rate, which basically has to do with whether _any_ of your positive results are false positives.
It's much much harder to lie with effect sizes. For instance, in genetics there are entire careers built on an association of certain genetic polymorphism with a particular phenotype. Even if the p-value can be reproduced with huge sample sizes, the effect size of the association is usually tiny.
In other words, if you carry the risk allele the odds of developing a disease are usually only slightly bigger than if you didn't. But this is often not told to funders! Focusing on effect sizes would drive research effort to things that actually make a difference.
You can do effect sizes with something as simple as a confidence interval, or with a fully fledged Bayesian model that tries to estimate the posterior distribution of some parameter.
Modern genome association studies rarely look for single allele effects anymre. The current views shift towards polygenic or sometimes even omnigenic (meaming majority of genes can have indirect effects on the phenotype of disease) associations. No one expects to see big effect sizes because very few people will have differences in any one particular loci. So under these models simple effect sizes are less important than p-values.
I think effect size is a very useful tool if you know some odds a priori. E.g if you can compare the effect size with some known features/variables. It can then give you a rough feeling how true it might be. If the features are small as genetic variants then people will have a hard time estimating the supposed effect (unless it's a functional one that alters the encoded protein e.g).
Unfortunately, raising p by a factor of 10 multiplies the data set by 100, if rule-of-thumb is a guide.
While impractical in many cases, I think the gold standard is to approach a problem from as many angles as possible, and relate it to disparate sources of knowledge, including theory.
For example, the accepted values of the physical constants are established by munging the results from a wide variety of different kinds of experiments, guided by theory. They don't just repeat the same experiment over and over. These constants are robust enough that you can hang your hat on them for virtually any practical use.
I think a lot of publications and scientists are now aware of this so they adjust their experimental design to not allow for a whole bunch of parameters or small sample/population size.
Technically, P-hacking isn't just ethically wrong, it's also statistically incorrect. If you're testing the same data for multiple hypotheses, or performing the same test on multiple data sets, you need to perform a Bonferroni correction for multiple hypothesis testing.
I knoew when I saw that title it would trigger me! Sure did: Number two, the Bootstrap.
It is the "Stationary Bootstrap" because it can only be used on stationary data (for very obvious reasons). Problem is that tests for stationarity are very poor.
I have seen it used in academic finance to prove technical analysis. Yes, really (A clue to the clueless - except in very special circumstances technical analysis does not work). Completely inappropriate to bootstrap financial returns data as they are not stationary.
I am talking about White's Reality Check: which was devised by Herbert White who was a Distinguished Professor of Economics at the University of California. It is completely bananas, (final return data is not stationary. Bootstrapping it gives random noise. Duh!). When I looked at it about decade ago the paper it was developed in (two papers actually) were among the most cited in the field.
I am not convinced that it can be used in stationary data, though I only have experience debunking it, not making good use of it. If something seems too good to be true.... And the bootstrap promises to take your too small number of data and make it into a lot of data.
I see the paper here did not bother with stationarity. I stopped reading there. I cannot bear the myths and self deception statisticians tell themselves. I did it as a graduate student for far too long.
They probably can but simply haven't because they haven't thought about the issue at all, aren't willing to put in the extra work (the PDF version was probably all but mandatory if they're from academia), didn't want to put in the significant amount of extra work to make an HTML paper look halfway decent, didn't want to learn a new set of tools right now when their PDF toolchain works just fine, ....
More importantly, they're people and probably had _some_ reason for doing what they did. If we'd prefer an HTML version instead or in addition, we ought to make an honest effort to understand their perspective and address their concerns rather than just bemoan their choices.