Hacker News new | past | comments | ask | show | jobs | submit login

What kind of solution is there to this? Only trust studies which have been successfully reproduced several times? Raise the bar for an acceptable p value so it becomes less practical to p-hack?



Preregistration of studies [0]. Submit design and models (e.g., logistic regression with these covariates and these exclusion criteria) before collecting data. This is relatively easy to implement and hard to get around, but does need to be paired with more acceptance of the importance of null results in advancing science.

[0] https://en.wikipedia.org/wiki/Preregistration_(science)


This is the right answer. I worked in clinical trials for a while, and they're required to submit a Statistical Analysis Plan before the study is even allowed to proceed. After all the data are collected, you just run whatever was originally proposed and you get the answer you get.

This really should be the standard for any meaningful science.


Can't you just pre-register for 20 studies of random correlations and still get one p>0.05?


The idea is that preregistration is done publicly, so a reviewer of a paper could easily see that you have registered multiple analyses and reject the research based on this.


Can't it be compensated for[1] instead?

[1]: https://en.wikipedia.org/wiki/Look-elsewhere_effect


Yes, that's entirely possible. My comment was made keeping in mind that the scientific norm is to have a single preregistration for a single experiment. If a researcher wants to test N hypothesis, using an appropriate p-value correction to account for this, usually the entire set of hypotheses and the correction itself would be contained in a single registration. And that would be perfectly fine.

Submiting a large number of individual registrations for a single set of data is just not the norm and so it would be suspicious to begin with. (Okay, I will admit that on the order of 2-3 registrations for very different categories of study may not be as strange; but certainly 20 would signal something very odd).


You can, but that's not necessarily problematic in itself (perhaps slightly depending on what your aim is). By definition, using a statistical significance threshold of 0.05 means that you will have 5% false positive rate. You always expect that _some_ results are erronous (either false positive or false negatives), and the point of using p-values is merely to control the rate at which this happens.

That said, you can do something called multiple comparisons corrections, where you take into account the fact that you're doing multiple tests in the same go. In this case you'll end up being more conservative, and it's not the false positive rate you'll be controlling - instead you'll control the family-wise error rate, which basically has to do with whether _any_ of your positive results are false positives.


Focusing on effect sizes!

It's much much harder to lie with effect sizes. For instance, in genetics there are entire careers built on an association of certain genetic polymorphism with a particular phenotype. Even if the p-value can be reproduced with huge sample sizes, the effect size of the association is usually tiny.

In other words, if you carry the risk allele the odds of developing a disease are usually only slightly bigger than if you didn't. But this is often not told to funders! Focusing on effect sizes would drive research effort to things that actually make a difference.

You can do effect sizes with something as simple as a confidence interval, or with a fully fledged Bayesian model that tries to estimate the posterior distribution of some parameter.


Modern genome association studies rarely look for single allele effects anymre. The current views shift towards polygenic or sometimes even omnigenic (meaming majority of genes can have indirect effects on the phenotype of disease) associations. No one expects to see big effect sizes because very few people will have differences in any one particular loci. So under these models simple effect sizes are less important than p-values.


I think effect size is a very useful tool if you know some odds a priori. E.g if you can compare the effect size with some known features/variables. It can then give you a rough feeling how true it might be. If the features are small as genetic variants then people will have a hard time estimating the supposed effect (unless it's a functional one that alters the encoded protein e.g).


Unfortunately, raising p by a factor of 10 multiplies the data set by 100, if rule-of-thumb is a guide.

While impractical in many cases, I think the gold standard is to approach a problem from as many angles as possible, and relate it to disparate sources of knowledge, including theory.

For example, the accepted values of the physical constants are established by munging the results from a wide variety of different kinds of experiments, guided by theory. They don't just repeat the same experiment over and over. These constants are robust enough that you can hang your hat on them for virtually any practical use.


I think a lot of publications and scientists are now aware of this so they adjust their experimental design to not allow for a whole bunch of parameters or small sample/population size.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: