Back to [Archive] Other discussions

1
Banning p-testing/NHST
Admin
Altho this method is widely used (even by me), I think it is pretty much bogus and should not be used if it can be avoided. Recently, one journal took a double step towards that by simply banning it. Double step because they also banned confidence intervals, which in my opinion are much better than NHST and still easy to understand. CIs are based on similar theory (frequentist) but supply more information (the standard error/precision of the study) than a mere p value. They are less amendable to correct statistical interpretation in my opinion. The total alternative is to use Bayesian methods, but they have their own problems as well as being pretty difficult to learn.

Personally, I think we should get rid of p-values and rely on CIs alone. Thoughts?

http://www.sciencebasedmedicine.org/psychology-journal-bans-significance-testing/

See also discussion in:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3095378/
http://www.phil.vt.edu/dmayo/personal_website/Schmidt_Hunter_Eight_Common_But_False_Objections.pdf
http://www.researchgate.net/profile/Raymond_Nickerson/publication/12384017_Null_hypothesis_significance_testing_a_review_of_an_old_and_continuing_controversy/links/54905e910cf214269f26684e.pdf
I think p values are not a problem per se. It's the convention of rejecting a hypothesis if a certain threshold is not met (e.g. 0.05) that is bogus and should be avoided. Authors could just report the p value without saying silly things such as "there is no difference", when there is a difference but it's not statisticlally significant. So they should report the exact p value, instead of saying whether it's < or > than 0.05 as most authors still do. So I propose we should instead ban the expression "statistically significant".
I see nothing wrong with banning it. In psychology, a non-trivial number of researchers seems to understand what is p value and what it is not. But among economists, it's as if no one knows that. When they do a simple OLS regression, they display the tables showing the unstandardized coefficients and p-values (they never use standardized coeff.), but sometimes, they don't talk about the coefficient. Only the p-value. And they conclude : "it's significant, so there is an effect". That's a problem, as the test of null vs alternative hypothesis is a dichotomy question : yes vs no. It does not answer the "how much" when the answer is yes. If it's trivial, you will look stupid by stating "yes there is an effect !". When you add the known fact that p-value is a composite of effect size and sample size, you understand that p-value should never be trusted.

There is even something more troubling about econometrics. Some of the techniques widely used in causality testing (e.g., Granger causality) uses p-value to answer the question of the direction of causality (X->Y, Y->X, or bi-directional). Just the p value, and nothing else. To make things worse. Several statistical tests must be done before running the causality test, such as, making sure that your variables are all stationary, but test of stationarity (e.g., Augmented Dickey-Fuller) involves the use of p-value, and nothing else. So, when you prepare your variables, you make fallacies. When you do the tests with these variables, you make some more fallacies.

And I have never heard of an economist who has made the claim that p-value based statistical methods are wrong. They don't want to question it. They seem to believe that if all of their other colleagues are using it, you must do it ! And don't dare question the conventional wisdom !

Concerning banning CIs, I don't know. I always thought it has its use nonetheless, but I have recently changed my mind a little bit. See :
http://andrewgelman.com/2014/12/11/fallacy-placing-confidence-confidence-intervals/
http://andrewgelman.com/wp-content/uploads/2014/09/fundamentalError.pdf
Admin
There are a number of problems that one can try to reduce/solve by policy changes.

Some problems:
  1. publication bias
  2. p-rounding to get p=.05, which may have been p=.054
  3. interpretation of p>α to mean no population effect when power is low.
  4. data snooping and conditional stopping of data collection based on p value
  5. interpretation of CIs containing 0 to mean no population effect when power is low.
  6. not reporting effect sizes only test values


These have various solutions.

1.
a) Encourage/mandatory preregistration, b) Encourage publishing results without p<α. c) Develop meta-analysis methods that can estimate population effect despite biased reporting. I think (c) is the safer solution to rely on as I don't think one can easily sway research practice as in (b). (a) if pre-registration can gain popularity as a mark of quality studies, this would perhaps help the issue. Personally, I will try to do this for all my survey studies in the future.

2.
a) obligatory reporting of p values to 3 digits, b) banning p values. I prefer (b).

3.
Reviewers must catch errors like this. Generally, if studies are small, estimation of population effect is best left to the meta-analyses.

4.
Pre-registration of sample sizes. Can sometimes be met with e.g. Mechanical Turk. Often not because it is unknown how large a sample one can attract.

5.
Same as (3). However, hopefully, CIs are less suitable for making this error. There is some evidence of this.

6.
Can be banned by policy.
My R code (Bayesian-Laplacian binomial calculator) with explanation based on continuous (instead of discrete) Bayesian hypothesis testing: https://thewinnower.com/papers/binomial-calculator-based-on-continuous-hypothesis-testing-there-is-no-such-a-thing-as-a-discrete-hypothesis
It is useful when prior probabilities are unknown.
We (my brother and I) have developed the code to compute credible intervals based on Bayesian framework. Explanation is here: http://figshare.com/articles/Credible_Interval_Probability_Calculator/1394536
and code is here: http://figshare.com/articles/Credible_Interval_Probability_Calculator/1394537

This method differs from traditional significance testing in which, given our vector’s mean and standard deviation, the probability (p) that after repeating the sampling, the value of the variable x falls within that interval, is computed. The function of this program is to compute the probability that the real mean of x is within the interval [X,Y]: such an interval is the hypothesis H (which usually corresponds to the null hypothesis, but not necessarily so).
1