Teaching in a school of public health, I often listen to presentations from master’s degree students who undertake analysis of primary data collected to answer a question of public health relevance. Inexorably, the presentation will lead to an analysis slide which depicts the results of a multivariate modeling exercise (where the associations between more than one identified factor and the outcome of interest are analysed). Strategic rows which indicate a significant p-value will be highlighted or marked with an asterisk (*), and the student will conclude with a statement indicating which of the identified factors had “statistically significant p-values”.
Use of the p-value as part of the tests of statistical significance is not an exception; it is the norm in most health research. When RA Fisher, who propounded this concept of the p-value, suggested, “It is usual and convenient for experimenters to take 5 percent as a standard level of significance, in the sense that they are prepared to ignore all results that fail to reach this standard, and, by this means, eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results”, he also prefaced it with “It is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require before he would be willing to admit that his observations have demonstrated a positive result.” (1). However, this allowance made by Fisher seems to have been lost in the effort to find an easy standard to apply.
Often, there is no discussion about the size of the effect (meaning how a unit change in the identified factor is expected to alter the outcome of interest, or the adequacy of the sample size to yield a valid estimate of this effect), or the efficiency of the model being specified (meaning how well the identified factors serve to explain the outcome). Most of us with better understanding are guilty of remaining silent through such presentations, or of asking one or two pointed questions without going through the whole gamut of explanations that are needed. This is possibly because of a collective angst regarding the outcome of the learning process for the master’s degree. I suspect part of the silence is shaped by the difficulties involved in finding simple, lay language explanations for how this form of use and interpretation of the p-value is limited in its scientific merit. While student presentations do not result in public harm, public policy choices, informed by misinterpreted or limited reading of results, can be damaging.
The American Statistical Association (ASA), one of the oldest professional bodies of statisticians with a global membership, took the unusual step of speaking out on the reading of evidence using statistical analysis in 2016 (2). It followed this statement published in the American Statistician with an explicit list of do’s and don’ts regarding p-values, confidence intervals and power of a test in the online supplement to the journal (3). This publication authored by the Who’s Who of statisticians elucidates the common errors in interpretation of p-values, confidence intervals and power, and makes an excellent accompaniment to the ASA guidance. These need to be read in tandem by the scientific community to understand the forms and implications of such misinterpretations. However, the guidance, by itself can be read by all who profess a scientific temper in their thinking and action.
The publication of the ASA guidance, though unusual, was timely and relevant. The effort was prompted by the routine use of p-values in empirical research to emphasise the statistical significance of a finding. This is not to say that statisticians and researchers across disciplines have remained silent on this issue. They have continuously pointed out through their publications the limitations in the use of the p-value (4). The prestigious journal of science, Nature, in its editorial of Feb 12, 2014 said: “…most scientists would look at a P-value of 0.01 and ‘say that there was just a 1% chance’ of the result being a false alarm. ‘But they would be wrong’. In other words, most researchers do not understand the basis for a term many use every day. Worse, scientists misuse it. In doing so, they help to bury scientific truth beneath an avalanche of false findings that fail to survive replication.” (5)
A look at the reference list of the ASA’s statement indicates that not only statisticians as a profession, but also researchers in medicine, psychology, economics, epidemiology, law and public health have recognised the misuse of the p-value. The earliest reference in this list is of 1960 vintage in the Psychological Bulletin and the most recent one is of 2014 in the American Scientist. This long duration of engagement and caution calling for better use of the tools of inferential statistics has not been appropriately heeded.
What did the ASA say about p-values? It said that the validity of scientific conclusions in any discipline should be based on appropriate interpretation of statistical results. In this context, it singled out the use of the p-value to assess statistical significance, its misuse and misinterpretation. It defined the p-value as “the probability under a specified statistical model that a statistical summary of data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value”. The guidance explains this quite lucidly for a lay audience – and by “lay”, I mean, those without statistical training. It outlined six principles that state what the p-value indicates and what it does not.
These principles are well known but worth repeating, as they are more often noticed in their breach than in their observance. The ASA guidelines represent a reiteration of what ethical statistical practice should be. Insofar as scientific evidence building is based largely on empiricism, good statistical practice represents ethical scientific practice. For this reason, it bears repeating and wider dissemination in the hope that the requirements that it advocates are heeded by all.
Disenchantment with the misuse of the p-value has even led journals to ban all statistical tests as reported by Greenland et al in their guidance to the ASA guidelines. The ASA guidelines themselves offer solutions to this “pernicious statistical practice” (3, 7). They have pointed out the possibilities of using estimation over testing, Bayesian methods, or even alternative methods of evidence such as likelihood ratios or Bayes Factors or decision theoretic modeling. The guidance concludes by emphasising the need to use multiple means to understand the phenomenon being studied and recognising the context while interpreting results, instead of using a single index like the p-value.
The Indian Journal of Medical Ethics by virtue of its disciplinary orientation attracts a wide variety of submissions, some of which have quantitative orientation. We recognise the utility of the ASA guidance that underscores many of the ethical concerns that we have dealt with during review of submissions to the journal. The ASA guidelines are meant to be just that, a timely caution on the interpretation of results, whatever be the statistical approach being used to establish the “truth”. This guidance needs wider dissemination across the scientific community in India and the subcontinent. We hope we have made a start with this editorial. The journal Nature has made a start in recognising the value of statistics in scientific reporting by having a parallel process with the standard peer review for some papers (5). IJME recognised this need in 2015 and has put in place a similar parallel process for papers that have a statistical content.