(False) precision: Part 2: Statistical probability

You are here: Articles --> 2025 --> (False) precision: Part 2: Statistical probability
Vous êtes ici : Essais --> 2025 --> (False) precision: Part 2: Statistical probability

By Geoffrey Hart

Previously published as: Hart, G. 2025. (False) precision: Part 2: Statistical probability. https://www.worldts.com/english-writing/503/index.html

In part 1 of this article, I described some of the guidelines for determining how many decimal places of precision it is legitimate and useful to report. In this part, I’ll explain how those guidelines apply to statistical probability.

Statistical probability

The issue of precision also arises for statistical probabilities. One of the standard tools of statistical analysis is the P score, which tells you the probability of a statistical result being due to a type I or type II error (which I have defined below) rather than being a real result. Most fields of science perform hypothesis testing using statistical significance based on the P score. Tests of significance are designed to determine whether the null hypothesis (that no significant relationship exists between two variables, or that no difference exists between treatments) is likely to be correct. The most common uses of the P value are to reveal the probability of two types of error:

Type I error: incorrectly rejecting the null hypothesis (thus, obtaining a significant result) when there is no real difference between treatments. This is often called the α probability.
Type II error: incorrectly accepting the null hypothesis (thus, obtaining a non-significant result) when there is a real difference between treatments. This is often called the β probability.

Note that this is not the same thing as a two-tailed test, which tests both whether a result is less than a comparison value and whether it is more than that value. That is, it tests both the lower and the upper “tails” of the statistical distribution. Most often researchers report only the probability of a type I error, since the goal is to correctly reject the null hypothesis (i.e., to detect a statistically significant result).

The probability of incorrectly rejecting the null hypothesis is expressed as the P value. The P value is the decimal value of a percentage between 1 and 100%. It refers to the number of times per 100 trials that you could expect to incorrectly reject the null hypothesis due to random chance rather than due to the existence of a real difference. The standard value used by many journals is P = 0.05, which means that you might see an error 5 times in 100 trials (1 time in 20). Some journals prefer to define significance as P < 0.01, which means that the error would occur 1 time or fewer in 100 trials. However, the choice of the P level that should be met before declaring a result statistically significant is completely arbitrary; an error that occurs 1 time in 20 is uncomfortably frequent for many researchers, particularly in cases such as testing pharmaceutical drugs where human lives are at risk, so some journals prefer P = 0.001, which is 1/50 times the error frequency at P = 0.05. Some physics journals that publish exceptionally precise measurements require “five sigma” significance (P = 0.0000003), where sigma (σ) represents the standard deviation. That is, a significant result must be five standard deviations from the value being used as the basis for the comparison (e.g., from a value of 0).

P values are statistical expectations, not laws of physics. As a result, they are entirely arbitrary values that reflect your desired level of confidence. Beyond a certain point, striving for more decimal places of significance is meaningless. The difference between P = 0.0010 and P = 0.0001 is largely meaningless in most research. This is why most journals ask authors to express statistical significance using only three standard levels: 0.05, 0.01, and 0.001. In practice, achieving a good P value in a single experiment suggests the result is likely to be meaningful, but doesn’t confirm that your result is meaningful until several other researchers have replicated your result. Successful replication, not P levels, is the true test of a hypothesis, and this replication, not the P level in a single study, is the key to determining when a hypothesis matures into a theory.

Note: Because the three P levels used by many journals are entirely arbitrary, some authors have made a strong case for reporting the actual P value rather than using arbitrary categories. For example, P = 0.049 would traditionally be considered significant because the value is less than 0.05 but P = 0.051 would be considered non-significant because the value is greater than 0.05. However, both in practice and in theory, it is unlikely that the two precise P values, which differ by only 0.2%, represent a real difference.

More importantly, a P level only tells you whether your result is likely to reflect a real difference. It does not tell you whether that result has any practical significance. For example, if you are confident at P < 0.001 that based on statistical expectations, one person in a million will win the next lottery, this doesn’t mean that investing in lottery tickets is a wise use of your money. Conversely, if you are confident at P < 0.10 that the bridge you’re thinking about crossing is 50% likely to collapse, you’d be unwise trying to cross the bridge, even though the P level seems weak.

Statistical significance also assumes that errors are random rather than inherent to the system you are studying (i.e., it assumes that there are no systematic errors). For example, if you continue to flip a coin with two sides (heads and tails) many times, it’s highly likely you will achieve a long-term average frequency close to 50% heads and 50% tails. But if both sides of a coin are heads, the frequency will always be 100% heads, no matter how often you perform this test. (It’s also possible, though highly improbable, that the coin will land on its edge and not display a head as the result.) A great deal of experimental design involves trying to eliminate systematic errors so that you can focus on the real phenomenon you’re trying to study.

In summary, P scores tell you how confident you should be in your result, not whether your result is meaningful. The more important point is the strength of the relationship you’re studying, which is usually expressed as the r value for a correlation analysis and the R² value for a regression analysis. A relatively weak r or R² value combined with p < 0.001 means only that you can be highly confident that you have found a real weak relationship rather than a relationship that exists purely by chance; conversely, a relatively strong r or R² value combined with p < 0.05 tells you only that you can be moderately confident that you have found a real and strong relationship.

Be precise, but not too precise!

The lesson of this article is that simply adding decimal places to a number, whether in a summary statistic such as a mean, an estimate from a model or equation, or a significance level, doesn’t necessarily mean a better result. It’s better to produce a highly probable value, a defensible value (a value that can be justified by the precision of the measurements) that’s sufficiently accurate for your purposes, and a replicated value that has been reproduced by multiple researchers.

Acknowledgment

I thank Dr. Julian Norghauer for a reality check on this article.