Geoff-Hart.com: Editing, Writing, and Translation

Home Services Books Articles Resources Fiction Contact me Français

You are here: Articles --> 2020 --> Be cautious with linear regression

Vous êtes ici : Essais --> 2020 --> Be cautious with linear regression

Be cautious with linear regression: some datasets are not linear!

By Geoffrey Hart

Previously published as: Hart, G. 2020. Be cautious with linear regression: some datasets are not linear! https://www.worldts.com/english-writing/eigo-ronbun74/index.html

One of the most common problems I encounter in my editing is the use of linear regression. Linear regression is, of course, a perfectly appropriate way to describe phenomena in which a change in an independent (causal) variable causes a proportional change in the dependent variable. Linear relationships are common in nature. Consider a simplistic example: If you double the temperature at which an endothermic chemical reaction occurs, you often double the reaction rate because you have doubled the amount of energy available to drive the reaction. A similar relationship exists for processes such as drying a sample to determine its moisture content or dry weight.

Note: In this article, I will focus on relationships between two variables at a time. Similar advice applies to relationships among multiple variables, although the solutions are more complex.

The problem arises when the phenomenon you’re describing is not linear. Many natural phenomena are nonlinear. For example, in a nuclear fission reactor, the fission process releases neutrons that can trigger the release of additional neutrons. Left unstopped, this can lead to a chain reaction in which the reaction rate (the amount of fission) increases exponentially until the reactor escapes control, leading to a meltdown. Such mechanisms are common in biology too. For example, during the degradation of a vegetation ecosystem, vegetation loss can expose the soil surface, making the soil more vulnerable to erosion by wind or rainfall. As erosion increases, it removes the most nutrient-rich surface soil, which decreases vegetation health and makes the vegetation more vulnerable to mortality. If that mortality occurs, it exposes more of the soil surface, which accelerates soil erosion and further decreases vegetation health. (These are examples of what is called positive feedback.)

Natural systems also commonly have ranges of conditions that show different behaviors, with the ranges separated by threshold values. Even when responses are linear within each range of conditions, the slopes and intercepts of the response lines differ between the ranges. For example, consider the different behaviors of water in its most commonly observed phases: solid water (ice), liquid water, and gaseous water. To increase the temperature of water in these three phases by 1°C, it’s necessary to add 2.11, 4.18, and 2.00 J of energy, respectively, per gram of water. Thus, to fully describe the response of water temperature to the addition of energy, it’s necessary to use a different equation for each phase.

Water temperature also shows discontinuities that represent thresholds between these phases. For example, for solid ice to become liquid water, the ice will absorb up to 333.55 kJ of energy per gram before the temperature begins to rise again, and will absorb an even larger amount of energy (2260 kJ per gram) before liquid water becomes a vapor and its temperature begins increasing again.

For such phenomena, simple linear regression is clearly not appropriate for the whole range of conditions under which the phenomenon will be studied. In such cases, it’s necessary to use piecewise (segmented) regression, with a separate linear regression performed for each phase.

Another problem with linear regression is that it assumes no bounds to the relationship you’re studying. In practice, most natural processes have a boundary they cannot exceed, such as an asymptote that defines a maximum or minimum value. For example, mortality within a populations can never be less than 0% and can never exceed 100%, so any linear regression that does not account for that minimum and maximum will produce misleading or completely wrong results. Similarly, if we’re studying how life expectancy improves as we invest more money in access to healthcare or in the quality of the care that is provided, we can expect life expectancy to increase with increasing access and quality. However, for the foreseeable future, we cannot expect these increases to provide immortality, so any regression analysis must be bounded by some maximum age.

This leads to an important caution that researchers often forget: It is dangerous to extrapolate a regression equation beyond the range of your data or beyond your experimental conditions. If you lack data for conditions outside those ranges, you have no way to know whether and where limits such as asymptotes exist and no way to know whether different phases exist that will require an additional, different analysis to detect thresholds.

To detect such problems and guide you in choosing the most appropriate form of regression analysis, always do three things:

  • First, think carefully about the phenomena your data represent. Think about the physical process you are trying to describe mathematically. If you suspect the existence of phase changes, as in my example of the three main phases of water, or the existence of a threshold (again, for water) or an asymptote, as in my example of mortality, consider a form of regression that will detect the need for different equations for different ranges of conditions. If you know that alternative stable states exist, separated by a threshold, examine those states separately.

  • Second, inspect a scatterplot of your data visually to see whether any obvious trend exists. If all of the data appear to cluster closely around the same straight line, then linear regression may be perfectly appropriate. However, if the data follow a curving path, nonlinear regression will be necessary.

  • Third, once you have detected the possibility of a conceptual or visual trend, try several different equation forms that could potentially describe the trend. For example, if you have reason to believe that a phenomenon is nonlinear, perform both an exponential regression (e.g., y = x2) and a logarithmic regression (e.g., y = ln x) to see which provides the best fit to your data. For processes that might be cyclical (e.g., for diurnal temperature changes), consider using a regression based on a sine function.

These steps greatly increase the likelihood that you will detect something new and interesting, and possibly something that previous researchers missed because they insisted on using linear regression for an inherently nonlinear phenomenon. Graphs that don’t confirm with your expectations may reveal important phenomena, such as when a process changes from linear to nonlinear or back again.

 


©2004–2024 Geoffrey Hart. All rights reserved.