Editorial: Outliers—the forgotten 5%

Previously published as: Hart, G. 2009. Editorial: Outliers—the forgotten 5%. The Exchange 16(4):2, 10–11.

You'll often see scientists make an interesting choice when it comes time to interpret their data: they focus on the main results, and ignore any minor results that don't conform. These inconsistencies are often called outliers (http://en.wikipedia.org/wiki/Outliers) because choosing that word helps the scientists see them as "lying outside" the other data. This makes it easier to attribute these outlying results to experimental error, whether due to human imprecision or due to the presence of subtle factors that weren't detected when the experiment was designed but that nonetheless affected the results. If they're errors, you can then ignore them.

Even if these outliers are real, not just errors, they probably aren't truly representative of the majority of the things being studied. Scientists often frame these things in terms of statistical probabilities, and focus on regions of probability such as the 95% "confidence interval" (http://en.wikipedia.org/wiki/Confidence_interval): if you'll pardon my doing a gross injustice to the idea of a confidence interval, the goal is to focus on the 95% of the results that are most significant. (Please treat that description as nothing more than a metaphor to make the concept more approachable.) The logic follows the prevailing scientific dogma, namely that the universe follows consistent rules and that close examination will reveal those rules. Once the rules are known, then events proceed from a given starting point, like clockwork, in accordance with those rules. Anything that reaches a different endpoint must have had a different starting point from the other things being studied, and can therefore follow a different path. In botany, for instance, 95 out of 100 plants may respond identically to an experimental treatment because they were in the same state of health and had nearly identical physiological parameters at the start of the experiment; the other 5 may have inadvertently been stressed by receiving insufficient water or being handled unusually roughly—or maybe they're just having a bad day.

(Un)fortunately, sometimes the 5% that follow different paths and produce outliers are important. Alexander Fleming is generally credited with the discovery of the antibiotic effects of penicillin, and his discovery is traditionally described as an accident (http://en.wikipedia.org/wiki/Penicillin): a culture dish became contaminated, and although this dish was unique in its contamination (thus, an outlier) and might have been simply thrown away by a less alert researcher, Fleming saw something that made him preserve it for further study. In truth, the antibiotic properties of certain bread moulds (probably ones in the penicillin family) were known centuries before Fleming's experiments, so it may be more accurate to describe his finding as the rediscovery of penicillin. Be that as it may, in the context of the present essay, this illustrates why it's not always wise to ignore the outliers.

For scientific communicators, this scientific attitude towards outliers has equally important consequences. First, and extending the metaphor of the confidence interval further than is really safe, we generally choose to focus on the 95% of our audience with identical needs, because they represent the majority of the information needs we must satisfy. But second, the needs of a few key individuals or a few rare situations (the outliers) are sometimes too important to ignore. When we form hypotheses about the needs of our audience, or actually survey them to learn their needs, it's natural and appropriate to focus on majority needs, but whenever we encounter rare situations or rare audience members (outliers), we should pause to ask ourselves a few questions about those rarities:

Do they represent something real, or just random events?
If they're random, are they likely to occur again with sufficient frequency that our audience needs to know how to deal with those situations?
If they're real, and represent an important underlying phenomenon, such as a previously unidentified audience need, how can we address that need without compromising our ability to meet majority needs?

An example from my own career illustrates the point. Many years ago, I had an opportunity to survey the readers of the reports I edited for a former employer, with the goal of learning how to improve these publications so they would better meet the reader's needs. The reports were based on field studies to solve operational problems in forestry, and although the investigative approach followed the scientific model, the readers of the resulting reports were not scientists and were far more interested in what the results would mean for them than in how we obtained the results. Thus, the overwhelming majority told us that they did not read the Methods section, in which we described the study approach. (Our inclusion of a Methods section stemmed from the original model for these reports, which was an uncritical adoption of the model for a scientific journal paper.)

Based on this result, it was tempting to simply discard the entire Methods section because we were (in theory) wasting time writing information that few people read. But in practice, some people (including our own researchers) did read this section and found it useful, often because they wanted to repeat the study themselves. These people needed a clear description of how to perform the study—the reason why the Methods section is so detailed in a journal paper—so it seemed that we could not entirely eliminate the section. As a compromise, we chose to retain the Methods section, but with greatly reduced detail. But where we felt it was relevant, we also preserved detailed descriptions of the methods in an "internal" report (that would generally not be distributed to our audience) so that these details would be available, both to the few audience members who requested this information and to future researchers, as a form of organizational memory. Because our approach to knowledge transfer emphasized ongoing personal contact with our audience, we made it clear to our audience that they should feel free to contact us to request a copy of that internal report or to request assistance designing similar studies.

In the field of scientific and technological risk communication, outliers are even more important. Consider, for example, the operation of a factory that uses or produces toxic chemicals. For the majority of that factory's operational life, it will function precisely as it was designed to do: safely, and with little or no damage caused by those chemicals. But because no complex device designed by humans is ever perfect, and because accidents, unexpected mechanical failures, and even extreme events such as terrorist attacks can disrupt the factory's normal operation, risk managers must consider the possibility that the toxins might be released and must plan accordingly. In effect, they must not only document the normal, routine operation of the factory (the 95%), but must also look for outliers such as industrial accidents (the 5%), and plan accordingly. For example, someone must document the emergency response procedures.

More interestingly for scientific communicators, outliers may indicate a situation where we have an opportunity to identify changes in practices that would reduce the risk of adverse events, much as STC members in the computer industry report bugs and interface glitches to computer programmers so that they can fix the problems. We may also discover opportunities to act as audience advocates by finding ways to increase communication and understanding between our employer and their audience. In my personal example, we found ways to improve communication by eliminating useless information from our reports while preserving that information elsewhere for those who needed it. In the risk communication example, we might translate between the managers of a factory and members of the community downwind from that factory who might be affected by industrial accidents. There's always a risk that by revealing information about the possibilities of disaster to the community, we will increase their fear. But the literature on risk communication has also shown convincingly that honest and open communication to improve understanding, and demonstrating concern for the needs of those who might be affected by a factory's operations by addressing their concerns in operating and emergency-response plans, can lead to constructive, mutually respectful dialogue instead of the confrontations that have often characterized relationships between industry and the general public.

Particularly for those of us who were trained as scientists, there's a strong temptation to ignore outliers and focus on the main results. But as I hope I've shown, that's a limited perspective that often leads us to ignore important, if less common, needs. Next time you begin developing documentation or a communication plan related to a matter of science and technology, ask yourself whether you've done an adequate job of identifying and responding to "the needs of the 5%". The answer may surprise you.

My essays on scientific communication have now been collected in the following book:

Hart, G. 2011. Exchanges: 10 years of essays on scientific communication. Diaskeuasis Publishing, Pointe-Claire, Que. Printed version, 242 p.; eBook in PDF format, 327 p.