Managing your study data and the supporting documentation. Part 3: Data validation and processing software

By Geoffrey Hart

In part 1 and part 2 of this series, I described how to set up your project directory and the files it will contain, as well as precautions you should take to protect key files and ways to detect errors during data entry. In this part, I will discuss additional ways to validate and process your data.

Note: This series of articles is based on the following book, but with the details modified for non-psychological research: Berenson, K.R. 2018. Managing Your Research Data and Documentation. American Psychological Association. 105 p. including index.

“Command” files are one common type of program created to input, validate, and process data. This name is based on the terminology used by the SPSS statistical software to describe a sequence of commands that will implement a series of analyses, such as calculating the means for two groups or treatments. A more general name would be “software scripts”, since that would include programs written using the R software to analyze a dataset, scripts that control the operation of a program (e.g., to set parameters for processing a specific type of data such as waveforms), and protocols or checklists that must be followed during data processing. It can also be used for data-entry forms that record survey responses and any software you develop to perform certain analyses that aren’t available in other software.

However, before you begin developing your scripts, it’s important to remember a famous phrase from computer science: “garbage in, garbage out”. This refers to the importance of ensuring that the data you enter is valid and ready for analysis. It doesn’t matter how carefully you analyze your data if the data itself is erroneous. Ideally, data-entry software should let you use tools such as pick lists (a menu that provides a list of choices from which you can pick the most appropriate choice), radio buttons (which let you choose only one option from a series of options), and autofill (the same technology that your smartphone uses to complete words as you type them) to ensure that data is entered correctly. For example, instead of typing a complex treatment name manually (thus risking typing errors), you can type all the valid names only once in your data-entry software and edit those names to ensure they are correct. Subsequently, anyone who is doing data entry can select the correct name from the carefully edited list of names, thereby eliminating a whole category of errors (i.e., typing errors). Of course, this can't prevent you from choosing the wrong item from a list of options, so someone must still check the entered data.

Grouping all your data from one treatment before you enter it into your software reduces the risk of another type of error. For example, first enter all the control data in a single file and include "control" in the file name so that it’s clear it contains data from the control. Since you are not entering any non-control data, this means that there is no need to review the entered data to confirm that every line of data is correctly labeled as coming from the control treatment. Create additional files for each group of treatment data, so that each file includes only the results of a given treatment. After you complete your data entry and validation for the control and each treatment, you can merge the files, if necessary.

If you are performing human research and use a computer for participants to enter their responses to a questionnaire or survey, store a copy of the data-entry program you used and document any steps you took to avoid data-entry errors. The problem with such software is that if it contains an error, that error will spread throughout your subsequent analyses. Thus, ask an expert to review your program, and validate it carefully using test data with a known outcome. For example, it's often possible to quickly calculate an approximate total or average (e.g., by rounding decimal values to the nearest integer). If your software provides the same total or average, you can be more confident that the program is working correctly. For additional validation, you should also test extreme cases, such as unusually large or small values or a larger dataset created by duplicating your 10 responses 100 times and running the software again to confirm that the average value doesn’t change.

Note: Professional computer programmers have usually graduated from a 2- to 4-year program of intensive study, but even these experts make mistakes. If you spent a few hours or days reading the user manual for software such as the R statistical programming language, you should not expect to be as good at programming in that language as someone who spent several years learning both the language and how to use it correctly. Thus, always ask an expert to review your programs to ensure that they’re correct.

Frequency tables are useful for validating data. For example, ensure that the total frequency for all response categories combined equals the total sample size (i.e., to ensure that no data were added by mistake and that no data are missing). If you know, for example, that the values of a given variable range from 0 to 10, any value <0 or >10 that has a frequency of 1 or more is clearly an error. If you correct any datapoints that you detect this way, clearly document this change so that a future researcher can repeat your analysis and obtain the same result—or can change the data in a different way if they disagreed with your reasons. Note that when you detect what seem to be errors, don’t guess at the true value unless you have no alternative. It’s better to mark data as “missing” than to choose an incorrect value that will bias the results of your subsequent analyses.

If it is truly necessary to fill in missing data, carefully document what you’ve done so that if you have to return to your original data to generate a new working copy, you’ll remember to make the same changes in the new working copy.

Learn how your statistics software encodes missing data. For example, if it uses a value such as 99 that could appear in your data, change the 99 to a different value or add steps in your script to ensure that any values of 99 are highlighted so you can decide whether they should be included in the calculations or not (i.e., so that they don’t affect the mean value).

Decide in advance how to handle missing data and how to prevent it. One useful form of prevention, which may not always be available, is to double-check that all questions have been answered before you let a study participant leave an interview; for field research, ensure that all data fields have been filled in for each group of measurements (e.g., to ensure that you did not accidentally skip to the next line in an Excel table before continuing to enter your data, leaving blank cells in the worksheet) and ensure that each data item has been recorded correctly before you move to the next measurement.

Sometimes it's important to replace missing data with a plausible estimate. For example, when economic data is missing for one year in a longer study period, it may be possible to interpolate between years to generate an estimate of the missing data. There are several defensible ways to fill in missing data when you can’t ask a study participant to clarify their meaning or can’t return to a distant field site to repeat a measurement. For example, if the data you are collecting forms a normally distributed dataset, and there are relatively few missing data, you could use the mean to replace the missing values; for sets with values that don’t have a mean (e.g., a list of cultivar or species names) or for data distributions that may be skewed, you could use the mode. For more complex data, you can use more complex methods such as the mean of a 7-day moving window, linear regression, kriging interpolation, or randomly sampling values from a “prior” distribution.

Whatever method you choose, it must be objective (i.e., the logic for your choice must be clear and correct), repeatable, and clearly described to ensure that if another researcher repeats your data analysis, they will not bias the result based on their personal subjective choices. For example, it is reasonable to interpolate outdoor temperatures between 9 AM and 10 AM because temperatures tend to change predictably over short periods. It would more difficult to justify assigning a value of 50% for a missing test score in a class where students have, in the past, generally received test scores of 60 to 100%.

Similarly, develop objective and quantitative criteria for when to exclude data. This is particularly common for field or laboratory research in which “outliers” occur. For example, calculate the 99% confidence interval for your data, and consider any data that lie outside this range to be outliers that should be excluded.

Professional programmers sometimes neglect a point that you should not neglect: document each line in any software you develop. For example, define the meaning of every variable name, and explain the purpose of performing a specific calculation. This will help you remember what you did and why in each step of the program, but will also help a colleague or team member validate your logic, and will help future researchers update your program to account for any improvements in your field of study since you created the original program. If you document your program code inside the program, you don’t need to create a separate document that will gradually become different from the software (since human nature means that you are likely to someday forget to update the separate document), making it difficult to determine whether the description in the software or the description in this document is correct.

In Part 4 of this article, I’ll discuss how to create and document your replication files (the files you will make available to future researchers who want to replicate your study or who want to use your data in meta-analyses).